I did a small and fun assembler SIMD optimization job the last week. The target architecture was ARMv6, but since the code will run on the iPhone I tried to keep the code fast on the Cortex-A8 as well.
When I did some profiling on my BeagleBoard, and I got some surprising results: The code run a faster as it should. This was odd. Never happened to me.
Fast forward 5 hours and lots of micro-benchmarking:
The 16 bit multiplies SMULxy on the Cortex-A8 are a cycle faster than documented!
They take one cycle for issue and have a result-latency of three cycles (rule of thumb, it’s a bit more complicated than that). And this applies to all variants of this instruction: SMULBB, SMULBT, SMULTT and SMULTB.
The multiply-accumulate variants of the 16 bit multiplies execute are as documented: Two cycles issue and three cycles result-latency.
This is nice. I have used the 16 bit multiplies a lot in the past but stopped to use them because I thought they offered no benefit over the more general MUL instruction on the Cortex-A8. The SMULxy multiplies mix very well with the ARMv6 SIMD multiplies. Both of them work on 16 bit integers but the SIMD instructions take a packed 16 bit argument while the SMULxy take only a single argument, and you can specify if you want the top or bottom 16 bits of each argument. Very flexible.
All this leads to nice code sequences. For example a three element dot-product of signed 16 bit quantities. Something that is used quite a lot for color-space conversion.
Assume this register-values on entry:
MSB16 LSB16 +----------+----------+ r1: | ignored | a1 | +----------+----------+ +----------+----------+ r2: | ignored | a2 | +----------+----------+ +----------+----------+ r3: | b1 | c1 | +----------+----------+ +----------+----------+ r4: | b2 | c2 | +----------+----------+
And this code sequence:
smulbb r0, r1, r2 smlad r0, r3, r4, r0
Gives a result: r0 = (a1*a2) + (b1*b2) + (c1*c2)
On the Cortex-A8 this will schedule like this:
Cycle0: smulbb r0, r1, r2 Pipe0 nop Pipe1 (free, can be used for non-multiplies) Cycle1: smlad r0, r3, r4, r0 Pipe0 nop Pipe1 (free, can be used for non-multiplies) Cycle2: blocked, because smlad is a multi-cycle instruction.
The result (r0) will be available three cycles later (cycle 6) for most instructions. You can execute whatever you want in-between as long as you don’t touch r0.
Note that this is a special case: The SMULBB instruction in cycle0 generates the result in R0. If the next instruction is one of the multiply-accumulate family, and the register is used as the accumulate argument a special forward path of the Cortex-A8 kicks in and the result latency is lowered to just one cycle. Cool, ain’t it?
Btw: Thanks to Måns/MRU. He was so kind and verified the timing on his beagleboard.