Faster Cortex-A8 16-bit Multiplies

I did a small and fun assembler SIMD optimization job the last week. The target architecture was ARMv6, but since the code will run on the iPhone I tried to keep the code fast on the Cortex-A8 as well.

When I did some profiling on my BeagleBoard, and I got some surprising results: The code run a faster as it should. This was odd. Never happened to me.

Fast forward 5 hours and lots of micro-benchmarking:

The 16 bit multiplies SMULxy on the Cortex-A8 are a cycle faster than documented!

They take one cycle for issue and have a result-latency of three cycles (rule of thumb, it’s a bit more complicated than that). And this applies to all variants of this instruction: SMULBB, SMULBT, SMULTT and SMULTB.

The multiply-accumulate variants of the 16 bit multiplies execute are as documented: Two cycles issue and three cycles result-latency.

This is nice. I have used the 16 bit multiplies a lot in the past but stopped to use them because I thought they offered no benefit over the more general MUL instruction on the Cortex-A8. The SMULxy multiplies mix very well with the ARMv6 SIMD multiplies. Both of them work on 16 bit integers but the SIMD instructions take a packed 16 bit argument while the SMULxy take only a single argument, and you can specify if you want the top or bottom 16 bits of each argument. Very flexible.

All this leads to nice code sequences. For example a three element dot-product of signed 16 bit quantities. Something that is used quite a lot for color-space conversion.

Assume this register-values on entry:

              MSB16      LSB16

          +----------+----------+
      r1: | ignored  |    a1    |
          +----------+----------+
          +----------+----------+
      r2: | ignored  |    a2    |
          +----------+----------+
          +----------+----------+
      r3: |    b1    |    c1    |
          +----------+----------+
          +----------+----------+
      r4: |    b2    |    c2    |
          +----------+----------+

And this code sequence:

    smulbb      r0, r1, r2      
    smlad       r0, r3, r4, r0
 

Gives a result: r0 = (a1*a2) + (b1*b2) + (c1*c2)

On the Cortex-A8 this will schedule like this:

  Cycle0:
  
    smulbb      r0, r1, r2          Pipe0
    nop                             Pipe1   (free, can be used for non-multiplies)
  
  Cycle1:        
  
    smlad       r0, r3, r4, r0      Pipe0
    nop                             Pipe1   (free, can be used for non-multiplies)
  
  Cycle2:        
  
    blocked, because smlad is a multi-cycle instruction.

The result (r0) will be available three cycles later (cycle 6) for most instructions. You can execute whatever you want in-between as long as you don’t touch r0.

Note that this is a special case: The SMULBB instruction in cycle0 generates the result in R0. If the next instruction is one of the multiply-accumulate family, and the register is used as the accumulate argument a special forward path of the Cortex-A8 kicks in and the result latency is lowered to just one cycle. Cool, ain’t it?

Btw: Thanks to Måns/MRU. He was so kind and verified the timing on his beagleboard.

This entry was posted in Beagleboard, OMAP3530. Bookmark the permalink.

5 Responses to Faster Cortex-A8 16-bit Multiplies

  1. Mans says:

    One should be careful when talking about result latency on ARM. Most simple instructions need their source operands in stage E2 while more complicated operations (e.g. multiplies and variably shifted operands) require the sources already in E1. The multiplication instructions all (both TRM and the ones I measured) provide the result in E5. This gives an observed latency of 4 or 5 cycles depending the stage in which the following dependent instruction requires the operand.

    With back-to-back independent SMULBB instructions, I measured one cycle per instruction. With each SMULBB using the result of the previous as one of its sources, I got 5 cycles per instruction, consistent with the result from E5 required in E1 as per the TRM. Back-to-back SMLABB instructions with dependency only on the accumulator take 2 cycles each, again in agreement with the TRM.

  2. Nils says:

    Hi Mans,

    I tried to keep the latency information simple.

    Most people don’t know how to read the pipeline stage tables, and speaking of E5 and E2 would just confuse them. That’s why I simplified and wrote about the simple 3 cycle latency. With 3 cycles I mean that you have to fill the next three cycles with instructions if you don’t want to see a stall.. Technically that’s four cycles of course.

  3. Etienne says:

    Hi.
    Interesting example.
    Let me be sute I’ve well understand what you said.

    smulbb r0, r1, r2
    used 1 cycle but have 3 latency cycle.

    and the
    smlad r0, r3, r4, r0
    used 2 cycles but will start the final addition in cycle 3 so it will be able to use the r0 result of the first mul !

    That’s it ?

  4. Pro Soccer says:

    I’d need to check with you here. Which isn’t one thing I usually do! I enjoy reading a submit that will make people think. Additionally, thanks for allowing me to remark!

  5. Hum.

    Thank you for this interesting post.
    I’ve made some test.

    And I confirm that you seem to be right. SMULxy (smulbb, smultb, smulbt, smultt) take only one cycle.
    It’s seems logical that SMULxy could not take more time than SMULWy.

    I don’t know why ARM do not update the documentation. MUL cycle page have many errors.
    – There is no reference to the MLA instruction
    – They wrote SMLALxy instead of SMLAxy
    – Cycle information for MLA shortcuts are quite strange. If you apply stage rules you should not execute a MUL and a MLA back-to-back.
    – …

Leave a Reply

Your email address will not be published. Required fields are marked *