ARM NEON Optimization. An Example

Since there is so little information about NEON optimizations out there I thought I’d write a little about it.

Some weeks ago someone on the beagle-board mailing-list asked how to optimize a color to grayscale conversion for images. I haven’t done much pixel processing with ARM NEON yet, so I gave if a try. The results I got where quite spectacular, but more on this later.

For the color to grayscale conversion I used a very simple conversion scheme: A weighted average of the red, green and blue components. This conversion ignores the effect of gamma but works good enough in practice. Also I decided not to do proper rounding. It’s just an example after all.

First a reference implementation in C:

void reference_convert (uint8_t * __restrict dest, uint8_t * __restrict src, int n)
  int i;
  for (i=0; i<n; i++)
    int r = *src++; // load red
    int g = *src++; // load green
    int b = *src++; // load blue 

    // build weighted average:
    int y = (r*77)+(g*151)+(b*28);

    // undo the scale by 256 and write to memory:
    *dest++ = (y>>8);

Optimization with NEON Intrinsics

Lets start optimizing the code using the compiler intrinsics. Intrinsics are nice to use because you they behave just like C-functions but compile to a single assembler statement. At least in theory as I’ll show you later..

Since NEON works in 64 or 128 bit registers it’s best to process eight pixels in parallel. That way we can exploit the parallel nature of the SIMD-unit. Here is what I came up with:

void neon_convert (uint8_t * __restrict dest, uint8_t * __restrict src, int n)
  int i;
  uint8x8_t rfac = vdup_n_u8 (77);
  uint8x8_t gfac = vdup_n_u8 (151);
  uint8x8_t bfac = vdup_n_u8 (28);

  for (i=0; i<n; i++)
    uint16x8_t  temp;
    uint8x8x3_t rgb  = vld3_u8 (src);
    uint8x8_t result;

    temp = vmull_u8 (rgb.val[0],      rfac);
    temp = vmlal_u8 (temp,rgb.val[1], gfac);
    temp = vmlal_u8 (temp,rgb.val[2], bfac);

    result = vshrn_n_u16 (temp, 8);
    vst1_u8 (dest, result);
    src  += 8*3;
    dest += 8;

Lets take a look at it step by step:

First off I load my weight factors into three NEON registers. The vdup.8 instruction does this and also replicates the byte into all 8 bytes of the NEON register.

    uint8x8_t rfac = vdup_n_u8 (77);
    uint8x8_t gfac = vdup_n_u8 (151);
    uint8x8_t bfac = vdup_n_u8 (28); 

Now I load 8 pixels at once into three registers.

    uint8x8x3_t rgb  = vld3_u8 (src);

The vld3.8 instruction is a specialty of the NEON instruction set. With NEON you can not only do loads and stores of multiple registers at once, you can de-interleave the data on the fly as well. Since I expect my pixel data to be interleaved the vld3.8 instruction is a perfect fit for a tight loop.

After the load, I have all the red components of 8 pixels in the first loaded register. The green components end up in the second and blue in the third.

Now calculate the weighted average:

    temp = vmull_u8 (rgb.val[0],      rfac);
    temp = vmlal_u8 (temp,rgb.val[1], gfac);
    temp = vmlal_u8 (temp,rgb.val[2], bfac);

vmull.u8 multiplies each byte of the first argument with each corresponding byte of the second argument. Each result becomes a 16 bit unsigned integer, so no overflow can happen. The entire result is returned as a 128 bit NEON register pair.

vmlal.u8 does the same thing as vmull.u8 but also adds the content of another register to the result.

So we end up with just three instructions for weighted average of eight pixels. Nice.

Now it’s time to undo the scaling of the weight factors. To do so I shift each 16 bit result to the right by 8 bits. This equals to a division by 256. ARM NEON has lots of instructions to do the shift, but also a “narrow” variant exists. This one does two things at once: It does the shift and afterwards converts the 16 bit integers back to 8 bit by removing all the high-bytes from the result. We get back from the 128 bit register pair to a single 64 bit register.

    result = vshrn_n_u16 (temp, 8);

And finally store the result.

    vst1_u8 (dest, result);

First Results:

How does the reference C-function and the NEON optimized version compare? I did a test on my Omap3 CortexA8 CPU on the beagle-board and got the following timings:

C-version:       15.1 cycles per pixel.
NEON-version:     9.9 cycles per pixel.

That’s only a speed-up of factor 1.5. I expected much more from the NEON implementation. It processes 8 pixels with just 6 instructions after all. What’s going on here? A look at the assembler output explained it all. Here is the inner-loop part of the convert function:

 160:   f46a040f        vld3.8  {d16-d18}, [sl]
 164:   e1a0c005        mov     ip, r5
 168:   ecc80b06        vstmia  r8, {d16-d18}
 16c:   e1a04007        mov     r4, r7
 170:   e2866001        add     r6, r6, #1      ; 0x1
 174:   e28aa018        add     sl, sl, #24     ; 0x18
 178:   e8bc000f        ldm     ip!, {r0, r1, r2, r3}
 17c:   e15b0006        cmp     fp, r6
 180:   e1a08005        mov     r8, r5
 184:   e8a4000f        stmia   r4!, {r0, r1, r2, r3}
 188:   eddd0b06        vldr    d16, [sp, #24]
 18c:   e89c0003        ldm     ip, {r0, r1}
 190:   eddd2b08        vldr    d18, [sp, #32]
 194:   f3c00ca6        vmull.u8        q8, d16, d22
 198:   f3c208a5        vmlal.u8        q8, d18, d21
 19c:   e8840003        stm     r4, {r0, r1}
 1a0:   eddd3b0a        vldr    d19, [sp, #40]
 1a4:   f3c308a4        vmlal.u8        q8, d19, d20
 1a8:   f2c80830        vshrn.i16       d16, q8, #8
 1ac:   f449070f        vst1.8  {d16}, [r9]
 1b0:   e2899008        add     r9, r9, #8      ; 0x8
 1b4:   caffffe9        bgt     160

Note the store at offset 168? The compiler decides to write the three registers onto the stack. After a bit of useless memory accesses from the GPP side the compiler reloads them (offset 188, 190 and 1a0) in exactly the same physical NEON register.

What all the ordinary integer instructions do? I have no idea. Lots of memory accesses target the stack for no good reason. There is definitely no shortage of registers anywhere. For reference: I used the GCC 4.3.3 (CodeSourcery 2009q1 lite) compiler .

NEON and assembler

Since the compiler can’t generate good code I wrote the same loop in assembler. In a nutshell I just took the intrinsic based loop and converted the instructions one by one. The loop-control is a bit different, but that’s all.


      # r0: Ptr to destination data
      # r1: Ptr to source data
      # r2: Iteration count:

    	push   	    {r4-r5,lr}
      lsr         r2, r2, #3

      # build the three constants:
      mov         r3, #77
      mov         r4, #151
      mov         r5, #28
      vdup.8      d3, r3
      vdup.8      d4, r4
      vdup.8      d5, r5


      # load 8 pixels:
      vld3.8      {d0-d2}, [r1]!

      # do the weight average:
      vmull.u8    q3, d0, d3
      vmlal.u8    q3, d1, d4
      vmlal.u8    q3, d2, d5

      # shift and store:
      vshrn.u16   d6, q3, #8
      vst1.8      {d6}, [r0]!

      subs        r2, r2, #1
      bne         .loop

      pop         { r4-r5, pc }

Final Results:

Time for some benchmarking again. How does the hand-written assembler version compares? Well – here are the results:

  C-version:       15.1 cycles per pixel.
  NEON-version:     9.9 cycles per pixel.
  Assembler:        2.0 cycles per pixel.

That’s roughly a factor of five over the intrinsic version and 7.5 times faster than my not-so-bad C implementation. And keep in mind: I didn’t even optimized the assembler loop.

My conclusion: If you want performance out of your NEON unit stay away from the intrinsics. They are nice as a prototyping tool. Use them to get your algorithm working and then rewrite the NEON-parts of it in assembler.

Btw: Sorry for the ugly syntax-highlighting. I’m still looking for a nice wordpress plug-in.

This entry was posted in Beagleboard, OMAP3530. Bookmark the permalink.

37 Responses to ARM NEON Optimization. An Example

  1. Ranjith Parakkal says:

    Your post is very interesting. I was reading up about Neon intrinsics. And I came across your post. This information is very useful. BTW I have a few doubts/comments.

    1. Where are the src and dst pointers pointing to ? Is it DDR ? or some faster memory like L3 memory in OMAP3 ? Can you achieve 2 cycles per pixel performance if data is in DDR, and A8 accesses it through Dcache. My guess was that, if data is in DDR, Dcache misses would be a gating factor and will result in worse performance, than 2 cycles per pixel. Can you please clarify.

    2. Can you try declaring the following variables outside the loop.

    uint16x8_t temp;
    uint8x8x3_t rgb = vld3_u8 (src);
    uint8x8_t result;

    I am wondering if declaring the variables inside the loop will cause the compliler to do strange things like reading/writing from stack every iteration. Only a guess. Isnt neccesarily right. Have you tried decaring them outside the loop ?

    3. Have you tried any other compiler. TI and ARM has A8 compilers that support Neon intrinsics. I have heard the TI compiler is pretty good.

    Thanks and Regards

  2. admin says:

    Hi Ranjith,

    The two pointers point to DDR memory. Source is 192kb and dest 64kb in size (256*256 pixels each). No internal has been used, and the memory blocks are much larger than the cache. Also as far as I know internal or tightly coupled memory does not exist on the ARM-side of the OMAP3530.

    I’ll try to move the variable declaration out of the loop this evening. It shouldn’t make a difference, but well… You’ll never know until you’ve tried. For trying the TI-compiler: That’s a good idea.

  3. Ranjith Parakkal says:


    Thanks a LOT for u reply. I will wait for you to try moving the declaration outside the loop and declare the results here .. 😉

    Just thinking out loud on how you have achieved 2 cycles per pixel.
    Since you are processing 8 cycles per pixel every iteration. On an average the following code should take 16 cycles.

    vld3.8 {d0-d2}, [r1]! — ?? cycle

    # do the weight average:
    vmull.u8 q3, d0, d3 — one cycle
    vmlal.u8 q3, d1, d4 — one cycle
    vmlal.u8 q3, d2, d5 — one cycle

    # shift and store:
    vshrn.u16 d6, q3, #8 — one cycle
    vst1.8 {d6}, [r0]! — ?? cycle

    subs r2, r2, #1 — one cycle
    bne .loop — ?? cycle

    The vector arithmetic operations and the subs operation should account for 5 cycles out of the 16, assuming each of them are single cycle when they are pipelined. So that leaves about 9 cycles for the branch and the vector load and the store operations. I think this should sort of account for the cache misses. I dont know much about ARMs cache structure. Lemme try go read up about it.

    BTW there is an L3 memory on OMAP3, which is not closely coupled with A8, but may still give better performance than cached-DDR.

  4. Mike says:


    I need some help for using the Neon instructions of Cortex-A8 in my application.
    I am writing a image algorithm where I want to use NEON instructions of Cortex A8.
    I have tested the program using RVDS, where I have used init.s and init_cache.s taken from Examples of RVDS. We had also given one scatter file for placement of stack and heap while linking in RVDS.
    I want to run my program on Beagle Board running Linux.
    My questions are:
    1. Do we need to use init.s and init_cache.s in my program to run it on Beagle board?
    2. Do we need scatter file to run the program on Beagle board? If yes, how to give scatter file using gnu ld.
    3. When we removed ‘1’ and ‘2’, the linker gives error “uses VFP registers”.

    Please help.


  5. Nils says:

    Hi Ranjith,

    It’s not that simple to count cycles in mixed ARM and NEON code. The NEON-Pipeline is very special. In a nutshell the NEON unit sits logically behind the ARM unit and lags 5 cycles behind execution. It also has it’s own instruction queue and only executes one instruction per cycle (in practice).

    In the code example the following will happen: The instruction decoder decodes two instructions per cycle. The ARM-pipeline can execute two instructions per cycle. NEON instructions are treated as a NOPs so to say. So all the NEON instructions flow through the ARM pipeline and do nothing. After the 5 cycles they get stuffed into the NEON instruction queue.

    The NEON-unit will now start to execute the queued instructions. It can only do one instruction per cycle (not exactly true, there are some pipeline possibilities, but I doubt my code uses any). Anyway, in practice the speed is just half of the ARM-unit.

    This has the following effect on timing: During the first two or three iterations of the conversion function the ARM unit generates NEON instructions much faster than the NEON unit can execute them. The queue gets filled up fast. Once the queue is filled, the NEON-unit will always able to execute while the ARM-unit is mostly waiting for a free queue entry. All other ARM instruction will execute in parallel with the NEON-unit, and the NEON-unit will never run out of work.

    So effectively all ARM instructions that are mixed between the NEON-code are free. The entire performance is dominated by just the NEON-timing.

    We end up with the six instructions executing within 16 cycles. I haven’t measured, but I guess that the the load-instruction accounts for at least 50% of the time (due to cache misses) and the data-processing and store does the other half.

    It would be fun to run the code on zero wait-state memory. Unfortunately I don’t know about any of such memory, even on the L3. Well – I could abuse the internal memory of the DSP but I don’t know how fast access from the ARM is.

  6. Nils says:

    Hi Mike.

    I’ve never worked with RVDS, but if you use init.s and init_cache.s it you’re compiling your program for use without any OS. E.g. bare-bone or boot-loader style.

    Since you want to execute the code on the BeagleBoard using Linux you need an executable in ELF-format.

    Check the manual how to generate such an output-file. I’m sure it’s documented somewhere.

  7. Mike says:


    Thanks for your reply.
    I build my application without init.s, init_cache.s & scatter file using -mfu=neon. It was build successfully.
    When I ran my application on beagleboard, it exited giving “Illegal Instruction”, which I suspect when we encounter first neon instruction.We are using beagleboard version B4 & Kernal

    Could you tell me how to solve this issue?

    Please help.


  8. Nils says:

    Hi Mike.

    Have you tried debugging the executable with gdb to find out which instruction triggers the “Ilegal Instrucion”-fault? That could give some clues what is going wrong. Maybe there is still some code in your application that tries to directly access the hardware? Moves to the co-processor that configure cache and the like will trigger an illegal instruction fault.

    You may get better answers on the beaglboard mailing list. I suggest you try to ask there. I’ve never worked with RVDS, and I don’t have any problems compiling my code using GCC on the PC or the beagleboard.

    Link to the beagleboard communiy:

    I’m using kernel 2.6.29 by the way… 2.6.22 is rather old. How does it come that you’re still using such an old kernel?

  9. Venkat says:

    Hi Nils,

    How did you get the assembler output generated? Did you use any specific compiler flag to generate assembler output.
    I am using Cross Compiler arm-none-linux-gnueabi-gcc to compile my programs and test them on OMAP Zoom 2 platform.

    I tried using -S option but it does not generate any assembler output file.

    Thanks and Best Regards,

  10. Venkat says:

    Hi Nils,

    I have one more question. How did you build an executable using hand coded assembler file? Could you please let me know the steps?

    Thanks and Best Regards,

  11. Nils says:

    Hi Venkat,

    I generated the assembler output using the gnu disassembler.

       arm-none-linux-gnueabi-objdump -d yourfile.o 

    will do the trick!

    To compile a raw assembler file do the same what you do with your .c files. E.g.

        arm-none-linux-gnueabi-gcc -c yourfile.s 

    will generate a yourfile.o object file. I’ve uploaded the file onto my webspace, so you can use it as a boilerplate: rgb_to_gray.s


  12. Chris says:

    Nils (and everyone),

    I’m an open source advocate at TI for OMAP. A couple of our engineers pointed me to this article. We are working with ARM and CodeSourcery to improve the quality of ARM compilers. We are pushing everything directly to gcc, and we hope to have some real progress by the end of 2010.

    We’ll keep you in the loop on the progress.


    Chris P

  13. Nils says:

    Hi Chris,

    That’s great news..

    I follow the gcc-dev mailing list for two years now and I developed a feeling how long/difficult some changes are and how gcc works on the inside. If you don’t mind I would like to contact you privately and give some additional insights on what goes wrong inside gcc when it comes to performant ARM code. There are a couple of low hanging fruits that would give instant performance improvements for Cortex-A8 code.

    Also – since you’ve contacted me via the blog: Don’t miss Mans blog on ARM-code: He is _very_ competent when it comes to ARM and NEON. I’m sure he would be glad to share some of his insights with you as well.


  14. Samuel says:

    The inefficient code generated by the vldX where X>1 intrinsics is a known problem in gcc:

  15. Saurabh says:


    I want to understand about how neon pipeline works. I have read cortex-a8 technical referance manual but still want to understand instruction level sceduling.


  16. Orthochronous says:

    Hi, just a speculative post under an old blog entry in case you happen to know the answer to this… :-)

    I’ve been poring over various RealView documents about NEON, trying to understand the similarities and differences from Intel SSE-x. Am I right in saying that there’s no equivalent to SSE’s _mm_movemask_epi8 (which basically takes a SIMD vector and generates a standard integer bitmask based upon a predicate (in this case hardwired to be “top-bit set”) applied to each vector element)? There doesn’t seem to be but maybe I’m missing one hidden away.

    This is probably an instruction that’s not that important for multimedia generation, but I’m interested in image analysis where one often wants to see if a “thing” is in some “computationally defined” set (eg, is an RGB pixel within some box in RGB space) and use SIMD parallelisation. Without this kind of instruction, you can do the testing in parallel but then you’ve got to build a mask of results by manually extracting each vector element to a scalar and build the bitmap, so it seems an odd thing to leave out of an instruction set design.


  17. Nils says:

    Hi Orthochronous,

    An instruction like this seems to be missing. The best thing to simulate it looks roughly like this:

    : d1 = input
    : d2 = mask (128, 64, 32, 16, 8, 4, 2, 1)

    vand d1, d1, d2
    vpadd.i8 d1, d1, d1
    vpadd.i8 d1, d1, d1
    vpadd.i8 d1, d1, d1

    That fills each byte of d1 with the sum of all elements of d1. Since we’ve masked out the bits before addition no overflow can happen and we end up with a mask similar to movemask. The result will be stored in each byte of d1, but you can just extract the last byte later.

    The same trick can be extended to 128 bits as well.

  18. Orthochronous says:


    Thanks for the insight. I’d never have thought of this approach, so it’s very useful.

  19. eDave says:

    I have some performence overhead while i add the Neon Intrinsics with my Array Operation Code

    The Code fragment shown below

    1. C Code Fragment

    kvalue = SHIFTR(*Residue++, 6) + *Predicted++;
    *Original++ = (byte) CLIPS(255, kvalue);
    kvalue = SHIFTR(*Residue++, 6) + *Predicted++;
    *Original++ = (byte) CLIPS(255, kvalue);
    kvalue = SHIFTR(*Residue++, 6) + *Predicted++;
    *Original++ = (byte) CLIPS(255, kvalue);
    kvalue = SHIFTR(*Residue++, 6) + *Predicted++;
    *Original++ = (byte) CLIPS(255, kvalue);

    2. Neon Specific Code

    /* Predicted Array is unsigned char type so a type cast activity done by assign it to Integer AryPred*/
    AryPred[0] = *Predicted++; AryPred[1] = *Predicted++; AryPred[2] = *Predicted++; AryPred[3] = *Predicted++;
    neonResidue = vld1q_s32(Residue);
    /* Code for SHIFTR */
    neonResidue = vaddq_s32(neonResidue, addconst);
    neonResidue = vshrq_n_s32(neonResidue, 6);

    neonPredict = vld1q_s32(AryPred);
    addedResult = vaddq_s32(neonResidue, neonPredict);
    *Original++ = (byte) CLIPS(255, vgetq_lane_s32(addedResult, 0));
    *Original++ = (byte) CLIPS(255, vgetq_lane_s32(addedResult, 1));
    *Original++ = (byte) CLIPS(255, vgetq_lane_s32(addedResult, 2));
    *Original++ = (byte) CLIPS(255, vgetq_lane_s32(addedResult, 3));
    Residue += 4;

    #define CLIPS(iF, iS)(iS > 0 ? (iS < iF ? iS : iF): (0 int
    Residue-> int
    Predicted->unsigned char
    Original->unsigned char

    The following data types are used to define the above array operations with Neon Operations

    AryPred -> int
    addedResult, neonPredict, neonResidue ->int32x4_t

    What is the reason for the overhead when i repeat the code 4 times? Is there any memory alignment issue or any extra pipeline stalls comes here ?
    Is there any performence gain while we change the data type of each array in to same (Neglect the Chance of Overflow)?


  20. Nice post,

    I’ve also made the same observations when using the compiler vs hand-optimized neon assembler. C functions that contain inline asm often produce very unpredictable and unwanted results when ‘optimized’ with both -O2 and -O3. Even -Os still suffers slightly in performance compared to a hand-optimized assembly routine. In short… we’ll get there, just not yet 😛

  21. shawn says:

    would you please tell me how to measure the timing of the code in a Linux environment?
    it seems that you execute the code on the BeagleBoard using Linux through an executable in ELF-format.
    How to make sure the timeing is not occupied by other tasks in Linux?


  22. jani says:

    how did you measure or estimate the number of cycles each of the variant takes without looking at the assembly output?

  23. srini says:

    iam very much new to cortex a9 and neon codging,
    i have written neon code for a C fucntion and exectute on beagle board performance of C version is better than neon code.

    what could be the problem?
    please help me in this regard
    my compiler options are
    -O3 -march=armv7-a -ftree-vectorize -mvectorize-with-neon-quad -fprefetch-loop-arrays -mfloat-abi=softfp -mfpu=neon -funroll-loops

    Thank you

  24. Just by removing pipeline bubbles, your code can run 2 times faster.

  25. Michael says:

    hi, mike
    I can’t enable NEON in vs2008 with Compact 7, do you know how to do this?
    Also, I can’t compile your sample NEON code successfully in vs2008 with Compact 7,
    Do you know how to enable NEON function?

  26. srinivas says:

    I’m new to assembly programming and I tried to understand by converting each intrinsic into ARM assembly. I cannot find any corresponding assembly instructions for “src += 8*3; dest += 8;” in rgb_to_gray.s (although the assembly file works as expected when executed)

    Can someone kindly throw some light on it.

  27. Nils says:

    The vld and vst instructions increment the pointers as a side-effect. That’s what the ‘!’ char does in the instruction.

  28. srinivas says:

    Thanks Nils :)

  29. Ankit says:

    Hi everybody,

    I am new in the field of playing with assembly language and thus challenging the optimization of professional Neon Compilers. Here are my questions?
    1)Suppose if I have a C Code (e.g. fft.c), then how can I view its Neon optimized assembly code using Real View Development System? A command line syntax can also work, but for windows only.
    2)Is there any general guideline for converting C codes into their corresponding Neon assembly counterparts?

    Pardon me for my unprofessional way of asking questions, but it would be very gracious of all of you people if anyone can answer my queries ASAP.

    Thank You

  30. Tom says:

    Hi all!

    I have a question on the timings discussed here:
    I took all implementations from here and from “pulsar” (incl. the “16 pixels at a time” version)
    and compared the runtime on a 600Mhz ARM with NEON and 256k L2 cache.
    If I assured, none of the image data has been used before (e.g. is not cached), I indeed got the same order of execution times
    for the different implementations, however the gain was significantly lower than the values discussed here.
    It was something about the “pulsar 16 pixel”-implementation (compare comment 24 on this page) being 15% faster than the “fastest” implementation on this page, for instance.
    (whereas Etienne claims it to be twice as fast)

    Using a small image size and repeating the calculation (=L2 cache should work perfectly),
    the differences in the execution time compare better to the expectation mentioned here.

    Am I doing anything wrong?

  31. sophana says:

    As said earlier, the bug was filed in gcc bugzilla.
    I tested 4.6.1, the generate assembly was much better but still suboptimal.
    latest 4.6 snapshot: same result.

    but latest 4.7 snapshot generates optimal code.

    tested with android on a samsung galaxy s2: the c neon code is 4 times better than the C reference

  32. Pingback: Mac OSX and Assembly Programming | Alauda Projects

  33. Pingback: Maximizing performance grayscale color conversion using NEON and cv::parallel_for « Computer Vision Talks

  34. nguns says:

    Hi Nils,

    It’s great reading your blogs. I’ve got a typical problem here. I thought you would be able to answer it better, I’d be more than happy even if you could have a look at this and give me some pointers. Here’s my problem on SO:


  35. Pingback: From zero to cross-compiling for pandaboard | Stan Manilov

  36. Wu Xiaoyong says:


    This post is really helpful. I am a newbie for Neon intrinsics, and right now I have to optimize my code such as the following pseudo-code:

    if(a > b + c) {
    a = b + c;
    } else if(a < b – c) {
    a = b – c;

    How can I transform it to Neon intrinsics? It seems that we can not do 8 operator parallel operation in such case. Isn't it?

  37. sophana says:


    Just like to say that I tried compiling the example with clang-3.0, and the generated assembly looks optimal:

    cmp r2, #8
    bxlt lr
    asr r3, r2, #31
    vmov.i8 d16, #0x97
    vmov.i8 d17, #0x4D
    add r2, r2, r3, lsr #29
    vmov.i8 d18, #0x1C
    mov r3, #0
    asr r2, r2, #3
    vld3.8 {d20, d21, d22}, [r1]!
    add r3, r3, #1
    cmp r3, r2
    vmull.u8 q12, d21, d16
    vmlal.u8 q12, d20, d17
    vmlal.u8 q12, d22, d18
    vshrn.i16 d19, q12, #8
    vst1.8 {d19}, [r0]!
    blt .LBB0_1
    bx lr

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>