Since there is so little information about NEON optimizations out there I thought I’d write a little about it.
Some weeks ago someone on the beagle-board mailing-list asked how to optimize a color to grayscale conversion for images. I haven’t done much pixel processing with ARM NEON yet, so I gave if a try. The results I got where quite spectacular, but more on this later.
For the color to grayscale conversion I used a very simple conversion scheme: A weighted average of the red, green and blue components. This conversion ignores the effect of gamma but works good enough in practice. Also I decided not to do proper rounding. It’s just an example after all.
First a reference implementation in C:
void reference_convert (uint8_t * __restrict dest, uint8_t * __restrict src, int n) { int i; for (i=0; i<n; i++) { int r = *src++; // load red int g = *src++; // load green int b = *src++; // load blue // build weighted average: int y = (r*77)+(g*151)+(b*28); // undo the scale by 256 and write to memory: *dest++ = (y>>8); } }
Optimization with NEON Intrinsics
Lets start optimizing the code using the compiler intrinsics. Intrinsics are nice to use because you they behave just like C-functions but compile to a single assembler statement. At least in theory as I’ll show you later..
Since NEON works in 64 or 128 bit registers it’s best to process eight pixels in parallel. That way we can exploit the parallel nature of the SIMD-unit. Here is what I came up with:
void neon_convert (uint8_t * __restrict dest, uint8_t * __restrict src, int n) { int i; uint8x8_t rfac = vdup_n_u8 (77); uint8x8_t gfac = vdup_n_u8 (151); uint8x8_t bfac = vdup_n_u8 (28); n/=8; for (i=0; i<n; i++) { uint16x8_t temp; uint8x8x3_t rgb = vld3_u8 (src); uint8x8_t result; temp = vmull_u8 (rgb.val[0], rfac); temp = vmlal_u8 (temp,rgb.val[1], gfac); temp = vmlal_u8 (temp,rgb.val[2], bfac); result = vshrn_n_u16 (temp, 8); vst1_u8 (dest, result); src += 8*3; dest += 8; } }
Lets take a look at it step by step:
First off I load my weight factors into three NEON registers. The vdup.8 instruction does this and also replicates the byte into all 8 bytes of the NEON register.
uint8x8_t rfac = vdup_n_u8 (77); uint8x8_t gfac = vdup_n_u8 (151); uint8x8_t bfac = vdup_n_u8 (28);
Now I load 8 pixels at once into three registers.
uint8x8x3_t rgb = vld3_u8 (src);
The vld3.8 instruction is a specialty of the NEON instruction set. With NEON you can not only do loads and stores of multiple registers at once, you can de-interleave the data on the fly as well. Since I expect my pixel data to be interleaved the vld3.8 instruction is a perfect fit for a tight loop.
After the load, I have all the red components of 8 pixels in the first loaded register. The green components end up in the second and blue in the third.
Now calculate the weighted average:
temp = vmull_u8 (rgb.val[0], rfac); temp = vmlal_u8 (temp,rgb.val[1], gfac); temp = vmlal_u8 (temp,rgb.val[2], bfac);
vmull.u8 multiplies each byte of the first argument with each corresponding byte of the second argument. Each result becomes a 16 bit unsigned integer, so no overflow can happen. The entire result is returned as a 128 bit NEON register pair.
vmlal.u8 does the same thing as vmull.u8 but also adds the content of another register to the result.
So we end up with just three instructions for weighted average of eight pixels. Nice.
Now it’s time to undo the scaling of the weight factors. To do so I shift each 16 bit result to the right by 8 bits. This equals to a division by 256. ARM NEON has lots of instructions to do the shift, but also a “narrow” variant exists. This one does two things at once: It does the shift and afterwards converts the 16 bit integers back to 8 bit by removing all the high-bytes from the result. We get back from the 128 bit register pair to a single 64 bit register.
result = vshrn_n_u16 (temp, 8);
And finally store the result.
vst1_u8 (dest, result);
First Results:
How does the reference C-function and the NEON optimized version compare? I did a test on my Omap3 CortexA8 CPU on the beagle-board and got the following timings:
C-version: 15.1 cycles per pixel. NEON-version: 9.9 cycles per pixel.
That’s only a speed-up of factor 1.5. I expected much more from the NEON implementation. It processes 8 pixels with just 6 instructions after all. What’s going on here? A look at the assembler output explained it all. Here is the inner-loop part of the convert function:
160: f46a040f vld3.8 {d16-d18}, [sl]
164: e1a0c005 mov ip, r5
168: ecc80b06 vstmia r8, {d16-d18}
16c: e1a04007 mov r4, r7
170: e2866001 add r6, r6, #1 ; 0x1
174: e28aa018 add sl, sl, #24 ; 0x18
178: e8bc000f ldm ip!, {r0, r1, r2, r3}
17c: e15b0006 cmp fp, r6
180: e1a08005 mov r8, r5
184: e8a4000f stmia r4!, {r0, r1, r2, r3}
188: eddd0b06 vldr d16, [sp, #24]
18c: e89c0003 ldm ip, {r0, r1}
190: eddd2b08 vldr d18, [sp, #32]
194: f3c00ca6 vmull.u8 q8, d16, d22
198: f3c208a5 vmlal.u8 q8, d18, d21
19c: e8840003 stm r4, {r0, r1}
1a0: eddd3b0a vldr d19, [sp, #40]
1a4: f3c308a4 vmlal.u8 q8, d19, d20
1a8: f2c80830 vshrn.i16 d16, q8, #8
1ac: f449070f vst1.8 {d16}, [r9]
1b0: e2899008 add r9, r9, #8 ; 0x8
1b4: caffffe9 bgt 160
Note the store at offset 168? The compiler decides to write the three registers onto the stack. After a bit of useless memory accesses from the GPP side the compiler reloads them (offset 188, 190 and 1a0) in exactly the same physical NEON register.
What all the ordinary integer instructions do? I have no idea. Lots of memory accesses target the stack for no good reason. There is definitely no shortage of registers anywhere. For reference: I used the GCC 4.3.3 (CodeSourcery 2009q1 lite) compiler .
NEON and assembler
Since the compiler can’t generate good code I wrote the same loop in assembler. In a nutshell I just took the intrinsic based loop and converted the instructions one by one. The loop-control is a bit different, but that’s all.
convert_asm_neon: # r0: Ptr to destination data # r1: Ptr to source data # r2: Iteration count: push {r4-r5,lr} lsr r2, r2, #3 # build the three constants: mov r3, #77 mov r4, #151 mov r5, #28 vdup.8 d3, r3 vdup.8 d4, r4 vdup.8 d5, r5 .loop: # load 8 pixels: vld3.8 {d0-d2}, [r1]! # do the weight average: vmull.u8 q3, d0, d3 vmlal.u8 q3, d1, d4 vmlal.u8 q3, d2, d5 # shift and store: vshrn.u16 d6, q3, #8 vst1.8 {d6}, [r0]! subs r2, r2, #1 bne .loop pop { r4-r5, pc }
Final Results:
Time for some benchmarking again. How does the hand-written assembler version compares? Well – here are the results:
C-version: 15.1 cycles per pixel. NEON-version: 9.9 cycles per pixel. Assembler: 2.0 cycles per pixel.
That’s roughly a factor of five over the intrinsic version and 7.5 times faster than my not-so-bad C implementation. And keep in mind: I didn’t even optimized the assembler loop.
My conclusion: If you want performance out of your NEON unit stay away from the intrinsics. They are nice as a prototyping tool. Use them to get your algorithm working and then rewrite the NEON-parts of it in assembler.
Btw: Sorry for the ugly syntax-highlighting. I’m still looking for a nice wordpress plug-in.
Your post is very interesting. I was reading up about Neon intrinsics. And I came across your post. This information is very useful. BTW I have a few doubts/comments.
1. Where are the src and dst pointers pointing to ? Is it DDR ? or some faster memory like L3 memory in OMAP3 ? Can you achieve 2 cycles per pixel performance if data is in DDR, and A8 accesses it through Dcache. My guess was that, if data is in DDR, Dcache misses would be a gating factor and will result in worse performance, than 2 cycles per pixel. Can you please clarify.
2. Can you try declaring the following variables outside the loop.
uint16x8_t temp;
uint8x8x3_t rgb = vld3_u8 (src);
uint8x8_t result;
I am wondering if declaring the variables inside the loop will cause the compliler to do strange things like reading/writing from stack every iteration. Only a guess. Isnt neccesarily right. Have you tried decaring them outside the loop ?
3. Have you tried any other compiler. TI and ARM has A8 compilers that support Neon intrinsics. I have heard the TI compiler is pretty good.
Thanks and Regards
Ranjith
Comment by Ranjith Parakkal — January 11, 2010 @ 12:45 pm
Hi Ranjith,
The two pointers point to DDR memory. Source is 192kb and dest 64kb in size (256*256 pixels each). No internal has been used, and the memory blocks are much larger than the cache. Also as far as I know internal or tightly coupled memory does not exist on the ARM-side of the OMAP3530.
I’ll try to move the variable declaration out of the loop this evening. It shouldn’t make a difference, but well… You’ll never know until you’ve tried. For trying the TI-compiler: That’s a good idea.
Comment by admin — January 11, 2010 @ 7:28 pm
Nils,
Thanks a LOT for u reply. I will wait for you to try moving the declaration outside the loop and declare the results here ..
Just thinking out loud on how you have achieved 2 cycles per pixel.
Since you are processing 8 cycles per pixel every iteration. On an average the following code should take 16 cycles.
.loop
vld3.8 {d0-d2}, [r1]! — ?? cycle
# do the weight average:
vmull.u8 q3, d0, d3 — one cycle
vmlal.u8 q3, d1, d4 — one cycle
vmlal.u8 q3, d2, d5 — one cycle
# shift and store:
vshrn.u16 d6, q3, #8 — one cycle
vst1.8 {d6}, [r0]! — ?? cycle
subs r2, r2, #1 — one cycle
bne .loop — ?? cycle
The vector arithmetic operations and the subs operation should account for 5 cycles out of the 16, assuming each of them are single cycle when they are pipelined. So that leaves about 9 cycles for the branch and the vector load and the store operations. I think this should sort of account for the cache misses. I dont know much about ARMs cache structure. Lemme try go read up about it.
BTW there is an L3 memory on OMAP3, which is not closely coupled with A8, but may still give better performance than cached-DDR.
Comment by Ranjith Parakkal — January 12, 2010 @ 10:18 am
Hi,
I need some help for using the Neon instructions of Cortex-A8 in my application.
I am writing a image algorithm where I want to use NEON instructions of Cortex A8.
I have tested the program using RVDS, where I have used init.s and init_cache.s taken from Examples of RVDS. We had also given one scatter file for placement of stack and heap while linking in RVDS.
I want to run my program on Beagle Board running Linux.
My questions are:
1. Do we need to use init.s and init_cache.s in my program to run it on Beagle board?
2. Do we need scatter file to run the program on Beagle board? If yes, how to give scatter file using gnu ld.
3. When we removed ’1′ and ’2′, the linker gives error “uses VFP registers”.
Please help.
Mike
Comment by Mike — January 12, 2010 @ 12:53 pm
Hi Ranjith,
It’s not that simple to count cycles in mixed ARM and NEON code. The NEON-Pipeline is very special. In a nutshell the NEON unit sits logically behind the ARM unit and lags 5 cycles behind execution. It also has it’s own instruction queue and only executes one instruction per cycle (in practice).
In the code example the following will happen: The instruction decoder decodes two instructions per cycle. The ARM-pipeline can execute two instructions per cycle. NEON instructions are treated as a NOPs so to say. So all the NEON instructions flow through the ARM pipeline and do nothing. After the 5 cycles they get stuffed into the NEON instruction queue.
The NEON-unit will now start to execute the queued instructions. It can only do one instruction per cycle (not exactly true, there are some pipeline possibilities, but I doubt my code uses any). Anyway, in practice the speed is just half of the ARM-unit.
This has the following effect on timing: During the first two or three iterations of the conversion function the ARM unit generates NEON instructions much faster than the NEON unit can execute them. The queue gets filled up fast. Once the queue is filled, the NEON-unit will always able to execute while the ARM-unit is mostly waiting for a free queue entry. All other ARM instruction will execute in parallel with the NEON-unit, and the NEON-unit will never run out of work.
So effectively all ARM instructions that are mixed between the NEON-code are free. The entire performance is dominated by just the NEON-timing.
We end up with the six instructions executing within 16 cycles. I haven’t measured, but I guess that the the load-instruction accounts for at least 50% of the time (due to cache misses) and the data-processing and store does the other half.
It would be fun to run the code on zero wait-state memory. Unfortunately I don’t know about any of such memory, even on the L3. Well – I could abuse the internal memory of the DSP but I don’t know how fast access from the ARM is.
Comment by Nils — January 12, 2010 @ 4:14 pm
Hi Mike.
I’ve never worked with RVDS, but if you use init.s and init_cache.s it you’re compiling your program for use without any OS. E.g. bare-bone or boot-loader style.
Since you want to execute the code on the BeagleBoard using Linux you need an executable in ELF-format.
Check the manual how to generate such an output-file. I’m sure it’s documented somewhere.
Comment by Nils — January 12, 2010 @ 4:20 pm
Hi,
Thanks for your reply.
I build my application without init.s, init_cache.s & scatter file using -mfu=neon. It was build successfully.
When I ran my application on beagleboard, it exited giving “Illegal Instruction”, which I suspect when we encounter first neon instruction.We are using beagleboard version B4 & Kernal 2.6.22.18.
Could you tell me how to solve this issue?
Please help.
Mike
Comment by Mike — January 13, 2010 @ 6:44 am
Hi Mike.
Have you tried debugging the executable with gdb to find out which instruction triggers the “Ilegal Instrucion”-fault? That could give some clues what is going wrong. Maybe there is still some code in your application that tries to directly access the hardware? Moves to the co-processor that configure cache and the like will trigger an illegal instruction fault.
You may get better answers on the beaglboard mailing list. I suggest you try to ask there. I’ve never worked with RVDS, and I don’t have any problems compiling my code using GCC on the PC or the beagleboard.
Link to the beagleboard communiy: http://groups.google.com/group/beagleboard
I’m using kernel 2.6.29 by the way… 2.6.22 is rather old. How does it come that you’re still using such an old kernel?
Comment by Nils — January 13, 2010 @ 5:49 pm
Hi Nils,
How did you get the assembler output generated? Did you use any specific compiler flag to generate assembler output.
I am using Cross Compiler arm-none-linux-gnueabi-gcc to compile my programs and test them on OMAP Zoom 2 platform.
I tried using -S option but it does not generate any assembler output file.
Thanks and Best Regards,
Venkat
Comment by Venkat — January 21, 2010 @ 2:42 pm
Hi Nils,
I have one more question. How did you build an executable using hand coded assembler file? Could you please let me know the steps?
Thanks and Best Regards,
Venkat
Comment by Venkat — January 21, 2010 @ 5:04 pm
Hi Venkat,
I generated the assembler output using the gnu disassembler.
will do the trick!
To compile a raw assembler file do the same what you do with your .c files. E.g.
arm-none-linux-gnueabi-gcc -c yourfile.swill generate a yourfile.o object file. I've uploaded the file onto my webspace, so you can use it as a boilerplate: rgb_to_gray.s
Cheers,
Nils
Comment by Nils — January 21, 2010 @ 10:03 pm
Nils (and everyone),
I’m an open source advocate at TI for OMAP. A couple of our engineers pointed me to this article. We are working with ARM and CodeSourcery to improve the quality of ARM compilers. We are pushing everything directly to gcc, and we hope to have some real progress by the end of 2010.
We’ll keep you in the loop on the progress.
Cheers,
Chris P
Comment by Chris — March 8, 2010 @ 3:00 pm
Hi Chris,
That’s great news..
I follow the gcc-dev mailing list for two years now and I developed a feeling how long/difficult some changes are and how gcc works on the inside. If you don’t mind I would like to contact you privately and give some additional insights on what goes wrong inside gcc when it comes to performant ARM code. There are a couple of low hanging fruits that would give instant performance improvements for Cortex-A8 code.
Also – since you’ve contacted me via the blog: Don’t miss Mans blog on ARM-code: http://www.hardwarebug.org He is _very_ competent when it comes to ARM and NEON. I’m sure he would be glad to share some of his insights with you as well.
Cheers,
Nils
Comment by Nils — March 8, 2010 @ 4:50 pm
The inefficient code generated by the vldX where X>1 intrinsics is a known problem in gcc:
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43118
Comment by Samuel — March 15, 2010 @ 1:13 pm
Hi,
I want to understand about how neon pipeline works. I have read cortex-a8 technical referance manual but still want to understand instruction level sceduling.
Thanks
Saurabh
Comment by Saurabh — March 26, 2010 @ 6:21 am
Hi, just a speculative post under an old blog entry in case you happen to know the answer to this…
I’ve been poring over various RealView documents about NEON, trying to understand the similarities and differences from Intel SSE-x. Am I right in saying that there’s no equivalent to SSE’s _mm_movemask_epi8 (which basically takes a SIMD vector and generates a standard integer bitmask based upon a predicate (in this case hardwired to be “top-bit set”) applied to each vector element)? There doesn’t seem to be but maybe I’m missing one hidden away.
This is probably an instruction that’s not that important for multimedia generation, but I’m interested in image analysis where one often wants to see if a “thing” is in some “computationally defined” set (eg, is an RGB pixel within some box in RGB space) and use SIMD parallelisation. Without this kind of instruction, you can do the testing in parallel but then you’ve got to build a mask of results by manually extracting each vector element to a scalar and build the bitmap, so it seems an odd thing to leave out of an instruction set design.
Cheers,
Orthochronous
Comment by Orthochronous — March 27, 2010 @ 6:01 pm
Hi Orthochronous,
An instruction like this seems to be missing. The best thing to simulate it looks roughly like this:
: d1 = input
: d2 = mask (128, 64, 32, 16, 8, 4, 2, 1)
vand d1, d1, d2
vpadd.i8 d1, d1, d1
vpadd.i8 d1, d1, d1
vpadd.i8 d1, d1, d1
That fills each byte of d1 with the sum of all elements of d1. Since we’ve masked out the bits before addition no overflow can happen and we end up with a mask similar to movemask. The result will be stored in each byte of d1, but you can just extract the last byte later.
The same trick can be extended to 128 bits as well.
Comment by Nils — March 27, 2010 @ 10:22 pm
Hi,
Thanks for the insight. I’d never have thought of this approach, so it’s very useful.
Comment by Orthochronous — March 29, 2010 @ 6:10 am
I have some performence overhead while i add the Neon Intrinsics with my Array Operation Code
The Code fragment shown below
1. C Code Fragment
kvalue = SHIFTR(*Residue++, 6) + *Predicted++;
*Original++ = (byte) CLIPS(255, kvalue);
kvalue = SHIFTR(*Residue++, 6) + *Predicted++;
*Original++ = (byte) CLIPS(255, kvalue);
kvalue = SHIFTR(*Residue++, 6) + *Predicted++;
*Original++ = (byte) CLIPS(255, kvalue);
kvalue = SHIFTR(*Residue++, 6) + *Predicted++;
*Original++ = (byte) CLIPS(255, kvalue);
2. Neon Specific Code
/* Predicted Array is unsigned char type so a type cast activity done by assign it to Integer AryPred*/
AryPred[0] = *Predicted++; AryPred[1] = *Predicted++; AryPred[2] = *Predicted++; AryPred[3] = *Predicted++;
neonResidue = vld1q_s32(Residue);
/* Code for SHIFTR */
neonResidue = vaddq_s32(neonResidue, addconst);
neonResidue = vshrq_n_s32(neonResidue, 6);
neonPredict = vld1q_s32(AryPred);
addedResult = vaddq_s32(neonResidue, neonPredict);
*Original++ = (byte) CLIPS(255, vgetq_lane_s32(addedResult, 0));
*Original++ = (byte) CLIPS(255, vgetq_lane_s32(addedResult, 1));
*Original++ = (byte) CLIPS(255, vgetq_lane_s32(addedResult, 2));
*Original++ = (byte) CLIPS(255, vgetq_lane_s32(addedResult, 3));
Residue += 4;
#define CLIPS(iF, iS)(iS > 0 ? (iS < iF ? iS : iF): (0 int
Residue-> int
Predicted->unsigned char
Original->unsigned char
The following data types are used to define the above array operations with Neon Operations
AryPred -> int
addedResult, neonPredict, neonResidue ->int32x4_t
What is the reason for the overhead when i repeat the code 4 times? Is there any memory alignment issue or any extra pipeline stalls comes here ?
Is there any performence gain while we change the data type of each array in to same (Neglect the Chance of Overflow)?
Rgds
eDave
Comment by eDave — May 6, 2010 @ 12:33 pm
Nice post,
I’ve also made the same observations when using the compiler vs hand-optimized neon assembler. C functions that contain inline asm often produce very unpredictable and unwanted results when ‘optimized’ with both -O2 and -O3. Even -Os still suffers slightly in performance compared to a hand-optimized assembly routine. In short… we’ll get there, just not yet
Comment by Christopher Friedt — June 6, 2010 @ 10:51 am
Nils,
would you please tell me how to measure the timing of the code in a Linux environment?
it seems that you execute the code on the BeagleBoard using Linux through an executable in ELF-format.
How to make sure the timeing is not occupied by other tasks in Linux?
Thanks,
Shawn
Comment by shawn — October 20, 2010 @ 10:13 am
Hello,
how did you measure or estimate the number of cycles each of the variant takes without looking at the assembly output?
Comment by jani — December 21, 2010 @ 12:10 pm
HI,
iam very much new to cortex a9 and neon codging,
i have written neon code for a C fucntion and exectute on beagle board performance of C version is better than neon code.
what could be the problem?
please help me in this regard
my compiler options are
-O3 -march=armv7-a -ftree-vectorize -mvectorize-with-neon-quad -fprefetch-loop-arrays -mfloat-abi=softfp -mfpu=neon -funroll-loops
Thank you
Comment by srini — March 30, 2011 @ 2:11 pm
Just by removing pipeline bubbles, your code can run 2 times faster.
http://pulsar.webshaker.net/ccc/result.php?lng=fr&sample=3
Comment by Etienne SOBOLE — May 14, 2011 @ 7:37 pm
hi, mike
I can’t enable NEON in vs2008 with Compact 7, do you know how to do this?
Also, I can’t compile your sample NEON code successfully in vs2008 with Compact 7,
Do you know how to enable NEON function?
Comment by Michael — May 25, 2011 @ 11:41 am
Hi,
I’m new to assembly programming and I tried to understand by converting each intrinsic into ARM assembly. I cannot find any corresponding assembly instructions for “src += 8*3; dest += 8;” in rgb_to_gray.s (although the assembly file works as expected when executed)
Can someone kindly throw some light on it.
Comment by srinivas — June 4, 2011 @ 9:23 pm
The vld and vst instructions increment the pointers as a side-effect. That’s what the ‘!’ char does in the instruction.
Comment by Nils — June 5, 2011 @ 11:19 am
Thanks Nils
Comment by srinivas — June 6, 2011 @ 1:32 pm
Hi everybody,
I am new in the field of playing with assembly language and thus challenging the optimization of professional Neon Compilers. Here are my questions?
1)Suppose if I have a C Code (e.g. fft.c), then how can I view its Neon optimized assembly code using Real View Development System? A command line syntax can also work, but for windows only.
2)Is there any general guideline for converting C codes into their corresponding Neon assembly counterparts?
Pardon me for my unprofessional way of asking questions, but it would be very gracious of all of you people if anyone can answer my queries ASAP.
Thank You
Comment by Ankit — July 1, 2011 @ 11:50 am
Hi all!
I have a question on the timings discussed here:
I took all implementations from here and from “pulsar” (incl. the “16 pixels at a time” version)
and compared the runtime on a 600Mhz ARM with NEON and 256k L2 cache.
If I assured, none of the image data has been used before (e.g. is not cached), I indeed got the same order of execution times
for the different implementations, however the gain was significantly lower than the values discussed here.
It was something about the “pulsar 16 pixel”-implementation (compare comment 24 on this page) being 15% faster than the “fastest” implementation on this page, for instance.
(whereas Etienne claims it to be twice as fast)
Using a small image size and repeating the calculation (=L2 cache should work perfectly),
the differences in the execution time compare better to the expectation mentioned here.
Am I doing anything wrong?
Comment by Tom — August 26, 2011 @ 5:44 pm
As said earlier, the bug was filed in gcc bugzilla.
I tested 4.6.1, the generate assembly was much better but still suboptimal.
latest 4.6 snapshot: same result.
but latest 4.7 snapshot generates optimal code.
tested with android on a samsung galaxy s2: the c neon code is 4 times better than the C reference
Comment by sophana — October 5, 2011 @ 11:04 am
[...] An example using NEON for optimization at Hilbert-Space.de, [...]
Pingback by Mac OSX and Assembly Programming | Alauda Projects — April 30, 2012 @ 11:21 am
[...] ARM NEON Optimization. An Example [...]
Pingback by Maximizing performance grayscale color conversion using NEON and cv::parallel_for « Computer Vision Talks — November 6, 2012 @ 9:04 pm
Hi Nils,
It’s great reading your blogs. I’ve got a typical problem here. I thought you would be able to answer it better, I’d be more than happy even if you could have a look at this and give me some pointers. Here’s my problem on SO: http://stackoverflow.com/questions/14869693/neon-vld-consuming-more-cycles-than-what-is-expected
Regards,
nguns
Comment by nguns — February 19, 2013 @ 9:05 am
When you make a comparison between board shoes and common shoes, the former one’s flat soles are different from the latter one. It is easy for feet to be able to completely stick to the flat bottom of the skateboard. It can absorb the shock as well. They very possess the garments, no unified uniform, the coloration of the clothing also is distinct, fights, typically in [url=http://www.jeremyscottwingsshop.us/]Adidas Jeremy Scott Wings[/url] opposition to it, frequently fire oneself individual, adverse operations. The fifteen th century, Western Europe there the mercenaries, just appear the uniform. Louis xiv time time period, the French to formally proven of unified uniform, and distinct uniform color varied: the prefects wear white [url=http://www.jeremyscottsale.us/]Jeremy Scott Shoes[/url] uniform, cavalry put on crimson uniforms, infantry putting on grey uniforms. Selling tickets to the [url=http://www.jeremyscottonline.us/]Jeremy Scott Leopard[/url] game is the easiest way to generate revenue for the team. Franchises in most major leagues charge upwards of $200 for a ticket depending on the sport. Even the cheap seats go for $50 in most stadiums. 2. Limits on occupancy. Your agreement should [url=http://www.jswingsonlineshop.us/]Jeremy Scott Wings[/url] clearly specify that the rental unit is the residence of only the tenants who have signed the lease and their minor children. You can find a number of large hammocks that are capable of accommodating more than one person. Many people even list these hammocks on their wish lists for birthdays, anniversaries and wedding presents. Nothing is more relaxing and romantic than spending a special moment in the summer sun with someone you love. Cambodge. Cameroun. Cap-vert. Later in 1971, Philip Knightrealized the importance of design ideas and for this he approachedDavidson who created the logo which is globally known as Swoosh. Thiswas first used in the running shoes at the US Track Field OlympicTrials (Oregon – Eugene). Nike was first introduced as afootball shoe in 1971. The big announcement Tuesday was that this will be a balanced budget – a forecast that in the past would have been a guaranteed vote-getter. No more. Given recent history, with wild, 10-figure swings between what was promised and what was delivered, we have more confidence in Lance Armstrong’s autobiographies than we do in any black-ink projections.
Comment by joefel parroco — April 10, 2013 @ 5:57 am
It’s awesome designed for me to have a site, which is beneficial designed for my knowledge. thanks admin
Comment by Phoebe — April 20, 2013 @ 6:34 pm
Yesterday, while I was at work, my sister stole my iPad and tested to see if it can survive
a forty foot drop, just so she can be a youtube sensation.
My apple ipad is now destroyed and she has 83 views.
I know this is completely off topic but I had to share it with someone!
Comment by best seo consultants — April 23, 2013 @ 10:14 am
Fantastic goods from you, man. I have understand your stuff previous to and you are just extremely fantastic.
Comment by Jones sabo inhale created by hearth — May 3, 2013 @ 2:25 pm