<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>hilbert-space</title>
	<atom:link href="http://hilbert-space.de/?feed=rss2" rel="self" type="application/rss+xml" />
	<link>http://hilbert-space.de</link>
	<description>Just another WordPress weblog</description>
	<lastBuildDate>Sun, 07 Feb 2010 17:40:21 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.2</generator>
		<item>
		<title>DSP default cache-sizes not optimal?</title>
		<link>http://hilbert-space.de/?p=77</link>
		<comments>http://hilbert-space.de/?p=77#comments</comments>
		<pubDate>Fri, 05 Feb 2010 18:33:56 +0000</pubDate>
		<dc:creator>Nils</dc:creator>
				<category><![CDATA[Beagleboard]]></category>
		<category><![CDATA[DSP]]></category>
		<category><![CDATA[OMAP3530]]></category>

		<guid isPermaLink="false">http://hilbert-space.de/?p=77</guid>
		<description><![CDATA[While debugging some DSP code yesterday I came a cross a performance oddity. Adding more code lowered the performance of an unrelated function. By itself this is not *that* odd. It happens if the size of your code is larger than your first level code-cache and different functions start to kick each other out of [...]]]></description>
			<content:encoded><![CDATA[<p>While debugging some DSP code yesterday I came a cross a performance oddity. Adding more code lowered the performance of an unrelated function.</p>
<p>By itself this is not *that* odd. It happens if the size of your code is larger than your first level code-cache and different functions start to kick each other out of the cache. However, in my little toy program this was unlikely. I had only around 20kb of code and the code-cache is 32kb in size.</p>
<p>Better safe than sorry I thought and took a look how the caches are configured. Big and pleasant surprise: Two of them are running at half the maximum size for no good reason:</p>
<p>In my case after DSP-boot I got:</p>
<pre>Level 1 Data-Cache 32k
Level 1 Code-Cache 16k
Level 2 Cache      32k</pre>
<p>However, the maximum possible cache sizes for the BeagleBoard are</p>
<pre>Level 1 Data-Cache 32k (no change)
Level 1 Code-Cache 32k (16kb larger)
Level 2 Cache      64k (32kb larger)</pre>
<p>So 48kb of valuable cache has been left unused. Changing the cache sizes is easy:</p>
<pre>  #include &lt; bcache.h &gt;

  // and somewhere at the start of main()
  BCACHE_Size size;
  size.l1dsize = BCACHE_L1_32K;
  size.l1psize = BCACHE_L1_32K;
  size.l2size  = BCACHE_L2_64K;
  BCACHE_setSize (&amp;size);</pre>
<p>That still leaves you the 48kb of L1DSRAM for single cycle access and 32kb of L2RAM to talk with the video accelerators. Oh &#8211; and it gave a noticeable performance boost.</p>
<p>Btw- it&#8217;s very possible that this only applies to the DspLink configuration that I am using.</p>
<h3>Update:</h3>
<p>It turned out that the reason for the smaller cache-sizes is the default DspLink configuration. You can override this if you add the following lines to your projects TCF-file. Just put them somewhere between utils.importFile(&#8220;dsplink-omap3530-base.tci&#8221;); and prog.gen():</p>
<pre>
prog.module("GBL").C64PLUSL2CFG  = "64k";
prog.module("GBL").C64PLUSL1DCFG = "32k";
prog.module("GBL").C64PLUSL1PCFG = "32k";

var IRAM = prog.module("MEM").instance("IRAM");
IRAM.len = 32768;</pre>
<p>This will configure the OMAP3530 DSP with:</p>
<pre>L2-Cache:     64kb
L1 Data-Cache 32kb
L1 Code-Cache 32kb
L1SDRAM       48kb
IRAM (L2 Ram) 32kb</pre>
]]></content:encoded>
			<wfw:commentRss>http://hilbert-space.de/?feed=rss2&#038;p=77</wfw:commentRss>
		<slash:comments>7</slash:comments>
		</item>
		<item>
		<title>Faster Cortex-A8 16-bit Multiplies</title>
		<link>http://hilbert-space.de/?p=66</link>
		<comments>http://hilbert-space.de/?p=66#comments</comments>
		<pubDate>Sun, 31 Jan 2010 06:08:03 +0000</pubDate>
		<dc:creator>Nils</dc:creator>
				<category><![CDATA[Beagleboard]]></category>
		<category><![CDATA[OMAP3530]]></category>

		<guid isPermaLink="false">http://hilbert-space.de/?p=66</guid>
		<description><![CDATA[I did a small and fun assembler SIMD optimization job the last week. The target architecture was ARMv6, but since the code will run on the iPhone I tried to keep the code fast on the Cortex-A8 as well. When I did some profiling on my BeagleBoard, and I got some surprising results: The code [...]]]></description>
			<content:encoded><![CDATA[<p>I did a small and fun assembler SIMD optimization job the last week. The target architecture was ARMv6, but since the code will run on the iPhone I tried to keep the code fast on the Cortex-A8 as well. </p>
<p>When I did some profiling on my BeagleBoard, and I got some surprising results: The code run a faster as it should. This was odd. Never happened to me. </p>
<p>Fast forward 5 hours and lots of micro-benchmarking: </p>
<h3>The 16 bit multiplies SMULxy on the Cortex-A8 are a cycle faster than documented!</h3>
<p>They take one cycle for issue and have a result-latency of three cycles (rule of thumb, it&#8217;s a bit more complicated than that). And this applies to all variants of this instruction: SMULBB, SMULBT, SMULTT and SMULTB. </p>
<p>The multiply-accumulate variants of the 16 bit multiplies execute are as documented: Two cycles issue and three cycles result-latency. </p>
<p>This is nice. I have used the 16 bit multiplies a lot in the past but stopped to use them because I thought they offered no benefit over the more general MUL instruction on the Cortex-A8. The SMULxy multiplies mix very well with the ARMv6 SIMD multiplies. Both of them work on 16 bit integers but the SIMD instructions take a packed 16 bit argument while the SMULxy take only a single argument, and you can specify if you want the top or bottom 16 bits of each argument. Very flexible. </p>
<p>All this leads to nice code sequences. For example a three element dot-product of signed 16 bit quantities. Something that is used quite a lot for color-space conversion.</p>
<p>Assume this register-values on entry:</p>
<pre>
              MSB16      LSB16

          +----------+----------+
      r1: | ignored  |    a1    |
          +----------+----------+
          +----------+----------+
      r2: | ignored  |    a2    |
          +----------+----------+
          +----------+----------+
      r3: |    b1    |    c1    |
          +----------+----------+
          +----------+----------+
      r4: |    b2    |    c2    |
          +----------+----------+
</pre>
<p>And this code sequence:</p>
<pre>
    smulbb      r0, r1, r2
    smlad       r0, r3, r4, r0
 </pre>
<p>Gives a result: r0 = (a1*a2) + (b1*b2) + (c1*c2)</p>
<p>On the Cortex-A8 this will schedule like this:</p>
<pre>
  Cycle0:

    smulbb      r0, r1, r2          Pipe0
    nop                             Pipe1   (free, can be used for non-multiplies)

  Cycle1:        

    smlad       r0, r3, r4, r0      Pipe0
    nop                             Pipe1   (free, can be used for non-multiplies)

  Cycle2:        

    blocked, because smlad is a multi-cycle instruction.
</pre>
<p>The result (r0) will be available three cycles later (cycle 6) for most instructions. You can execute whatever you want in-between as long as you don&#8217;t touch r0.</p>
<p>Note that this is a special case: The SMULBB instruction in cycle0 generates the result in R0. If the next instruction is one of the multiply-accumulate family, and the register is used as the accumulate argument a special forward path of the Cortex-A8 kicks in and the result latency is lowered to just one cycle. Cool, ain&#8217;t it?</p>
<p>Btw: Thanks to <a href="http://www.hardwarebug.org">Måns/MRU</a>. He was so kind and verified the timing on his beagleboard. </p>
]]></content:encoded>
			<wfw:commentRss>http://hilbert-space.de/?feed=rss2&#038;p=66</wfw:commentRss>
		<slash:comments>5</slash:comments>
		</item>
		<item>
		<title>C64x+ DSP MMU faults, and how to disable the MMU.</title>
		<link>http://hilbert-space.de/?p=55</link>
		<comments>http://hilbert-space.de/?p=55#comments</comments>
		<pubDate>Wed, 20 Jan 2010 23:39:59 +0000</pubDate>
		<dc:creator>Nils</dc:creator>
				<category><![CDATA[Beagleboard]]></category>
		<category><![CDATA[DSP]]></category>
		<category><![CDATA[Linux]]></category>
		<category><![CDATA[OMAP3530]]></category>

		<guid isPermaLink="false">http://hilbert-space.de/?p=55</guid>
		<description><![CDATA[Two days ago, while testing some image processing algorithms on the DSP I got the following message for the first time: DSP MMU Error Fault! MMU_IRQSTATUS = [0x1]. Virtual DSP addr reference that generated the interrupt = [0x85000000]. Outch! I was aware that the DSP of the OMAP3530 has a memory management unit, but so [...]]]></description>
			<content:encoded><![CDATA[<p>Two days ago, while testing some image processing algorithms on the DSP I got the following message for the first time:</p>
<pre><strong>
DSP MMU Error Fault!  MMU_IRQSTATUS = [0x1]. Virtual DSP addr reference that generated the interrupt = [0x85000000].
</strong></pre>
<p>Outch!</p>
<p>I was aware that the DSP of the OMAP3530 has a memory management unit, but so far I never had to deal with it. Dsplink initialized the MMU and enabled access to all the DSP memory and all peripherals I accessed so far. </p>
<p>However, this time I passed a pointer to a memory block allocated via CMEM to the DSP. This triggered a page fault. Now what? I did some research and figured out that the CodecEngine enables access to the CMEM memory. I don&#8217;t use the CodecEngine, so I have to do it on my own.</p>
<p>Fortunately the TI folks have thought about that! The PROC module has functionality to modify the DSP MMU entries. Here are some ready to use functions:</p>
<p><!-- Generator: GNU source-highlight 3.1<br />
by Lorenzo Bettini</p>
<p>http://www.lorenzobettini.it</p>
<p>http://www.gnu.org/software/src-highlite --></p>
<pre><tt><font color="#009900">int</font> <b><font color="#000000">dsp_mmu_map</font></b> <font color="#990000">(</font><font color="#009900">unsigned</font> <font color="#009900">long</font> physical_ptr<font color="#990000">,</font> <font color="#009900">int</font> size<font color="#990000">)</font>
<i><font color="#9A1900">///////////////////////////////////////////////////////////</font></i>
<i><font color="#9A1900">// Maps a physical memory region into the DSP address-space</font></i>
<font color="#FF0000">{</font>
  <font color="#008080">ProcMemMapInfo</font> mapInfo<font color="#990000">;</font>
  mapInfo<font color="#990000">.</font>dspAddr <font color="#990000">=</font> <font color="#990000">(</font>Uint32<font color="#990000">)</font>physical_ptr<font color="#990000">;</font>
  mapInfo<font color="#990000">.</font>size    <font color="#990000">=</font> size<font color="#990000">;</font>
  <b><font color="#0000FF">return</font></b> <b><font color="#000000">DSP_SUCCEEDED</font></b><font color="#990000">(</font><b><font color="#000000">PROC_control</font></b><font color="#990000">(</font><font color="#993399">0</font><font color="#990000">,</font> PROC_CTRL_CMD_MMU_ADD_ENTRY<font color="#990000">,</font> <font color="#990000">&amp;</font>mapInfo<font color="#990000">));</font>
 <font color="#FF0000">}</font>

<font color="#009900">int</font> <b><font color="#000000">dsp_mmu_unmap</font></b> <font color="#990000">(</font><font color="#009900">unsigned</font> <font color="#009900">long</font> dspAddr<font color="#990000">,</font> <font color="#009900">int</font> size<font color="#990000">)</font>
<i><font color="#9A1900">//////////////////////////////////////////////////////////////</font></i>
<i><font color="#9A1900">// Unmaps a physical memory region into the DSP address-space</font></i>
<font color="#FF0000">{</font>
  <font color="#008080">ProcMemMapInfo</font> mapInfo<font color="#990000">;</font>
  mapInfo<font color="#990000">.</font>dspAddr <font color="#990000">=</font> <font color="#990000">(</font>Uint32<font color="#990000">)</font>dspAddr<font color="#990000">;</font>
  mapInfo<font color="#990000">.</font>size <font color="#990000">=</font> size<font color="#990000">;</font>
  <b><font color="#0000FF">return</font></b> <b><font color="#000000">DSP_SUCCEEDED</font></b><font color="#990000">(</font><b><font color="#000000">PROC_control</font></b><font color="#990000">(</font><font color="#993399">0</font><font color="#990000">,</font> PROC_CTRL_CMD_MMU_DEL_ENTRY<font color="#990000">,</font> <font color="#990000">&amp;</font>mapInfo<font color="#990000">));</font>
<font color="#FF0000">}</font>

</tt></pre>
<p>All nice and dandy now? No, it&#8217;s not. You can only map a limited number of memory regions (around 32 I think). That&#8217;s not enough for my needs. Also I don&#8217;t feel like tracking memory regions and swap them as needed. So the way CodecEngine does it is probably better: Enable access to the whole CMEM address-space: These two functions do this, but without CodecEngine:</p>
<p><!-- Generator: GNU source-highlight 3.1<br />
by Lorenzo Bettini</p>
<p>http://www.lorenzobettini.it</p>
<p>http://www.gnu.org/software/src-highlite --></p>
<pre><tt><font color="#009900">int</font> <b><font color="#000000">dsp_mmu_map_cmem</font></b> <font color="#990000">(</font><font color="#009900">void</font><font color="#990000">)</font>
<i><font color="#9A1900">/////////////////////////////////////////</font></i>
<i><font color="#9A1900">// map first cmem block into the DSP MMU:</font></i>
<font color="#FF0000">{</font>
  <font color="#008080">CMEM_BlockAttrs</font> info<font color="#990000">;</font>

  <b><font color="#0000FF">if</font></b> <font color="#990000">(</font><b><font color="#000000">CMEM_getBlockAttrs</font></b><font color="#990000">(</font><font color="#993399">0</font><font color="#990000">,</font> <font color="#990000">&amp;</font>info<font color="#990000">)</font> <font color="#990000">!=</font> <font color="#993399">0</font><font color="#990000">)</font>
    <b><font color="#0000FF">return</font></b> <font color="#993399">0</font><font color="#990000">;</font>

  <b><font color="#0000FF">return</font></b> <b><font color="#000000">dsp_mmu_map</font></b> <font color="#990000">(</font>info<font color="#990000">.</font>phys_base<font color="#990000">,</font> info<font color="#990000">.</font>size<font color="#990000">);</font>
<font color="#FF0000">}</font>

<font color="#009900">int</font> <b><font color="#000000">dsp_mmu_unmap_cmem</font></b> <font color="#990000">(</font><font color="#009900">void</font><font color="#990000">)</font>
<i><font color="#9A1900">//////////////////////////////////////////</font></i>
<i><font color="#9A1900">// umap first cmem block into the DSP MMU:</font></i>
<font color="#FF0000">{</font>
  <font color="#008080">CMEM_BlockAttrs</font> info<font color="#990000">;</font>

  <b><font color="#0000FF">if</font></b> <font color="#990000">(</font><b><font color="#000000">CMEM_getBlockAttrs</font></b><font color="#990000">(</font><font color="#993399">0</font><font color="#990000">,</font> <font color="#990000">&amp;</font>info<font color="#990000">)</font> <font color="#990000">!=</font> <font color="#993399">0</font><font color="#990000">)</font>
    <b><font color="#0000FF">return</font></b> <font color="#993399">0</font><font color="#990000">;</font>

  <b><font color="#0000FF">return</font></b> <b><font color="#000000">dsp_mmu_unmap</font></b> <font color="#990000">(</font>info<font color="#990000">.</font>phys_base<font color="#990000">,</font> info<font color="#990000">.</font>size<font color="#990000">);</font>
<font color="#FF0000">}</font>

</tt></pre>
<p>All problems solved. Great! </p>
<p>I could have stopped here, but I was eager to know if the MMU has any impact on the memory throughput. Is it possible to completely disable the MMU? Sure, this opens a can of worms. A bug in my code or a wrong DMA transfer could write to nearly any location. It could even erase the flash. But on the DaVinci I didn&#8217;t had a MMU and I never run into such problems. So I did some research, and:</p>
<p>It is simple!</p>
<p>The MMU has a disable bit, and the TRM sais that you have to do a soft-reset of the MMU if you fiddle with the settings. I gave it a try and it worked on the first try! You don&#8217;t even need a kernel-module for it. The following code will do all the magic from linux-user mode under the restriction that you need read and write access to /dev/mem. </p>
<p>Call this between PROC_load and PROC_start:<br />
<!-- Generator: GNU source-highlight 3.1<br />
by Lorenzo Bettini</p>
<p>http://www.lorenzobettini.it</p>
<p>http://www.gnu.org/software/src-highlite --></p>
<pre><tt><font color="#009900">int</font> <b><font color="#000000">dsp_mmu_disable</font></b> <font color="#990000">(</font><font color="#009900">void</font><font color="#990000">)</font>
<i><font color="#9A1900">//////////////////////////</font></i>
<i><font color="#9A1900">// Disables the DSP MMU.</font></i>
<i><font color="#9A1900">// Sets virtual = physical mapping.</font></i>
<font color="#FF0000">{</font>
  <b><font color="#0000FF">volatile</font></b> <font color="#009900">unsigned</font> <font color="#009900">long</font> <font color="#990000">*</font> mmu<font color="#990000">;</font>
  <font color="#009900">int</font> result <font color="#990000">=</font> <font color="#993399">0</font><font color="#990000">;</font>

  <i><font color="#9A1900">// physical addres of the MMU2 and some register offsets:</font></i>
  <b><font color="#0000FF">const</font></b> <font color="#009900">unsigned</font> MMU2_PHYSICAL <font color="#990000">=</font> <font color="#993399">0x5d000000</font><font color="#990000">;</font>
  <b><font color="#0000FF">const</font></b> <font color="#009900">unsigned</font> MMU_SYSCONFIG <font color="#990000">=</font> <font color="#993399">4</font><font color="#990000">;</font>
  <b><font color="#0000FF">const</font></b> <font color="#009900">unsigned</font> MMU_SYSSTATUS <font color="#990000">=</font> <font color="#993399">5</font><font color="#990000">;</font>
  <b><font color="#0000FF">const</font></b> <font color="#009900">unsigned</font> MMU_CNTL      <font color="#990000">=</font> <font color="#993399">17</font><font color="#990000">;</font>

  <i><font color="#9A1900">// needs Read+Write access to /dev/mem, so you'll better run this as root.</font></i>
  <font color="#009900">int</font> fd <font color="#990000">=</font> <b><font color="#000000">open</font></b><font color="#990000">(</font><font color="#FF0000">"/dev/mem"</font><font color="#990000">,</font> O_RDWR<font color="#990000">);</font>

  <b><font color="#0000FF">if</font></b> <font color="#990000">(</font>fd<font color="#990000">&gt;=</font><font color="#993399">0</font><font color="#990000">)</font>
  <font color="#FF0000">{</font>
    <i><font color="#9A1900">// get a pointer to the MMU2 register space:</font></i>
    mmu <font color="#990000">=</font> <font color="#990000">(</font><font color="#009900">unsigned</font> <font color="#009900">long</font> <font color="#990000">*)</font> <b><font color="#000000">mmap</font></b><font color="#990000">(</font>NULL<font color="#990000">,</font> <font color="#993399">4096</font><font color="#990000">,</font> PROT_READ <font color="#990000">|</font> PROT_WRITE<font color="#990000">,</font> MAP_SHARED<font color="#990000">,</font> fd<font color="#990000">,</font> MMU2_PHYSICAL<font color="#990000">);</font>

    <b><font color="#0000FF">if</font></b> <font color="#990000">(</font>mmu <font color="#990000">!=</font> MAP_FAILED<font color="#990000">)</font>
    <font color="#FF0000">{</font>
      <font color="#008080">time_t</font> start<font color="#990000">;</font>

      <i><font color="#9A1900">// A timeout of 10 milliseconds is more then plenty.</font></i>
      <i><font color="#9A1900">//</font></i>
      <i><font color="#9A1900">// Usually the reset takes about 10 microseconds.</font></i>
      <i><font color="#9A1900">// It never happend to me that the reset didn't</font></i>
      <i><font color="#9A1900">// succeded, but better safe than sorry.</font></i>
      <font color="#008080">time_t</font> timeout <font color="#990000">=</font> <font color="#990000">(</font>CLOCKS_PER_SEC<font color="#990000">/</font><font color="#993399">100</font><font color="#990000">);</font>

      <i><font color="#9A1900">// start MMU soft-reset:</font></i>
      mmu<font color="#990000">[</font>MMU_SYSCONFIG<font color="#990000">]</font> <font color="#990000">|=</font> <font color="#993399">1</font><font color="#990000">;</font>

      <i><font color="#9A1900">// wait (with timeout) until the reset is complete.</font></i>
      start <font color="#990000">=</font> <b><font color="#000000">clock</font></b><font color="#990000">();</font>
      <b><font color="#0000FF">while</font></b> <font color="#990000">((!</font>mmu<font color="#990000">[</font>MMU_SYSSTATUS<font color="#990000">])</font> <font color="#990000">&amp;&amp;</font> <font color="#990000">(</font><b><font color="#000000">clock</font></b><font color="#990000">()-</font>start <font color="#990000">&lt;</font> timeout<font color="#990000">))</font> <font color="#FF0000">{}</font>

      <b><font color="#0000FF">if</font></b> <font color="#990000">(</font>mmu<font color="#990000">[</font>MMU_SYSSTATUS<font color="#990000">])</font>
      <font color="#FF0000">{</font>
        <i><font color="#9A1900">// disable MMU</font></i>
        mmu<font color="#990000">[</font>MMU_CNTL<font color="#990000">]</font> <font color="#990000">=</font><font color="#993399">0</font><font color="#990000">;</font>

        <i><font color="#9A1900">// set result to SUCCESS.</font></i>
        result <font color="#990000">=</font> <font color="#993399">1</font><font color="#990000">;</font>
      <font color="#FF0000">}</font>
      <i><font color="#9A1900">// remove mapping:</font></i>
      <b><font color="#000000">munmap</font></b><font color="#990000">((</font><font color="#009900">void</font><font color="#990000">*)</font>mmu<font color="#990000">,</font> <font color="#993399">4096</font><font color="#990000">);</font>
    <font color="#FF0000">}</font>
    <b><font color="#000000">close</font></b> <font color="#990000">(</font>fd<font color="#990000">);</font>
  <font color="#FF0000">}</font>

  <i><font color="#9A1900">// failed:</font></i>
  <b><font color="#0000FF">return</font></b> result<font color="#990000">;</font>
<font color="#FF0000">}</font>

</tt></pre>
<p>And to answer my own question: No, the MMU does not has any negative impact on the performance. Also the MMU tables reside in the DDR2 memory, the pages are so large that the extra memory traffic for the MMU table-walks can&#8217;t even be measured.</p>
<p>Btw &#8211; I&#8217;ve made a little easy to use library out of the above functions and I&#8217;ll release them under the BSD license, so everyone can use it. Get it here: <a href="http://torus.untergrund.net/code/dsp_mmu_util.tgz">dsp_mmu_util.tgz</a></p>
<p>The next question would be: Can the DSP jail-break and disable it&#8217;s own MMU? That would be of little practical use but interesting to know.. </p>
]]></content:encoded>
			<wfw:commentRss>http://hilbert-space.de/?feed=rss2&#038;p=55</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>More on EDMA3 on the BeagleBoard/OMAP3530</title>
		<link>http://hilbert-space.de/?p=47</link>
		<comments>http://hilbert-space.de/?p=47#comments</comments>
		<pubDate>Wed, 06 Jan 2010 06:05:28 +0000</pubDate>
		<dc:creator>Nils</dc:creator>
				<category><![CDATA[Beagleboard]]></category>
		<category><![CDATA[DSP]]></category>
		<category><![CDATA[OMAP3530]]></category>

		<guid isPermaLink="false">http://hilbert-space.de/?p=47</guid>
		<description><![CDATA[Didn&#8217;t I mention that the EDMA3 on the OMAP3530 is identical to the EDMA3 of the DaVinci? As I found out this is not exactly true. There is a subtle but important difference: The order of the transfer-controllers has been reversed. On the DaVinci TPTC0 was ment to be used for system critical controls with [...]]]></description>
			<content:encoded><![CDATA[<p>Didn&#8217;t I mention that the EDMA3 on the OMAP3530 is identical to the EDMA3 of the DaVinci? As I found out this is not exactly true. There is a subtle but important difference:</p>
<p>The order of the transfer-controllers has been reversed. On the DaVinci TPTC0 was ment to be used for system critical controls with low latency and TPTC1 for longer background tasks. <strong>On the OMAP3530 this order is exactly reversed. </strong>And by the way: Ever wondered what the difference between those two controllers is? On the OMAP3530 the first controller has a FIFO-length of 256 bytes while the second only has 128 bytes. The transfer speed is the same, but transfers issued on the controller with the shorter FIFO have lower latency, so the data reaches the destination a tad earlier.</p>
<p>Btw, while I fooled around with the EDMA I made some speed measurements. I think these can be interesting..</p>
<ul>
<li> DSP DMA transfer, internal to DDR2 RAM:    550 mb/s</li>
<li>DSP CPU transfer (memset) to DDR2 RAM:     123 mb/s (outch!)</li>
<li>DSP CPU transfer (memset) to internal RAM:  3550 mb/s</li>
</ul>
<p>For reference I made the same memset test on the CortexA8:</p>
<ul>
<li>CortexA8 DDR2 memset (cached):              417 mb/s</li>
<li>CortexA8 DDR2 memset (uncached):             25 mb/s</li>
</ul>
<p>All numbers taken with GPP-clock at 500Mhz and DSP-clock at 360Mhz. Caches have been enabled and the transfer-size was one megabyte.</p>
]]></content:encoded>
			<wfw:commentRss>http://hilbert-space.de/?feed=rss2&#038;p=47</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>2009/2010 Status Update</title>
		<link>http://hilbert-space.de/?p=35</link>
		<comments>http://hilbert-space.de/?p=35#comments</comments>
		<pubDate>Sun, 03 Jan 2010 17:42:33 +0000</pubDate>
		<dc:creator>Nils</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://hilbert-space.de/?p=35</guid>
		<description><![CDATA[So, 2010 has arrived. Time for a little status update on my multi-effect project for the BeagleBoard. Well &#8211; I still don&#8217;t have sound output, but I have made some significant steps: I got the DSP working! That does not sound like much, but it is a key component for my software-architecture. Getting everything working [...]]]></description>
			<content:encoded><![CDATA[<p>So, 2010 has arrived. Time for a little status update on my multi-effect project for the BeagleBoard.</p>
<p>Well &#8211; I still don&#8217;t have sound output, but I have made some significant steps:</p>
<ul>
<li>I got the DSP working!
<p>That does not sound like much, but it is a key component for my software-architecture. Getting everything working smoothly wasn&#8217;t that hard, but there have been some nice and unexpected pitfalls on the road. I may later write about some of the things that will go ẃrong on the first try if you want to run DSP code on the OMAP3530.</li>
<li>I can now talk to the TWL4030 codec via I2C.
<p>That has been an unexpected task as well. My assumption was that I could simply use the I2C driver and talk to the chip from Linux user-space. Unfortunately it was not that easy. Long story short: A bunch of drivers is blocking the I2C bus to the TWL4030, so you can&#8217;t just send the I2C commands without confusing the existing drivers. However, the drivers expose an interface to the kernel, so all it took was to write a kernel-module that exposes the interface to the user-space.</li>
<li>I got the EDMA3 controller on the DSP working as well.
<p>The EDMA3 peripheral on the OMAP3530 is exactly the same as on the DaVinci family. I&#8217;ve already worked with this DMA, so that part was easy.</li>
</ul>
<h3>And why all the hassle?</h3>
<p>My plan is to move the entire sound output code to the DSP-side of the OMAP. That&#8217;ll way I can do all sound processing on a system without any operation system. The good realtime capabilities of the DspBios and the performance of the DSP will allow me to do my sound processing with minimal latency. I estimated that 2ms latency will be no problem, but we&#8217;ll see where I end up.</p>
<h3>Next steps:</h3>
<p>Compile a kernel with all McBSP and sound support disabled. Then write a McBSP driver for the DSP and do some noise <img src='http://hilbert-space.de/wp-includes/images/smilies/icon_smile.gif' alt=':-)' class='wp-smiley' /> </p>
]]></content:encoded>
			<wfw:commentRss>http://hilbert-space.de/?feed=rss2&#038;p=35</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>ARM NEON Optimization. An Example</title>
		<link>http://hilbert-space.de/?p=22</link>
		<comments>http://hilbert-space.de/?p=22#comments</comments>
		<pubDate>Fri, 18 Dec 2009 18:15:01 +0000</pubDate>
		<dc:creator>Nils</dc:creator>
				<category><![CDATA[Beagleboard]]></category>
		<category><![CDATA[OMAP3530]]></category>

		<guid isPermaLink="false">http://hilbert-space.de/?p=22</guid>
		<description><![CDATA[Since there is so little information about NEON optimizations out there I thought I&#8217;d write a little about it. Some weeks ago someone on the beagle-board mailing-list asked how to optimize a color to grayscale conversion for images. I haven&#8217;t done much pixel processing with ARM NEON yet, so I gave if a try. The [...]]]></description>
			<content:encoded><![CDATA[<p>Since there is so little information about NEON optimizations out there I thought I&#8217;d write a little about it.</p>
<p>Some weeks ago someone on the beagle-board mailing-list asked how to optimize a color to grayscale conversion for images. I haven&#8217;t done much pixel processing with ARM NEON yet, so I gave if a try. The results I got where quite spectacular, but more on this later.</p>
<p>For the color to grayscale conversion I used a very simple conversion scheme: A weighted average of the red, green and blue components. This conversion ignores the effect of gamma but works good enough in practice. Also I decided not to do proper rounding. It&#8217;s just an example after all.</p>
<p>First a reference implementation in C:</p>
<p><!-- Generator: GNU source-highlight 3.1<br />
by Lorenzo Bettini</p>
<p>http://www.lorenzobettini.it</p>
<p>http://www.gnu.org/software/src-highlite --></p>
<pre><tt><span style="color: #009900;">void</span> <strong><span style="color: #000000;">reference_convert</span></strong> <span style="color: #990000;">(</span>uint8_t <span style="color: #990000;">*</span> __<span style="color: #008080;">restrict</span> dest<span style="color: #990000;">,</span> uint8_t <span style="color: #990000;">*</span> __<span style="color: #008080;">restrict</span> src<span style="color: #990000;">,</span> <span style="color: #009900;">int</span> n<span style="color: #990000;">)</span>
<span style="color: #ff0000;">{</span>
  <span style="color: #009900;">int</span> i<span style="color: #990000;">;</span>
  <strong><span style="color: #0000ff;">for</span></strong> <span style="color: #990000;">(</span>i<span style="color: #990000;">=</span><span style="color: #993399;">0</span><span style="color: #990000;">;</span> i<span style="color: #990000;">&lt;</span>n<span style="color: #990000;">;</span> i<span style="color: #990000;">++)</span>
  <span style="color: #ff0000;">{</span>
    <span style="color: #009900;">int</span> r <span style="color: #990000;">=</span> <span style="color: #990000;">*</span>src<span style="color: #990000;">++;</span> <em><span style="color: #9a1900;">// load red</span></em>
    <span style="color: #009900;">int</span> g <span style="color: #990000;">=</span> <span style="color: #990000;">*</span>src<span style="color: #990000;">++;</span> <em><span style="color: #9a1900;">// load green</span></em>
    <span style="color: #009900;">int</span> b <span style="color: #990000;">=</span> <span style="color: #990000;">*</span>src<span style="color: #990000;">++;</span> <em><span style="color: #9a1900;">// load blue </span></em>

    <em><span style="color: #9a1900;">// build weighted average:</span></em>
    <span style="color: #009900;">int</span> y <span style="color: #990000;">=</span> <span style="color: #990000;">(</span>r<span style="color: #990000;">*</span><span style="color: #993399;">77</span><span style="color: #990000;">)+(</span>g<span style="color: #990000;">*</span><span style="color: #993399;">151</span><span style="color: #990000;">)+(</span>b<span style="color: #990000;">*</span><span style="color: #993399;">28</span><span style="color: #990000;">);</span>

    <em><span style="color: #9a1900;">// undo the scale by 256 and write to memory:</span></em>
    <span style="color: #990000;">*</span>dest<span style="color: #990000;">++</span> <span style="color: #990000;">=</span> <span style="color: #990000;">(</span>y<span style="color: #990000;">&gt;&gt;</span><span style="color: #993399;">8</span><span style="color: #990000;">);</span>
  <span style="color: #ff0000;">}</span>
<span style="color: #ff0000;">}</span>

</tt></pre>
<h3>Optimization with NEON Intrinsics</h3>
<p>Lets start optimizing the code using the compiler intrinsics. Intrinsics are nice to use because you they behave just like C-functions but compile to a single assembler statement. At least in theory as I&#8217;ll show you later..</p>
<p>Since NEON works in 64 or 128 bit registers it&#8217;s best to process eight pixels in parallel. That way we can exploit the parallel nature of the SIMD-unit. Here is what I came up with:</p>
<p><!-- Generator: GNU source-highlight 3.1<br />
by Lorenzo Bettini</p>
<p>http://www.lorenzobettini.it</p>
<p>http://www.gnu.org/software/src-highlite --></p>
<pre><tt><span style="color: #009900;">void</span> <strong><span style="color: #000000;">neon_convert</span></strong> <span style="color: #990000;">(</span>uint8_t <span style="color: #990000;">*</span> __<span style="color: #008080;">restrict</span> dest<span style="color: #990000;">,</span> uint8_t <span style="color: #990000;">*</span> __<span style="color: #008080;">restrict</span> src<span style="color: #990000;">,</span> <span style="color: #009900;">int</span> n<span style="color: #990000;">)</span>
<span style="color: #ff0000;">{</span>
  <span style="color: #009900;">int</span> i<span style="color: #990000;">;</span>
  <span style="color: #008080;">uint8x8_t</span> rfac <span style="color: #990000;">=</span> <strong><span style="color: #000000;">vdup_n_u8</span></strong> <span style="color: #990000;">(</span><span style="color: #993399;">77</span><span style="color: #990000;">);</span>
  <span style="color: #008080;">uint8x8_t</span> gfac <span style="color: #990000;">=</span> <strong><span style="color: #000000;">vdup_n_u8</span></strong> <span style="color: #990000;">(</span><span style="color: #993399;">151</span><span style="color: #990000;">);</span>
  <span style="color: #008080;">uint8x8_t</span> bfac <span style="color: #990000;">=</span> <strong><span style="color: #000000;">vdup_n_u8</span></strong> <span style="color: #990000;">(</span><span style="color: #993399;">28</span><span style="color: #990000;">);</span>
  n<span style="color: #990000;">/=</span><span style="color: #993399;">8</span><span style="color: #990000;">;</span>

  <strong><span style="color: #0000ff;">for</span></strong> <span style="color: #990000;">(</span>i<span style="color: #990000;">=</span><span style="color: #993399;">0</span><span style="color: #990000;">;</span> i<span style="color: #990000;">&lt;</span>n<span style="color: #990000;">;</span> i<span style="color: #990000;">++)</span>
  <span style="color: #ff0000;">{</span>
    <span style="color: #008080;">uint16x8_t</span>  temp<span style="color: #990000;">;</span>
    <span style="color: #008080;">uint8x8x3_t</span> rgb  <span style="color: #990000;">=</span> <strong><span style="color: #000000;">vld3_u8</span></strong> <span style="color: #990000;">(</span>src<span style="color: #990000;">);</span>
    <span style="color: #008080;">uint8x8_t</span> result<span style="color: #990000;">;</span>

    temp <span style="color: #990000;">=</span> <strong><span style="color: #000000;">vmull_u8</span></strong> <span style="color: #990000;">(</span>rgb<span style="color: #990000;">.</span>val<span style="color: #990000;">[</span><span style="color: #993399;">0</span><span style="color: #990000;">],</span>      rfac<span style="color: #990000;">);</span>
    temp <span style="color: #990000;">=</span> <strong><span style="color: #000000;">vmlal_u8</span></strong> <span style="color: #990000;">(</span>temp<span style="color: #990000;">,</span>rgb<span style="color: #990000;">.</span>val<span style="color: #990000;">[</span><span style="color: #993399;">1</span><span style="color: #990000;">],</span> gfac<span style="color: #990000;">);</span>
    temp <span style="color: #990000;">=</span> <strong><span style="color: #000000;">vmlal_u8</span></strong> <span style="color: #990000;">(</span>temp<span style="color: #990000;">,</span>rgb<span style="color: #990000;">.</span>val<span style="color: #990000;">[</span><span style="color: #993399;">2</span><span style="color: #990000;">],</span> bfac<span style="color: #990000;">);</span>

    result <span style="color: #990000;">=</span> <strong><span style="color: #000000;">vshrn_n_u16</span></strong> <span style="color: #990000;">(</span>temp<span style="color: #990000;">,</span> <span style="color: #993399;">8</span><span style="color: #990000;">);</span>
    <strong><span style="color: #000000;">vst1_u8</span></strong> <span style="color: #990000;">(</span>dest<span style="color: #990000;">,</span> result<span style="color: #990000;">);</span>
    src  <span style="color: #990000;">+=</span> <span style="color: #993399;">8</span><span style="color: #990000;">*</span><span style="color: #993399;">3</span><span style="color: #990000;">;</span>
    dest <span style="color: #990000;">+=</span> <span style="color: #993399;">8</span><span style="color: #990000;">;</span>
  <span style="color: #ff0000;">}</span>
<span style="color: #ff0000;">}</span>

</tt></pre>
<p>Lets take a look at it step by step:</p>
<p>First off I load my weight factors into three NEON registers. The vdup.8 instruction does this and also replicates the byte into all 8 bytes of the NEON register.</p>
<p><!-- Generator: GNU source-highlight 3.1<br />
by Lorenzo Bettini</p>
<p>http://www.lorenzobettini.it</p>
<p>http://www.gnu.org/software/src-highlite --></p>
<pre><tt>    <span style="color: #008080;">uint8x8_t</span> rfac <span style="color: #990000;">=</span> <strong><span style="color: #000000;">vdup_n_u8</span></strong> <span style="color: #990000;">(</span><span style="color: #993399;">77</span><span style="color: #990000;">);</span>
    <span style="color: #008080;">uint8x8_t</span> gfac <span style="color: #990000;">=</span> <strong><span style="color: #000000;">vdup_n_u8</span></strong> <span style="color: #990000;">(</span><span style="color: #993399;">151</span><span style="color: #990000;">);</span>
    <span style="color: #008080;">uint8x8_t</span> bfac <span style="color: #990000;">=</span> <strong><span style="color: #000000;">vdup_n_u8</span></strong> <span style="color: #990000;">(</span><span style="color: #993399;">28</span><span style="color: #990000;">);</span> 

</tt></pre>
<p>Now I load 8 pixels at once into three registers.</p>
<p><!-- Generator: GNU source-highlight 3.1<br />
by Lorenzo Bettini</p>
<p>http://www.lorenzobettini.it</p>
<p>http://www.gnu.org/software/src-highlite --></p>
<pre><tt>    <span style="color: #008080;">uint8x8x3_t</span> rgb  <span style="color: #990000;">=</span> <strong><span style="color: #000000;">vld3_u8</span></strong> <span style="color: #990000;">(</span>src<span style="color: #990000;">);</span>

</tt></pre>
<p>The vld3.8 instruction is a specialty of the NEON instruction set. With NEON you can not only do loads and stores of multiple registers at once, you can de-interleave the data on the fly as well. Since I expect my pixel data to be interleaved the vld3.8 instruction is a perfect fit for a tight loop.</p>
<p>After the load, I have all the red components of 8 pixels in the first loaded register. The green components end up in the second and blue in the third.</p>
<p>Now calculate the weighted average:</p>
<p><!-- Generator: GNU source-highlight 3.1<br />
by Lorenzo Bettini</p>
<p>http://www.lorenzobettini.it</p>
<p>http://www.gnu.org/software/src-highlite --></p>
<pre><tt>    temp <span style="color: #990000;">=</span> <strong><span style="color: #000000;">vmull_u8</span></strong> <span style="color: #990000;">(</span>rgb<span style="color: #990000;">.</span>val<span style="color: #990000;">[</span><span style="color: #993399;">0</span><span style="color: #990000;">],</span>      rfac<span style="color: #990000;">);</span>
    temp <span style="color: #990000;">=</span> <strong><span style="color: #000000;">vmlal_u8</span></strong> <span style="color: #990000;">(</span>temp<span style="color: #990000;">,</span>rgb<span style="color: #990000;">.</span>val<span style="color: #990000;">[</span><span style="color: #993399;">1</span><span style="color: #990000;">],</span> gfac<span style="color: #990000;">);</span>
    temp <span style="color: #990000;">=</span> <strong><span style="color: #000000;">vmlal_u8</span></strong> <span style="color: #990000;">(</span>temp<span style="color: #990000;">,</span>rgb<span style="color: #990000;">.</span>val<span style="color: #990000;">[</span><span style="color: #993399;">2</span><span style="color: #990000;">],</span> bfac<span style="color: #990000;">);</span>

</tt></pre>
<p>vmull.u8 multiplies each byte of the first argument with each corresponding byte of the second argument. Each result becomes a 16 bit unsigned integer, so no overflow can happen. The entire result is returned as a 128 bit NEON register pair.</p>
<p>vmlal.u8  does the same thing as vmull.u8 but also adds the content of another register to the result.</p>
<p>So we end up with just three instructions for weighted average of eight pixels. Nice.</p>
<p>Now it&#8217;s time to undo the scaling of the weight factors. To do so I shift each 16 bit result to the right by 8 bits. This equals to a division by 256. ARM NEON has lots of instructions to do the shift, but also a &#8220;narrow&#8221; variant exists. This one does two things at once: It does the shift and afterwards converts the 16 bit integers back to 8 bit by removing all the high-bytes from the result. We get back from the 128 bit register pair to a single 64 bit register.</p>
<p><!-- Generator: GNU source-highlight 3.1<br />
by Lorenzo Bettini</p>
<p>http://www.lorenzobettini.it</p>
<p>http://www.gnu.org/software/src-highlite --></p>
<pre><tt>    result <span style="color: #990000;">=</span> <strong><span style="color: #000000;">vshrn_n_u16</span></strong> <span style="color: #990000;">(</span>temp<span style="color: #990000;">,</span> <span style="color: #993399;">8</span><span style="color: #990000;">);</span>

</tt></pre>
<p>And finally store the result.</p>
<p><!-- Generator: GNU source-highlight 3.1<br />
by Lorenzo Bettini</p>
<p>http://www.lorenzobettini.it</p>
<p>http://www.gnu.org/software/src-highlite --></p>
<pre><tt>    <strong><span style="color: #000000;">vst1_u8</span></strong> <span style="color: #990000;">(</span>dest<span style="color: #990000;">,</span> result<span style="color: #990000;">);</span>

</tt></pre>
<h3>First Results:</h3>
<p>How does the reference C-function and the NEON optimized version compare? I did a test on my Omap3 CortexA8 CPU on the beagle-board and got the following timings:</p>
<pre>C-version:       15.1 cycles per pixel.
NEON-version:     9.9 cycles per pixel.</pre>
<p>That&#8217;s only a speed-up of factor 1.5. I expected much more from the NEON implementation. It processes 8 pixels with just 6 instructions after all. What&#8217;s going on here? A look at the assembler output explained it all. Here is the inner-loop part of the convert function:</p>
<pre> 160:   f46a040f        vld3.8  {d16-d18}, [sl]
 164:   e1a0c005        mov     ip, r5
 168:   ecc80b06        vstmia  r8, {d16-d18}
 16c:   e1a04007        mov     r4, r7
 170:   e2866001        add     r6, r6, #1      ; 0x1
 174:   e28aa018        add     sl, sl, #24     ; 0x18
 178:   e8bc000f        ldm     ip!, {r0, r1, r2, r3}
 17c:   e15b0006        cmp     fp, r6
 180:   e1a08005        mov     r8, r5
 184:   e8a4000f        stmia   r4!, {r0, r1, r2, r3}
 188:   eddd0b06        vldr    d16, [sp, #24]
 18c:   e89c0003        ldm     ip, {r0, r1}
 190:   eddd2b08        vldr    d18, [sp, #32]
 194:   f3c00ca6        vmull.u8        q8, d16, d22
 198:   f3c208a5        vmlal.u8        q8, d18, d21
 19c:   e8840003        stm     r4, {r0, r1}
 1a0:   eddd3b0a        vldr    d19, [sp, #40]
 1a4:   f3c308a4        vmlal.u8        q8, d19, d20
 1a8:   f2c80830        vshrn.i16       d16, q8, #8
 1ac:   f449070f        vst1.8  {d16}, [r9]
 1b0:   e2899008        add     r9, r9, #8      ; 0x8
 1b4:   caffffe9        bgt     160</pre>
<p>Note the store at offset 168? The compiler decides to write the three registers onto the stack. After a bit of useless memory accesses from the GPP side the compiler reloads them (offset 188, 190 and 1a0) in exactly the same physical NEON register.</p>
<p>What all the ordinary integer instructions do? I have no idea. Lots of memory accesses target the stack for no good reason. There is definitely no shortage of registers anywhere. For reference: I used the GCC 4.3.3 (CodeSourcery 2009q1 lite) compiler .</p>
<h3>NEON and assembler</h3>
<p>Since the compiler can&#8217;t generate good code I wrote the same loop in assembler. In a nutshell I just took the intrinsic based loop and converted the instructions one by one. The loop-control is a bit different, but that&#8217;s all.</p>
<p><!-- Generator: GNU source-highlight 3.1<br />
by Lorenzo Bettini</p>
<p>http://www.lorenzobettini.it</p>
<p>http://www.gnu.org/software/src-highlite --></p>
<pre><tt>convert_asm_neon<span style="color: #990000;">:</span>

<strong><span style="color: #000080;">      # r0</span></strong><span style="color: #990000;">:</span> Ptr to destination data
<strong><span style="color: #000080;">      # r1</span></strong><span style="color: #990000;">:</span> Ptr to source data
<strong><span style="color: #000080;">      # r2</span></strong><span style="color: #990000;">:</span> <span style="color: #008080;">Iteration</span> count<span style="color: #990000;">:</span>

    	push   	    <span style="color: #ff0000;">{</span>r4<span style="color: #990000;">-</span>r5<span style="color: #990000;">,</span>lr<span style="color: #ff0000;">}</span>
      <span style="color: #008080;">lsr</span>         r2<span style="color: #990000;">,</span> r2<span style="color: #990000;">,</span> #<span style="color: #993399;">3</span>

<strong><span style="color: #000080;">      # build</span></strong> the <span style="color: #008080;">three</span> constants<span style="color: #990000;">:</span>
      <span style="color: #008080;">mov</span>         r3<span style="color: #990000;">,</span> #<span style="color: #993399;">77</span>
      <span style="color: #008080;">mov</span>         r4<span style="color: #990000;">,</span> #<span style="color: #993399;">151</span>
      <span style="color: #008080;">mov</span>         r5<span style="color: #990000;">,</span> #<span style="color: #993399;">28</span>
      vdup<span style="color: #990000;">.</span><span style="color: #993399;">8</span>      d3<span style="color: #990000;">,</span> r3
      vdup<span style="color: #990000;">.</span><span style="color: #993399;">8</span>      d4<span style="color: #990000;">,</span> r4
      vdup<span style="color: #990000;">.</span><span style="color: #993399;">8</span>      d5<span style="color: #990000;">,</span> r5

  <span style="color: #990000;">.</span>loop<span style="color: #990000;">:</span>

<strong><span style="color: #000080;">      # load</span></strong> <span style="color: #993399;">8</span> pixels<span style="color: #990000;">:</span>
      vld3<span style="color: #990000;">.</span><span style="color: #993399;">8</span>      <span style="color: #ff0000;">{</span>d0<span style="color: #990000;">-</span>d2<span style="color: #ff0000;">}</span><span style="color: #990000;">,</span> <span style="color: #990000;">[</span>r1<span style="color: #990000;">]!</span>

<strong><span style="color: #000080;">      # do</span></strong> the <span style="color: #008080;">weight</span> average<span style="color: #990000;">:</span>
      vmull<span style="color: #990000;">.</span><span style="color: #008080;">u8</span>    q3<span style="color: #990000;">,</span> d0<span style="color: #990000;">,</span> d3
      vmlal<span style="color: #990000;">.</span><span style="color: #008080;">u8</span>    q3<span style="color: #990000;">,</span> d1<span style="color: #990000;">,</span> d4
      vmlal<span style="color: #990000;">.</span><span style="color: #008080;">u8</span>    q3<span style="color: #990000;">,</span> d2<span style="color: #990000;">,</span> d5

<strong><span style="color: #000080;">      # shift</span></strong> <span style="color: #008080;">and</span> store<span style="color: #990000;">:</span>
      vshrn<span style="color: #990000;">.</span><span style="color: #008080;">u16</span>   d6<span style="color: #990000;">,</span> q3<span style="color: #990000;">,</span> #<span style="color: #993399;">8</span>
      vst1<span style="color: #990000;">.</span><span style="color: #993399;">8</span>      <span style="color: #ff0000;">{</span>d6<span style="color: #ff0000;">}</span><span style="color: #990000;">,</span> <span style="color: #990000;">[</span>r0<span style="color: #990000;">]!</span>

      <span style="color: #008080;">subs</span>        r2<span style="color: #990000;">,</span> r2<span style="color: #990000;">,</span> #<span style="color: #993399;">1</span>
      bne         <span style="color: #990000;">.</span>loop

      pop         <span style="color: #ff0000;">{</span> r4<span style="color: #990000;">-</span>r5<span style="color: #990000;">,</span> pc <span style="color: #ff0000;">}</span>

</tt></pre>
<h3>Final Results:</h3>
<p>Time for some benchmarking again. How does the hand-written assembler version compares? Well &#8211; here are the results:</p>
<pre>  C-version:       15.1 cycles per pixel.
  NEON-version:     9.9 cycles per pixel.
  Assembler:        2.0 cycles per pixel.</pre>
<p>That&#8217;s roughly a factor of five over the intrinsic version and 7.5 times faster than my not-so-bad C implementation. And keep in mind: I didn&#8217;t even optimized the assembler loop.</p>
<p>My conclusion: If you want performance out of your NEON unit stay away from the intrinsics. They are nice as a prototyping tool. Use them to get your algorithm working and then rewrite the NEON-parts of it in assembler.</p>
<p>Btw: Sorry for the ugly syntax-highlighting. I&#8217;m still looking for a nice wordpress plug-in.</p>
]]></content:encoded>
			<wfw:commentRss>http://hilbert-space.de/?feed=rss2&#038;p=22</wfw:commentRss>
		<slash:comments>32</slash:comments>
		</item>
		<item>
		<title>Compiling CMEM for the Beagleboard&#8230;</title>
		<link>http://hilbert-space.de/?p=14</link>
		<comments>http://hilbert-space.de/?p=14#comments</comments>
		<pubDate>Tue, 03 Nov 2009 08:51:19 +0000</pubDate>
		<dc:creator>Nils</dc:creator>
				<category><![CDATA[Beagleboard]]></category>
		<category><![CDATA[DSP]]></category>
		<category><![CDATA[Linux]]></category>
		<category><![CDATA[OMAP3530]]></category>

		<guid isPermaLink="false">http://hilbert-space.de/?p=14</guid>
		<description><![CDATA[Since I tend to forget these things, here&#8217;s a little tutorial how to compile the Texas Instruments CMEM and SDMA kernel-modules for the beagleboard. I don&#8217;t like the codec-engine build process, therefore I&#8217;ll compile the kernels by hand. So what&#8217;s CMEM all about? In a nutshell CMEM is a kernel-module that allows you to allocate [...]]]></description>
			<content:encoded><![CDATA[<p>Since I tend to forget these things, here&#8217;s a little tutorial how to compile the Texas Instruments CMEM and SDMA kernel-modules for the beagleboard. I don&#8217;t like the codec-engine build process, therefore I&#8217;ll compile the kernels by hand.</p>
<h3><span style="text-decoration: underline;">So what&#8217;s CMEM all about?</span></h3>
<p>In a nutshell CMEM is a kernel-module that allows you to allocate contiguous memory on the OMAP3, map this memory it into the address-space of a user-mode program so you can read and write to it.</p>
<p>CMEM also gives you the physical address of these memory-blocks.</p>
<p>This is important if you want to share some memory with the C64x+ DSP as the DSP has no idea what the memory manager of the Cortex-A8 is doing. It also allows linux user-mode programs to allocate memory that can be used with DMA.</p>
<h3>Things you need:</h3>
<ul>
<li>The sources of the libutils from the <a href="http://software-dl.ti.com/dsps/dsps_registered_sw/sdo_sb/targetcontent/linuxutils/index.html" target="_blank">TI website</a> (registration is required but free). I&#8217;ve used release 2.24 which works fine with my 2.6.29-omap1 kernel image.</li>
<li>The linux kernel-sources for the beagleboard. If you use OpenEmbedded and you have already compiled an image you&#8217;ll most likey find them at $OE_HOME/tmp/staging/beagleboard-angstrom-linux-gnueabi/kernel/.</li>
<li>A cross-compiler toolchain for ARM. I still use the CodeSourcery 2007q3 light release. Works for me.</li>
<li>A beagleboard. Also not strictly required it makes perfect sense to have one.</li>
</ul>
<h3>Howto compile CMEM:</h3>
<ol>
<li>Untar the linuxutils package. The place where to untar them is not important.</li>
<li>Go into the CMEM subfolder. For the 2.24 release it&#8217;s the ./packages/ti/sdo/linuxutils/cmem/ folder.</li>
<li>Take a look at the Rules.make file. Messy, ain&#8217;t it?  Remove the write protection.. chmod +w Rules.make will do that. You now have to adjust the pathes in that file or if you&#8217;re like me &#8211; delete it and write it from scratch:Here is my copy with everything not needed removed:
<pre>
# path to your toolchain. Yes, you need to set it twice (don't ask...)
MVTOOL_PREFIX=/opt/CodeSourcery/bin/arm-none-linux-gnueabi-
UCTOOL_PREFIX=/opt/CodeSourcery/bin/arm-none-linux-gnueabi-

# path to the kernel-sources:
LINUXKERNEL_INSTALL_DIR=${OE_HOME}/tmp/staging/beagleboard-angstrom-linux-gnueabi/kernel

# some config things:
USE_UDEV=1
MAX_POOLS=128</pre>
</li>
<li>That&#8217;s it.. If all pathes are correct &#8220;make release&#8221; should build the kernel module and some test applications.</li>
</ol>
<h3>Howto test CMEM:</h3>
<ol>
<li>Copy the kernel-module to the beagleboard. For the test I&#8217;ve just copied it into /home/root/. You&#8217;ll find the kernel-module at ./src/module/cmemk.ko</li>
<li>On the board, check your U-Boot boot-parameters. Since CMEM manages physical memory you have to restrict the amount of memory managed by linux. To put aside some memory add the  mem=80M directive to the bootargs. You can of course use a different setting if you want to, but the following examples assume 80M for the linux-kernel and the rest for DSP and CMEM.</li>
<li>Boot the beagle and login as root.</li>
<li>Load the kernel-module. Let&#8217;s keep things simple. We create a single 16mb memory pool. To do so load the module like this:
<pre>/sbin/insmod cmemk.ko pools=1x1000000 phys_start=0x85000000 phys_end=0x86000000</pre>
<p>If everything worked as expected you&#8217;ll find the following line in the kernel-log (type dmesg to get it):</p>
<pre>cmem initialized 1 pools between 0x85000000 and 0x86000000</pre>
<p>If not &#8211; well &#8211; CMEM will give you a bunch of hints in the kernel-log if it had problems during initialization. Most likely you&#8217;ve got the addresses wrong. As the start-address you should pass 0&#215;80000000 plus the size you&#8217;ve specified in the u-boot bootargs. Add the sizes of all of your CMEM-pools and use this as the end address.</li>
<li>While the module is loaded you&#8217;ll find a file under /proc/cmem with some statistics.</li>
<li>If everything worked so far you can run some of the demo-applications like apitest. They&#8217;re are located in the ./apps/apitest/ folders.</li>
</ol>
<h3>Compile an ARM program that uses CMEM:</h3>
<p style="padding-left: 30px;">This is easy. Copy ./src/interface/cmem.h to a place where the cross-compiler will find it and add one of the cmem.a libraries to your project. Since I like to keep things simple I&#8217;ve just added the interface source to my project. It&#8217;s  ./src/interface/cmem.c.</p>
<p>Now you can allocate contiguous memory and get the physical address of it. Big deal, eh? Honestly, like I said CMEM only makes sense if you want to make use of the C64x+ DSP or the SDMA of the OMAP3.</p>
<p style="padding-left: 30px;">
]]></content:encoded>
			<wfw:commentRss>http://hilbert-space.de/?feed=rss2&#038;p=14</wfw:commentRss>
		<slash:comments>10</slash:comments>
		</item>
	</channel>
</rss>

