SWP Tracer/Sniffer

It’s possible to modify the SWP Transceiver front-end circuit presented earlier to a full functional SWP Tracer/Sniffer.

swp-for-web-tracer

This is almost the same circuit as the SWP transceiver but I moved some parts around to make the signal flow a bit clearer.

I don’t drive the SWP_TX signal myself anymore. Instead the SWP_TX becomes an output which is directly connected to a SWP master (aka NFC controller chip):

The SWP slave / SWP sim part remains the same, and so does the TX/RX signal splitter circuit.

The trick to get it working is to feed-back the extracted RX signal as a current-sink on the SWP master side.

Here it’s done with a fast BFS20 transistor as a switch.

R1 defines the current. At a SWP signaling voltage of 1.8V the 1.8kOhm roughly sinks 1mA of current. Due to VCE(sat) of Q1 the actual current is slightly lower, but that’s okay. The specification allows us to down to 600µA. We’re at the safe side here.

R2 is the usual base resistor with C1 acting as a speed-up capacitor to improve the switching speed.

The IO voltages are 1.8V for SWP_TX and 3.3V for SWP_RX.

You can now connect SWP_TX and SWP_RX to a logic-analyzer/micro-controller and trace away.

For this circuit you can’t replace the parts with slower devices. The signal already takes a complete round-trip through the opamp and comparator. The overall propagation delay should be small enough not to cause any confusion on the SWP master side.

Happy Hacking!

Posted in Uncategorized | Leave a comment

SWP Reader – The Analog Part

I think it’s time to lift the curtain how the analog part of my SWP reader project looks like. This is the exact same circuit that I’ve used in the last two prototypes. I’m going to describe the circuit block by block and show the whole thing at the very end.

Signals

There are a bunch of signals and voltages that connect to the circuit. Those are:

5V main supply voltage, taken from USB
V+ 9.5V, higher voltage to supply the opamps
V- -4.5V, negative supply for opamp and comparator
SWP_TX SWP transmit signal, 3.3V
SWP_RX SWP receive signal, 3.3V
SIM_SWP connects to the C6 pin of the SIM-card
DAC supplys a reference voltage between 0 and 3.3V

Transceiver Front-End
swp-analog-1

This is the heart of the SWP transceiver. It takes the digital TX signal, sends it to the SIM-card while extracting the SWP RX signal by measuring the current drawn by the SIM. The architecture is built around the trans impedance amplifier circuit with some tweaks.

R1/R2 form a voltage divider that convert the incoming 3.3V signal to 1.8V. R2 is also doing double duty as a pull-down. The level-shifted signal directly feeds into the non-inverting input of U1.

The SIM-card SWP pin is directly connected to the inverted opamp input. Yes, I’m using the opamp input as an output here. Since negative feedback is present via R3, the voltage at the two inputs will always be very close to each other, so the SIM will always see the SWP_TX signal.

R3 and the opamp itself is where the magic takes place. Any current that is flowing into the SIM card will cause a voltage drop across R3, and we see this voltage drop at the output of the opamp along with the input signal added to it.

With a maximum SWP signaling current of 1mA we’ll see 1.8V for RX plus 1.8V for the TX signal. Here is a simulation screen-shot with a pulse-train of one-bits on the TX and alternating ones and zeros on the RX:

real-outThe spikes on the edges is caused by the parasitic capacitance of the SIM card and it’s socket. When switching, the charge stored in this capacitance causes a very short burst of current flow. This manifests itself as the spikes on the signal transitions. The ringing is not present at the SIM card terminal though.

If you want to substitute for another opamp make sure that the gain-bandwidth product and the slew rate is sufficient. You need a fast part. I would not go below 100Mhz GBW and 100V/µS slew-rate. The LT1227 does a really good job here.

RX Signal Extractor

This part is straight forward. The mixed RX/TX signal gets converted back to a digital signal with the help of a comparator.

swp-analog-2

At first the R4/R5 voltage divider brings the mixed RX/TX signal voltage down into the safe range. This is necessary because the power-hungry LT1016 gets powered from the 5V rail instead of the (rather weak) V+. The reference signal from the DAC gets a bit of noise-filtering via R10/C5. Finally R7 provides a good deal of hysteresis for a clean output signal.

The conversion of the comparator output down to 3.3V level is a bit dirty but worked fine so far. I just load the output of the comparator using R8. This reduces the output voltage to the required level and also provides some termination. If you want to be on the safe side you’d rather put a zeener diode clamp in here. R9 and C4 limit the slew-rate to something sane and reduce EMI.

Power Supply

Not much to see here. The power-supply is built around a LT1054 buck-boost converter. That’s pretty much the same chip as the ICL7660, MAX1044 and others. I used low ESR ceramic chip capacitors exclusively.

There is some ripple left on the generated voltages, but that is not causing any issues. The digital outputs look clean and communication over SWP works perfect.

swp-analog-3

The SD103BW are very cool Schottky diodes by the way. Good spec and *cheap*. They also survive 1.5A peak current. Robust little buggers.

Closing

That’s all the analog stuff you need to talk SWP with a SIM-card. How you generate the SWP signals is up to you. I use a Xilinx CPLD for this which talks SPI to a micro-controller and drives/samples the SWP signals from this analog circuit. I’ll likely write about this another day.

If you want to build something upon this circuit don’t forget that you also need to control the power and reset line of the SIM.

The circuit – as is – has no issues reaching the full 1.69 Mbit/s data-rate of the SWP bus. You can even run it at a higher speed without degrading the signal much. The SIM-card that I’m testing with stops to respond after about 2 Mbit/s (way out of spec) but the signals itself still look fine.

For completes sake, here is the entire schematic in one image with the required decoupling capacitors added:

swp-for-web

Posted in Uncategorized | 5 Comments

SWP Reader Evolution

I’ve been working on my SWP reader for about a year now, so I think it’s a good time to dump some photos and show how the project evolved:

The first “proof of concept” prototype:

final-highres-webFor this prototype I decided to stay with plain old through-hole packages and build the analog part in a modular way. The restriction to through-hole packages had a great influence of what parts I could use because most of the good stuff is in SMD these days.

On the top I’ve used a XuLA-200 FPGA board. A really nice breakout for the Spartan 3A FPGA family. The analog part consists of (left to right): LT1227 current feedback OpAmp as the main SWP transceiver, good old LM311 as a one-bit A/D converter (to slow for high-speed SWP, but good enough up to 400 kbit/s). On the right there is a local power supply based around the LT1054 to generate the supply voltages for the OpAmp.

This prototype worked right away and was great to do the first data exchanges between the SIM and my PC. In the end I decided to abandon the FPGA in favor for a microprocessor with better connectivity to the PC side.

Entering Prototype 2:

proto1The FPGA is gone and has been replaced by a Cortex-M3 CPU. The blue board is a mbed LPC1768 which is quite nice and easy to work with (and no, mbed doesn’t force you into their online-compiler anymore). The red board below is a Xilinx CoolRunner-II CPLD breakout from Dangerous Prototypes which I use to translate the data-stream from SPI to SWP and vice versa. Going from 200k FPGA gates down to 64 gates was not a big deal because most of the complex stuff is now running in software on the Cortex-M3. I even have plenty of gates and flip-flops available, so I can add some more Shenanigans if I want to.

Finally all the analog stuff is now in SMD package on the black board. The circuit is almost the same as the first prototype except that I’ve upgraded the comparator from LM311. Since I had to order samples from linear anyway, so I thought: “Let’s get one of the finest comparators as well”. This turned out to be a mistake because the chip was *way* to fast for my needs so I had to slow it down with some external circuitry. In one of the next revisions I’ll change that chip to something cheaper and more sane.

Along with the stuff already seen I’ve also added a voltage tracker to power the SIM and a PWM to DC converter to set the comparator threshold voltage (both based on a good old and trusty LM358).

Here is another shot of the same board with USB connectors attached. Most of the wiring between boards is on the backside:

 

swp

I was really happy how the prototype 2 turned out. It was working fine, however due to all the long connections running on the backside of the board the signal quality was questionable:

nullbits-webThese are SWP zero-bits transmitted at around 1.7Mhz, taken at the opamp output. You can clearly see how the clocks from the CPU and from the CPLD leak into the received signal. The signal was good enough not to cause any transmission errors though.

Nonetheless I decided to bring everything on one board. The end result is this:

The Single Board Prototype:

swp-rev2-webThis was the first board I’ve designed with KiCad and it turned out pretty good. Due to a bug in the version of KiCad I was using I missed two unrouted wires so I had to patch the board. Nonetheless it’s working great. The micro-controller has been changed from LPC1768 to the low pin count version LPC1758. The analog circuit is still almost the same except that there is no PWM to DC converter anymore. The LM358 is still there but working as a voltage tracker for SIM supply exclusively.

I’ve also added the ISO7816 interface, so the board is feature complete.

The Ethernet interface has not been assembled and probably never will. Due to the two air-wires I’ll do another revision of this board anyway, and I’ll change the Ethernet PHY chip from the QFN package (not shown, it’s on the back of the board) to something in LQFP package for easier hand assembly.

The next version will also get a new (cheaper) analog part. The chips from Linear are really nice and all, but they are so damn expensive. I’ll probably change the OpAmp to a TSH82 dual opamp, one half for the SWP transceive job, and the other to power the SIM. The comparator could be a TS3011. These parts are much cheaper and will still be more than fast enough. I’ll also get rid of the buck-boost converter LT1054 as well because no other supply than 5V will be needed anymore.

So, that’s it so far. Right now I’m working on the software side of things. I’m busy porting the mbed based C++ code to C and the new micro-controller. I’ll probably do a FreeRTOS port as well but I haven’t decided on this yet.

Posted in Uncategorized | 5 Comments

Ultra simple ISO-7816 Interface

While laying out a PCB for my SWP reader project I realized that I haven’t ever tested the ISO-7816 (aka contact) interface yet. I probably forgot that because it’s not all that difficult and not that interesting, but I’d rather see it working before I order PCBs.

So I spend an hour or two in the internet looking for inspiration how other people did it. There are lots of specialized chips for this purpose out there, but sourcing is always a problem and since it’s “just” a simple serial interface I was more interested in a simple hack that will work.

Turns out there are a lot of simple SIM/Smart-card readers out there that just do this, and they pretty much all look like this:

iso1

Here Q1 is running as a open collector driver with R1 as a pull-up resistor. This transistor will invert the signal, that’s why there is an additional inverter A1 in front of the base. R2 and C1 are the usual base-resistor and speed-up capacitor.

You’ll find variations of this basic circuit all over the net. Sometimes they omit the speed-up capacitor, sometimes you find buffers in the RX-UART path, but that’s it basically.

I heated up my soldering iron and gave this circuit a test drive, and lo and behold: It works as expected (aka good enough in practice).
So problem solved, move on.

Not so. In the middle of the night it came to me, that almost the entire circuit is unnecessary. What does it really do? On the Q1 collector we see a replica of the TX-UART signal. The SIM card IO pin (which is just an open collector IO-pin) is able to pull the signal down at will without causing a short to TX-UART.

RX-UART picks up this signal and echoes back either what comes from TX-UART or from IO. Neither pin is pulling up the signal, that’s what R1 is doing.

So how about this:

iso2

It’s working just the same, just faster and with less parts.

In case that TX-UART is transmitting, and IO is listening the singal will just pass R1. If TX-UART stops transmitting the UART will go into idle-state (logic high). This effectively ties R1 to VCC and we have exactly the same behaviour as with the more “complex” circuit.

If the SIM transmit something SIM-IO just pulls the line down to ground. A bit of current will flow out of TX-UART, but that’s fine. Compared to drive a LED from a GPIO pin that’s nothing.

If you feel inspired to try this out, here is a short how-to:

  • Configure the UART on the micro-controller side for 9600 baud, 8 data bits, two stop-bits and even parity.
  • Power the SIM VCC pin, have reset low.
  • Apply a clock signal 372 times the UART baud-rate: 3.57 Mhz.
  • Wait a little for the SIM/SmartCard to stabilize.
  • Raise the reset line to taking the SIM-card out of reset.

And then watch the Answer to Rest (ATR) signature arriving at your micro-controller UART-RX pin. You’re now ready to implement the ISO7816-3 T=0 or T=1 protocol and do some real data-exchange. With practically any micro-controller and just a simple resistor.

Oh, by the way. You’re allowed to let the IO pin of the SIM to pull down up to 500µA, so if you get problems with stray capacitance just lower R1. Minimum values are:

  • 3.6k for 1.8V supply
  • 6.6k for 3.3V supply
  • 10k for 5V supply.
Posted in Uncategorized | Leave a comment

NFC SWP Physical Layer – How it works

As I’m currently building an NFC-SWP reader device I have to tackle quite some challenges simply because there there is no single chip solution out there that you can simply connect to USB and a SIM card. Most NFC Controllers can of course talk SWP, but they will not work as a simple and transparent bridge. Therefore I’ll design my own solution.

To do so, it is crucial to understand how the protocol works on the lowest level.

The SWP physical layer is quite a unique thing. It allows the NFC SIM-card and the NFC controller to exchange data at a rate of 1.7 megabit/second full duplex over just a single wire. And they not only transmit bidirectional data, they transmit a clock signal for synchronization as well.

 

How do they pulled that off?

First have a look at the S1 signal. This is the signal that transmits the clock and the data from the NFC controller towards the NFC SIM. Each bit gets transmitted using a full cycle, and the pulse-width of the signal defines if a zero or one bit gets transmitted.

Here is a picture of the S1 signal transmitting a stream of zeroes:

s1-zeroes

Each bit starts with the raising edge. The voltage high duration of each cycle is 25% of the cycle length. This is interpreted as a zero-bit. Likewise, if the voltage high duration is 75% of the cycle length a one-bit gets transmitted. Again a picture of one-bits to illustrate this:

s1-ones

Extracting the clock and data from this signal is easy. Each clock-cycle starts with the raising edge of the signal. For the clock extraction the falling edge can simply be ignored.

Getting data-bits is easy as well. All you have to do to extract the bits from this signal is to take a look at the voltage at the middle of the cycle. In the images I’ve aligned the numbers denoting the bit-value to this place. I use this method when I debug SWP signals taken with a logic-analyzer. In a hardware-design it is probably not feasible because you’ll never know where exactly the middle of the cycle is until the cycle has ended. I don’t know for sure, but I bet they measure the durations of the high and low periods and extract the bit from that.

For completes sake, here is a picture of a signal with some one and zeros:

s1-bits

This is how one side of the communication works. I’ve simplified a bit and left out things like fall- and rise-times, voltage-levels, tolerances and so on. Also it is worth noting that the clock-rate is not fixed. The NFC-controller is allowed to change the clock rate at will as long as you stay within the allowed range.

How does the SIM transmits data?

Now we have seen how the NFC-controller talks to the SIM-card and how the synchronization works. But the SIM card probably wants to transmit data as well. How does this work?

The NFC-SIM can not transmit data by applying a voltage to the SWP link because the link is always driven by the NFC-Controller. However, the NFC-SIM can draw current from the SWP-link without interfering with the NFC-controller.

Take a look at one of the S1 signal images again. In each clock cycle there will always be voltage high period. During these high periods the SIM can load the SWP link and draw some current. During the voltage low periods it can’t because to draw a current a voltage must be present.

This leads to the fact that the signal from the SIM (S2) will always be modulated by the signal S1 generated by the NFC-controller.

To make things a bit more easy to understand here is a picture of signals S1 and S2. The voltage domain is shown in blue while the current domain is shown in red.

First S1 transmitting some one and zero bits while S2 (the SIM) transmitting a stream of zeros:

s2-zeroesNot much going on on the S2 signal. That’s how zero-bits look like.

 

And now S1 transmitting the same data again while S2 is transmitting a stream of ones:s2-onesOn the S2 signal, one bits become a copy of the S1 bits.

 

And finally both signals transmitting a bunch of bits:s2-bitsYou probably already guessed it. One bits in S2 become copies of S1, zero bits are just a flat line.

The NFC-controller can read the S2 bit-stream by measuring the current consumption of the SWP link just before it generates the falling edge.

 

Timing, Timing and Levels:

In the general case communication over SWP must be done at a frequency between 200 kilobit/second up to 1 megabit/s. A SIM card may also announce the capability to go faster (up to 1.69 megabit/s) or slower (down to 100 kilobit/s).

And they mean it! While troubleshooting I’ve tried to run SWP at a slower clock-rate. It didn’t worked at all. There is some wiggle room, but not much.

The S1 (voltage) is in practice a 1.8V digital signal regardless of the SIM-card supply voltage. The levels that define the high and low regions change somewhat with the different supply voltage classes, but if you provide a clean digital 1.8V signal you won’t run into issues.

In practice it seems that NFC SIM-cards are very forgiving about the voltage applied to the SWP-pin as long as you don’t exceed the supply voltage.

Note that these are ballpark figures, within 10% or so of the real thing. You’ll find the exact values in the specification ETSI TS 102 613 (you’ll find it via Google), chapter 7.1.3.

The S2 (current) signaling definition is much simpler. Independent of the class, a logic high is defined by drawing 600µA or more. The SIM should not draw more than 1mA though.


Why did they came up with such a oddball protocol?

To be honest, I don’t know exactly.

A couple of years ago, while I was working on a NFC middle-ware, I had good contacts with a NFC chip manufacturer. While on site, during lunch I asked exactly this question. The answer was more pragmatic than I expected:

During the stone ages of smart-cards the contact interface (aka the gold pads on your SIM that makes contact with the phone) has been defined. The same interface has been re-used by SIM-cards because SIMs *are* just small smart-cards.

Back then writing to smart-cards required a dedicated programming voltage. Nowadays no one needs the programming voltage anymore, so the pin was always unconnected. Since they wanted NFC functionality in their SIM’s they re purposed this pin and use it for SWP now.

If they had two free pins we would probably have something much simpler. Doing stuff in the current domain has a price: It consumes power. Now SIM cards usually end up in mobile phones. Going low power and having having a long standby time is a thing so I’ve heard. Fortunately the SIM is not communicating over SWP most of the time.

There is one thing that I don’t understand at all though: Going current domain and doing three things over a single wire: All fine. But if you read the specification of SWP you’ll find a lot of details where some crazy timing constraints are required. The requirement bit-rates up to 1 megabit per second for example, let alone 1.7 megabit/second in the high-speed case. I can’t come up with any use-case that even remotely needs such a high bandwidth.

Posted in Uncategorized | 3 Comments

About a failed Circuit Idea

I had a circuit idea in mind that I’ve never been able to try out. I’m working on a SWP reader device right now (that’s a device that should directly talk to NFC enabled SIM cards).

So recently while browsing through semiconductor lists I across the TSH70 OpAmp. This is a great part and it’s much cheaper than the OpAmp I’m currently using and the specs are just about the minimum requirements that I have.

Also contrary to the OpAmp I’m currently using it’s a normal voltage feedback OpAmp This allows for a new transceiver design that I came up with.

Schematic1

Here you’ll see the idea in action: The SWP-TX signal (1.8V digital logic) goes directly into the non-invertig input. The OpAmp circuit itself is just a voltage follower with a transistor booster (Q1).

The voltage seen at the C6 input of the SIM card should be identical to the SWP-TX signal, and in fact it is.

If the SIM card wants to transmit data it does so by drawing current (Rule of thumb: 1mA current equals to a logic one). So the task is to measure current at high speeds.

If the SIM card draws current this current will be sourced by the Q1 emitter and not the OpAmp output. The majority of current that passes Q1 emitter is again sourced from the collector.

On the top of the circuit you’ll see Q2 and Q3. These form a current-mirror. E.g. whatever current is drawn from the Q2 collector will also be drawn from the Q3 collector.

Long story short: The current drawn from the SIM card will appear at the collector of Q3 (just mirrored). Adding a load-resistor R5 converts this voltage to a current and we can measure it.

And here is how it looks like in a spice simulation:

SimTrace1

Blue is the control signal that controls the SIM current. If it’s high the current flowing into C6 is 1mA.

Red is the SWP-TX signal. I’ll show it along with the other curves so you can see that the actual voltage across the SIM does not affect the output much.

Green is the SWP-RX signal.

This green signal looks great eh? Nice, defined edges, just a little bit of ripple. Very low propagation delay. I could directly connect this signal to a micro-processor pin and start reading data.

Except it won’t work like this. I completely missed to add some parasitic capacitance across the emulated SIM card.

Here is the same circuit with C1 added in. I’s just a tiny 10pF capacitor that should emulate the capacitance of the SIM card itself along with sockets and so on.

Schematic2

And this is the signal response after the capacitor has been added:

SimTrace2

Now the received signal rings a lot, and there is also the propagation delay went completely over the roof. It’s almost half as long as the impulse itself!

What happened? Once the capacitor has been charged, and the SWP-TX signal goes back to zero there is no quick way for the capacitor to discharge!. Q1 can only source current to C1, not sink any. The only way to lose charge and lower the voltage across C1 is to slowly leak through R4. And this completely messes up the negative feedback loop of the OpAmp (not his fault!).

I could lower R4 to allow for faster discharge, but then again more current will flow through the transistors and I mess up my nice green output signal.

I could probably replace Q1 with a proper push-pull stage. That’s something I’ll try one day. Right now I’m staying with my “tried and trusted” SWP analog front-end. It has a different topology where the parasitic capacitance actually speed things up! It requires the much more expensive current-feedback OpAmp I’ve mentioned, but but it doesn’t show this defect.

Lesson learned: small parasitics can mess up things much more than expected.

Posted in Uncategorized | 5 Comments

DSP default cache-sizes not optimal?

While debugging some DSP code yesterday I came a cross a performance oddity. Adding more code lowered the performance of an unrelated function.

By itself this is not *that* odd. It happens if the size of your code is larger than your first level code-cache and different functions start to kick each other out of the cache. However, in my little toy program this was unlikely. I had only around 20kb of code and the code-cache is 32kb in size.

Better safe than sorry I thought and took a look how the caches are configured. Big and pleasant surprise: Two of them are running at half the maximum size for no good reason:

In my case after DSP-boot I got:

Level 1 Data-Cache 32k
Level 1 Code-Cache 16k
Level 2 Cache      32k

However, the maximum possible cache sizes for the BeagleBoard are

Level 1 Data-Cache 32k (no change)
Level 1 Code-Cache 32k (16kb larger)
Level 2 Cache      64k (32kb larger)

So 48kb of valuable cache has been left unused. Changing the cache sizes is easy:

  #include < bcache.h >

  // and somewhere at the start of main()
  BCACHE_Size size;
  size.l1dsize = BCACHE_L1_32K;
  size.l1psize = BCACHE_L1_32K;
  size.l2size  = BCACHE_L2_64K;
  BCACHE_setSize (&size);

That still leaves you the 48kb of L1DSRAM for single cycle access and 32kb of L2RAM to talk with the video accelerators. Oh – and it gave a noticeable performance boost.

Btw- it’s very possible that this only applies to the DspLink configuration that I am using.

Update:

It turned out that the reason for the smaller cache-sizes is the default DspLink configuration. You can override this if you add the following lines to your projects TCF-file. Just put them somewhere between utils.importFile(“dsplink-omap3530-base.tci”); and prog.gen():

 
prog.module("GBL").C64PLUSL2CFG  = "64k";
prog.module("GBL").C64PLUSL1DCFG = "32k";
prog.module("GBL").C64PLUSL1PCFG = "32k";

var IRAM = prog.module("MEM").instance("IRAM");
IRAM.len = 32768;

This will configure the OMAP3530 DSP with:

L2-Cache:     64kb
L1 Data-Cache 32kb
L1 Code-Cache 32kb
L1SDRAM       48kb
IRAM (L2 Ram) 32kb
Posted in Beagleboard, DSP, OMAP3530 | 7 Comments

Faster Cortex-A8 16-bit Multiplies

I did a small and fun assembler SIMD optimization job the last week. The target architecture was ARMv6, but since the code will run on the iPhone I tried to keep the code fast on the Cortex-A8 as well.

When I did some profiling on my BeagleBoard, and I got some surprising results: The code run a faster as it should. This was odd. Never happened to me.

Fast forward 5 hours and lots of micro-benchmarking:

The 16 bit multiplies SMULxy on the Cortex-A8 are a cycle faster than documented!

They take one cycle for issue and have a result-latency of three cycles (rule of thumb, it’s a bit more complicated than that). And this applies to all variants of this instruction: SMULBB, SMULBT, SMULTT and SMULTB.

The multiply-accumulate variants of the 16 bit multiplies execute are as documented: Two cycles issue and three cycles result-latency.

This is nice. I have used the 16 bit multiplies a lot in the past but stopped to use them because I thought they offered no benefit over the more general MUL instruction on the Cortex-A8. The SMULxy multiplies mix very well with the ARMv6 SIMD multiplies. Both of them work on 16 bit integers but the SIMD instructions take a packed 16 bit argument while the SMULxy take only a single argument, and you can specify if you want the top or bottom 16 bits of each argument. Very flexible.

All this leads to nice code sequences. For example a three element dot-product of signed 16 bit quantities. Something that is used quite a lot for color-space conversion.

Assume this register-values on entry:

              MSB16      LSB16

          +----------+----------+
      r1: | ignored  |    a1    |
          +----------+----------+
          +----------+----------+
      r2: | ignored  |    a2    |
          +----------+----------+
          +----------+----------+
      r3: |    b1    |    c1    |
          +----------+----------+
          +----------+----------+
      r4: |    b2    |    c2    |
          +----------+----------+

And this code sequence:

    smulbb      r0, r1, r2      
    smlad       r0, r3, r4, r0
 

Gives a result: r0 = (a1*a2) + (b1*b2) + (c1*c2)

On the Cortex-A8 this will schedule like this:

  Cycle0:
  
    smulbb      r0, r1, r2          Pipe0
    nop                             Pipe1   (free, can be used for non-multiplies)
  
  Cycle1:        
  
    smlad       r0, r3, r4, r0      Pipe0
    nop                             Pipe1   (free, can be used for non-multiplies)
  
  Cycle2:        
  
    blocked, because smlad is a multi-cycle instruction.

The result (r0) will be available three cycles later (cycle 6) for most instructions. You can execute whatever you want in-between as long as you don’t touch r0.

Note that this is a special case: The SMULBB instruction in cycle0 generates the result in R0. If the next instruction is one of the multiply-accumulate family, and the register is used as the accumulate argument a special forward path of the Cortex-A8 kicks in and the result latency is lowered to just one cycle. Cool, ain’t it?

Btw: Thanks to Måns/MRU. He was so kind and verified the timing on his beagleboard.

Posted in Beagleboard, OMAP3530 | 5 Comments

C64x+ DSP MMU faults, and how to disable the MMU.

Two days ago, while testing some image processing algorithms on the DSP I got the following message for the first time:


DSP MMU Error Fault!  MMU_IRQSTATUS = [0x1]. Virtual DSP addr reference that generated the interrupt = [0x85000000].

Outch!

I was aware that the DSP of the OMAP3530 has a memory management unit, but so far I never had to deal with it. Dsplink initialized the MMU and enabled access to all the DSP memory and all peripherals I accessed so far.

However, this time I passed a pointer to a memory block allocated via CMEM to the DSP. This triggered a page fault. Now what? I did some research and figured out that the CodecEngine enables access to the CMEM memory. I don’t use the CodecEngine, so I have to do it on my own.

Fortunately the TI folks have thought about that! The PROC module has functionality to modify the DSP MMU entries. Here are some ready to use functions:

int dsp_mmu_map (unsigned long physical_ptr, int size)
///////////////////////////////////////////////////////////
// Maps a physical memory region into the DSP address-space
{
  ProcMemMapInfo mapInfo;
  mapInfo.dspAddr = (Uint32)physical_ptr;
  mapInfo.size    = size;
  return DSP_SUCCEEDED(PROC_control(0, PROC_CTRL_CMD_MMU_ADD_ENTRY, &mapInfo));
 }


int dsp_mmu_unmap (unsigned long dspAddr, int size)
//////////////////////////////////////////////////////////////
// Unmaps a physical memory region into the DSP address-space
{
  ProcMemMapInfo mapInfo;
  mapInfo.dspAddr = (Uint32)dspAddr;
  mapInfo.size = size;
  return DSP_SUCCEEDED(PROC_control(0, PROC_CTRL_CMD_MMU_DEL_ENTRY, &mapInfo));
}

All nice and dandy now? No, it’s not. You can only map a limited number of memory regions (around 32 I think). That’s not enough for my needs. Also I don’t feel like tracking memory regions and swap them as needed. So the way CodecEngine does it is probably better: Enable access to the whole CMEM address-space: These two functions do this, but without CodecEngine:

int dsp_mmu_map_cmem (void)
/////////////////////////////////////////
// map first cmem block into the DSP MMU:
{
  CMEM_BlockAttrs info;

  if (CMEM_getBlockAttrs(0, &info) != 0)
    return 0;

  return dsp_mmu_map (info.phys_base, info.size);
}


int dsp_mmu_unmap_cmem (void)
//////////////////////////////////////////
// umap first cmem block into the DSP MMU:
{
  CMEM_BlockAttrs info;

  if (CMEM_getBlockAttrs(0, &info) != 0)
    return 0;

  return dsp_mmu_unmap (info.phys_base, info.size);
}

All problems solved. Great!

I could have stopped here, but I was eager to know if the MMU has any impact on the memory throughput. Is it possible to completely disable the MMU? Sure, this opens a can of worms. A bug in my code or a wrong DMA transfer could write to nearly any location. It could even erase the flash. But on the DaVinci I didn’t had a MMU and I never run into such problems. So I did some research, and:

It is simple!

The MMU has a disable bit, and the TRM sais that you have to do a soft-reset of the MMU if you fiddle with the settings. I gave it a try and it worked on the first try! You don’t even need a kernel-module for it. The following code will do all the magic from linux-user mode under the restriction that you need read and write access to /dev/mem.

Call this between PROC_load and PROC_start:

int dsp_mmu_disable (void)
//////////////////////////
// Disables the DSP MMU.
// Sets virtual = physical mapping.
{
  volatile unsigned long * mmu;
  int result = 0;

  // physical addres of the MMU2 and some register offsets:
  const unsigned MMU2_PHYSICAL = 0x5d000000;
  const unsigned MMU_SYSCONFIG = 4;
  const unsigned MMU_SYSSTATUS = 5;
  const unsigned MMU_CNTL      = 17;

  // needs Read+Write access to /dev/mem, so you'll better run this as root.
  int fd = open("/dev/mem", O_RDWR);

  if (fd>=0)
  {
    // get a pointer to the MMU2 register space:
    mmu = (unsigned long *) mmap(NULL, 4096, PROT_READ | PROT_WRITE, MAP_SHARED, fd, MMU2_PHYSICAL);

    if (mmu != MAP_FAILED)
    {
      time_t start;

      // A timeout of 10 milliseconds is more then plenty.
      //
      // Usually the reset takes about 10 microseconds.
      // It never happend to me that the reset didn't
      // succeded, but better safe than sorry.
      time_t timeout = (CLOCKS_PER_SEC/100);

      // start MMU soft-reset:
      mmu[MMU_SYSCONFIG] |= 1;

      // wait (with timeout) until the reset is complete.
      start = clock();
      while ((!mmu[MMU_SYSSTATUS]) && (clock()-start < timeout)) {}

      if (mmu[MMU_SYSSTATUS])
      {
        // disable MMU
        mmu[MMU_CNTL] =0;
        
        // set result to SUCCESS.
        result = 1;
      }
      // remove mapping:
      munmap((void*)mmu, 4096);
    }
    close (fd);
  }

  // failed:
  return result;
}

And to answer my own question: No, the MMU does not has any negative impact on the performance. Also the MMU tables reside in the DDR2 memory, the pages are so large that the extra memory traffic for the MMU table-walks can’t even be measured.

Btw – I’ve made a little easy to use library out of the above functions and I’ll release them under the BSD license, so everyone can use it. Get it here: dsp_mmu_util.tgz

The next question would be: Can the DSP jail-break and disable it’s own MMU? That would be of little practical use but interesting to know..

Posted in Beagleboard, DSP, Linux, OMAP3530 | 2 Comments

More on EDMA3 on the BeagleBoard/OMAP3530

Didn’t I mention that the EDMA3 on the OMAP3530 is identical to the EDMA3 of the DaVinci? As I found out this is not exactly true. There is a subtle but important difference:

The order of the transfer-controllers has been reversed. On the DaVinci TPTC0 was ment to be used for system critical controls with low latency and TPTC1 for longer background tasks. On the OMAP3530 this order is exactly reversed. And by the way: Ever wondered what the difference between those two controllers is? On the OMAP3530 the first controller has a FIFO-length of 256 bytes while the second only has 128 bytes. The transfer speed is the same, but transfers issued on the controller with the shorter FIFO have lower latency, so the data reaches the destination a tad earlier.

Btw, while I fooled around with the EDMA I made some speed measurements. I think these can be interesting..

  • DSP DMA transfer, internal to DDR2 RAM: 550 mb/s
  • DSP CPU transfer (memset) to DDR2 RAM: 123 mb/s (outch!)
  • DSP CPU transfer (memset) to internal RAM: 3550 mb/s

For reference I made the same memset test on the CortexA8:

  • CortexA8 DDR2 memset (cached): 417 mb/s
  • CortexA8 DDR2 memset (uncached): 25 mb/s

All numbers taken with GPP-clock at 500Mhz and DSP-clock at 360Mhz. Caches have been enabled and the transfer-size was one megabyte.

Posted in Beagleboard, DSP, OMAP3530 | Leave a comment