Async(ish) SPI.write?

Working libraries, libraries being ported and related hardware
heisan
Posts: 111
Joined: Thu May 31, 2018 5:39 pm

Async(ish) SPI.write?

Post by heisan » Sat Aug 04, 2018 7:25 pm

(Using Roger's core)

Playing around today, I noticed something quite interesting:

In libmaple's spi_tx:

Code: Select all

while ( (regs->SR & SPI_SR_TXE)==0 ) ; //while ( spi_is_tx_empty(dev)==0 ); // wait Tx to be empty
regs->DR = *dp8++;
In libraries/SPI.c:

Code: Select all

spi_tx_reg(_currentSetting->spi_d, data); // write the data to be transmitted into the SPI_DR register (this clears the TXE flag)
while (spi_is_tx_empty(_currentSetting->spi_d) == 0); // "5. Wait until TXE=1 ..."
while (spi_is_busy(_currentSetting->spi_d) != 0); // "... and then wait until BSY=0 before disabling the SPI." 
Basically libmaple waits until there is space in the txfifo, then adds another byte/word.

SPI class rather writes a byte/word into the txfifo, then waits until transmission is 100% complete.

When generating pixels through a minor computational task and spitting them out with SPI.write, I found the process to be around 3x slower than writing the same number of pixels by DMA.

I then added some simple helpers to SPI:

Code: Select all

void SPIClass::writeAsync(uint16 data)
{
    while (spi_is_tx_empty(_currentSetting->spi_d) == 0); // "5. Wait until TXE=1 ..."
    spi_tx_reg(_currentSetting->spi_d, data); // write the data to be transmitted into the SPI_DR register (this clears the TXE flag)
}
void SPIClass::writeAsyncEnd(void)
{
    while (spi_is_tx_empty(_currentSetting->spi_d) == 0); // "5. Wait until TXE=1 ..."
    while (spi_is_busy(_currentSetting->spi_d) != 0); // "... and then wait until BSY=0 before disabling the SPI." 
}
Basically, use the libmaple style writing to keep the txfifo packed and then wait for completion right at the end. This is only 30% slower than DMA.

Is there any reason the original SPI library was not written the same as libmaple? The current implementation seems unnecessarily slow?

Thanks,
Justin

stevestrong
Posts: 3053
Joined: Mon Oct 19, 2015 12:06 am
Location: Munich, Germany
Contact:

Re: Async(ish) SPI.write?

Post by stevestrong » Sun Aug 05, 2018 6:36 am

Can you provide a simple sketch to test?
Do u use single writes or buffered (multiple) writes?
Because using buffered write spi.write(buf, nr_bytes) will be almost as fast as with dma, for less then ~70 bytes even faster than with dma.

EDIT
If you refer to these lines: https://github.com/rogerclarkmelbourne/ ... #L107-L108
this is for buffered write, and they are optimized so that it will not wait for the end of the last transferred byte.
But the wait for the last byte must be there somewhere, and therefore it is transferred to SPI.cpp: https://github.com/rogerclarkmelbourne/ ... #L357-L358.
I really don't see any problem with that.

EDIT2
I think you mixed the single byte writes with the buffered write.
Your extract from SPI.c is for single write: https://github.com/rogerclarkmelbourne/ ... #L327-l328, this calls the (inline) function spi_tx_reg, not to mix with spi_tx !!!

heisan
Posts: 111
Joined: Thu May 31, 2018 5:39 pm

Re: Async(ish) SPI.write?

Post by heisan » Sun Aug 05, 2018 10:37 am

I will be in and out throughout the day. If I manage to sit down long enough, I will prepare a stand-alone sketch.

This is where the code is intended to be used (added to Adafruit_ILI9341_STM.h):

Code: Select all

#ifdef _ADAFRUIT_FB_H_
 void drawFB(int x, int y, Adafruit_FB1BPP& fb, uint16_t pal[]) {
  if(x < 0 || y < 0 || x + fb.fbwidth() > WIDTH || y + fb.fbheight() > HEIGHT) return;
  setAddrWindow(x, y, x + fb.fbwidth() - 1, y + fb.fbheight() - 1);
  int iy, ix;
  uint16_t p;
  int bp = 0;
  uint8_t m = 1;
  int ms = 0;
  for(iy = 0; iy < fb.fbheight(); iy++) {
   for(ix = 0; ix < fb.fbwidth(); ix++) {
    mSPI.writeAsync(pal[(fb.buffer[bp++] & m)>>ms]);
   }
   m <<= 1;
   ms++;
   if(m) {
    bp -= fb.fbwidth();
   } else {
    m = 1;
    ms = 0;
   }
  }
  mSPI.writeAsyncEnd();
  cs_set();
 }
#endif
Adafruit_FB is a stripped down version of the SSD1306 driver, which just renders to a 1BPP frame buffer. So this code expands the 1BPP frame buffer to 16 bit colours (through a palette map), and streams it to the display.

Since each pixel is computed on the fly, it can not be streamed unless it is buffered first - but then you have the same async problem when streaming the buffers. So the correct solution is to use double buffered async DMA - which starts to get quite complicated for something which is supposed to be a light-weight plug in.

An asynchronous SPI.write() at least doubles the speed over the current synchronous write() function.

EDIT: Wow I am an idiot. Slight tweak:

Code: Select all

  uint8_t * bp = fb.buffer;
  for(iy = 0; iy < fb.fbheight(); iy++) {
   for(ix = 0; ix < fb.fbwidth(); ix++) {
    mSPI.writeAsync(pal[(*(bp++) & m)>>ms]);
   }
  }
Eliminating the extra calculation from the inner loop, and it is now only 30% slower than DMA (DMA speed is calculated from a fillRect() of the same dimensions).

stevestrong
Posts: 3053
Joined: Mon Oct 19, 2015 12:06 am
Location: Munich, Germany
Contact:

Re: Async(ish) SPI.write?

Post by stevestrong » Sun Aug 05, 2018 11:08 am

Your code seems reasonable.
I assume the array pal[] contains uint16 values, so you are using the SPI in 16bit mode, right?

You could speed up the code a bit if you declare the writeAsync() function as "inline" in a header file.

heisan
Posts: 111
Joined: Thu May 31, 2018 5:39 pm

Re: Async(ish) SPI.write?

Post by heisan » Sun Aug 05, 2018 11:25 am

stevestrong wrote:
Sun Aug 05, 2018 11:08 am
Your code seems reasonable.
I assume the array pal[] contains uint16 values, so you are using the SPI in 16bit mode, right?

You could speed up the code a bit if you declare the writeAsync() function as "inline" in a header file.
Thanks - I haven't started optimising that yet. This was a quick hack to try find out why SPI.write() was so slow.

I was wondering if there is a particular reason that the SPIClass wrapper was written in such a way, rather than using the libmaple primitives (which are already asynchronous)? Should I rather spend some time prepping a change to SPIClass, or just add a separate class of async functions?

heisan
Posts: 111
Joined: Thu May 31, 2018 5:39 pm

Re: Async(ish) SPI.write?

Post by heisan » Sun Aug 05, 2018 11:27 am

As a side note, rendering 240x240 dial type instruments is now 2x faster than drawing directly to the ILI9341 library, and flicker free ;) .

stevestrong
Posts: 3053
Joined: Mon Oct 19, 2015 12:06 am
Location: Munich, Germany
Contact:

Re: Async(ish) SPI.write?

Post by stevestrong » Sun Aug 05, 2018 5:40 pm

heisan wrote:
Sun Aug 05, 2018 11:25 am
I was wondering if there is a particular reason that the SPIClass wrapper was written in such a way, rather than using the libmaple primitives (which are already asynchronous)? Should I rather spend some time prepping a change to SPIClass, or just add a separate class of async functions?
The SPI lib functions are Arduino compatible, and I would not suggest to change them.
Your solution is a non-standard tweak, which helps in a particular situation.
As you said yourself, your application would be actually a good candidate for a dual buffered DMA transfer.

heisan
Posts: 111
Joined: Thu May 31, 2018 5:39 pm

Re: Async(ish) SPI.write?

Post by heisan » Sun Aug 05, 2018 8:56 pm

stevestrong wrote:
Sun Aug 05, 2018 5:40 pm
The SPI lib functions are Arduino compatible, and I would not suggest to change them.
Your solution is a non-standard tweak, which helps in a particular situation.
As you said yourself, your application would be actually a good candidate for a dual buffered DMA transfer.
SPIClass.write() is not a standard Arduino API function - it only has transfer(), which by definition must be synchronous. But I can see that modifying the standard write() function will not actually be practical - there is not enough discipline in the various libraries in calling endTransaction() - so data output would not be guaranteed in all circumstances.

As an alternative, what would be your thoughts on moving _currentSetting from private to protected, so that derived classes can be made with async functions? I would like to avoid the additional memory penalty of two DMA buffers (especially when the possible performance gain is so small), but I also don't want users to need a custom core to build the application...

Thanks,
Justin

heisan
Posts: 111
Joined: Thu May 31, 2018 5:39 pm

Re: Async(ish) SPI.write?

Post by heisan » Wed Aug 08, 2018 9:21 pm

Finally got to play with this some more, and I am making some progress.

First a note. The SPI clock is 36MHz - i.e. 2 CPU clocks per bit. SPI writes are 16 bits wide. So there are 32 CPU clocks available to keep the SPI buffer full. If you can do that then the final result will be faster and use less memory than a DMA implementation.

The problem is, 32 clocks is not a lot ;) .

As a performance reference I am using:

Code: Select all

tfto.fillRect(0,0,240,240,0);
which just does a zero increment DMA of 240x240 words - so this should be pretty much the fastest the screen area can be filled.

This takes 2.5 seconds for 100 cycles. The implementation I posted above takes 3.2s (700ms slower).

One simple change:

Code: Select all

void SPIClass::writeAsync(uint16& data)
i.e. passing by reference means the compiler does not write to an intermediate register (even when inlined), and this reduces it to 2.8s - this is something I did not think of before - used with care, pass by reference can give a big performance increase.

Next I looked at ways to implement this without changing the SPI core. This turned out to be easier than expected, as buried in SPI.h is the dev() method which returns the libmaple device, allowing for direct implementation of the comms.

This gets me down to 2.6s (only 100ms slower than the limit). So now I only need to shave off another clock cycle, or two, and I should be able to reach the hardware limit ;) .

Current test code:

Code: Select all

 void drawFB(int x, int y, Adafruit_FB& fb, uint16_t pal[]) {
  if(x < 0 || y < 0 || x + fb.fbwidth() > WIDTH || y + fb.fbheight() > HEIGHT) return;
  setAddrWindow(x, y, x + fb.fbwidth() - 1, y + fb.fbheight() - 1);
  spi_reg_map *regs = mSPI.dev()->regs;
  int iy, ix;
  uint8_t m = (1<>ms]);
    //while (spi_is_tx_empty(mSPI.dev()) == 0); // "5. Wait until TXE=1 ..."
    //spi_tx_reg(mSPI.dev(), pal[(*(bp++) & m)>>ms]); // write the data to be transmitted into the SPI_DR register (this clears the TXE flag)
    while(!(regs->SR & SPI_SR_TXE));
    regs->DR = pal[(*(bp++) & m)>>ms];
   }
   m <<= fb.fbdepth();
   ms += fb.fbdepth();
   if(m) {
    bp -= fb.fbwidth();
   } else {
    m = (1<SR & SPI_SR_TXE));
  while(regs->SR & SPI_SR_BSY);
  cs_set();
 }

stevestrong
Posts: 3053
Joined: Mon Oct 19, 2015 12:06 am
Location: Munich, Germany
Contact:

Re: Async(ish) SPI.write?

Post by stevestrong » Thu Aug 09, 2018 5:05 am

Use "-O3" compiler directive.

Post Reply