Be Driven
  Device Drivers in the Be Os
     
    Q & A : Data Transfer

Questions and Answers
Data Transfer Design Issues.

This section deals with more complex design decsions, as opposed to just simple, Will it work? This is probably going to be more along the line of Question and Responses.

Information primarily taken out of BeDevTalk Archives. ;-)
Very much worth reading through to learn from others experiences.

Question : Fast I/O

Something I've been wondering about for a while now; suppose you have a driver for something that can generate a lot of data fast. Say, a 100base ethernet driver (altho these may still have a special non-kernel driver API in DR9, but it's just an example anyway). There may be 10 megabytes of data coming in each second, and you don't want to lose any of it, which means that to keep up you have to have a buffer around for DMA'ing into at any time, and you can't afford to copy the data too many times or you run out of memory bandwidth.

The optimal method is to have three or more equal-sized buffers posted to the driver at any time, the driver filling them up one by one and posting back to the app which cycles them back after processing the data. Trivial to do with a message queue. But how are you going to do it with a synchronous read()/write() API?

solution 1: one thread in a read() loop. Between the time read() returns and the thread manages to call it again with a new or emptied buffer, n Ethernet packet frames may have been lost.

solution 2: three threads blocked in read() loops, served one by one. Question is, how is the order enforced?

solution 3: a kernel thread managing a buffer queue in kernel space and doing memory copies to user space on read(). Risks running out of memory bandwidth.

solution 4: two user threads blocked in read() loops, managing a buffer queue with a third thread through a message port. The third thread does all the actual processing of the data. Complicated, with two layers of semaphores (one in driver, another in the message port).
-Osma Ahvenlampi (oahvenla@cc.hut.fi)

Response:

>The optimal method is to have three or more equal-sized buffers posted to the
>driver at any time, the driver filling them up one by one and posting back to
>the app which cycles them back after processing the data. Trivial to do with a
>message queue. But how are you going to do it with a synchronous read()/write() API?

You're confusing the driver implementation, talking to hardware, with the API where the application is talking to the driver.

>solution 3: a kernel thread managing a buffer queue in kernel space and doing
>memory copies to user space on read(). Risks running out of memory bandwidth.

Yes, well, such is life. You will need at least two memory copies; from the device to the driver (even if it's DMA, it still counts as a "memory copy") and from the driver to the application. Not counting any copying or processing the application wants to do
with the data.

Forcing user applications to use a non-standard I/O model is only needed for devices that come close to the machines' total capacity; typically, that is not expected to happen for "normal" devices. 100 MBit Ethernet is somewhat fast for the baseline CPU of today, but that's good, because modern machines will not have a problem with it, and it'll take some time before it's perceived as "slow" (like 10baseT is in some circles).

The good thing about a driver is that you can put smarts into it; for a network card driver, you can instruct the driver to discard packets you don't care about, and thus reducing memory bandwidth. Some Ethernet hardware can do it in hardware, even, which would be a potentially big win if you have lots of packets you don't care about addressed to your hardware interface.
- hplus@zilker.net

Response:

No, I'm not. I believe in designing device driver APIs so that the two could
interoperate as closely as possible.

> Yes, well, such is life. You will need at least two memory copies;
> from the device to the driver (even if it's DMA, it still counts
> as a "memory copy") and from the driver to the application. Not
> counting any copying or processing the application wants to do
> with the data.

An optimal device API would lock the application's own buffers (which the app would triple-buffer queue) into memory and DMA right into them. No UNIX can do this, but Amiga's device driver APIs work this way (and the network API indeed requires a TCP/IP stack or an equivalent app talking directly to the hardware to triple-buffer I/O requests).

> Forcing user applications to use a non-standard I/O model is only
> needed for devices that come close to the machines' total capacity;

It's unfortunate that the standard model can't handle this.

> typically, that is not expected to happen for "normal" devices.
> 100 MBit Ethernet is somewhat fast for the baseline CPU of today,
> but that's good, because modern machines will not have a problem
> with it, and it'll take some time before it's perceived as "slow"

Don't forget that gigabit Ethernet devices are already available, and Firewire also has 400Mbit capacity. This is far more than the average desktop machine memory bandwidth (see the C'T article comparing Pentium, PPro and Klamath: 32MB/s main-memory bandwidth for these 66MHz EDO-RAM systems).

> The good thing about a driver is that you can put smarts into it;
> for a network card driver, you can instruct the driver to discard
> packets you don't care about, and thus reducing memory bandwidth.

Packets addressed to other Ethernet devices are discarded automatically by _every_ card available (unless you put them in promiscuous mode, which is an immense performance hit on most busy LANs). The driver can not discard anything that comes through from the card, because it is not the Ethernet driver's job to interpret IP/IPX/AppleTalk headers. Further filtering needs to be done by the network stack itself.

- Osma Ahvenlampi <mailto:oa@iki.fi>

Response:

We should be able to do this with an ioctl and a few areas right???
ie.
create 3 areas in your application
open
ioctl, send driver the id's of the areas, and tell it to associate them
with the stream you opened.

then you just need to somehow communicate back and forth as to when the
areas are available. probably some more ioctl calls... sure, this isn't a standard method, but it's highly optimized, requiring NO copying... just directly storing the data as it comes in to the buffer.
- --
Daniel Lakeland
dlakelan@iastate.edu

Interesting Email on Data Movement, Small Copying..
You could Implement a similar protocol manually from your device if required. (Like a/v capture ;-)

++++++++++++++++++++++++++++++++++++++++++++++
Re: Reading kernel ports without copying
Eric Berdahl (berdahl@serendipity.org)
Mon, 23 Feb 1998 14:36:44 -0800

I mention this only for the purpose of historical anecdote or interesting observation and not necessarilly because I think this is what Be should put in their kernel. Now that the disclaimers are done...

One of the really cool features of NuKernel (the kernel Apple wrote for Copland) was the optimizations it made in message passing. All message passing was synchronous (sender blocked until the receiver replied to the message) and the client's message buffer was, therefore, guaranteed to be valid through the entire rendevous. When the receiver blocked on a receive, it provided a buffer and the size of that buffer to the receive function, but when the receive function returned, it gave the receiver a pointer to the message data and the size of that message data. The message data was explicitly not guaranteed to be in the buffer the receiver provided.

The kernel had a heuristic for looking at the size of the message data, the size of the receiver's buffer, and some characteristics of the VM system and thereby determining what the most efficient way to pass the message data was.

Method 1: It turns out that there was a small amount of "extra" space in the message header returned by the receive function. If the message was small enough, copying it into that extra space was a big win. This worked great for small messages.

Method 2: If the message was larger than the small space, small enough to fit in the receiver's buffer, and not so large that it crossed some magic boundary (based on VM stats and other things the NuKernel team measure), the message was copied into the receiver's buffer.

Method 3: This was what I thought was the really cool part. If the message was REALLY BIG or just too large to fit in the receiver's buffer, the message system would map the sender buffer's pages from the sender's address space into the receiver's address space as read-only pages. Because the pages were mapped read-only, the receiver could not trash the sender's address space, but because they were mapped instead of being
copied, there was relatively (!) small overhead. If the overhead of mapping pages was smaller than the overhead of doing the memcpy on the message, this method would end up being the fastest. It also put an upper limit on the amount of overhead provided by the message system.This was one of the features that made NuKernel one of the nicest kernels I've programmed.
Regards,
Eric


The Communal Be Documentation Site
1999 - bedriven.miffy.org