Something I've been wondering about for a while now; suppose you
have a driver for something that can generate a lot of data fast.
Say, a 100base ethernet driver (altho these may still have a special
non-kernel driver API in DR9, but it's just an example anyway). There
may be 10 megabytes of data coming in each second, and you don't want
to lose any of it, which means that to keep up you have to have a
buffer around for DMA'ing into at any time, and you can't afford to
copy the data too many times or you run out of memory bandwidth.
The optimal method is to have three or more equal-sized buffers
posted to the driver at any time, the driver filling them up one by
one and posting back to the app which cycles them back after processing
the data. Trivial to do with a message queue. But how are you going
to do it with a synchronous read()/write() API?
solution 1: one thread in a read() loop. Between the time
read() returns and the thread manages to call it again with a new
or emptied buffer, n Ethernet packet frames may have been lost.
solution 2: three threads blocked in read() loops, served
one by one. Question is, how is the order enforced?
solution 3: a kernel thread managing a buffer queue in kernel
space and doing memory copies to user space on read(). Risks running
out of memory bandwidth.
solution 4: two user threads blocked in read() loops, managing
a buffer queue with a third thread through a message port. The third
thread does all the actual processing of the data. Complicated, with
two layers of semaphores (one in driver, another in the message port).
-Osma Ahvenlampi (oahvenla@cc.hut.fi)
Response:
>The optimal method is to have three or more equal-sized
buffers posted to the
>driver at any time, the driver filling them up one by one and
posting back to
>the app which cycles them back after processing the data. Trivial
to do with a
>message queue. But how are you going to do it with a synchronous
read()/write() API?
You're confusing the driver implementation,
talking to hardware, with the API where the application is talking
to the driver.
>solution 3: a kernel thread managing a buffer
queue in kernel space and doing
>memory copies to user space on read(). Risks running out of memory
bandwidth.
Yes, well, such is life. You will need at
least two memory copies; from the device to the driver (even if it's
DMA, it still counts as a "memory copy") and from the driver
to the application. Not counting any copying or processing the application
wants to do
with the data.
Forcing user applications to use a non-standard I/O model is only
needed for devices that come close to the machines' total capacity;
typically, that is not expected to happen for "normal" devices.
100 MBit Ethernet is somewhat fast for the baseline CPU of today,
but that's good, because modern machines will not have a problem with
it, and it'll take some time before it's perceived as "slow"
(like 10baseT is in some circles).
The good thing about a driver is that you can put smarts into it;
for a network card driver, you can instruct the driver to discard
packets you don't care about, and thus reducing memory bandwidth.
Some Ethernet hardware can do it in hardware, even, which would be
a potentially big win if you have lots of packets you don't care about
addressed to your hardware interface.
- hplus@zilker.net
Response:
No, I'm not. I believe in designing device driver APIs so that the
two could
interoperate as closely as possible.
> Yes, well, such is life. You will need at least
two memory copies;
> from the device to the driver (even if it's DMA, it still counts
> as a "memory copy") and from the driver to the application.
Not
> counting any copying or processing the application wants to do
> with the data.
An optimal device API would lock the application's
own buffers (which the app would triple-buffer queue) into memory
and DMA right into them. No UNIX can do this, but Amiga's device driver
APIs work this way (and the network API indeed requires a TCP/IP stack
or an equivalent app talking directly to the hardware to triple-buffer
I/O requests).
> Forcing user applications to use a non-standard
I/O model is only
> needed for devices that come close to the machines' total capacity;
It's unfortunate that the standard model can't handle this.
> typically, that is not expected to happen for
"normal" devices.
> 100 MBit Ethernet is somewhat fast for the baseline CPU of today,
> but that's good, because modern machines will not have a problem
> with it, and it'll take some time before it's perceived as "slow"
Don't forget that gigabit Ethernet devices are already available,
and Firewire also has 400Mbit capacity. This is far more than the
average desktop machine memory bandwidth (see the C'T article comparing
Pentium, PPro and Klamath: 32MB/s main-memory bandwidth for these
66MHz EDO-RAM systems).
> The good thing about a driver is that you can
put smarts into it;
> for a network card driver, you can instruct the driver to discard
> packets you don't care about, and thus reducing memory bandwidth.
Packets addressed to other Ethernet devices
are discarded automatically by _every_ card available (unless you
put them in promiscuous mode, which is an immense performance hit
on most busy LANs). The driver can not discard anything that comes
through from the card, because it is not the Ethernet driver's job
to interpret IP/IPX/AppleTalk headers. Further filtering needs to
be done by the network stack itself.
- Osma Ahvenlampi <mailto:oa@iki.fi>
Response:
We should be able to do this with an ioctl and a few areas
right???
ie.
create 3 areas in your application
open
ioctl, send driver the id's of the areas, and tell it to associate
them
with the stream you opened.
then you just need to somehow communicate back and forth as to when
the
areas are available. probably some more ioctl calls... sure, this
isn't a standard method, but it's highly optimized, requiring NO copying...
just directly storing the data as it comes in to the buffer.
- --
Daniel Lakeland
dlakelan@iastate.edu
++++++++++++++++++++++++++++++++++++++++++++++
Re: Reading kernel ports without copying
Eric Berdahl (berdahl@serendipity.org)
Mon, 23 Feb 1998 14:36:44 -0800
I mention this only for the purpose of historical anecdote or interesting
observation and not necessarilly because I think this is what Be should
put in their kernel. Now that the disclaimers are done...
One of the really cool features of NuKernel (the kernel Apple wrote
for Copland) was the optimizations it made in message passing. All
message passing was synchronous (sender blocked until the receiver
replied to the message) and the client's message buffer was, therefore,
guaranteed to be valid through the entire rendevous. When the receiver
blocked on a receive, it provided a buffer and the size of that buffer
to the receive function, but when the receive function returned, it
gave the receiver a pointer to the message data and the size of that
message data. The message data was explicitly not guaranteed to be
in the buffer the receiver provided.
The kernel had a heuristic for looking at the size of the message
data, the size of the receiver's buffer, and some characteristics
of the VM system and thereby determining what the most efficient way
to pass the message data was.
Method 1: It turns out that there was a small amount of "extra"
space in the message header returned by the receive function. If the
message was small enough, copying it into that extra space was a big
win. This worked great for small messages.
Method 2: If the message was larger than the small space,
small enough to fit in the receiver's buffer, and not so large that
it crossed some magic boundary (based on VM stats and other things
the NuKernel team measure), the message was copied into the receiver's
buffer.
Method 3: This was what I thought was the really cool part.
If the message was REALLY BIG or just too large to fit in the receiver's
buffer, the message system would map the sender buffer's pages from
the sender's address space into the receiver's address space as read-only
pages. Because the pages were mapped read-only, the receiver could
not trash the sender's address space, but because they were mapped
instead of being
copied, there was relatively (!) small overhead. If the overhead of
mapping pages was smaller than the overhead of doing the memcpy on
the message, this method would end up being the fastest. It also put
an upper limit on the amount of overhead provided by the message system.This
was one of the features that made NuKernel one of the nicest kernels
I've programmed.
Regards,
Eric