Not much has been disclosed about the internals of the kernel scheduler.
It's time to explain in detail how it works, so that real-time application
programmers know what they can expect from it and can make better
use of it. In this article, I discuss the kernel scheduler as it
has been implemented since Developer Release 7.
The scheduler distinguishes between two classes of threads:
"time-sharing" and "real-time."
The thread's priority level determines which category it falls
into. The priority range 1-99 is used for time-sharing threads;
100-120 is used for real-time
threads. Within each class, threads are further differentiated using
their relative priorities.
A real-time thread is executed as soon as it's ready. If more
than one real-time thread is ready at the same time, the thread
with the higher priority is executed first. The thread is guaranteed
to run without being preempted (except by a real-time thread with
a higher priority) until it invokes a blocking
system call (semaphore, ports, snooze, I/O, and so on).
A time-sharing thread is executed only if there are no real-time
threads in the ready queue. When this condition is satisfied, a
time-sharing thread is elected in a probabilistic manner based on
its priority, using a logarithmic scale of base 2.
For example, a thread with a priority 8 is four times more likely
to be elected than a thread with a priority 6. A new time-sharing
thread is reelected regularly, once every "scheduler quantum"
(currently, every 3 milliseconds).
Because the BeBox is a multiprocessor machine, things are actually
a bit more complex. But the rules described still apply:
No time-sharing thread is allowed to stand in the way of a real-time
thread. For example, if CPU 1 is executing real-time thread A, it's
possible for a time-sharing thread B to run on CPU 2. (Note that
on a single-processor system, B would have to wait until A blocks.)
If real-time thread C wakes up, it
would immediately interrupt B on CPU 2 and run there, without interfering
with A.
If you look at the header file OS.h (in DR8), you will see some
predefined priorities. With each real-time priority, there's an
associated "maximum latency" (described in the comment
next to each priority constant). This latency is the maximum time
that a thread can afford to sit in the ready queue without being
executed. Note that these maximum latencies are *not*
enforced by the scheduler -- these are guidelines, not guarantees.
As a developer, you can use the latency guidelines to find the appropriate
priority for your real-time thread.
In the real-time class:
B_REAL_TIME_PRIORITY (120) -- latency 0.5-1 ms Used for the
most demanding real-time tasks, such as audio. We currently use
this one for the threads in the chain feeding the DAC.
B_URGENT_PRIORITY (110) -- latency 5-10 ms Used for slightly
less demanding tasks. We use it for MIDI.
B_REAL_TIME_DISPLAY_PRIORITY (100) -- latency 10-30 ms
We use this one for the app_server threads poller and mouser, in
charge of reading the keyboard and displaying the cursor.
In the time-sharing class:
B_URGENT_DISPLAY_PRIORITY (20) We don't use this one yet.
You shouldn't use it either.
B_DISPLAY_PRIORITY (15) This is the priority of window
threads, where updates take place.
B_NORMAL_PRIORITY (10) This is the priority of all other
system-created threads.
B_LOW_PRIORITY (5) This is for threads that run in the
background, without disturbing any other threads. When designing
a real-time application, analyze its real- time constraints. If
the maximum tolerable latency falls below 0.5 milliseconds, then
the BeBox isn't (yet) the right platform for that application. However,
for anything higher than that, you should pick the priority with
the matching latency.
Real-time threads must limit themselves to operations that truly
require real-time attention. A healthy respect for this rule is
essential to the proper behavior of the system. Memory allocation,
for example, should be done by a normal priority time-sharing thread.
The real-time thread should only access buffers that have already
been allocated (and locked in memory, if necessary).
Another factor to keep in mind when composing your real-time thread
is whether a particular function call will "lock" the
underlying system resource or component. Most of the components
of the BeOS(TM) are reentrant, and so calls on that component can
be used in a real-time application. But some components aren't reentrant;
functions that access these components are automatically locked
while they're being used.
This can degrade performance. For example, the file system (in
its current state) may lock the I/O system for long periods. This
can be a problem for certain real- time applications as we cannot
guarantee reasonable latency for disk I/O. We're quite aware of
these limitations and are constantly working to eliminate them.
If you run into any problem while designing or programming your
real-time application, or if you have a question, you can contact
me at cyril@be.com.
The sample code
<ftp://ftp.be.com/pub/samples/drivers/pcspeaker.zip>
demonstrates just how fine-grained you can get. The code is a driver
for the built-in PC speaker. If you have a file in mono 8-bit unsigned
20k raw format, you can cat it directly to /dev/audio/pcspeaker to
play it. In lieu of that, I've included a 'sine' program that writes
a sine wave to stdout, which can be redirected to this device. This
was tested on a 333Mhz Pentium II. On slower machines, you may have
to tweak the values to keep the machine usable.
The PC speaker is a primitive piece of hardware. It basically has
two states: on and off. You can drive it from a programmable rate
generator, or you can directly modify the state by setting a bit.
The classic way to mimic analog signals with this type of speaker
is to pulse it very quickly -- faster than the speaker cone can
physically move. The amount of time that current is applied to the
output vs. the amount of time that it is not applied is controlled
by the amplitude of the input signal, causing the speaker position
to approximate the actual waveform. The sample driver will use this
technique if it is compiled with the PLAY_IN_BACKGROUND macro
set to 0. The down side to this approach is that it requires constant
attention from the CPU. In order to get any kind of sound quality,
you have to shut off interrupts, rendering the machine unusable.
This is clearly undesirable.
The sample code takes a rather unique approach to this problem.
It programs a timer to wake it up periodically so it can move the
speaker cone. Other threads continue to run normally, but at defined
intervals, an interrupt occurs and the speaker driver code is executed
(at interrupt level) to control the speaker.
To get any quality, the interrupts need to occur frequently; around
every 30-60us. Note that, although the machine is still useable,
interrupting this frequently cuts down performance quite a bit.
This is a rather extreme case, but it does illustrate how fine-
grained you can get with kernel timers. You should also note that
the standard user space synchronization primitives use this underlying
mechanism, and you can now get very accurate delays using system
calls such as snooze, read port etc, or acquire sem etc. You don't
need to write a device driver to get accurate timing. Note that,
although the machine is >still useable, interrupting this frequently
cuts down performance >quite a bit.
From the driver level, the timers are programmed by using a new
API added for Genki. The declarations for these functions are in
KernelExport.h.
Programming a timer in a device driver basically involves calling
the add timer function, like so:
ret = add timer( (timer*) &my timer,
handler_function, time,
B_ONE_SHOT_ABSOLUTE_TIMER);
The first parameter is a pointer to a timer structure. You can
add your own parameters to this by defining your own structure and
making kernel's timer struct the first element, for example:
struct my timer {
timer kernel timer;
long my variable;
.
.
.
};
The second parameter to the add timer function is a hook function
that should be called when the timer expires. It has the form:
int32 timer hook(timer *t);
Note that the timer struct that you originally passed to add timer
is also passed into this function, so you can access elements that
you've added to that struct from the interrupt handler.
The third parameter is the time that this interrupt should occur.
How this is interpreted is determined by the fourth parameter. There
are three basic modes that a timer understands:
B_ONE_SHOT_ABSOLUTE_TIMER, as the name implies, is a single
interrupt that occurs at a specific system time.
B_ONE_SHOT_RELATIVE_TIMER is similar, but allows you to
specify an interval rather than an actual time.
B_PERIODIC_TIMER will cause the callback to be called repeatedly
at a given interval.
When playing in background mode, the speaker driver determines
the next time the speaker cone has to move and program a timer for
that time. If it's under 20us, it doesn't actually program the timer.
This is because there is some overhead to handling an interrupt.
If you program a timer that's too short, you may end up wasting
time and be late for the next event you want to handle. Likewise,
when it sets up the interrupt, it will set it a little early to
compensate for the interrupt latency. A macro called SPIN determines
whether the code will spin in a loop when it is early to handle
event, or just move the speaker cone early. In the case of the speaker
driver, which is obviously not a high- fidelity component, this
isn't really necessary. In fact, since these interrupts occur so
frequently, machine performance is degraded significantly when it
behaves like this. In drivers where timing is critical, spinning
like this is a way to be accurate.
A quick note about the implementation of timers. You may be familiar
with the way timing is implemented on many operating systems (including
BeOS before Genki). Normally, an operating system sets up a periodic
interrupt that occurs every couple of milliseconds. At that time,
the timeout queue is checked to see if there are expired timeouts.
As a consequence, you can't get a dependable resolution of less
than the interrupt period. In Genki, the kernel dynamically programs
a hardware timer for the next thread to wake or sleep, allowing
much finer resolutions. This, coupled with the preemptive nature
of the kernel, is what allows the driver to accurately set timers
for 30-60us.
When it finishes playing a chunk of data, the interrupt handler
releases a semaphore that the calling thread was waiting for. Normally,
when a semaphore is released, the scheduler is invoked to wake up
threads that may have been waiting for the semaphore. As you're
probably aware, rescheduling from within an interrupt handler is
very bad. Driver writers must use the flag B_DO_NOT_RESCHEDULE
when releasing a semaphore from an interrupt handler. This, by itself,
was a limitation, however, because it meant that a thread waiting
for some device driver call has to wait until the scheduler is invoked
again (for some unrelated reason) before it will be run. This can
be several milliseconds later. Generally, you want it to run as
soon as possible.
In Genki, an interrupt handler can now return B_INVOKE_SCHEDULER,
which call the scheduler immediately after your interrupt handler
returns. If you have released semaphores from an interrupt handler,
you can return this to make sure that waiting threads are woken
up soon. This is especially useful for drivers where low-latency
access to hardware is important.
As you can see, there are a number of very useful facilities for
low-latency device access and high-precision timing. In addition,
many of the generic user-level synchronization primitives have become
much more accurate.