Be Driven
  Device Drivers in the Be Os
     
    Memory - x86

Memory Architecture for an intel 80386->

Go Buy a Book

Programming the 80386
John H. Crawford, Patrick P. Gelsinger
Sybex
ISBN :0-89588-381-3

This is a very rough but broad overview.. If you need to understand anything more than this, go and buy a good reference manual. If you have any expereience with x86 assembler, then this shouldn't be 2 painful.

Understanding the 80386 model is good enough for general Intel BeOS programming. Newer processors extend functionality, but do not change the layout of how things behave.. (As opposed to the changes between 8088, 80286, and finally the 80386).

Because BeOS only runs in 32bit modes, I will avoid all discussion about 16 bit modes, which have a different memory layout.

In this section I won't be repeating myself as much as Inormally do, so be more carefull, and when I say re-read the protection chapter, I mean it ;-)

AX or EAX and Edians..

In the transition between 16 bit and 32 bit processor development, the old 16 bit registers were Extended into 32 bit reigsters.. so any register with an E infront of it means that it refers to an entire 32 bits, where without refers to the lower 16 bits of the same area. On certain registers, the 16 bit values are split into 2 more 8 bit areas, with H for high, and L for low. AH, CL, BX, etc.

31             16 15     8 7       0
-------------------------------------
|                 |   AH   |   AL   |
|                 -------------------
|                 |       AX        |
-------------------------------------
                 EAX

from the above, all true believers in Big-Edian format order are already in cursing the day intel decided to go with Little-Edian format.

?? Huh, what are you talking about ??
If you said that, your in for a nasty surprise. ;-)

The lowest chunk of data you can read at any point is 8 bits. You can then read in groups of 16 bits , then 32bits. (Any C Programmer knows this).

Now, unlike the old Motorolla 68000 family, and some other processors, you are allowed to start your pointer ANYWARE in memory, and it does not need to by alligned to the nearest even 4 bytes (by mod(4)). This at least is nice.
Now, When you access your 16, or 32 bit chunks, how are you actually reading the memory..

That's right, on intel, the LOGICAL order of the number, is not the way it is PHYSICALLY stored in the processor or in memory.

There are 2 ways to store a 16 bit value representing 0x1234

Highest bit -> Lowest Bit : 0x34 0x12 (motorolla)

or

Lowest Bit -> Highest Bit : 0x12 0x34 (intel)

The Little Edian might look easier to read on paper, but it can cause serious confusion when working with graphics, 16 bit audio, etc.

So, this is why you often see the following code....

long v = 0x12345678;
outchar( v << 24 ); // 12 : Most significant value
outchar( v << 16 ); // 34
outchar( v << 8 ); // 56
outchar( v & 0xff ); // 78 : Least Significant Value

where on PPC, you might just load it straight into memory, and deal with it as a 32 bit integer, with no conversion. (But you don't so as to make the code portable between platforms)

I am not going any further into this.. Read some books, search the internet, actually look at the memory when doing your operations on an intel for the first time.. Get used to it.

Now this does cause issues with some hardware.
If they have a 16 bit I/O port, which way were they expecting the data???
;-)

The processor status and control flags register
- EFLAGS

The EFLAGS register is one of the two status / control registers. (The other being then EIP (Instruction Pointer)

The 32bit EFLAGS register contains several status flag and control flag bits.

The program sets the control bits to control the operation of certain functions of the 80386. The processor itself sets the status bits, which are tested by the program after arithmetic operations to check for special conditions.

eg.,
DEC BX ; BX -= 1;
JZ some address ; Jump if last operation = Zero

Arithmetic Flags
Bit 00 : CF : Carry Flag
Bit 01 : 1  : Reserved as 1
Bit 02 : PF : Parity Flag 
Bit 03 : 0  : Reserved as 0
Bit 04 : AF : Auxilary carry Flag
Bit 05 : 0  : Reserved as 0
Bit 06 : ZF : Zero Flag
Bit 07 : SF : Sign Flag
Bit 11 : OF : OverFlow Flag
Process Control Flags
Bit 08 : TF  : Trap Enable Flag [ single step instructions ]
Bit 09 : IF  : Interrupt Enable Flag
Bit 10 : DF  : Direction Flag [ string manipulation ]
Bit 12 : IOPL LO : I/O Privelage Level
Bit 13 : IOPL HI : is a 2 bit value
Bit 14 : NT  : Nested Task Flag [ is interrupt, or task switch]
Bit 15 : 0   : Reserved as 0
Bit 16 : RF  : Restart Flag
Bit 17 : VM  : are we in Virtual 8086 Mode bit.
Bit 18 -> 31 : Reserved as 0

The RF, NT, DF, and TF flags may be set or cleared by a program running at any process level. The VM and IOPL fields may be set when running at process level 0. The IF bit can only be changed when executing with I/O Privelage.

Bits marked as '0' or '1' are reserved. The reserved bits must be loaded as '0' or '1', as indicated, and must be ignored when examaning the EFLAGS register.

Of primary interest in our conversations are the IF (Interrupt Flag), and the IOPL (I/O Privelage Level) flags in the EFLAGS register.

Memory Addressing Concepts

The 80386 uses a memory addressing technique called segmentation, which divides the memory space into one or more sperate linear regions called 'segments'. A memory address consisits of two parts; a sgement part that identifies the containing segment, and an offset part that gives a simply byte offset inside the segment.

[ Before you 16 bit mode programmers go, Yeah I know this, the 32bit memory addressing scheme is DIFFERENT, even though it still uses a Seg:Offset combination. ]

Both parts must be specified by an instruction with a memory operand (accesses memory, as opposed to internal registers like AX/BX/CX, etc)

The segment part is a 16bit segment selector, which contains a 14bit field that identifies one of 16,384 possible segments.

The 32 bit offset part gives a byte offset within a segment. This offset can access a consecutive byte address of 4GB. (Which still proves to be inadequate for the MsOffice 2000 runtime).

This 2 part memory notation is always represented as Segment:Offset. The Segment part is always represented by a register..

eg., ES:[EAX], is Extended Segment Register : EAX

Given this as it is, most applications use only a few segments, and tend to place large chunks of data in side them.

There are 6 segment registers available to a programme.

  • CS : Code Segment
    [current instruction is at CS:EIP]

  • SS : Stack Segment
    [comonly used as SS:ESP for current stack position]

  • DS : Data Segment

  • ES : Destintation Segment [where you want a result to go by convention]

  • FS : What are they named??

  • GS : What are they used for by GCC ??

You can use any Segment register interchangably, but convention dictates the most comon behaviour.

Offsets

Offsets are represented in serveral ways that is of more interest to assembler programers than C programmers. In short, a register is most of used to make up the offset address..

Memory-Managment Features

There are 2 key parts to any complete memory-management system : protection and address translation. Protection is provided to prevent a task from accessing memory belonging to another task or the operating system. Address translation gives the os flexibility in allocating memory to tasks, and it is also a key protection mechanism.

Address Translation

The pysical memory, (accessed at pin levels across the bus), is a linear array of bytes, each byte having a unique address known as its 'pysical address'.

But, as you would have just read, every address is represented in a segmented form. This is called a Virtual Address, because it is translated into a physical address. Every last piece of code uses this Virtual Addressing scheme. (Read the instruction set if you want proof).

Segmentation and Paging

To minimize the amount of information needed to specify the address translation function, large sequential blocks of memory are mapped as single units.

Now, the BeOS uses the Paging method of translations... Even though a purely Segmentation method exists, it is not as good for virtual memory requirements, as chunks are dealt as variable sized blocks. So only the method that involves the Paging Method is described.

The 80386 uses both segmentation and paging in a two-stage virtual to physical address translation mechnism. Each stage of address translation requires a table be maintained by BeOS for each Team.

The conversion is as follows

[Virtual Address (Seg:Offset)] ->> Segment Table ->>
[32 bit Linear Address] ->> Paging Table ->>
[32 bit Physical Address]

The first stage uses segmentation to translate a 2 part address in the virtual address space into an address in an intermediate adddress space, called the linear address space. The second stage uses paging to translate this linear address to a physical address.

Each team's Virtual Address space has up to 16,384 segments, each of which can be upto 4GB in size, making the virtual address space 64 terabytes in size. But, as the physical address space is only 32 bits in length, and is not segmented, the system can only address a mere 4GB of real memory at any point in time. ( I believe this is the total memory capacity of the system, despite the addressible range of virtual address space )

The 2 translation stages are discrete and seperate entites, with there own processes of translation. The Segment table is stored in linear memory space, where the paging tables are store in physical memory. One consequence of this is that the segment translation tables can be relocated by the paging mechanism without the knowledge of the Segment Translation engine. Similarly the paging mechanism knows nothing about the virtual address space that is used by programes to generate there addressing schemes. Niether system are aware of the behaviour of the other, they just do what there told.

Virtual Memory

Virtual Memory comes into the picture at this point to provide an extension to physical memory. Virtual memory in Be is stored on the tape backup device, umm., I mean the hard drive, and is used to let you get your 4GB physical worth, by extending what little memory you actually have in reallity. Virtual memory happens at the Page Translation stage.

How this works In English:

Virtual memory has the ability to pretend memory exists in 2 tricky ways. One is to mark the page as on disk in the tables, so that the operating system can go and fetch it for you. The other way is to pretend the pages exist, but if they never get accessed, never have to exist, thus you have memory for nothing ;-)

How this works in Tech Speak:

The address translation mechanism supports virtual memory in 2 way. First it is used to mark only the parts of the virtual memory actually in main memory as valid, and it is set up to translate virtual address corresponding to the resident parts of the virtual memory to their respective physical memory addresses. If a program references a virtual address corresponding to a part of the virtual memory that is not resident, the reference will cause an exception due to invalid mapping information. The operating system at this point can deal with the situation by loading in the requested data or creating a new page, at which point, the application that generated the fault may continue. Now, this process on some systems (w95/nt) often seems to produce a nasty Freeze in the system, while the OS goes of and gets the data. I have noticed with Be, that it actually only blocks the one Thread that made the request, and allows other threads to continue.. [correct me if i'm wrong]

Given that we are moving memory to and fro from harddrive, using fixed blocks of memory speads up this process nicely. On the 80386, a 4028byte chunk is used.. (you may have already come across this magic figure in the headers all ready).

Protection

Go Re-Read the Chapter on Protection, as it is for this processor. It may even make more sense now with the above info.
<grin>

The Segmentation Process

So the die-hard is back, (you did read it didn't you) and you want to know more..

If Not, don't forget to read about the I/O space and other important stuff below that IS important to what you want to do.. !!

So, when dealing with Segment translation, each segment has 3 associated pieces of information with it.

  1. Base Address in Linear Space.
  2. Size Limit... Umm., you guessed it, the largest offset for this segment.
  3. Attributes, which indicate segment characteristics such as whether the segment can be read to, written to, or executed as a program, the privelage level of the segment, etc.

The Size Limit, is the maximum address you can read as an offset in a segment. Accessing past this limit, and you gussed it., a General Protection Fault!

Access violations also occur if you break the access rules acording to the privelage level attributes for the segment. (Remember, the attributes are only relevant for the current team's address space).

The base address, limit, and protection attributes for a segment are stored in a segment descriptor, which is referenced during the virtual-to-linear address translation process. Segment descriptors are stored in memory in Descriptor Tables, which are simply arrays of segment descriptors. The Segment Selector part of a virtual address, index's this table..

Segment Descriptor Tables

The Global Descriptor Table (GDT) and the Local Descriptor Table (LDT) are special segments that contain the segment descriptor tables. Descriptor tables are stored in special segments that are mainted by BeOS, to prevent application software from modifying the address translation information.
The virtual address space is adjusted into 2 equal halves, one mapped to the GDT, and one by the LDT ..... (remember what was said in the protection chapter, about kernel/user space)

When a task switch occurs, the Local Descriptor Table is changed, but the GDT stays the same...

The LDT contains descriptors for the segments private to a single thread. Several threads can share a common LDT, as is the case with BeOS applications (teams).

Segment Selectors : Tricky Stuff

About that 13 bit, 3 bit spare thing in the segment descriptors...

16                   3   2   1 0 
+---------------------+----+-----+
|  Description Index  | TI | RPL |
+---------------------+----+-----+
RPL = Request Priviledge Level
TI = Table Index : 0 = Global Descriptor Table : 1 = LDT.
DI = Descriptor Index in the table requested by the Table Index.

Most things should be clear, except this mysterious RPL field.. Whenever the program attempts to acess a segment, the current privilege level (CPL) is compared to the privilege level of the segment to see if were allowed to be accessing it. The operating system, can twiddle these values, so that this ensures that the operating system does not access a segment on behalf of a calling program, unless itself has the access to the segment. If you need to know how this happens or want to know more about Segment Descriptors, go buy a book.

Paging

Paging operates after segmentation to complete the virtual-to-physical adress translation process. Paging trnaslates the linear addresses put out by segmentation to physcial address.

Unlike segmentation , wich operates with variable-size chunks of memory, paging operates with fixed-size chunks of memory called pages. Paging divides both linear and physical address spaces into pages. Any page in the linear address space an map to any page in the physical address space. (Regardless of how the linear space is mapped by segmentation).

How this happens In English:

You have your linear-address space cut up into chunks of 4k, which can then be swapped in and out of virtual memory. There will also be chunks of memory that don't map to ANYTHING, and thus have a little flag saying Not Valid, which causes nasty Access Violations if the user trys to access them! Also, we only need 20 bits to address this region, because 4k chunk is represented by using 12 bits, so this reduces the size of the lookup table...

How this happens In Tech Speak:

The 80386 uses a 4KB byte page size, aligned on a 4k boundary. This means that the paging mechanism divides the linear address space (4GB) into 2^20 pages.

Since an enitre 4k page is mapped as a unit, and because of the alignment, the lower 12 bits of the linear address are passed through the paging mechanism directly as the lower 12 bits of the physical address. The relocation function performed by paging can be thought of as a function that translates the upper 20 bits of a linear address to the upper 20 bits of the corresponding physical address.

The linear-to-physical addres translation function is extended to permit a linear address to be marked as invalid rather than producing a physical address. A page can be marked invoalid either because it is simply a linear address not supported by the Operating System, or because it corresponds to a page in virtual memory. In the first case, the program generating the invalid address must be terminated. (GPF!), else, the Os goes in and gets the memory for us.

The Two Level Page Structure

Because this segment-page table would take up 4 MB of memory, implementing this dumb would suck something cronic. So the Address space is divided into 2 sections.

We can represent the address by only the upper 20 bits (because lower 12bits map to an offset within a 4k chunk.)

This 20 bit address is split into10bits for one part, 10 bits for the second part. Now, To reduce the memory needed to store things, the second half can be stored on disk, or it can be 'invalid' in which case it doesn't exist until it needs to.
Due to the tricks with the layout of this memory, it is possible to map the Global Descriptor Table in such a way that it is the same for every process. This reduces the amount of table that is seperate for each process.

When a task swap occurs, they just change the process specific addresses.
So, we end up with a minium1MB of memory per team spent in look up tables, with a Maximum of 2MB per team if they use all the address locations.. (Eeek).
So, if that satisfies your curiousity, skip the next 2 Tech Speak explanations and read about Page Level Protection.

The 2 Level Page Structure In Tech Speak

The page table contains 2^20 entries, each of which is 4 bytes wide. It stored as one table, it would occupy 4 megabytes of Contiginous memory. Rather than dedicate this memory to the page table, the table is stored as a 2 level table. Furthermore, the linear-to-physical address translation for the upper 20 addr bits is done in two steps, each using 10 bits.

The first level of the table is called the page directory. It is stored in a single 4k byte page and has 1024 four-byte entries that point to a second-level table. The second-level tables are called page tables and are also exactly one page (4K) in size and contain 1024 four-byte entries. Each four-byte entry contains the physical address of a page. The second-level page tables are indexed by the middle 10 linear address bits (bits 21...12) to obtain the page table entry containing the physical base address of a page. The upper 20 bits of this physical address are combined with the low-order 12 bits (the page offset) from the linear address to fortm the final physical address that is the output of the page translation process.

Non-Present Page Tables in Tech Speak

By using a 2 level table structure, we have not solved the problem of needing 4 megabytes of memory to store the page table. However, the 2 level structure allows the page to be scattered in pages through memory rather than being stored in one contiginous 4MB chunk... Furthermore, second-level tables need not be allocated for nonexistent or unused parts of the linear address space. The directory page must always exist, but the second level tables can be allocated only as needed.

The present attribute in a directory entry indicates if the corresponding 2nd level table is available for use in page translation is present, and make one if it doesnt exist so we can access it, or to swap it back from virtual memory.

Global vs Local Page Tables

Unlike the segment table structure, there is no provision for spliting the page table into a global table and a locat table. However, by arranging for each task to share a part of the linear-to-physical address mapping function, a global part of the linear address space can be defined. (convention).

This part of the linear address space is mapped the same in every task, resulting in the ability for Global Pages and Local Pages. Thus when we perform tasks swaps, we need only change half the table.

This is also really efficient in Virtual Memory terms because if a Global Memory page is swapped in, it thus available to all teams.

Page Level Protection

Pages only have 4 combinations for protection.. 1 bit for read access and 1 bit for write access, Execute access is always assumed valid, and is left to segmentation to control..

So when you are using the create_area() and clone_area() functions, your are controlling things from a Page Level. This is automatically mapped into the Segmented Area, to the Virtual Area Adress you specific (or let the system automatically allocate for you.)

Only 2 privelage levels are recongized by pages, 012 are grouped together as 0, and user level (3) as 1.

Page Level control is OR'd with Segment Level Control.
If One say Lock in memory, All Lock, etc.

Pages and Processor Caches

To increase speed by avoiding access to the memory-resident page tables for every memory reference, the most recently used linear-to-physical address translations are store in a page translation chache in the processor.. (thus why all the table entries are the same size!) This cache is consulted before the memory based tables are called. Coherence between the data in the paging translation cache, and the actual tables in memory is the responsibility of the operating system..

And you don't need to worry about it ;-)

Virtual Memory

Incase you were wondering how which pages were determined to be swaped out by the system, the technique is quite blunt ...

  • If the P flag is set, were in memory, else were not..
  • The A and D bits assist the virtual memory implementation in determining who has been used recently.

By periodically examining and clearing all of the A bits, the opering system can determine which pages have not been referenced recently. These pages might be good candidates to move out to disk storage. If the D bit is set to 0 when a page is read in from disk, and it is still 0 when the page is to be moved out to disk, the page has not been modified, and need not be re-saved over the old page on the disk.

I/O Space

As it name implies, this area is optimized for the storage of control ports, such as input/output devices like keyboards, disks, CRT displays, printers, and so on. This space has a very different protection scheme to main memory.

"I/O space on x86 is wholly separate from memory. Accessing I/O space and memory space share the same physical lines (addr and data) but the state of one extra pin determines whether it's an i/o access or a memory access.
No translation is performed for I/O space."
-Gregory Gerard <ggerard@iname.com>

This area has an addressable range of 64KB in length.

Typically, an I/O device has only a few control ports, requiring only a small number of bytes of addressable storage, and there are only a small number of devices in the system (relative to the 64k space available). [some video cards try hard to prove me wrong, but thats another story]

I/O ports can not be relocated within the space by the processor, though there are soft-addressing schemes employed with Plug and Play systems and PCI devices, but this is a matter of choosing which fixed I/O address to listen to through the use of a initilization protocol, as opposed to creating a virtual-to-physical look up table that we see in main memory.

This area of memory is access through a special set of instruction registers to transfer information between the I/O space and the processor registers/main memory. (To trigger the extra pin on the processor)

There are Two mechaisms employed to control the access to the I/O address space and I/O related instructions.

  1. The IOPL field in the EFLAGS register.
  2. The I/O Permission bitmap in the TSS.

The EFLAGS register reports on processor status, and also doubles as a control flag register. The IOPL flag is the I/O Privelage Level field, and is two bits wide, and is what represents the Layer of protection the system is in . (0-3)

The TSS is a 1 bit per byte Bitmap that masks the entire 64k region. If a i/o operation is performed on address that is mapped to 0, the processor generates an Exception.. The Operating System can catch this exception, and determine what to do.

When running in virtual 8086 mode, the operating system is responsible for catching the i/o operations, switching into Protected mode, and actually doing the I/O operation. This results in large delays for I/O on systems such as win95. Because BeOS is always running in 32bit mode, we don't have a problem.

Understanding the DMA Controller Model
The 8237a DMA Controller

Nasty Compatability Problems

So, now that your are all supped up on what memory actually is, in a sort of round about way, you might be scared to hear that there is a 16 MB physical memory address limitation to a lot of hardware (ISA land particularly) on Intel. And not only this, there are lots of silly problems with I/O access because of the equally archaic boards that still exist.

Remember, that before the 386, there was the good old 8Mhz XT (8088).
I promised not to talk about the old memory layout, but unfortunatly, you are going to have to understand atleast a little bit... <sorry>

I/O Ports, and Nasty little Problems

"Accessing I/O space and memory space share the same physical lines (addr and data) but the state of one extra pin determines whether it's an io access or a memory access. " -Gregory Gerard <ggerard@iname.com>
...
To do:
Pin Layouts for an I/O operation in Hardware on an ISA bus.
...

Because many of the old cards believed that people didn't access I/O addressed any larger than XYZ address in the low memory area, they didn't check the higher bits were zero, resulting in a device living in one low port address, may actually respond to any call with the same lower addr lines, regardless of what higher address lines set might have been set....

Thus when PCI map to higher memory ranges, care has to be taken to avoid all these shadow addresses. That is, PCI assumes that every Legacy Device living on an address, may shadow other address..

Device Drivers and the 16MB PHYSICAL memory problem

Again, much because of the Address scheme of the original 8237a DMA controller on the old AT, the memory addessable by this chip is only the lower 16 MB of memory, and a maximum of 64k of data can be transfered per call..

This legacy issue, is still around today.

This produces problems, in that if you wish to write from a network card, into CPU memory, you need to allocate this space in the first 16 MB of physical memory.. This is why when you create a new area, you may need to specify that it exist in this area, and of course, you should also specify that the address is mapped into Kernel address space so that it is always Globally accessible, regardless of the team that the interrupt steps in on..


The Communal Be Documentation Site
1999 - bedriven.miffy.org