Ignorance may be Strength : Rust

Showing posts with label Rust. Show all posts

Friday, January 5, 2024

Unit Tests in Rust

Looking at that list of things to do, one seems obviously simpler than the rest and probably more useful in the short to medium term: getting the unit tests to run.

In fact, it's ridiculously easy. I'm not exactly sure where I copied this from now, but the problem is either I chose the wrong place or didn't copy it very well. The line I have which says

#[cfg(tests)]

is plural when in fact is should be singular

#[cfg(test)]

Yup, that's it. The tests now run. I'm not quite sure why Rust doesn't give a warning about this particular mistake. rust-analyzer says that the config "tests" is "disabled", but my thought had been that meant that there was a configuration "tests" which I had specifically - deliberately or accidentally - disabled, and I spent some time looking through my configuration settings before seeing enough examples on the web where it said cfg(test) that it finally sank in that I had a simple typo.

For those wondering, yes, my code under test passed first time. My test had another error in it, which is I had copied across this (extraneous) line as well which wouldn't compile:

use alloc::collections::btree_map::Values;

And now I get this output:

   Compiling homer_rust v0.1.0 (/home/gareth/Projects/IgnoranceBlog/homer_rust)
    Finished test [unoptimized + debuginfo] target(s) in 0.32s
     Running unittests src/lib.rs (target/debug/deps/homer_rust-6186d6a57da04f90)

running 1 test
test tests::test_set_4_in_1_from_0 ... ok

test result: ok. 1 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.00s

   Doc-tests homer_rust

running 0 tests

test result: ok. 0 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.00s

This is chattier than I would like, but Rust seems to be that way. Also, for those scoring at home, it's worth noting that if I had had this working at the time, I would have written four or five tests to ensure I wrote the correct code. After the fact, especially with my application now working, I'm not going to bother.

This is checked in as RUST_BARE_METAL_TEST_NOT_TESTS.

Fixing Blue Homer

In the sample code we are copying, when configuring the framebuffer device, one of the options refers to a choice of byte order in the color setting: it can be "RGB" or "BGR". RGB was requested, but the sample accepts that the answer may come back "BGR". I ignored this, in part because the emulator accepts RGB.

My best guess is that the real hardware doesn't - and so Homer is inverted. Let's test this theory.

The response is (now) processed in lfb_init. So let's intercept that and send the RGB/BGR bit down the UART.

    let pixorder = unsafe { read_volatile(volbuf.add(24)) };
    write("pixel order is ");
    write_8_chars(hex32(pixorder));
    write("\r\n");

And we see:

pixel order is 00000000

That's zero, which our comments tell us is BGR, not RGB. So we need to store that in our struct and use it when rendering. This adds a certain amount of complexity, but here are the highlights:

We declare an enum to make it clearer what the two cases are. For some reason, this needs an annotation declaring that it derives PartialEq, although presumably we could not do that and explicitly define it. I assume this just means that we can test that two values with BGR are the same, as are two RGBs. We then add this into the struct of information about the Framebuffer:

#[derive(PartialEq)]
enum PixelOrder {
    BGR,
    RGB
}

struct FrameBufferInfo {
    width : u32,
    height : u32,
    pitch: u32,
    base_addr: u32,
    pixorder: PixelOrder
}

And then we need to set this to the correct enum value based on the 0 or 1 return value from the mailbox call:

fb.pixorder = if pixorder == 1 { PixelOrder::RGB } else { PixelOrder::BGR };

Note that, if (like me) you did not already know this, Rust eschews the ternary operator in favour of having functional-style if statements that can return values (as long as you don't put semicolons after the final expression in a block).

And, of course, when we come to draw Homer we need to take this into account:

            if (fb.pixorder == PixelOrder::RGB) {
                unsafe { *((ptr + x*4 + 0) as *mut u8) = homer[homer_index + 0]; }
                unsafe { *((ptr + x*4 + 1) as *mut u8) = homer[homer_index + 1]; }
                unsafe { *((ptr + x*4 + 2) as *mut u8) = homer[homer_index + 2]; }
                unsafe { *((ptr + x*4 + 3) as *mut u8) = homer[homer_index + 3]; }
            } else {
                unsafe { *((ptr + x*4 + 2) as *mut u8) = homer[homer_index + 0]; }
                unsafe { *((ptr + x*4 + 1) as *mut u8) = homer[homer_index + 1]; }
                unsafe { *((ptr + x*4 + 0) as *mut u8) = homer[homer_index + 2]; }
                unsafe { *((ptr + x*4 + 3) as *mut u8) = homer[homer_index + 3]; }
            }

(It's a very small change, but the offsets in the indices on the left hand side in the second block go 2,1,0,3).

OK, let's try that. Yup, Homer is now the right colour. Let's check this in before it stops working! It's tagged as RUST_BARE_METAL_HOMER_PI.

And now that I have everything fairly stable, I'm going to finish the task I'd set myself and clean up all the dead code, removing unnecessary blocks of tracing and associated functions. The final version is tagged RUST_BARE_METAL_CLEANED_HOMER.

Conclusion

It's been painful (for me, at least) but we are now at the point where we can do a number of things on the bare metal of a Pi working almost entirely in Rust. But it's one big mess.

I think we've gone as far as Homer can really take us, but I do want to read about - and experiment with - some other concepts in this space, and improving Homer seems like the most sensible way forward.

As a list as much to myself as anything, I want to:

figure out how to get most of the standard library in, without pulling in dependencies on Linux;
after that, figure out how to get memory management to work so I can allocate blocks of memory;
using that, be able to allocate memory blocks of arbitrary alignment;
figure out how to get unit tests to run in this environment
look into making the code more modular.

While doing all this, I hope to gain a better understanding of what "idiomatic" Rust looks like and then, within that, to find my own "voice" At the moment, I suspect I am essentially trying to write Java code in Rust, which is slightly odd because I feel Rust itself is closer to my natural intuition of how to code than Java is.

When I have done all that, I think I want to move on and try and build a proper console on the monitor which can display the messages that would otherwise go to the UART (although it should be configurable to do both).

Wednesday, January 3, 2024

UART and Real Hardware

One thing I have been trying to avoid is to go down the road of connecting a serial cable to the Pi and sending signals to a USB port on a real computer. It just seems too hard (I'm a software guy, not a hardware guy). However, at this point I need to admit I misjudged how hard it is to get the HDMI console working, so I'm backing off and trying something else.

I ordered an FTDI Chip 1m USB to UART Cable in Black from Radio Spares (RS) in the UK.

Just wiring the cable up reminds me of why I hate hardware so much. On the PC end, of course, you just shove the USB connector into your laptop and your done. On the Pi end, however ...

First off, the cable comes with three wires which we will call black, orange and yellow. One of the things about serial connections that confuses me is the crossover: you connect "Rx" to "Tx" and "Tx" to "Rx" and then you have to remember which "Rx" and "Tx" you are thinking about. The Pi has a set of GPIO pins that (if you have the case the same way around I do) run down the left hand side of the box from the front to near the back. The pins are confusingly numbered twice: once by their physical location and once by their logical location. For now (we will come back to them), I'm going to ignore the GPIO numbers and just go with the physical location: the pins are numbered starting at "1" from the front, and each pair of pins has the lower, odd, pin on the right and the higher, even, pin on the left. Thus the first row has "2" and "1", the second "4" and "3", the third "6" and "5" and so on. We want to wire up the "black" lead to pin "6", the "yellow" lead to pin "8" and the "orange" lead to pin "10". That's the third, fourth and fifth pins from the front on the extreme left hand side of the box.

Doing this is not made easier by the fact that (for my cable at least), the orange and yellow leads "wanted" to be the other way around.

That's the clearest description I've been able to come up with - the one which would have helped me to get it set up before I started. The link to the website above gives the more technical description of how the cable itself is wired. This link explains in detail how the Pi is connected together.

Configuring the UART

Now that we have physically connected the two ends together, we need to set up the software on both ends. You need some terminal software to run on the PC end of things. I'm using Linux and after looking at what other people had done, decided to use picocom as a terminal. It's fairly simple to install and use:

$ sudo apt-get install picocom
$ sudo picocom --baud 115200 /dev/ttyUSB0
picocom v3.1

port is        : /dev/ttyUSB0
flowcontrol    : none
baudrate is    : 115200
parity is      : none
databits are   : 8
stopbits are   : 1

It comes back to you with all the settings it uses. Apart from the baud rate (which we specified), it is using no parity, no flow control, 8 data bits and 1 stop bit. I'm fairly optimistic that these are the settings that the code I'm copying from also used (to check, I installed the original code onto the SD card to try this and it worked as a teletype echo, so I know I have done everything correctly).

So now I need to go back and port that code to set up the port correctly.

Here is the original C code:

void uart_init()
{
    register unsigned int r;

    /* initialize UART */
    *UART0_CR = 0;         // turn off UART0

    /* set up clock for consistent divisor values */
    mbox[0] = 9*4;
    mbox[1] = MBOX_REQUEST;
    mbox[2] = MBOX_TAG_SETCLKRATE; // set clock rate
    mbox[3] = 12;
    mbox[4] = 8;
    mbox[5] = 2;           // UART clock
    mbox[6] = 4000000;     // 4Mhz
    mbox[7] = 0;           // clear turbo
    mbox[8] = MBOX_TAG_LAST;
    mbox_call(MBOX_CH_PROP);

    /* map UART0 to GPIO pins */
    r=*GPFSEL1;
    r&=~((7<<12)|(7<<15)); // gpio14, gpio15
    r|=(4<<12)|(4<<15);    // alt0
    *GPFSEL1 = r;
    *GPPUD = 0;            // enable pins 14 and 15
    wait_cycles(150);
    *GPPUDCLK0 = (1<<14)|(1<<15);
    wait_cycles(150);
    *GPPUDCLK0 = 0;        // flush GPIO setup

    *UART0_ICR = 0x7FF;    // clear interrupts
    *UART0_IBRD = 2;       // 115200 baud
    *UART0_FBRD = 0xB;
    *UART0_LCRH = 0x7<<4;  // 8n1, enable FIFOs
    *UART0_CR = 0x301;     // enable Tx, Rx, UART
}

All of this looks pretty hairy and none of it is completely transparent. (This, of course, is why I have avoided having anything to do with it - it feels as complicated as getting the main display to work.)

But let's take it slowly and see what we can get to work in our own Rust code.

First off, let's add a call to uart_init in kernel_main:

#[no_mangle]
pub extern fn kernel_main() {
avoid_emulator_segv();
uart_init();

In order to port this code, I've decided to do a limited amount of refactoring and cleaning up of the code. For example, we are going to reuse mbox_send to set the clock, so I've moved the code that was (wrongly) in there to check the response about the video buffer out to the lfb_init method. I've also bundled up the piece of code that is responsible for ensuring that the emulator doesn't SEGV into its own method (avoid_emulator_segv) and called that up front.

So what does the rest of this code do? Well, the first line claims (presumably correctly) to disable the UART by writing 0 to the UART0_CR. We can do that too:

const UART_CR: u32 = 0x3F201030;
...
fn uart_init() {
// Turn off UART0 while we configure it
mmio_write(UART_CR, 0);

The next block of code sets the UART clock. Setting the clocks in this way is described in the section of the wiki that deals with tagged mailbox messages.

    // Now, set the UART clock (yes, the Raspberry Pi seems
    // to have about 10 separate clocks) to 4MHz.
    let mut buf: [u32;36] = [0; 36];

    buf[0] = 9 * 4; // this message has 9 4-byte words
    buf[1] = 0;
    buf[2] = 0x38002; // set one of the clock rates
    buf[3] = 12; // request has three words of data
    buf[4] = 0;  // space for response length, but is zero for request
    buf[5] = 2;  // 2 selects the "UART" clock
    buf[6] = 4000000; // set it to 4MHz
    buf[7] = 0;  // avoid setting "turbo" mode
    buf[8] = 0;

    let mut msg = Message { buf: buf };
    mbox_send(8, &mut msg.buf);

Note that this reuses the (refactored) mbox_send that we used previously to configure the display.

Now, to get to the rest of it, we need to understand the GPIO configuration registers. In the ARM peripherals guide, chapter 6 (p89) describes the GPIO pins and the following page (p90) has a table with all the registers and their alleged addresses (again, these are in the right order, but with the wrong offset, for the actual Raspberry Pi boards). This is somewhat confusing, in part because the first row is duplicated.

Then Table 6-1 explains how the registers are used. The first five registers are used to control the meaning of the 54 GPIO pins (forty in the strip down the side of the board, the other fourteen in the header at the front). For each pin, three bits are used, giving eight possible options for the pin. 000 means this pin is used as an input, 001 means this pin is used as an output, and the other six options identify special-purpose "alternative" functions. These alternative functions are specified in Table 6-31 in Section 6.2 on pp102-103.

Remember that earlier I said that the pins were numbered twice, once for their physical location and once for their "logical" location? Well, this time we use the logical location. Looking at the pinout for the GPIO header you can see that pin 8 is described as GPIO 14 (RXD) and pin 10 is described as GPIO 15 (TXD). Comparing this to Table 6-31, you can see that in the rows GPIO14 and GPIO15 the alternate functions in column "ALT0" are "TXD0" and "RXD0" respectively.

So what we need to do is to set the relevant bits of the second select register (GPFSEL1) to choose ALT0 without damaging any of the other bits. This is what the code does. My question is whether, in porting it, we can make it a little clearer? And whether this is the right time to introduce some unit tests?

I wrote this test:

#[cfg(tests)]
mod tests {
    use alloc::collections::btree_map::Values;

    use super::*;

    #[test]
    fn test_set_4_in_1_from_0() {
        let mut val = 0;
        gpf_select(&mut val, 1, ALT0);
        assert_eq!(val, 0b100000);
    }
}

and tried to run it using cargo test but it comes back and says 0 tests found. For now, I think I'm going to put that on my list of things that aren't working in my environment and that I need to get working and for now assume that I can write this function without help. So I end up with these three functions:

fn gpfsel_read(reg: u32) -> u32 {
    let addr = PERIPHERAL_BASE + GPIO_BASE + (reg*4);
    mmio_read(addr)
}

fn gpf_select(flags: &mut u32, pos: u32, fun: u32) {
    let lsb = pos * 3;
    *flags = *flags & !(7 << lsb); // clear these bits
    *flags = *flags | (fun << lsb);  // set these bits
}

fn gpfsel_write(reg: u32, value: u32) {
    let addr = PERIPHERAL_BASE + GPIO_BASE + (reg*4);
    mmio_write(addr, value);
}

and I can wire them up as follows inside init_uart:

    let mut fs1 = gpfsel_read(1);
    gpf_select(&mut fs1, 4, ALT0);
    gpf_select(&mut fs1, 5, ALT0);
    gpfsel_write(1, fs1);

The next part of the code seems something between arcane and bizarre, but definitely matches the description given on p101 of the Broadcom peripherals guide. It appears that the above code sets the values we want in memory, but does not propagate our choices to the hardware. To achieve that, we need to go through a cycle of telling the chip to make the changes.

In both places, the "magic number" of 150 cycles is specified as being the amount of time that is needed for the change to take effect. I have to imagine that this means "at least" 150 cycles because, apart from anything else, you can't really be sure that any code you write will not be subject to interrupts. And the code that I am copying - unless it is unrolled - would seem to me to use 150 cycles in executing the nop operation, along with at least twice as many in handling the control loop. So I am going to assume that as long as we wait for at least 150 cycles, we will be fine.

It has to be said that while I think I understand what this code is trying to achieve, I don't understand why it works the way it does, and, specifically, I don't understand what is meant by "pull-up" and "pull-down" and why that has anything to do with selecting the function associated with the GPIO pins.

So what this description says (and that the code seems to say) is:

Write 0 to the register GPPUD to remove the current pull-up/down setting;
Wait for the system to recognize the change;
Write a word with bits 14 & 15 set to PUDCLK0;
Wait for the system to process the change;
Clean up by removing the GPPUD and PUDCLK - in our case, we don't need to clean up GPPUD, so we just need to write PUDCLK.

    mmio_write(GPPUD, 0);
    wait_a_while(150);
    mmio_write(GPPUDCLK0, (1<<14) | (1<<15));
    wait_a_while(150);
    mmio_write(GPPUDCLK0, 0);

And, finally, we need to configure the UART itself. The documentation for the UART starts on p175, and the section on the registers begins on p177.

I am somewhat confused by the first line of C code, which is supposed to clear the interrupts, because it seems to contradict what the documentation says on how the register is to be used. The C code writes a 1 into each of the 10 well-defined bits of the ICR, but it seems to me that the definition in the documentation expects that 0s will be written to clear the bits.

Given that the code works, and I don't have a great deal of faith in any of the documentation, I am going to copy the code rather than the documentation but I'm not sure I will be able to tell if this is "correct" or "just happens to work".

mmio_write(UART_ICR, 0x7ff);

Now onto setting the baud rate. We want to set the rate to 115,200 baud based on a clock speed of 4MHz: to set this, we need to provide the "divisor": basically a "wavelength". This is very poorly explained in the Broadcom documentation, so I searched the web for other documentation and found this which may in fact be a good source of documentation in general.

It's fairly obvious that the integer part needs to be an unsigned integer between 1 and 65535, while the fraction part is more obscure. It turns out that the value of the fractional part is a numerator where the denominator is always 64, so FBRD is the number of 64ths. In trying to repeat the calculation that I am copying, the clock speed of 4,000,000 is divided by the baud rate of 115,200 giving a divisor of 3.7222...; according to the documentation, this needs to be further divided by 16 which gives me 2.170138...; the integer portion is clearly 2, and the fractional portion is just under 11/64. So I want to write 2 and 11 into the IBRD and FBRD registers respectively.

As luck would have it, those are exactly the numbers in the code I am copying, but I am going to write them in decimal rather than hex for stylistic reasons: I think hex tends to suggest something which derives from bitwise operations.

mmio_write(UART_IBRD, 2);
mmio_write(UART_FBRD, 11);

OK, we're nearly there now. The Line Control Register sets the rest of the transmission parameters, and has eight significant bits. We want to turn most of these bits off, but we want to select 8 bit transmission, which involves setting bits 5 and 6, and enable the FIFO mode (so that we can transmit a buffer in one go and then have the UART do all the hard work), which is in bit 4. So we want to set the register to 0x70.

mmio_write(UART_LCRH, 0x70);

And finally, we re-enable the UART. The control register is described on pp185-187, and the relevant bits we need to set are 0, 8 and 9, which has the hex mask 0x301.

mmio_write(UART_CR, 0x301); // Enable UART with Rx and Tx

And that all works in the emulator. But on the real hardware, not so much.

This is kind of what I was afraid of. In order to be able to debug things, I need to be able to have some means of knowing what is going on. My first approach was to think that I could write to the console. My second approach was to say, well, if I can't do that, can I at least write to the UART? Apparently, I can't do that either.

What's really interesting is that when I try and simplify this by eliminating some of the code, it still doesn't work. Even if I comment out the whole of the body of uart_init, the Pi starts up and does nothing. If I comment out the calls to avoid_emulator_segv and uart_init, it goes back to almost working (showing Homer in the wrong colors). What seems obvious to me is that something, probably eithr memory related or timing related, is going wrong in a way that is affected by the size of the kernel_main function. But I cannot guess what that could be, nor why it would happen on real hardware and not on the emulator. Interestingly, it does seem to be consistent: if I make one random change and then undo it, it goes back to the behaviour that it had before.

Having spent some time thinking about this away from the computer, I'm becoming increasingly convinced that it must be a timing thing and that it is something I did not copy correctly from the sample program.

Looking back at the C version of the program, I found these lines and the end of mbox_call, which has become send_mbox in my code:

    while(1) {
        /* is there a response? */
        do{asm volatile("nop");}while(*MBOX_STATUS & MBOX_EMPTY);
        /* is it a response to our message? */
        if(r == *MBOX_READ)
            /* is it a valid successful response? */
            return mbox[1]==MBOX_RESPONSE;
    }

What this is doing is checking that the data available to read in the message box is actually the answer to our message (rather than some other message). I suspect that because we are not doing this, it seems we have a response to the second mailbox message, but it's the response to the previous message (setting up the UART). So I'm now going to copy this across.

    loop {
        let rb = mmio_read(MBOX_READ);
        if rb == addr {
            break;
        }
    }

Now, this is back to working and I am getting messages coming across on the console. Taffing with this, it appears to continue working, which is the most important thing to me. On the other hand, Homer, is still blue in the face but I think I know why.

This is all checked in as RUST_BARE_METAL_INIT_UART.

Conclusion

With a few twists and turns, we managed to successfully wire up and set up the UART to communicate with the host PC. In doing this, we now have access to more standard tracing. I'm still a little nervous about how stable everything is, but I'm increasingly convincing myself that the problems are all me doing things wrong and not random instability.

Let's stop Homer being so blue.

Figuring out the Issues

So we find ourselves in a situation where the program runs in the emulator sometimes and not at other times. Why not?

Annoyingly, the fact that it works when we have tracing enabled, and not when it doesn't, means that we can't use tracing to find the problem.

But fortunately, we have determined that we can use the debugger. So let's try that again.

Continuing with our theme of optimization, let's add a new script, debug.sh, which contains the relevant commands to debug in the emulator. It will be recalled that this requires the debugger to start with "-s -S", and we can specify that to make with the GDB option as follows:

make "GDB=-s -S" run

We then need (again in a separate tab or window) to run the debugger, which we will put in a script gdb.sh:

gdb-multiarch -iex "file asm/kernel8.elf" -iex "target remote :1234"

where -iex is an option which specfies that the argument should be run as a command once gdb has started.

Sadly, when we try this, we are reminded that we switched to a release build to avoid some compilation/linking errors:

Reading symbols from asm/kernel8.elf...
(No debugging symbols found in asm/kernel8.elf)
Remote debugging using :1234
0x0000000000000000 in ?? ()
(gdb)

So, we need to go back to trying to make a debug build work. A few judicious edits to our scripts, and debug.sh can now build and link a debug version of our executable. When we do this, it shows the following errors:

aarch64-linux-gnu-ld -nostdlib boot.o ../target/aarch64-unknown-linux-gnu/debug/libhomer_rust.rlib -T linker.ld -o kernel8.elf
aarch64-linux-gnu-ld: ../target/aarch64-unknown-linux-gnu/debug/libhomerrust.rlib(homerrust-c9f0f32593953886.3qdmjppfu6iwhc43.rcgu.o): in function `homerrust::lfbinit':
/home/gareth/Projects/IgnoranceBlog/homer_rust/src/lib.rs:133: undefined reference to `memset'
aarch64-linux-gnu-ld: /home/gareth/Projects/IgnoranceBlog/homer_rust/src/lib.rs:136: undefined reference to `core::panicking::panic'
aarch64-linux-gnu-ld: /home/gareth/Projects/IgnoranceBlog/homer_rust/src/lib.rs:195: undefined reference to `core::panicking::panic'
aarch64-linux-gnu-ld: ../target/aarch64-unknown-linux-gnu/debug/libhomerrust.rlib(homerrust-c9f0f32593953886.3qdmjppfu6iwhc43.rcgu.o): in function `homerrust::showhomer':
/home/gareth/Projects/IgnoranceBlog/homer_rust/src/lib.rs:213: undefined reference to `core::panicking::panic'
aarch64-linux-gnu-ld: /home/gareth/Projects/IgnoranceBlog/homer_rust/src/lib.rs:214: undefined reference to `core::panicking::panic'
aarch64-linux-gnu-ld: /home/gareth/Projects/IgnoranceBlog/homer_rust/src/lib.rs:222: undefined reference to `core::panicking::panic'
aarch64-linux-gnu-ld: ../target/aarch64-unknown-linux-gnu/debug/libhomerrust.rlib(homerrust-c9f0f32593953886.3qdmjppfu6iwhc43.rcgu.o):/home/gareth/Projects/IgnoranceBlog/homer_rust/src/lib.rs:222: more undefined references to `core::panicking::panic' follow

Boiled down, this comes to two things:

memset, which is defined in the C standard library, is undefined.
core::panicking::panic is undefined.

The latter is actually quite easy to fix given that we have already fixed the array bounds panic: we simply find the relevant (mangled) symbol and define it in boot.S.

.globl _ZN4core9panicking5panic17h8f06a2df29fa4962E
_ZN4core9panicking5panic17h8f06a2df29fa4962E:
b halt

memset is slightly trickier, since we actually need to implement it. But it's not that difficult an implementation, providing we do the right magic to make it seem to be a C function, not a Rust function:

#[no_mangle]
pub extern fn memset(mut buf: *mut u8, val: u8, cnt: usize) {
    let mut i=0;
    while i<cnt {
        unsafe {
            *buf = val;
            buf = buf.add(1);
        }
        i+=1;
    }
}

Now we can use gdb to set a breakpoint. Let's put one in at lfb_init and another at mbox_send:

Reading symbols from asm/kernel8.elf...
Remote debugging using :1234
0x0000000000000000 in ?? ()
(gdb) b lfb_init
Breakpoint 1 at 0x80360: file src/lib.rs, line 145.
(gdb) b mbox_send
Breakpoint 2 at 0x80e48: file src/lib.rs, line 249.
(gdb) c
Continuing.

Thread 1 hit Breakpoint 1, homer_rust::lfb_init (fb=0x7ffd8) at src/lib.rs:145
145            let mut buf: [u32;36] = [0; 36];
(gdb) n
148            buf[0] = 35 * 4; // the buffer has 35 4-byte words
(gdb) n
149            buf[1] = 0; // we indicate we are sending a MBOX_REQUEST as 0
(gdb) n
154            buf[2] = 0x48003;
(gdb) n
155            buf[3] = 8; // the number of bytes in the request value
(gdb) p/x buf[2]
$1 = 0x48003

So far, so good. We are going through the code and we seem to be correctly setting up the buffer. Let's carry on and see how mbox_send gets on.

(gdb) c
Continuing.
Thread 1 hit Breakpoint 2, homer_rust::mbox_send (ch=8, buf=0x7fedc) at src/lib.rs:249
249            while mmio_read(MBOX_STATUS) & MBOX_BUSY != 0 {
(gdb) n
253            let volbuf = buf as *const u32;
(gdb) n
256            let ptr:u32 = volbuf as u32;
(gdb) n
259            let addr = (ptr & !0x0F) | ((ch as u32) & 0x0f);
(gdb) p/x ptr
$2 = 0x7fedc

Wait a minute! This code is predicated on ptr being aligned to a 16-byte boundary, but it clearly isn't. Let's just go one line further and check what happens:

(gdb) n
262 mmio_write(MBOX_WRITE, addr);
(gdb) p/x addr
$3 = 0x7fed8
(gdb)

Yeah, that's not going to end up too well. It's going to try and read and write a buffer 12 bytes before we've set it up. What exactly happens will obviously depend on what it sees, but it's not going to be what we wanted (I guess we already knew that).

OK, so the debugger helped us out there to see what the problem is, but why? And how do we fix it?

It would seem that the C compiler has different alignment rules to the Rust compiler. I'm not entirely sure about what either set are, and I'm too lazy to check, but I suspect that in C, arrays that have a size which is a multiple of 16 bytes are aligned to 16 byte boundaries (which is why the array is declared as 36 words when we only use 35). In Rust, this obviously does not happen.

We therefore need to manually force the alignment. I considered a number of ways of doing this. One is to allocate an array too big, and then to figure out an offset into the array which is aligned; another is to "do the job properly" and introduce memory allocation and have a method which returns a value with arbitrary alignment; and the third is to see what compatibility features Rust has. I decided that the last was probably the best compromise although, as you'll see, I fully accept I could be wrong.

Rust offers an attribte #[repr(align(16))] which says it aligns things to the appropriate boundary. This seems to be just what we need. However, applying it to the array gives an error:

error[E0517]: attribute should be applied to a struct, enum, function, associated function, or union
   --> src/lib.rs:145:12
    |
145 |     #[repr(align(16))]
    |            ^^^^^^^^^
146 |     let mut buf: [u32;36] = [0; 36];
    |     -------------------------------- not a struct, enum, function, associated function, or union

It doesn't come much clearer than that. You can only align some things, and arrays are not one of them. So let's package the array in a struct and align that.

#[repr(align(16))]
struct Message {
pub buf: [u32; 36]
}

We can then create our buffer and create a Message from that, and then use that (aligned) buffer to send to the mailbox:

    let mut msg = Message { buf: buf };
    mbox_send(8, &mut msg.buf);

    let volbuf: *mut u32 = &mut msg.buf as *mut u32;

Sadly, compiling this throws up another linker error:

aarch64-linux-gnu-ld: ../target/aarch64-unknown-linux-gnu/debug/libhomer_rust.rlib(homer_rust-c9f0f32593953886.3qdmjppfu6iwhc43.rcgu.o): in function `homer_rust::lfb_init':
/home/gareth/Projects/IgnoranceBlog/homer_rust/src/lib.rs:216: undefined reference to `memcpy'
aarch64-linux-gnu-ld: /home/gareth/Projects/IgnoranceBlog/homer_rust/src/lib.rs:216: undefined reference to `memcpy'

But fortunately, we are starting to get good at this:

#[no_mangle]
pub extern fn memcpy(mut dest: *mut u8, mut src: *const u8, cnt: usize) {
    let mut i=0;
    while i<cnt {
        unsafe {
            *dest = *src;
            dest = dest.add(1);
            src = src.add(1);
        }
        i+=1;
    }
}

And, now, it works! Great. Let's check that in before it stops working again. It's tagged as RUST_BARE_METAL_FIXED_EMULATOR

Conclusion

So, this is getting messier and messier, but at least it now works in the emulator somewhat reliably - with or without tracing turned on.

We learnt that alignment is different between C and Rust, and that we can (somewhat) hack that by using a struct with the retr attribute in Rust.

This also (coincidentally) seems to fix one of our problems on the physical box: Homer now appears. Sadly, he is once again the wrong color (blue this time). Looks like we still have work to do.

Friday, December 29, 2023

A Simple Bare Metal Program

Previously, I managed to install Rust and compile a simple, Linux-based, "hello world" program. Then I outlined the background for building ZiOS, an "alternative", experimental operating system to run on a Raspberry Pi. And I've done enough research to have a list of source material to draw on.

I have learnt that 32-bit and 64-bit ARM are radically different, down to having different instruction sets. And that means that all the 32-bit examples are not directly useful to me: I can extract information from them and use their resources but I cannot reuse any of the assembler code, and much of any C code may have references to addresses or layouts that are not compatible with the 64-bit model.

Furthermore, although I think it's probably a good idea in general, I don't want to fiddle with GPIO pins and serial UART ports - I want to use the framebuffer and HDMI. And I want to write my code in Rust, although I will happily port code written in C. So of the possible examples I have, it seems that Koltan Baldaszti's repository is the best starting point. So, I'm going to copy the code from tutorial 9 and adapt it somewhat. This tutorial sets out to display an image of Homer Simpson on the display during boot.

Get it to Work

I have a sample, but I still need to actually reproduce the results myself. And before I can do anything else, I want to install the Pi Emulator.

Fortunately, there is a Linux package for that.

sudo apt install qemu-system-arm
[sudo] password for gareth:
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
The following additional packages will be installed:
  ibverbs-providers ipxe-qemu ipxe-qemu-256k-compat-efi-roms libcacard0 libdaxctl1 libfdt1 libgfapi0 libgfrpc0 libgfxdr0 libglusterfs0
  libibverbs1 libiscsi7 libndctl6 libpmem1 libpmemobj1 librados2 librbd1 librdmacm1 libslirp0 libspice-server1 libusbredirparser1
  libvirglrenderer1 qemu-block-extra qemu-efi-aarch64 qemu-efi-arm qemu-system-common qemu-system-data qemu-system-gui qemu-utils
Suggested packages:
  gstreamer1.0-libav gstreamer1.0-plugins-ugly samba vde2 debootstrap
The following NEW packages will be installed
  ibverbs-providers ipxe-qemu ipxe-qemu-256k-compat-efi-roms libcacard0 libdaxctl1 libfdt1 libgfapi0 libgfrpc0 libgfxdr0 libglusterfs0
  libibverbs1 libiscsi7 libndctl6 libpmem1 libpmemobj1 librados2 librbd1 librdmacm1 libslirp0 libspice-server1 libusbredirparser1
  libvirglrenderer1 qemu-block-extra qemu-efi-aarch64 qemu-efi-arm qemu-system-arm qemu-system-common qemu-system-data qemu-system-gui
  qemu-utils
0 to upgrade, 30 to newly install, 0 to remove and 31 not to upgrade.

We already have a suitable C compiler from our previous cross-compilation of Rust. One of my modifications is to update the Makefile to use that rather than aarch64-elf-gcc. After this, make will build our application and make run will run it in the emulator. So far, so easy.

At some point, I need to write a full post on the boot sequence for the Pi, but for now, the important thing to know is that the Pi looks for an SD Card with a "boot" partition which needs to be formatted for FAT32. On that card, there need to be some "standard" firmware files and a kernel7.img or kernel8.img file. All of this happens automatically when you create a standard Pi SD card, so I'm just reusing one of these. The difference between kernel7.img and kernel8.img is the architecture that the system will boot into: if it finds kernel8.img, it will use ARMv8 (64-bit) whereas if it finds kernel7.img, it will use ARMv7 (which is the 32-bit version). All very clever.

So to deploy this program on my actual Pi box, it is simply necessary to copy the kernel8.img file produced by make into the "boot" partition of an SD card which was previously formatted to run the standard OS. In spite of the name, this file doesn't need to be an OS kernel as such, just a program that can run without an existing operating system. Obviously, it is possible to start from scratch with a completely vanilla SD card and install all the necessary files: I'm sure that at some point I will do this.

When this card is put into the Pi and it is booted, a picture of Homer appears on the screen.

Excellent.

Introducing Rust

I'm now going to go back and start again with a completely new Rust project based on the Rust Bare Bones Tutorial which I have to admit I don't understand. Since I don't like just following instructions I don't understand, I'm going to try and skip most of the steps and see what does and doesn't work. If necessary, I'm going to come back around and fill in the blanks.

The objective of this tutorial is just to get something to build using Rust that will run on the emulator and on the Pi.

So, first off, I'm going to do the obvious and use cargo to create my new project:

$ cargo new homer_rust
Created binary (application) `homer_rust` package

I'm then going to follow the instruction to handle panic as abort in Cargo.html:

[profile.dev]
panic = "abort"

And then I'm going to put this in src/lb.rs:

#![no_std]

#[no_mangle]
pub extern fn kernel_main() {}

This implies to me that we don't want the src/main.rs which was automatically generated by cargo, so I'm going to delete it.

And then I'm going to build it using a cross-compiling cargo target:

$ cargo build --target=aarch64-unknown-linux-gnu
Compiling homerrust v0.1.0 (/home/gareth/Projects/homerrust)
Finished dev [unoptimized + debuginfo] target(s) in 0.03s

We now have a target directory and, inside this, we have a subdirectory for our cross target aarch64-unknown-linux-gnu, and then, within that a debug directory. In here are a number of files which I don't quite understand, but some judicious investigation suggests that the library has been built into homer_rust.rlib where the extension rlib is just an alias for an ar archive with compiled rust files in it (ar is the program used by the C compiler and other systems programming languages such as as for building archive libraries).

This library can now be linked together with a boot.S assembler script using an appropriate linker script to build an entire kernel image.

The instructions I'm following assume that you are familiar with the C version of the tutorial so I will go there to pull in the relevant files.

For now, just follow along. When we move on from just trying to copy things and get them to work, I'll come back and either write updated versions of these files or keep the same versions and explain how and why they work.

We need a linker.ld script which is appropriate for the Aarch64 architecture:

ENTRY(_start)

SECTIONS
{
    /* Starts at LOADER_ADDR. */
    . = 0x80000;
    __start = .;
    __text_start = .;
    .text :
    {
        KEEP(*(.text.boot))
        *(.text)
    }
    . = ALIGN(4096); /* align to page size */
    __text_end = .;

    __rodata_start = .;
    .rodata :
    {
        *(.rodata)
    }
    . = ALIGN(4096); /* align to page size */
    __rodata_end = .;

    __data_start = .;
    .data :
    {
        *(.data)
    }
    . = ALIGN(4096); /* align to page size */
    __data_end = .;

    __bss_start = .;
    .bss :
    {
        bss = .;
        *(.bss)
    }
    . = ALIGN(4096); /* align to page size */
    __bss_end = .;
    __bss_size = __bss_end - __bss_start;
    __end = .;
}

And we also need the appropriate boot.S taken from the same source.

// To keep this in the first portion of the binary.
.section ".text.boot"

// Make _start global.
.globl _start

    .org 0x80000
// Entry point for the kernel. Registers:
// x0 -> 32 bit pointer to DTB in memory (primary core only) / 0 (secondary cores)
// x1 -> 0
// x2 -> 0
// x3 -> 0
// x4 -> 32 bit kernel entry point, _start location
_start:
    // set stack before our code
    ldr     x5, =_start
    mov     sp, x5

    // clear bss
    ldr     x5, =__bss_start
    ldr     w6, =__bss_size
1:  cbz     w6, 2f
    str     xzr, [x5], #8
    sub     w6, w6, #1
    cbnz    w6, 1b

    // jump to C code, should not return
2:  bl      kernel_main
    // for failsafe, halt this core
halt:
    wfe
    b halt

We now need to assemble this, and then link everything together:

aarch64-linux-gnu-gcc -Wall -O2 -ffreestanding -fno-stack-protector -nostdinc -nostdlib -nostartfiles -c boott.S -o boot.o
aarch64-linux-gnu-ld -nostdlib -T linker.ld boot.o target/aarch64-unknown-linux-gnu/debug/libhomer_rust.rlib -o kernel8.elf

This produces a "standard binary", that is, one which would load and run under Linux. But we are trying to execute something on bare metal, so we cannot expect to have a relocating loader to hand. So we need to extract all the code and build an image file that can be directly mapped into memory.

aarch64-linux-gnu-objcopy -O binary kernel8.elf kernel8.img

So this is the kernel image we will want to load. And then we can run this in the emulator using the same command as before:

qemu-system-aarch64 -M raspi3b -kernel kernel8.img -serial stdio

The emulator starts up and doesn't complain, so I'm going to consider that a moral victory.

Adding some actual code

This "kernel" is the absolutely minimal bare metal Rust program - it does nothing. It's probably possible to connect to the emulator in some way to see that the boot.S code at least is running. But I'm going to assume that everything is OK and press on to writing some more serious code.

At this point, the Rust bare bones tutorial uses the UART port (which is mapped to the console in the emulator) to write the string "Hello, Rust kernel world". We can copy this code into our example and see if we can see this in the emulator. Since I don't have the UART actually connected to my physical hardware, this won't work on the Pi box itself.

There is a lot of complex code to actually drive the UART, which I'm not interested in, but ends up being wrapped up in a function write. So the kernel_main function becomes:

#[no_mangle]
pub extern fn kernel_main() {
    write("Hello Rust Kernel world!");
    loop {
        writec(getc())
    }
}

With this in place, we can re-link and re-run the code:

aarch64-linux-gnu-ld -nostdlib -T linker.ld boot.o target/aarch64-unknown-linux-gnu/debug/libhomer_rust.rlib -o kernel8.elf
aarch64-linux-gnu-objcopy -O binary kernel8.elf kernel8.img
qemu-system-aarch64 -M raspi3b -kernel kernel8.img -serial stdio

Sadly, it doesn't work.

It's not entirely clear why not. This is why we have debuggers.

It is possible to start the emulator QEMU with a couple of flags that say "don't start running the executable until the debugger says so" (-S) and "open a port to allow a debugger to connect" (-s). The difference between these two flags is the case that they are in.

qemu-system-aarch64 -M raspi3b -S -s -kernel kernel8.img -serial stdio

But what debugger do we have? There is a multi-architecture (i.e. cross) debugger available on linux:

$ sudo apt install gdb-multiarch
[sudo] password for gareth:
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
The following NEW packages will be installed
  gdb-multiarch
0 to upgrade, 1 to newly install, 0 to remove and 31 not to upgrade.
Need to get 4,589 kB of archives.
After this operation, 18.2 MB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu jammy-updates/universe amd64 gdb-multiarch amd64 12.1-0ubuntu1~22.04 [4,589 kB]
Fetched 4,589 kB in 1s (3,238 kB/s)
Selecting previously unselected package gdb-multiarch.
(Reading database ... 620618 files and directories currently installed.)
Preparing to unpack .../gdb-multiarch_12.1-0ubuntu1~22.04_amd64.deb ...
Unpacking gdb-multiarch (12.1-0ubuntu1~22.04) ...
Setting up gdb-multiarch (12.1-0ubuntu1~22.04) ...

We obviously need to open a new terminal for this (the emulator is running in a terminal and is waiting for us to connect) and then run the debugger giving it information about the symbols in the program:

$ gdb-multiarch -e kernel8.elf
GNU gdb (Ubuntu 12.1-0ubuntu1~22.04) 12.1
Copyright (C) 2022 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<https://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word".

Once the debugger starts, we tell it that the running process has opened the port 1234 (the default for -s) and then say that we want the correct layout for assembler, since we want to start by stepping through the boot.S file.

(gdb) target remote :1234
Remote debugging using :1234
0x0000000000000000 in ?? ()
(gdb) layout asm

When we do this, we can see that the processor starts executing at address 0x0 and there is some "default" code there which executes and then transfers control to 0x80000. When we get there, we find that the code there is not the code from boot.S that we are expecting.

What can we do to find our code? Well, objdump is a low-level introspector (compare to javap if you have a Java background). It has a whole bunch of features, including the (-d) option to "disassemble" the code in a file, so let's see what we've got:

$ aarch64-linux-gnu-objdump -d kernel8.elf
kernel8.elf:     file format elf64-littleaarch64

Disassembly of section .text:

0000000000080000 <_start>:
        ...
  100000:        58000185         ldr        x5, 100030 <halt+0xc>
  100004:        910000bf         mov        sp, x5
  100008:        58000185         ldr        x5, 100038 <halt+0x14>
  10000c:        18000106         ldr        w6, 10002c <halt+0x8>
  100010:        34000086         cbz        w6, 100020 <_start+0x80020>
  100014:        f80084bf         str        xzr, [x5], #8
  100018:        510004c6         sub        w6, w6, #0x1
  10001c:        35ffffa6         cbnz        w6, 100010 <_start+0x80010>
  100020:        94000067         bl        1001bc <kernel_main>

OK, well that at least explains what is going on: _start has been correctly bound to 0x80000. But then the first instruction doesn't actually come along until 0x100000. Why not? I have to confess that it took me a while to figure this out, including fiddling with different things, until I realized that 0x100000 is exactly double 0x80000. What's happened is that we have applied two offsets of 0x80000 before the first instruction. Why? I'm not entirely sure, but I suspect that the instructions and code we have used have not all been entirely consistent. Looking at the linker file, we have this:

    . = 0x80000;
    __start = .;
    __text_start = .;
    .text :
    {
        KEEP(*(.text.boot))
        *(.text)
    }

while in the assembler file boot.S, we have:

.globl _start

.org 0x80000
_start:

And I think both of these are being applied, rather than being seen as saying the same thing. So the obvious thing is to comment out the .org in the boot.S file.

.globl _start

// .org 0x80000
_start:

And with that fix, everything works!

$ qemu-system-aarch64 -M raspi3b -kernel kernel8.img -serial stdio
Hello Rust Kernel world!

Very good. Let's check that in for posterity, and tag it RUST_BARE_METAL_MINIMAL.

Making that reproducible

I'm not happy with either the file layout I've ended up with, or with the number of steps that it takes to build it. As far as I can tell, there is currently no standard way to perform "post-build" steps in a Rust project, such as the custom linking we are doing here. So I'm going to use a shell script to coordinate the cargo build of the Rust project with make doing the final assembly. To facilitate that, I'm going to copy the Makefile from my previous project and use that to assemble boot.S, do the linking and generate the final image. It also provides the run target to run the resultant image in the emulator.

The Makefile ends up looking like this:

# Allow run to connect to GDB by specifying GDB flags
# for example, -s -S
GDB =
DEBUG_WITH_GDB = -s -S
LIB = ../target/aarch64-unknown-linux-gnu/debug/libhomer_rust.rlib

all: kernel8.img

boot.o: boot.S
        aarch64-linux-gnu-gcc -Wall -O2 -ffreestanding -fno-stack-protector -nostdinc -nostdlib -nostartfiles -c boot.S -o boot.o

kernel8.img: boot.o $(LIB)
        aarch64-linux-gnu-ld -nostdlib boot.o $(LIB) -T linker.ld -o kernel8.elf
        aarch64-linux-gnu-objcopy -O binary kernel8.elf kernel8.img

clean:
        rm -f kernel8.elf kernel8.img *.o

run: kernel8.img
        qemu-system-aarch64 $(GDB) -M raspi3b -kernel kernel8.img -serial stdio

And then the build script which first does the Rust compilation, then calls make, looks like this:

cargo build --target aarch64-unknown-linux-gnu
cd asm
make

At the same time, I'm going to move all the non-rust code down into a new directory called asm. And wrap everything in a single script to build the Rust and assembler code and link it all together.

This is now available in RUST_BARE_METAL_MAKEFILE

Porting Homer from C to Rust

At this point, I have derived some code from two separate tutorials and have them both working. One is written in C and does what I want: displays a picture of Homer on the monitor. The other uses Rust but does none of this.

So what I want to do now is to port the C code to Rust and ultimately have a program which is written in Rust, runs on actual Pi hardware and displays Homer's picture. I'm actually not sure how hard that is going to be to do.

The various parts of this code varies from the trivial to the hairy, so, in spite of the fact that nothing really works until we have some of the hairy code done, I am going to start with the trivial. Homer's image in the tutorial is encoded in something called The Gimp Header File Format which, it turns out, is just a way of encoding 24-bit RGB values as four printable characters.

So I'm going to start by copying across the actual data into Rust, and then porting the decoding function and then write the resulting image to the UART in hex.

In order to do this, I am going to first write a function (or two) to convert 24-bit RGB values (in a u32) to a "string". But because we are working in a very constrained environment, it's hard to create Strings or Vecs, so I'm using a static bufffer of u8s. Sadly, because of the limitations of Rust, this appears to be unsafe. Also, in order not to have panics when doing arithmetic, I have also switched to using a release build, rather than a debug build.

I'm slowly starting to realise why the various steps that I skipped in the "Bare Bones Rust tutorial" were there. In order to be able to use the standard library, it's necessary to have a build of Rust you can modify at runtime, so that you can use your own calls to replace the underlying calls (such as memory management) that are normally found in the operating system. For now, I'm going to continue to resist the temptation, but one day it will become inevitable.

static mut BUF:[u8;6] = [0;6];

pub fn hex(n : u32) -> &'static[u8;6] {
    unsafe {
        BUF[0] = digit((n >> 20) & 0xf);
        BUF[1] = digit((n >> 16) & 0xf);
        BUF[2] = digit((n >> 12) & 0xf);
        BUF[3] = digit((n >> 8) & 0xf);
        BUF[4] = digit((n >> 4) & 0xf);
        BUF[5] = digit(n & 0xf);
        &BUF
    }
}

fn digit(n: u32) -> u8 {
    if n < 10 {
        ((48 + n) & 0xff) as u8
    } else {
        (55 + n) as u8
    }
}

And we can test that this works in the lib.rs file:

pub extern fn kernel_main() {
write_chars(hex(0x9a3cb0));
...
}

With that in place for debugging, we can go back and convert Homer into RGB format by converting each group of four characters into 24-bit RGB pixel values.

pub fn read_homer(homer: &str) -> &'static[u8;HOMER_BYTES] {
    unsafe {
        let mut pos : usize = 0;
        while pos < homer.len() {
            let c0 = homer.as_bytes()[pos];
            let c1 = homer.as_bytes()[pos+1];
            let c2 = homer.as_bytes()[pos+2];
            let c3 = homer.as_bytes()[pos+3];

            IMAGE[pos] = 0;
            IMAGE[pos+1] = ((c0-33) << 2) | ((c1-33) >> 4);
            IMAGE[pos+2] = ((c1-33) << 4) | ((c2-33) >> 2);
            IMAGE[pos+3] = ((c2-33) << 6) | ((c3-33));
            pos+=4;
        }
        &IMAGE
    }
}

This code is tagged as RUST_BARE_METAL_CONVERT_IMAGE.

Framebuffer Initialization

Before we can draw Homer on the screen, we need access to a "framebuffer". If you've not come across the word before, a framebuffer is a low-level block of memory that is mapped onto the screen by the GPU. So we need to ask the GPU to create a framebuffer for us. This requires us to use a "mailbox" to talk to the GPU.

A mailbox is a way in which the CPU can communicate with the hardware using a combination of Memory-Mapped I/O and DMA. That is, a program can create a request in a buffer in memory, and then pass the address of that buffer to a specific memory-mapped address (the "mailbox") which is then read by the GPU, which then reads and processes the request from the buffer at that address. A further memory-mapped status address is then used to determine when the operation has completed and the response has been written to the same buffer.

It seems to me that there is not a single source you can go to in order to find out all the information you need in order to do this, but rather there are a number of different reference documents which you need to consult to put the entire picture together.

Probably the best documentation for this I have come across so far is the Raspberry Pi firmware wiki which has a page describing the various mailboxes. In fact, there are only two actual "mailboxes", one to send messages from the ARM to the GPU, and one for the GPU to send messages to the ARM. But the mailbox has within it a set of "channels", where each channel is essentially a different mailbox.

The other key source of information is other people's code, which has the advantage of having been tested. Unfortunately, it is often the case that this code worked in a different environment or on a different box to the one that we are using. In this case, we have some C code which does exactly what we want and that we know works on our box in 64-bit mode. So, first, let's look at that code. For now, we are just going to try and initialize the frame buffer.

The first thing is that there is a declaration of the message array. This is declared to be 36 4-byte words (144 bytes). This magic number is the size of the request, as we will see in a moment.

extern volatile unsigned int mbox[36];

Among the challenges of working with devices on the Raspberry Pi is that everything is different everywhere depending on the specific device you have and how you are using it. I will try to be repeatedly clear that I am using a 3B+ in 64-bit (aarch64) mode. In this case, it would seem that the base address for all the memory-mapped devices is 0x3F000000. I don't have a definitive reference for this, but I have seen it used a number of times (including in this project) and it is the accepted answer in this question on the forum.

This is defined in gpio.h:

#define MMIO_BASE 0x3F000000

Separately, from somewhere has come the knowledge that the mailbox for the Videocore GPU is to be found at offset 0xB880 in this range - that is at 0x3f00b880. This is defined in mbox.c:

#define VIDEOCORE_MBOX (MMIO_BASE+0x0000B880)

As yet, I have been unable to find any original source for this, just examples of people using it.

The next step is to build a request in the mbox message defined earlier. This consists of four set operations and two get operations. These requests I can track back at least as far as the Wiki, but I'm not sure where they get their information from.

The first two words (remember, each of the entries in the mbox array is a 4-byte word) are the number of bytes in the request (not including this length) and a signal that this is a request (which is a word consisting of all zeros). The same buffer is used for the request AND the reply, so this distinguishes between the two. Responses all have a leading '1' (ie. 0x8xxxxxxx).

mbox[0] = 35*4;
mbox[1] = MBOX_REQUEST;

The first request is to set the physical width and height of the framebuffer. "Physical" refers to how big we think the actual screen is in pixels.

0x48003 is the magic code which says that this is the operation we want to perform. The next word contains the space allocated for the value in bytes. We are going to be sending two words (a width and a height), so that is 8. As I read it, the following value is for flags sent with the request, which must have its top bit (b31) clear, and all the rest of the fields are reserved; I would use 0 myself (and will do later unless it doesn't work), but this code has repeated 8. When the request is overwritten by the response, this field will hold the length of the response value, again in bytes.

    mbox[2] = 0x48003;  //set phy wh
    mbox[3] = 8;
    mbox[4] = 8;
    mbox[5] = 1024;         //FrameBufferInfo.width
    mbox[6] = 768;          //FrameBufferInfo.height

To set the virtual width and height, we use the code 0x48004. This dictates how much memory is to be allocated for the display. In this case, the code has simply requested the same size as the physical screen, but it is common to request double the height to allow for double buffering of the display.

    mbox[7] = 0x48004;  //set virt wh
    mbox[8] = 8;
    mbox[9] = 8;
    mbox[10] = 1024;        //FrameBufferInfo.virtual_width
    mbox[11] = 768;         //FrameBufferInfo.virtual_height

Given that there are both physical and virtual framebuffers, it is necessary to specify where in the virtual framebuffer the origin of the physical framebuffer is to be found. Tag 0x48009 performs this operation. Obviously, since our physical and virtual framebuffers are the same size, the offset must be (0,0).

    mbox[12] = 0x48009; //set virt offset
    mbox[13] = 8;
    mbox[14] = 8;
    mbox[15] = 0;           //FrameBufferInfo.x_offset
    mbox[16] = 0;           //FrameBufferInfo.y.offset

0x48005 sets the depth of the framebuffer. This is the number of bits that are used to represent each pixel. This is set as 32. I'm not quite sure what that actually means - is it 24 rounded up to 32 or is it RGBA? Note that we only have 4 bytes of data with this tag.

    mbox[17] = 0x48005; //set depth
    mbox[18] = 4;
    mbox[19] = 4;
    mbox[20] = 32;          //FrameBufferInfo.depth

The Videocore apparently allows both RGB and BGR orderings on the pixels, so we have to request the one that we want (if available). The 0x48006 request chooses a particular order (0 = BGR, 1 = RGB).

    mbox[21] = 0x48006; //set pixel order
    mbox[22] = 4;
    mbox[23] = 4;
    mbox[24] = 1;           //RGB, not BGR preferably

Now we need a memory address for the framebuffer. The allocate buffer request allocates a framebuffer and responds with the address. This code does not seem to exactly match the specification, which says the request value size should be 4. On the other hand, the response value size is expected to be 8, so it makes sense if the response is going to be written in the same place that we need 8 bytes here.

The value that is passed here (4096) is the alignment, written as a modulus rather than a bit count or a mask. I'm assuming that's right. So whatever address we get back should be aligned to 4096, i.e. the bottom 14 bits will all be 0s.

    mbox[25] = 0x40001; //get framebuffer, gets alignment on request
    mbox[26] = 8;
    mbox[27] = 8;
    mbox[28] = 4096;        //FrameBufferInfo.pointer
    mbox[29] = 0;           //FrameBufferInfo.size

The description of the Get Pitch tag says that it is the number of bytes per line. I would expect this to be four times the width in pixels (so 4096), but hardware being hardware, I can understand why it is good to check.

    mbox[30] = 0x40008; //get pitch
    mbox[31] = 4;
    mbox[32] = 4;
    mbox[33] = 0;           //FrameBufferInfo.pitch

MBOX_TAG_LAST (which has the value 0) says that there are no more tags in this request. Which is good, because we have run out of bytes in which to place them.

mbox[34] = MBOX_TAG_LAST;

OK. So now we know what we want the code to do, how do we write that in Rust?

Well, it's a little complicated, I think because we don't have proper memory allocation. But a mutable buffer seems able to handle the construction of the message.

    let mut buf: [u32;36] = [0; 36];

    // The header of the message has a length and a status (0 = REQUEST; 0x8000xxxx = RESPONSE)
    buf[0] = 35 * 4; // the buffer has 35 4-byte words
    buf[1] = 0; // we indicate we are sending a MBOX_REQUEST as 0

    // Now each of the tags

    // First, set the physical size of the framebuffer to 1024 x 768
    buf[2] = 0x48003;
    buf[3] = 8; // the number of bytes in the request value
    buf[4] = 0; // reserved in request - will be used for the length of the response value
    buf[5] = 1024; // the requested width
    buf[6] = 768; // the requested height

    // Now, set the virtual size of the framebuffer
    // This must be at least as big as the physical framebuffer, but can be bigger, e.g. to support double buffering
    // or to support scrolling
    buf[7] = 0x48004;
    buf[8] = 8;
    buf[9] = 0;
    buf[10] = 1024; // the requested virtual width
    buf[11] = 768; // the requested virtual height

    // Now specify where the physical framebuffer is in the virtual framebuffer
    buf[12] = 0x48009;
    buf[13] = 8;
    buf[14] = 0;
    buf[15] = 0;
    buf[16] = 0;

    // Now set the depth of the framebuffer in bits.  This is 32, presumably to get 24 bits aligned nicely
    buf[17] = 0x48005;
    buf[18] = 4;
    buf[19] = 0;
    buf[20] = 32; // allocate 32 bits per pixel

    // choose RGB over BGR
    buf[21] = 0x48006;
    buf[22] = 4;
    buf[23] = 0;
    buf[24] = 1; // 1 = RGB, 0 = BGR

    // Now we are ready to allocate the buffer.
    buf[25] = 0x40001;
    buf[26] = 8;  // we are only sending 4 bytes, but we want 8 bytes back
    buf[27] = 0;
    buf[28] = 4096; // the modulus of the alignment we want (i.e. only the top 20 bits are significant)
    buf[29] = 0;

    // And we want to check that we were given RGB (or not)
    buf[30] = 0x40008;
    buf[31] = 4;
    buf[32] = 0;
    buf[33] = 0; // this is just a placeholder for the value to come back

    // We have no more tags
    buf[34] = 0;

So now we need to actually send that whole message to the GPU for processing.

Sadly, before we can do that, it seems that we need to add a spin loop to avoid a SEGV in the emulator qemu (this saddens me, but such is life). This doesn't do anything, it just wastes time. The mmio_read is just there to stop the compiler from optimizing the entire loop away.

    // avoid a SEGV in the emulator
    let mut y = 0;
    while y < 1000000 {
        y = y + 1;
        mmio_read(MBOX_STATUS); // do something to waste time
    }

So now we can write to the mbox. We call a function mbox_send to make it clear where the separation is between creating the request and dealing with the hardware.

mbox_send(8, &mut buf);

This, of course, is where the magic happens. The "magic number" 8 here is the "channel" number that we want to send the message on. Again, this is poorly documented; I obtained it mainly from examples, but I would cite as a "definitive" source the raspberry pi wiki.

This method starts by polling the MBOX_STATUS port until it stops being "busy".

fn mbox_send(ch: u8, buf: &mut[u32; 36]) {
while mmio_read(MBOX_STATUS) & MBOX_BUSY != 0 {
}

The message is passed in as a strongly typed array, but right now we are about to descend into the dark world of systems programming, and deal with "unsafe" raw pointers and integers, so that we can manipulate them. Note that this code does not need to be wrapped in an unsafe block, because converting pointers to integers is not inherently unsafe: it is taking a "random" integer and using it as a pointer which is unsafe.

    // obtain the address of buf as a raw pointer
    let volbuf = buf as *const u32;

    // then convert that to just a plain integer
    let ptr:u32 = volbuf as u32;

Please pay attention! I am going to describe this next bit as "the most important thing to note here" for two reasons: first, I didn't do this right when I first wrote this code and spent about an hour trying to figure out why nothing happened; and, secondly, it is so counterintuitive to me.

The pointer is a "word" pointer and thus the bottom 4 bits are all zeros. The mailbox interface wants to just read one word for ease, so it re-purposes those four bits to send the channel number. If you don't merge the channel number in here, it will be sent to channel 0, which (at least in this case) won't do what you expect. I don't actually know what it will do, but it definitely won't initialize the framebuffer.

// what we pass to the mailbox is that in the top 28 bits and the channel number (ch) in the bottom 4 bits
let addr = (ptr & !0x0F) | ((ch as u32) & 0x0f);

Now we write this to the MBOX_WRITE address, which is a memory mapped device address, so it immediately causes the GPU (which is listening on the other end) to process our message. When it has finished processing, it will put a response back on the MBOX_STATUS address, so we wait for that to happen.

    // send to the mailbox write address
    mmio_write(MBOX_WRITE, addr);

    // wait until we have a response from the GPU
    while mmio_read(MBOX_STATUS) & MBOX_PENDING != 0 {
    }

We can then write the output to the tty, which will come out on the terminal screen when running in the emulator.

    // show the returned buffer contents
    write("returned buffer contents:\n");
    let mut x = 0;
    while x < 36 {
        write_8_chars(hex32(buf[x]));
        write("\n");
        x = x + 1;
    }

We now want to process the response. We know that the response is written back into the same buffer that we sent, so we should just be able to use buffer accesses. I ran into a problem that this didn't seem to be happening, and concluded that the compiler had decided it shouldn't have changed and returned the constants I had put there. I'm not sure that was true, but I added a read_volatile anyway. Once we have the response in place, we can check that all the things that should have happened in a particular way did, in fact, happen in that way.

    // The compiler optimizes away (or something) reads into buf and returns what we wrote
    // We need to be sure we read what was written
    let stat = unsafe { read_volatile(volbuf.add(1)) };

    // test that we received valid data
    if stat != 0x80000000 {
        write("error returned from getfb ");
        write_8_chars(hex32(stat));
        write("\n");
        return;
    }

    let pixdepth = unsafe { read_volatile(volbuf.add(20)) };
    if pixdepth != 32 {
        write("pixel depth is not 32");
        return;
    }

    let alignment = unsafe { read_volatile(volbuf.add(28)) };
    if alignment == 0 {
        write("alignment is zero");
        return;
    }
}

We can now run this in the emulator by using make:

cd asm
make run

Nothing very exciting happens, but we do get a bunch of numbers coming out confirming the buffer response and also showing us the converted RGB pixels of the image.

I have checked this in as RUST_BARE_METAL_INITIALIZE_FRAMEBUFFER. So now it's time to move on to actually rendering the image on the screen.

Rendering the Image

So now we have an image in a buffer that is in basic RGB format, and we have a connection to the framebuffer. We need to extract the address of the framebuffer from the mailbox reply, and then start copying our image data from the decoded buffer to the framebuffer.

Again, we have example C code to follow, and it's just a question of trying to convert that into Rust. I'm not going to reproduce all the C code here, although you can look at it in the relevant tutorial if you want.

The code to render Homer is fundamentally simple, so I'm going to present the function I ended up with all in a single slab here, and then go back and show the changes I had to make elsewhere in order to make everything fit together.

Basically, it is just a question of copying the array of pixels that make up the image of Homer into the block of memory we have been allocated as the framebuffer. The only caveat is that we need to copy it into the centre of the screen on a line-by-line basis.

fn show_homer(fb : &FrameBufferInfo, homer : &[u8; HOMER_BYTES]) {
    // Because we want to put Homer in the middle of the screen, we need to first figure out where
    // he should go.  We have a framebuffer width and height, and a homer width and height.
    // So we need to put half of the distance in each direction as an initial value for x and y

    let xoff = fb.width/2 - HOMER_WIDTH/2;
    let yoff = fb.height/2 - HOMER_HEIGHT/2;

So the first thing we do is to figure out how much blank space we need at the top and the left.

    // Now we want to go over each of the scan lines, having the ptr be the base (from fb.base_addr)
    // plus the current y multiplied by the pitch, plus the above xoff

    let mut homer_index: usize = 0;
    let mut y = 0;
    while y < HOMER_HEIGHT {
        let ptr = fb.base_addr + (yoff + y) * fb.pitch + xoff*4;

Each row is just copied sequentially from the source image (using homer_index) into a single scan line. We figure out where to put this starting at the base address of the framebuffer PLUS the number of bytes per line multiplied by the top margin added to the current line, and then finally a left indent of 4 (the number of bytes per pixel) multiplied by the left margin. It is important to remember here that both the framebuffer pointer and ptr are pointers to bytes.

        let mut x: u32 = 0;
        while x < HOMER_WIDTH {
            unsafe { *((ptr + x*4 + 0) as *mut u8) = homer[homer_index + 0]; }
            unsafe { *((ptr + x*4 + 1) as *mut u8) = homer[homer_index + 1]; }
            unsafe { *((ptr + x*4 + 2) as *mut u8) = homer[homer_index + 2]; }
            unsafe { *((ptr + x*4 + 3) as *mut u8) = homer[homer_index + 3]; }
            x += 1;
            homer_index += 4;
        }
        y += 1;
    }
}

Finally, we can copy across one row of pixels, and then increase the x position by one (which will be multiplied by 4 when computing the start location of the next pixel) and homer_index by 4. At the end of a row, we increase the line count by 1.

So far, so good, except that doesn't compile for a number of reasons. Most importantly, where did that new argument type FrameBufferInfo come from?

In order to group together all of the information about the framebuffer when collecting it from the return from the mailbox call to initialize the framebuffer and pass it to the render function, I declared a struct.

struct FrameBufferInfo {
    width : u32,
    height : u32,
    pitch: u32,
    base_addr: u32
}

The four pieces of information we need here are the physical width and height of the framebuffer (which may be set differently to what we asked for), the pitch (the number of bytes in each line) and the allocated base address of the framebuffer.

Less obviously, we have reused HOMER_HEIGHT and HOMER_WIDTH which were defined in ghff.rs when we were reading the file. These are inaccessible due to protection, so we have made them pub.

pub const HOMER_HEIGHT : u32 = 64;
pub const HOMER_WIDTH : u32 = 96;

Furthermore, in the file where we are referencing them, they need to be explicitly imported:

use crate::ghff::HOMER_WIDTH;
use crate::ghff::HOMER_HEIGHT;

Of course, we need to actually hook this code into the main flow somewhere, otherwise it will just be so much dead code.

So we add a call to it in kernel_main:

    let mut fb = FrameBufferInfo{width: 0, height: 0, pitch: 0, base_addr: 0};
    lfb_init(&mut fb);
    let homer: &[u8; HOMER_BYTES] = read_homer(HOMER_DATA);
    show_homer(&fb, &homer);

Of course, that's never quite enough: in order to pass in a FrameBufferInfo, we need to first declare it and then populate it, which requires us to pass it to lfb_init.

This, in turn, populates it when it has the answers back from the mailbox:

    let volbuf = &mut buf as *mut u32;
    fb.width = unsafe { read_volatile(volbuf.add(5)) };
    fb.height = unsafe { read_volatile(volbuf.add(6)) };
    fb.pitch = unsafe { read_volatile(volbuf.add(33)) };
    fb.base_addr = unsafe { read_volatile(volbuf.add(28)) } & 0x3fffffff;

OK, so we are ready to give this a go. But I'm bored of running multiple commands to build and run in the emulator, so I'm adding a second script, run.sh which combines both.

set -e

`dirname $0`/build.sh
cd asm
make run

Sadly, when we run this, bad things happen:

Compiling homer_rust v0.1.0 (/home/gareth/Projects/IgnoranceBlog/homer_rust)
Finished release [optimized] target(s) in 0.20s
aarch64-linux-gnu-ld -nostdlib boot.o ../target/aarch64-unknown-linux-gnu/release/libhomer_rust.rlib -T linker.ld -o kernel8.elf
aarch64-linux-gnu-ld: ../target/aarch64-unknown-linux-gnu/release/libhomer_rust.rlib(homer_rust-2e5114238dd615aa.homer_rust.f56f16d2546ffac4-cgu.0.rcgu.o): in function `kernel_main':
homer_rust.f56f16d2546ffac4-cgu.0:(.text.kernel_main+0x9e4): undefined reference to `core::panicking::panic_bounds_check'
aarch64-linux-gnu-ld: homer_rust.f56f16d2546ffac4-cgu.0:(.text.kernel_main+0x9fc): undefined reference to `core::panicking::panic_bounds_check'
aarch64-linux-gnu-ld: homer_rust.f56f16d2546ffac4-cgu.0:(.text.kernel_main+0xa18): undefined reference to `core::panicking::panic_bounds_check'
aarch64-linux-gnu-ld: homer_rust.f56f16d2546ffac4-cgu.0:(.text.kernel_main+0xa34): undefined reference to `core::panicking::panic_bounds_check'
make: ** [Makefile:13: kernel8.img] Error 1

It would seem that the problem here is that some part of the render code uses an array reference which is automatically bounds checked by Rust. In doing so, it calls panic_bounds_check if the index is out of bounds. But this is part of the standard library which we are NOT including.

The simplest solution is to define it, but how and where? If we can find the mangled name (which has been demangled for the purposes of this error message), we can simply define it in boot.S.

objdump is our friend again, and we can use it to dump all the symbols found or referenced in our library:

aarch64-linux-gnu-objdump -t target/aarch64-unknown-linux-gnu/release/libhomer_rust.rlib
In archive target/aarch64-unknown-linux-gnu/release/libhomer_rust.rlib:

...
homer_rust-2e5114238dd615aa.homer_rust.f56f16d2546ffac4-cgu.0.rcgu.o: file format elf64-littleaarch64

SYMBOL TABLE:
...
0000000000000000 *UND* 0000000000000000 _ZN4core9panicking18panic_bounds_check17h75f9d87f2a814b7bE

Yes, that was my reaction too. Not the world's most obvious name. But we can take it and run with it.

.globl _ZN4core9panicking18panic_bounds_check17h75f9d87f2a814b7bE
_ZN4core9panicking18panic_bounds_check17h75f9d87f2a814b7bE:
b halt

So, let's try running it again. And there he is! Homer appears in the middle of the screen.

But ... he looks a little weird. Sort of off-colour. I'm not 100% sure what, if any, tools can tell you what I've done wrong here, but given how close it is, it's fairly obvious the pixels are in the wrong places. Exhaustive testing would probably indicate the problem, but I did random trial-and-error and then went back and compared what I'd done to the original.

The problem is that we have three significant bytes (red, green and blue) in a 32-bit word (4 bytes) for each pixel. One of them has to be zero. I thought, given that it was really a 24-bit value, the most significant 8 bits would be zero and the three data bytes would be packed into the low 24 bits. But apparently not, the first three are significant and the fourth is zero. Thus, I need to go back and rework the code that expands the GHFF into bytes for HOMER_DATA.

            IMAGE[pos] = ((c0-33) << 2) | ((c1-33) >> 4);
            IMAGE[pos+1] = ((c1-33) << 4) | ((c2-33) >> 2);
            IMAGE[pos+2] = ((c2-33) << 6) | ((c3-33));
            IMAGE[pos+3] = 0;

And just like that, amazingly, we have Homer in the middle of the screen. While it's working, let's check it in as RUST_BARE_METAL_HOMER_APPEARS.

I say amazingly because, knowing what I know now, it is amazing that this works. More luck than judgment, some would say. I tried removing all the trace statements and it stopped working. I tried it in a real device: no joy. I tried this and tried that, none of them worked. It is amazing how fragile it is. This does not seem like what we would expect.

Conclusion

We have succeeded in building a bare metal program which runs, at least under certain conditions, in the emulator. It does not run under all conditions and not at all on real hardware.

It is a mess. As I looked over the code while writing up the description of it, I was appalled at how quickly it has spiralled out of control.

I am obviously not very experienced with Rust, and it shows: I am using all kinds of odd features in ways that I don't think the designers intended. We need to pull that back in.

I don't feel I have a stable methodology for development here.

All, of that, of course, is considered acceptable on this blog: here we support ignorance and the ability to learn from that and to grow and discover things. The big problem is deciding which order we are going to deal with things in. There is so much to do here, and so little time.

For me (at least; speak up if you feel differently), the most important thing is stability: I want it to behave predictably. Most importantly, in a given environment (in this case, the emulator) it should not be influenced by such trivia as whether or not there are debugging statements in the code. Then I want to get it working on real hardware, which I think will involve getting the real UART to work, and only then will I be able to consider what an appropriate factoring might be.