Sunday, January 7, 2024

Getting to Grips with Memory Allocation

Now I want to go back and handle memory allocation properly. All of our buffers and arrays are statically allocated (and often reused). And we have the hack to address aligning message buffers that I want to remove.

It was at this point that I discovered the Embedded Rust Manual while searching for information about memory allocation. In particular, I became much more aware of the difference between the core crate and the std crate. Specifically, most of the hard work is done in the core crate and the main things that are missing are memory allocation and automated loading of libraries. Collections such as Vec are also excluded, but can be included separately by importing them from the collections crate. Macros such as println! also seem to be a problem, but it's one I've seen other people work around and will come back to.

Explicit memory allocation is covered in Chapter 7 of the Embedded Rust Manual and it seems at first glance to be really complicated. Now that I think I've understood it (having read numerous other references as well), let's test that by seeing if I can explain it while writing some code.

Let's start by writing the line of code that we really want, replacing the code that allocates a message buffer on the stack with one that allocates a 35-byte message on the heap and returns an array.
    let mut buf: &mut [u32] = allocate_message_buffer(35);
To implement this function we need to allocate a block of memory using the "global allocator" and then "turn that" into an array slice.

In order to use alloc directly, we need to add the following declaration at the top of our file:
extern crate alloc;
The documentation says that it is also necessary to specify #![feature(alloc)], but the compiler contradicts this, saying that this is only necessary for unstable features, and the alloc feature has been stable for a while, so the feature declaration is no longer needed.

The function to allocate the memory we need for our message buffer is then fairly easy to write, with the caveat that it involves some arcane Rust magic:
fn allocate_message_buffer(nwords: usize) -> &'static mut [u32] {
    unsafe {
        let ptr = alloc::alloc::alloc(alloc::alloc::Layout::from_size_align_unchecked(nwords * 4, 16)) as *mut u32;
        alloc::slice::from_raw_parts_mut(ptr, nwords)
    }
}
Starting in the middle, the normal way of referencing alloc would be std::alloc::alloc, that is, the alloc function in the alloc module of the std crate, but because we aren't using std but the alloc crate direcly, it becomes alloc::alloc::alloc. Likewise, the Layout class would normally be referenced as std::alloc::Layout but becomes alloc::alloc::Layout.

The key thing we want to achieve is 16-byte alignment, so we specify a Layout with a size of nwords words and an alignment of 16. It's my impression that both of these values are expected to be in bytes, but I can't claim that I have read anything which makes this absolutely clear. We then pass the resultant Layout object to the alloc method which returns a *mut u8 pointer, which we then cast to a *mut u32 pointer.

However, what we really want to do is to return an array slice, which is a "regular", stack-allocated object which wraps this pointer together with other meta-information (specifically, the size of each element and the number of elements that can be referenced without overflow) which can be used "safely", that is, with full compiler checks and no need for the unsafe keyword.

To effect this transition, we call the slice::from_raw_parts_mut function, passing in the pointer and the available number of words (I think it deduces the size of each entry from the type of the pointer). This constructs and returns an array slice, which we then return to the caller.

Most of the return type - &mut [u32] - makes perfect sense to me, but the compiler forced me to add the 'static lifetime. I'm still a Rust newbie, so I can't claim to understand this, but if it's what the compiler requires, it "must be" right - at least until it isn't or until I know more. My instinct is that we want to copy what we have created on the stack into the parent, not return something with the static lifecycle, but since I don't understand what I am doing, I will follow the compiler's advice until I run into trouble.

There are a couple more changes that I need to make in order for all the code to be consistent. Obviously, I want to remove the copy into the Message struct, and we need to then convert the array slice address to a raw pointer using the as_mut_ptr.
    mbox_send(8, &mut buf);

    let volbuf: *mut u32 = buf.as_mut_ptr();
We need to change the signature of the mbox_send to remove the explicit array length, and likewise change the code to extract the pointer.
fn mbox_send(ch: u8, buf: &mut[u32]) {
    while mmio_read(MBOX_STATUS) & MBOX_BUSY != 0 {
    }

    // obtain the address of buf as a raw pointer
    let volbuf = buf.as_mut_ptr();
So now we can try building this. As you might expect, we get linking errors:
   Compiling homer_rust v0.1.0 (/home/gareth/Projects/IgnoranceBlog/homer_rust)
    Finished release [optimized] target(s) in 0.20s
aarch64-linux-gnu-ld -nostdlib boot.o ../target/aarch64-unknown-linux-gnu/release/libhomer_rust.rlib -T linker.ld -o kernel8.elf
aarch64-linux-gnu-ld: ../target/aarch64-unknown-linux-gnu/release/libhomer_rust.rlib(homer_rust-2e5114238dd615aa.homer_rust.f56f16d2546ffac4-cgu.0.rcgu.o): in function `kernel_main':
homer_rust.f56f16d2546ffac4-cgu.0:(.text.kernel_main+0xb8): undefined reference to `__rust_no_alloc_shim_is_unstable'
aarch64-linux-gnu-ld: homer_rust.f56f16d2546ffac4-cgu.0:(.text.kernel_main+0xbc): undefined reference to `__rust_no_alloc_shim_is_unstable'
aarch64-linux-gnu-ld: homer_rust.f56f16d2546ffac4-cgu.0:(.text.kernel_main+0xcc): undefined reference to `__rust_alloc'
make: *** [Makefile:14: kernel8.img] Error 1
Having said I was expecting linking errors, I don't think I was expecting these linking errors: I don't know what rust_no_alloc_shim_is_unstable even is, and I was thinking the rust_alloc would be defined and have some internal logic to handle the actual allocation process.

Anyway, I'm now going to go ahead and implement an allocator as laid out in the Embedded Rust Book chapter on alloc and see what happens.

I'm going to put the allocator in a new module, so in lib.rs, I added the following line toward the top of the file:
mod allocator;
and created a file allocator.rs.

The first thing I want to do is build enough code that the program links again, and then worry about whether it works or not - once it builds, I can always connect a debugger if I need to in order to see what is happening inside.

Following the instructions in the guide, the first thing we are going to do is use the global_allocator attribute to declare an allocator called HEAP.
use core::alloc::GlobalAlloc;

#[global_allocator]
static HEAP: HeapAllocator = HeapAllocator {
};

struct HeapAllocator {
}
In order to be able to fulfill the role of global_allocator, it is necessary to implement the GlobalAlloc trait. This has two methods, alloc and dealloc. For now, while we are just trying to get something to link, we are going to just return 0 in alloc. We don't need to implment dealloc at all yet.
unsafe impl GlobalAlloc for HeapAllocator {
    unsafe fn alloc(&self, _layout: core::alloc::Layout) -> *mut u8 {
        // simply return 0 - won't work but will link
        0x0 as *mut u8
    }

    unsafe fn dealloc(&self, _ptr: *mut u8, _layout: core::alloc::Layout) {
        // we don't support deallocation yet
    }
}
Excellent. That links and we can try running it. Unsurprisingly, it doesn't resize the screen and doesn't display Homer. OK, I can live with that for now.

Moving on, let's try and allocate a heap and then have our allocator return the block at the beginning of that area.

It seems to me that the right way to do this is to allocate a .heap section in the linker script, linker.ld. I'm going to put it at the end, after the .bss section and before the .end (obviously):
    . = ALIGN(4096); /* align to page size */
    __bss_end = .;
    __bss_size = __bss_end - __bss_start;

    __heap_start = .;
    .heap :
    {

    }
    . = 0x100000;
    __heap_end = .;
    __heap_size = __heap_end - __heap_start;
    
    __end = .;
This declares a .heap section and three symbols: __heap_start, which marks the start of the block, immediately after the end of the .bss section, aligned to a 4096-byte page boundary; __heap_end, at the end of the heap, which I have explicitly forced to be at 0x100000; and __heap_size, the size of the heap, which the linker calculates as the difference between these two.

Now we can update the allocator to include __heap_start:
extern {
    pub static __heap_start : *mut u8;
}
and return it from the alloc function:
    unsafe fn alloc(&self, _layout: core::alloc::Layout) -> *mut u8 {
        // just return the top of heap
        __heap_start
    }
Strangely, this now goes back to issuing some linker errors:
   Compiling homer_rust v0.1.0 (/home/gareth/Projects/IgnoranceBlog/homer_rust)
    Finished dev [unoptimized + debuginfo] target(s) in 0.06s
aarch64-linux-gnu-ld -nostdlib boot.o ../target/aarch64-unknown-linux-gnu/debug/libhomer_rust.rlib -T linker.ld -o kernel8.elf
aarch64-linux-gnu-ld: ../target/aarch64-unknown-linux-gnu/debug/libhomer_rust.rlib(homer_rust-c9f0f32593953886.1y1lhhvr9vp1xcdg.rcgu.o): in function `alloc::alloc::alloc':
/rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/alloc/src/alloc.rs:92: undefined reference to `__rust_no_alloc_shim_is_unstable'
aarch64-linux-gnu-ld: /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/alloc/src/alloc.rs:92: undefined reference to `__rust_no_alloc_shim_is_unstable'
It's that symbol __rust_no_alloc_shim_is_unstable again. I suspect that this is something to do with unstable Rust builds; let's track that down.

This thread explains that it is there as a hint to remind implementors that some aspects of the API may change in the future. I'm not quite sure what that means, but it sounds a bit like signing a legal contract. I don't see how it will actually help me in the future, unless it is going to be replaced by _is_stable or something when the API changes, which will stop the program linking again until I make the necessary changes.

Anyway, it seems to be declared in the allocator API and we just need to define and export it:
#[allow(non_upper_case_globals)]
#[no_mangle]
pub static __rust_no_alloc_shim_is_unstable: u8 = 0;
This feels more wrangling than actual code: the first simply suppresses the warning associated with having to declare a C style variable: Rust expects static variables to be declared in upper snake case. The second stops the name (which has already been mangled) being mangled again.

Running the application with this allocator in place, we now see Homer again.

Adding in a trace statement to check on the address that is returned from our allocator, I'm staggered to see it returning 00000000. How can that be? And if so, why did it not work when I returned 0 explicitly?

I suspect that there is something funky going on between declaring the variable __heap_start in the linker and using it here. So let's attach the debugger again. Remember that we have scripts scripts/debug.sh to compile and start the program, and scripts/gdb.sh to run the debugger. When we do this, we can put a breakpoint in the allocate function:
(gdb) b allocator.rs:21
Breakpoint 1 at 0x83364: file src/allocator.rs, line 21.
(gdb) c
Continuing.

Thread 1 hit Breakpoint 1, homer_rust::allocator::{impl#0}::alloc (self=0x8a163, _layout=...) at src/allocator.rs:21
21                __heap_start
(gdb) x/5i $pc
=> 0x83364 <_ZN89_LThomer_rust..allocator..HeapAllocatoru20asu20core..alloc..global..GlobalAllocGT5alloc17h5b531fd15c42200cE+16>:        adrp        x8, 0x8b000
   0x83368 <_ZN89_LThomer_rust..allocator..HeapAllocatoru20asu20core..alloc..global..GlobalAllocGT5alloc17h5b531fd15c42200cE+20>:        ldr        x8, [x8, #16]
   0x8336c <_ZN89_LThomer_rust..allocator..HeapAllocatoru20asu20core..alloc..global..GlobalAllocGT5alloc17h5b531fd15c42200cE+24>:        ldr        x0, [x8]
   0x83370 <_ZN89_LThomer_rust..allocator..HeapAllocatoru20asu20core..alloc..global..GlobalAllocGT5alloc17h5b531fd15c42200cE+28>:        add        sp, sp, #0x20
   0x83374 <_ZN89_LThomer_rust..allocator..HeapAllocatoru20asu20core..alloc..global..GlobalAllocGT5alloc17h5b531fd15c42200cE+32>:        ret
So it seems that what is happening here is that it is treating __heap_start not as a pointer itself, but as the location where the desired value can be found. Consequently, it is dereferencing it and returning the value in memory at that location - which could be anything but happens to be 0. So we need to take the address of __heap_start and then work through all the insanity of casting it to the value it needs to be:
    unsafe fn alloc(&self, _layout: core::alloc::Layout) -> *mut u8 {
        // just return the top of heap
        (&__heap_start as *const _) as *mut u8
    }
In this context, *const _ refers to a "raw" pointer which is what we need to move from the pointer we have to the pointer we want. I think in the fullness of time the correct solution to declare the value of __heap_start not as a pointer but as an array or object or something from which we can more reasonably take the address. But this works, and the value shown by the debugger is now 00090000 which matches the value we can see using nm:
$ nm asm/kernel8.elf
...
0000000000090000 B __heap_start
...
This is checked in and tagged as RUST_BARE_METAL_MESSAGE_ALLOCATOR.

A Simple, Page-Based Allocator

So, the actual process of incorporating an allocator wasn't too bad. On the other hand, I haven't actually written an allocator yet - I have just implemented the API and it won't so much as handle two calls, let alone complexities such as reallocation or multi-threading.

Not surprisingly, the alloc documentation in the Embedded Rust Book suggests that you don't write your own allocator, but use a battle-hardened one. But that's not why we're here. We are here (as JFK would say) to "play Texas": these are precisely the things we want to do in order to see how they work and how we can deal with the problems that arise.

For now, at least, I want to write something very simple, which can handle both allocation and deallocation and can handle different sizes of memory, along with alignment.

One of the challenges of memory management is fragmentation, and knowing where the most suitable chunk of memory can be found. What I'm going to do to reduce this complexity is only offer three possible chunk sizes: 16, 256 and 4096 bytes. Each of these will be aligned to the same boundary as its size. If you want less memory than one of these, I'll give you the next size up. If you want more than 4096 bytes in a single allocation, you will see receive a series of pages, fresh from the heap, but at least they will be guaranteed to be consecutive.

Memory is often (generally?) issued in 4096-byte "pages". This is certainly the case in the ARM v8 MMU. So if we handle memory internally in pages, we will probably find life easier than if we do something else.

The struct associated with the global_allocator can hold arbitrary data items, although obviously this is all allocated in the data segment, not on the heap. We can use this to hold references to our currently available pages.

My basic strategy is to divide up the memory between __heap_start and __heap_end into pages, and to allocate one of these each time we have run out of free memory. We will call this next_page. We can either allocate the new page as a whole page, as 15 256-byte blocks, or as 255 16-byte blocks. If a new page is requested and the pointer is already at __heap_end, we will return nulll (indicating out of memory, which apparently causes a panic in _rust_alloc). There is no garbage collection here, we are only going to reuse memory which has been deallocated.

We also need pointers to the free lists, which we will call free_16, free_256 and free_4096. Note that even when someone requests a whole page, we may have pages on hand which have been previously allocated and then deallocated which we can hand them. We will initialize each of these pointers to 0, indicating that there are no free entries available for that size.

When an alloc request is received, if a suitable block is on the free list, we want to return that, and update the pointer to point to the next free block. If the pointer is 0, then we need to allocate a whole new page, and then initialize it. What is meant by initialization? For a whole page, nothing is required, but for the other cases, we need to do two things:
  • The first entry (somewhat wastefully) contains an integer which is the size of the entries in that block, i.e. either 16 or 256. We will use this in dealloc to determine how big the block was that we issued.
  • We need to set up a linked list through the remaining entries in the block, so that when we return one of the entries in the block, we know where the next one is.
OK, ready to start?

This is a non-trivial piece of code, and it is made more complicated by the fact that I am abusing Rust without really knowing what I'm doing. I am going to press on regardless, and get something to work which is in about the shape that we want. I'm then going to double back and use automated tests (now that I know how) to flesh out the implementation.

First off, I'm going to change __heap_start to be a u32, since we're only taking its address anyway. I'm also going to introduce __heap_end as a u32, so that we can detect when we run out of memory.
extern {
    pub static __heap_start : u32;
    pub static __heap_end : u32;
}
I'm going to rename HeapAllocator to PageAllocator, since that more accurately reflects what it is doing, rather than the role that it is playing. And I am going to add four members: one to track where the next unassigned page is, and the other three to track where the heads of the three free lists are.
struct PageAllocator {
    next_page: UnsafeCell<usize>,
    free_16: UnsafeCell<usize>,
    free_256: UnsafeCell<usize>,
    free_4096: UnsafeCell<usize>
}
I spent a lot of time trying to figure out how to store these values, but I ended up settling for the same UnsafeCell<usize> that is used in BumpPointerAlloc in the embedded Rust book. But doing this causes an error to appear:
error[E0277]: `UnsafeCell<usize>` cannot be shared between threads safely
  --> src/allocator.rs:14:14
   |
14 | static HEAP: PageAllocator = PageAllocator {
   |              ^^^^^^^^^^^^^ `UnsafeCell<usize>` cannot be shared between threads safely
   |
   = help: within `PageAllocator`, the trait `Sync` is not implemented for `UnsafeCell<usize>`
note: required because it appears within the type `PageAllocator`
  --> src/allocator.rs:21:8
   |
21 | struct PageAllocator {
   |        ^^^^^^^^^^^^^
   = note: shared static variables must have a type that implements `Sync`
I spent some time looking into this, and trying to understand how the BumpPointerAlloc class worked, until I spotted this line in the example:
unsafe impl Sync for BumpPointerAlloc {}
Basically, just wish it to be true! The code explicitly says that the memory allocator is only valid in a single-threaded environment. Presumably it would be possible to make the allocator thread-safe by appropriate use of thread-aware constructs, but since we are (currently) also in a single-threaded environment, I'm not going to worry about this too much, and just copy it.
// OK to hack this because we are single threaded for now
unsafe impl Sync for PageAllocator {}
Having declared the allocator, we need to initialize it. I again spent a non-trivial amount of time trying to figure out how to pass __heap_start into the initializer before finally giving up and saying "that is just a case I will need to handle in alloc". So we initialize all the fields to zero instead:
#[global_allocator]
static HEAP: PageAllocator = PageAllocator {
    next_page: UnsafeCell::new(0),
    free_16: UnsafeCell::new(0),
    free_256: UnsafeCell::new(0),
    free_4096: UnsafeCell::new(0)
};
Now it's just a question of writing the actual allocation function. This has four cases, which are:
  • allocate a small block (16 bytes or less)
  • allocate a middling block (17-256 byres)
  • allocate a page (for requests from 257-4096 bytes)
  • fail for now (for larger requests)
    unsafe fn alloc(&self, layout: core::alloc::Layout) -> *mut u8 {
        if layout.size() <= 16 && layout.align() <= 16 {
            self.next(&self.free_16)
        } else if layout.size() <= 256 && layout.align() <= 256 {
            self.next(&self.free_256)
        } else if layout.size() <= 4096 && layout.align() <= 4096 {
            self.next(&self.free_4096)
        } else {
            core::ptr::null_mut() // this is allegedly what will cause OOM exceptions
        }
    }
So, yes, this does just delegate all of the hard work to a function called next. The reason for this is that the three cases where we do allocate memory are all very similar and we just need to consider which list to use (we also need to know the sizes of the block, but that code isn't written yet).

That function does all the hard work. I think I'm going to present it first, then talk about it.
impl PageAllocator {
    unsafe fn next(&self, list: &UnsafeCell<usize>) -> *mut u8 {
        let lp = list.get();
        if *lp == 0 {
            let np = self.next_page.get();
            // we don't have any free memory, allocate next page
            if *np == 0 {
                *np = (&__heap_start as *const _) as usize;
            }
            if *np >= (&__heap_end as *const _) as usize {
                return core::ptr::null_mut(); // we are out of memory
            }

            // TODO: still need to initialize inside of block if not 4096
            *lp = *np;
            *np = *np + 4096;
        }
        let curr = *lp;
        *lp = *(curr as *const usize);
        curr as *mut u8
    }
}
First, note that this is not in the same impl block as alloc. That's because that block is exclusively for implementing the methods of the GlobalAlloc trait. This is a new method exclusive to PageAllocator and so goes in its own block.

An UnsafeCell is an object which can hold a value and return a pointer to it when desired. We are using these here to store the head of a linked list inside our heap. In next, we start by obtaining and dereferencing this pointer to obtain the usize value. If it is the value 0, then we know we don't have any available items on this free list and we need to allocate some. Leaving aside that block, when we have an entry on the free list, we assign it to curr, the current value we will return, and then assign the contents of that memory address to the "head" pointer. This assumes that we have stored the next free pointer (if any) in this location previously, or 0 if there are no more free locations.

Now let's return to the base case, in which there are no entries on the free list. Here, we look at the next_page pointer, which is where we are storing the pages that haven't been used yet. If this is 0, then this is the first call to alloc, and we need to assign the address of __heap_start to the next_page pointer before continuing.

The next check ensures that the value in the next_page pointer is not past the address of __heap_end. If it is, then we have completely run out of heap and cannot allocate the requested memory. Following the example we are basing our code on, we return core::ptr::null_mut() to indicate that we could not perform the allocation, and it would seem that the parent code deals with the rest of the issues.

We have two final steps to complete: first, we need to initialize this block to make sure that it contains all the right values in all the right places (just a comment right now); and then we need to update the current free list to point to the page just allocated, and move the next_page pointer on by 4096 to point to the next page (which must either be the next free page or __heap_end).

For all that complexity, this code does exactly the same thing as our prior code in the sample we are using: it just returns the address of __heap_start. So, unsurprisingly, the code still works.

I have checked this in as RUST_BARE_METAL_PAGE_ALLOCATOR_1.

Completing the Implementation

So up to this point, I have been mainly fiddling in the dark without really knowing what I was doing and mainly just trying to abuse Rust to the point where it let me compile something. But now I have something working, I want to switch gears and complete the implementation in the canonical way - by writing unit tests.

So let's write a test, which we will place in its own module and file, tests.rs:
#[cfg(test)]
mod tests {
    use crate::allocator::PageAllocator;
    use core::{cell::UnsafeCell, alloc::GlobalAlloc};

    #[test]
    fn test_allocate_first_page() {
        let pa = simple_allocator();
        unsafe {
            let addr = pa.alloc(alloc::alloc::Layout::fromsizealign(4096, 16).unwrap()) as usize;
            assert_eq!(addr, 4096);
        }
    }

    fn simple_allocator() -> PageAllocator {
        return PageAllocator{
            next_page: UnsafeCell::new(4096),
            free_16: UnsafeCell::new(0),
            free_256: UnsafeCell::new(0),
            free_4096: UnsafeCell::new(0)        
        }
    }
}
The problem with this is that when I go to run the test, the linker gives me errors. Why? Well, I can't say it's entirely unexpected. We have written code that depends on __heap_start and that is a symbol we bind in during our explicit linking process in linker.ld. The cargo script doesn't know how to find it. However, the error message is as impressive as usual for Rust, and in addition to reporting the entire command line which is passed to ld through cc, it gives these errors and suggestions:
          /usr/bin/ld: /home/gareth/Projects/IgnoranceBlog/homerrust/target/debug/deps/homerrust-6186d6a57da04f90.3ti03rv249xsljw4.rcgu.o: in function `homer_rust::allocator::PageAllocator::next':
          /home/gareth/Projects/IgnoranceBlog/homer_rust/src/allocator.rs:58: undefined reference to `__heap_start'
          /usr/bin/ld: /home/gareth/Projects/IgnoranceBlog/homer_rust/src/allocator.rs:60: undefined reference to `__heap_end'
          collect2: error: ld returned 1 exit status
          
  = note: some `extern` functions couldn't be found; some native libraries may need to be installed or have their path specified
  = note: use the `-l` flag to specify native libraries to link
  = note: use the `cargo:rustc-link-lib` directive to specify the native libraries to link with Cargo (see https://doc.rust-lang.org/cargo/reference/build-scripts.html#cargorustc-link-libkindname)
I have checked the current state of the code in as RUST_BARE_METAL_PAGE_ALLOCATOR_TESTS_BROKEN, in large part so I can try different strategies and come back to a known state.

Interestingly, what my research reveals is that the solution to the whole problem is a lot simpler than seems to be indicated by the messages above. What I hadn't realized, but is clearly documented is that the #[cfg(test)] attribute is in fact a directive for conditional compilation comparable to using #ifdef in C. So we can put any code we want inside the #[cfg(test)] and it will be defined there and not in the live application. And, in contrast, we can define a block with #[cfg(not(test))] which will include code that is only defined when compiling not for tests.

So, I can add this directly into tests.rs:
    #[no_mangle]
    pub static __heap_start :u32 = 0;
    #[no_mangle]
    pub static __heap_end :u32 = 0;
And we can add yet one more attribute to the existing definition of the unstable shim variable in allocator.rs:
#[cfg(not(test))]
#[allow(non_upper_case_globals)]
#[no_mangle]
pub static __rust_no_alloc_shim_is_unstable: u8 = 0;
This stops it being defined more than once in the test scenario.

The tests still fail:
$ cargo test
    Finished test [unoptimized + debuginfo] target(s) in 0.00s
     Running unittests src/lib.rs (target/debug/deps/homer_rust-6186d6a57da04f90)
memory allocation of 48 bytes failed
error: test failed, to rerun pass `--lib`
This is not actually that surprising, because we have defined a #[global-allocator] and so the code we are trying to test is being used to allocate memory for the test. This was never going to end well.

But surely our new "get out of jail free" card can be played here as well:
#[cfg(not(test))]
#[global_allocator]
static HEAP: PageAllocator = PageAllocator {
    ...
};
Now it goes a little further, but still no joy:
$ cargo test
   Compiling homer_rust v0.1.0 (/home/gareth/Projects/IgnoranceBlog/homer_rust)
    Finished test [unoptimized + debuginfo] target(s) in 0.21s
     Running unittests src/lib.rs (target/debug/deps/homer_rust-6186d6a57da04f90)

running 2 tests
test tests::test_set_4_in_1_from_0 ... ok
error: test failed, to rerun pass `--lib`

Caused by:
  process didn't exit successfully: `/home/gareth/Projects/IgnoranceBlog/homer_rust/target/debug/deps/homer_rust-6186d6a57da04f90` (signal: 11, SIGSEGV: invalid memory reference)
SIGSEGV is never good. And it suggests that we have done something wrong with memory and pointers and all that unsafe stuff we've been messing around with. Basically, our code doesn't work. But at least it is compiling and running now.

Removing all the code from the test case gets the test to pass:
running 2 tests
test tests::test_set_4_in_1_from_0 ... ok
test allocator::tests::tests::test_allocate_first_page ... ok
So at least we have a baseline that compiles and links we can work from. I'm going to put the code back, and then check this in as RUST_BARE_METAL_PAGE_ALLOCATOR_TESTS_SEGV.

I don't know how to attach a debugger to cargo test, and, rather than find out right now, I've decided to fall back on the old tactic of commenting out code until I find the line that fails. It's this one:
        *lp = *(curr as *const usize);
But it's not obvious to me why. Presumably the variable curr is not correctly defined. Looking through my code, I see that in the test definition of the allocator, I set it to 4096 which is clearly not a valid address. I'd done that without thinking about the fact that I would be wanting to write to that address. So let's instead set it to zero, and then, now that we have __heap_start defined in the test, that should work.

Sadly, it still fails, but at least it's now an assertion failure:
assertion `left == right` failed
  left: 94405430325248
 right: 4096
So, now that we have reached this point, I think it's time to ask if we are going the way we want to go or not. So, again, I'm going to check in broken code as RUST_BARE_METAL_PAGE_ALLOCATOR_TESTS_FAILING.

I'm not really ready for how complicated this is becoming. But having floundered my way through it, I hope I can explain it in words of one syllable. It occurs to me that what I need to do (in the tests) is to use the standard allocator to allocate a block of (test) heap which I can then treat as if it were the entire heap available to the allocator under test.

So it seems to me that the right thing to do is to extract the section of code that handles initialisation when the initial value of next_page is 0. We can then wrap this up in a closure and pass it to the PageAllocator constructor.

So, in the test, this looks something like this:
            let blk = alloc::alloc::alloc(alloc::alloc::Layout::from_size_align(4096, 16).unwrap());
            let start = (blk as *const _) as usize;
            let f = || { (start, start+4096) };
The first line here allocates a block of memory on the heap which is exactly the size of a page. The second line converts the pointer to a usize value, and the third defines a closure which returns the start and end of the block.

Meanwhile, in the non-test case, we can extract the values of __heap_start and __heap_end as defined by the linker in a function:
    fn init_from_heap() -> (usize, usize) {
        unsafe { ((&__heap_start as *const _) as usize, (&__heap_end as *const _) as usize) }
    }    
We need to update the PageAllocator struct to store this field, as well as a field to hold the end value.
struct PageAllocator {
    next_page: UnsafeCell<usize>,
    end: UnsafeCell<usize>,
    free_16: UnsafeCell<usize>,
    free_256: UnsafeCell<usize>,
    free_4096: UnsafeCell<usize>,
    init: fn() -> (usize, usize)
}
What could possibly go wrong?

Well, quite a bit.

First, we have declared init to have a function type, but Rust views a closure as something different because it references (or "captures") the value of start.
error[E0308]: mismatched types
  --> src/allocator/tests.rs:27:23
   |
20 |             let f = || { (start, start+4096) };
   |                     -- the found closure
...
27 |                 init: f
   |                       ^ expected fn pointer, found closure
   |
   = note: expected fn pointer `fn() -> (usize, usize)`
                 found closure `[closure@src/allocator/tests.rs:20:21: 20:23]`
note: closures can only be coerced to `fn` types if they do not capture any variables
  --> src/allocator/tests.rs:20:27
   |
20 |             let f = || { (start, start+4096) };
   |                           ^^^^^ `start` captured here
This now sends us down a chain of corrections. OK, it's a closure, so we need to declare it as such. To declare a closure, you need to use a Rust trait, which, as I understand it, is similar to an interface in Java. But you can't just declare the closure type, as you can with a function, because each closure is different and requires the compiler to generate a different enclosing type (in this case PageAllocator) for each actual closure instance. So, we need to provide the type with a type attribute, and then express the constraint that it is a closure trait.
struct PageAllocator<F> where F: Fn() -> (usize,usize) {
    ...
    init: F
}
So far, so good. The problem is that we need to add a similar attribute everywhere we use PageAllocator.

In the actual initialization of the global_allocator:
    static HEAP: PageAllocator<fn()->(usize,usize)> = PageAllocator {
        ...
    };
In the test initialization:
    fn simple_allocator() -> (usize, PageAllocator<Fn() -> (usize,usize)>) {
        ...
    }
And then in each of the impl blocks for PageAllocator. Interestingly, in these cases, we need to declare the parameter all over again - it isn't automatically inherited from the struct declaration. I'm sure there are reasons for this, but I don't understand them.
impl<F> PageAllocator<F> where F: Fn() -> (usize,usize) {
    ...
}

unsafe impl<F> Sync for PageAllocator<F> where F: Fn() -> (usize,usize) {}

unsafe impl<F> GlobalAlloc for PageAllocator<F> where F: Fn() -> (usize,usize) {
    ...
}
OK, now does it work? Well, sadly not. The closure type cannot be used directly in the test initialization because the compiler cannot figure out what size it is. I think what it means is that it cannot figure out how to generate the code for the PageAllocator to store or invoke the closure, because I can't see how it doesn't know it when generating the code to call the constructor.
error[E0782]: trait objects must include the `dyn` keyword
  --> src/allocator/tests.rs:16:52
   |
16 |     fn simple_allocator() -> (usize, PageAllocator<Fn() -> (usize,usize)>) {
   |                                                    ^^^^^^^^^^^^^^^^^^^^^
   |
help: add `dyn` keyword before this trait
   |
16 |     fn simple_allocator() -> (usize, PageAllocator<dyn Fn() -> (usize,usize)>) {
   |                                                    +++
OK, so let's try adding a dyn as shown. This leads to another message:
error[E0277]: the size for values of type `(dyn Fn() -> (usize, usize) + 'static)` cannot be known at compilation time
  --> src/allocator/tests.rs:16:30
   |
16 |     fn simple_allocator() -> (usize, PageAllocator<dyn Fn() -> (usize,usize)>) {
   |                              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ doesn't have a size known at compile-time
   |
   = help: the trait `Sized` is not implemented for `(dyn Fn() -> (usize, usize) + 'static)`
note: required by a bound in `PageAllocator`
  --> src/allocator.rs:7:22
   |
7  | struct PageAllocator<F> where F: Fn() -> (usize,usize) {
   |                      ^ required by this bound in `PageAllocator`
help: you could relax the implicit `Sized` bound on `F` if it were used through indirection like `&F` or `Box<F>`
  --> src/allocator.rs:7:22
   |
7  | struct PageAllocator<F> where F: Fn() -> (usize,usize) {
   |                      ^ this could be changed to `F: ?Sized`...
...
13 |     init: F
   |           - ...if indirection were used here: `Box<F>`
All of these suggestions can be followed, and I did try a couple of them, but the Box seemed the simplest to get working:
    fn simple_allocator() -> (usize, PageAllocator<Box<dyn Fn() -> (usize,usize)>>) {
        unsafe {
            ...
            let f = || { (start, start+4096) };
            (start, PageAllocator{
                ...
                init: Box::new(f)
            })
        }
    }
So, now it works, right? Almost, but not quite.

Quite understandably, the compiler thinks that start will go out of scope before it is used in the closure:
error[E0373]: closure may outlive the current function, but it borrows `start`, which is owned by the current function
  --> src/allocator/tests.rs:20:21
   |
20 |             let f = || { (start, start+4096) };
   |                     ^^    ----- `start` is borrowed here
   |                     |
   |                     may outlive borrowed value `start`
   |
note: closure is returned here
  --> src/allocator/tests.rs:21:13
   |
21 | /             (start, PageAllocator{
22 | |                 next_page: UnsafeCell::new(0),
23 | |                 end: UnsafeCell::new(0),
24 | |                 free_16: UnsafeCell::new(0),
...  |
27 | |                 init: Box::new(f)
28 | |             })
   | |_______^
help: to force the closure to take ownership of `start` (and any other referenced variables), use the `move` keyword
   |
20 |             let f = move || { (start, start+4096) };
   |                     ++++
But at least the suggested solution is quite simple: add the move keyword to the definition of the closure.

And, yes, now everything works. And the test passes while the code keeps on running.

That's all checked in as RUST_BASE_METAL_PAGE_ALLOCATOR_FIRST_TEST. Before checking it in, I did a couple of quick refactorings to put all the non-test code in its own module, all neatly wrapped up in a single #[not(cfg(test))].

Tidying up

OK, I think everything from here on is straightforward TDD of a fairly normal piece of code - the fact that it will be used as a memory allocator can be safely ignored. Some refactoring happened as well, and I've replaced all the static memory allocation with dynamic allocation. I don't think any interesting issues came up.

The final version is checked in as RUST_BARE_METAL_PAGE_ALLOCATOR_COMPLETE.

Conclusion

We have successfully managed to incorporate memory management into our Homer example. That sorts out most of the issues we have right now and just leaves the fact that the code is a mess, but I'm not going to fix that in the Homer project. But before I start a whole new project, there are a couple of things I still want to try and understand how to do in the bare-metal context.

Friday, January 5, 2024

Unit Tests in Rust


Looking at that list of things to do, one seems obviously simpler than the rest and probably more useful in the short to medium term: getting the unit tests to run.

In fact, it's ridiculously easy. I'm not exactly sure where I copied this from now, but the problem is either I chose the wrong place or didn't copy it very well. The line I have which says
#[cfg(tests)]
is plural when in fact is should be singular
#[cfg(test)]
Yup, that's it. The tests now run. I'm not quite sure why Rust doesn't give a warning about this particular mistake. rust-analyzer says that the config "tests" is "disabled", but my thought had been that meant that there was a configuration "tests" which I had specifically - deliberately or accidentally - disabled, and I spent some time looking through my configuration settings before seeing enough examples on the web where it said cfg(test) that it finally sank in that I had a simple typo.

For those wondering, yes, my code under test passed first time. My test had another error in it, which is I had copied across this (extraneous) line as well which wouldn't compile:
    use alloc::collections::btree_map::Values;
And now I get this output:
   Compiling homer_rust v0.1.0 (/home/gareth/Projects/IgnoranceBlog/homer_rust)
    Finished test [unoptimized + debuginfo] target(s) in 0.32s
     Running unittests src/lib.rs (target/debug/deps/homer_rust-6186d6a57da04f90)

running 1 test
test tests::test_set_4_in_1_from_0 ... ok

test result: ok. 1 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.00s

   Doc-tests homer_rust

running 0 tests

test result: ok. 0 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.00s
This is chattier than I would like, but Rust seems to be that way. Also, for those scoring at home, it's worth noting that if I had had this working at the time, I would have written four or five tests to ensure I wrote the correct code. After the fact, especially with my application now working, I'm not going to bother.

This is checked in as RUST_BARE_METAL_TEST_NOT_TESTS.

Fixing Blue Homer

In the sample code we are copying, when configuring the framebuffer device, one of the options refers to a choice of byte order in the color setting: it can be "RGB" or "BGR". RGB was requested, but the sample accepts that the answer may come back "BGR". I ignored this, in part because the emulator accepts RGB.

My best guess is that the real hardware doesn't - and so Homer is inverted. Let's test this theory.

The response is (now) processed in lfb_init. So let's intercept that and send the RGB/BGR bit down the UART.
    let pixorder = unsafe { read_volatile(volbuf.add(24)) };
    write("pixel order is ");
    write_8_chars(hex32(pixorder));
    write("\r\n");
And we see:
pixel order is 00000000
That's zero, which our comments tell us is BGR, not RGB. So we need to store that in our struct and use it when rendering. This adds a certain amount of complexity, but here are the highlights:

We declare an enum to make it clearer what the two cases are. For some reason, this needs an annotation declaring that it derives PartialEq, although presumably we could not do that and explicitly define it. I assume this just means that we can test that two values with BGR are the same, as are two RGBs. We then add this into the struct of information about the Framebuffer:
#[derive(PartialEq)]
enum PixelOrder {
    BGR,
    RGB
}

struct FrameBufferInfo {
    width : u32,
    height : u32,
    pitch: u32,
    base_addr: u32,
    pixorder: PixelOrder
}
And then we need to set this to the correct enum value based on the 0 or 1 return value from the mailbox call:
    fb.pixorder = if pixorder == 1 { PixelOrder::RGB } else { PixelOrder::BGR };
Note that, if (like me) you did not already know this, Rust eschews the ternary operator in favour of having functional-style if statements that can return values (as long as you don't put semicolons after the final expression in a block).

And, of course, when we come to draw Homer we need to take this into account:
            if (fb.pixorder == PixelOrder::RGB) {
                unsafe { *((ptr + x*4 + 0) as *mut u8) = homer[homer_index + 0]; }
                unsafe { *((ptr + x*4 + 1) as *mut u8) = homer[homer_index + 1]; }
                unsafe { *((ptr + x*4 + 2) as *mut u8) = homer[homer_index + 2]; }
                unsafe { *((ptr + x*4 + 3) as *mut u8) = homer[homer_index + 3]; }
            } else {
                unsafe { *((ptr + x*4 + 2) as *mut u8) = homer[homer_index + 0]; }
                unsafe { *((ptr + x*4 + 1) as *mut u8) = homer[homer_index + 1]; }
                unsafe { *((ptr + x*4 + 0) as *mut u8) = homer[homer_index + 2]; }
                unsafe { *((ptr + x*4 + 3) as *mut u8) = homer[homer_index + 3]; }
            }
(It's a very small change, but the offsets in the indices on the left hand side in the second block go 2,1,0,3).

OK, let's try that. Yup, Homer is now the right colour. Let's check this in before it stops working! It's tagged as RUST_BARE_METAL_HOMER_PI.

And now that I have everything fairly stable, I'm going to finish the task I'd set myself and clean up all the dead code, removing unnecessary blocks of tracing and associated functions. The final version is tagged RUST_BARE_METAL_CLEANED_HOMER.

Conclusion

It's been painful (for me, at least) but we are now at the point where we can do a number of things on the bare metal of a Pi working almost entirely in Rust. But it's one big mess.

I think we've gone as far as Homer can really take us, but I do want to read about - and experiment with - some other concepts in this space, and improving Homer seems like the most sensible way forward.

As a list as much to myself as anything, I want to:
  • figure out how to get most of the standard library in, without pulling in dependencies on Linux;
  • after that, figure out how to get memory management to work so I can allocate blocks of memory;
  • using that, be able to allocate memory blocks of arbitrary alignment;
  • figure out how to get unit tests to run in this environment
  • look into making the code more modular.
While doing all this, I hope to gain a better understanding of what "idiomatic" Rust looks like and then, within that, to find my own "voice" At the moment, I suspect I am essentially trying to write Java code in Rust, which is slightly odd because I feel Rust itself is closer to my natural intuition of how to code than Java is.

When I have done all that, I think I want to move on and try and build a proper console on the monitor which can display the messages that would otherwise go to the UART (although it should be configurable to do both).

Wednesday, January 3, 2024

UART and Real Hardware


One thing I have been trying to avoid is to go down the road of connecting a serial cable to the Pi and sending signals to a USB port on a real computer. It just seems too hard (I'm a software guy, not a hardware guy). However, at this point I need to admit I misjudged how hard it is to get the HDMI console working, so I'm backing off and trying something else.

I ordered an FTDI Chip 1m USB to UART Cable in Black from Radio Spares (RS) in the UK.

Just wiring the cable up reminds me of why I hate hardware so much. On the PC end, of course, you just shove the USB connector into your laptop and your done. On the Pi end, however ...

First off, the cable comes with three wires which we will call black, orange and yellow. One of the things about serial connections that confuses me is the crossover: you connect "Rx" to "Tx" and "Tx" to "Rx" and then you have to remember which "Rx" and "Tx" you are thinking about. The Pi has a set of GPIO pins that (if you have the case the same way around I do) run down the left hand side of the box from the front to near the back. The pins are confusingly numbered twice: once by their physical location and once by their logical location. For now (we will come back to them), I'm going to ignore the GPIO numbers and just go with the physical location: the pins are numbered starting at "1" from the front, and each pair of pins has the lower, odd, pin on the right and the higher, even, pin on the left. Thus the first row has "2" and "1", the second "4" and "3", the third "6" and "5" and so on. We want to wire up the "black" lead to pin "6", the "yellow" lead to pin "8" and the "orange" lead to pin "10". That's the third, fourth and fifth pins from the front on the extreme left hand side of the box.

Doing this is not made easier by the fact that (for my cable at least), the orange and yellow leads "wanted" to be the other way around.

That's the clearest description I've been able to come up with - the one which would have helped me to get it set up before I started. The link to the website above gives the more technical description of how the cable itself is wired. This link explains in detail how the Pi is connected together.

Configuring the UART

Now that we have physically connected the two ends together, we need to set up the software on both ends. You need some terminal software to run on the PC end of things. I'm using Linux and after looking at what other people had done, decided to use picocom as a terminal. It's fairly simple to install and use:
$ sudo apt-get install picocom
$ sudo picocom --baud 115200 /dev/ttyUSB0
picocom v3.1

port is        : /dev/ttyUSB0
flowcontrol    : none
baudrate is    : 115200
parity is      : none
databits are   : 8
stopbits are   : 1
It comes back to you with all the settings it uses. Apart from the baud rate (which we specified), it is using no parity, no flow control, 8 data bits and 1 stop bit. I'm fairly optimistic that these are the settings that the code I'm copying from also used (to check, I installed the original code onto the SD card to try this and it worked as a teletype echo, so I know I have done everything correctly).

So now I need to go back and port that code to set up the port correctly.

Here is the original C code:
void uart_init()
{
    register unsigned int r;

    /* initialize UART */
    *UART0_CR = 0;         // turn off UART0

    /* set up clock for consistent divisor values */
    mbox[0] = 9*4;
    mbox[1] = MBOX_REQUEST;
    mbox[2] = MBOX_TAG_SETCLKRATE; // set clock rate
    mbox[3] = 12;
    mbox[4] = 8;
    mbox[5] = 2;           // UART clock
    mbox[6] = 4000000;     // 4Mhz
    mbox[7] = 0;           // clear turbo
    mbox[8] = MBOX_TAG_LAST;
    mbox_call(MBOX_CH_PROP);

    /* map UART0 to GPIO pins */
    r=*GPFSEL1;
    r&=~((7<<12)|(7<<15)); // gpio14, gpio15
    r|=(4<<12)|(4<<15);    // alt0
    *GPFSEL1 = r;
    *GPPUD = 0;            // enable pins 14 and 15
    wait_cycles(150);
    *GPPUDCLK0 = (1<<14)|(1<<15);
    wait_cycles(150);
    *GPPUDCLK0 = 0;        // flush GPIO setup

    *UART0_ICR = 0x7FF;    // clear interrupts
    *UART0_IBRD = 2;       // 115200 baud
    *UART0_FBRD = 0xB;
    *UART0_LCRH = 0x7<<4;  // 8n1, enable FIFOs
    *UART0_CR = 0x301;     // enable Tx, Rx, UART
}
All of this looks pretty hairy and none of it is completely transparent. (This, of course, is why I have avoided having anything to do with it - it feels as complicated as getting the main display to work.)

But let's take it slowly and see what we can get to work in our own Rust code.

First off, let's add a call to uart_init in kernel_main:
#[no_mangle]
pub extern fn kernel_main() {
    avoid_emulator_segv();
    uart_init();
In order to port this code, I've decided to do a limited amount of refactoring and cleaning up of the code. For example, we are going to reuse mbox_send to set the clock, so I've moved the code that was (wrongly) in there to check the response about the video buffer out to the lfb_init method. I've also bundled up the piece of code that is responsible for ensuring that the emulator doesn't SEGV into its own method (avoid_emulator_segv) and called that up front.

So what does the rest of this code do? Well, the first line claims (presumably correctly) to disable the UART by writing 0 to the UART0_CR. We can do that too:
const UART_CR: u32 = 0x3F201030;
...
fn uart_init() {
    // Turn off UART0 while we configure it
    mmio_write(UART_CR, 0);
The next block of code sets the UART clock. Setting the clocks in this way is described in the section of the wiki that deals with tagged mailbox messages.
    // Now, set the UART clock (yes, the Raspberry Pi seems 
    // to have about 10 separate clocks) to 4MHz.
    let mut buf: [u32;36] = [0; 36];

    buf[0] = 9 * 4; // this message has 9 4-byte words
    buf[1] = 0;
    buf[2] = 0x38002; // set one of the clock rates
    buf[3] = 12; // request has three words of data
    buf[4] = 0;  // space for response length, but is zero for request
    buf[5] = 2;  // 2 selects the "UART" clock
    buf[6] = 4000000; // set it to 4MHz
    buf[7] = 0;  // avoid setting "turbo" mode
    buf[8] = 0;

    let mut msg = Message { buf: buf };
    mbox_send(8, &mut msg.buf);
Note that this reuses the (refactored) mbox_send that we used previously to configure the display.

Now, to get to the rest of it, we need to understand the GPIO configuration registers. In the ARM peripherals guide, chapter 6 (p89) describes the GPIO pins and the following page (p90) has a table with all the registers and their alleged addresses (again, these are in the right order, but with the wrong offset, for the actual Raspberry Pi boards). This is somewhat confusing, in part because the first row is duplicated.

Then Table 6-1 explains how the registers are used. The first five registers are used to control the meaning of the 54 GPIO pins (forty in the strip down the side of the board, the other fourteen in the header at the front). For each pin, three bits are used, giving eight possible options for the pin. 000 means this pin is used as an input, 001 means this pin is used as an output, and the other six options identify special-purpose "alternative" functions. These alternative functions are specified in Table 6-31 in Section 6.2 on pp102-103.

Remember that earlier I said that the pins were numbered twice, once for their physical location and once for their "logical" location? Well, this time we use the logical location. Looking at the pinout for the GPIO header you can see that pin 8 is described as GPIO 14 (RXD) and pin 10 is described as GPIO 15 (TXD). Comparing this to Table 6-31, you can see that in the rows GPIO14 and GPIO15 the alternate functions in column "ALT0" are "TXD0" and "RXD0" respectively.

So what we need to do is to set the relevant bits of the second select register (GPFSEL1) to choose ALT0 without damaging any of the other bits. This is what the code does. My question is whether, in porting it, we can make it a little clearer? And whether this is the right time to introduce some unit tests?

I wrote this test:
#[cfg(tests)]
mod tests {
    use alloc::collections::btree_map::Values;

    use super::*;

    #[test]
    fn test_set_4_in_1_from_0() {
        let mut val = 0;
        gpf_select(&mut val, 1, ALT0);
        assert_eq!(val, 0b100000);
    }
}
and tried to run it using cargo test but it comes back and says 0 tests found. For now, I think I'm going to put that on my list of things that aren't working in my environment and that I need to get working and for now assume that I can write this function without help. So I end up with these three functions:
fn gpfsel_read(reg: u32) -> u32 {
    let addr = PERIPHERAL_BASE + GPIO_BASE + (reg*4);
    mmio_read(addr)
}

fn gpf_select(flags: &mut u32, pos: u32, fun: u32) {
    let lsb = pos * 3;
    *flags = *flags & !(7 << lsb); // clear these bits
    *flags = *flags | (fun << lsb);  // set these bits
}

fn gpfsel_write(reg: u32, value: u32) {
    let addr = PERIPHERAL_BASE + GPIO_BASE + (reg*4);
    mmio_write(addr, value);
}
and I can wire them up as follows inside init_uart:
    let mut fs1 = gpfsel_read(1);
    gpf_select(&mut fs1, 4, ALT0);
    gpf_select(&mut fs1, 5, ALT0);
    gpfsel_write(1, fs1);
The next part of the code seems something between arcane and bizarre, but definitely matches the description given on p101 of the Broadcom peripherals guide. It appears that the above code sets the values we want in memory, but does not propagate our choices to the hardware. To achieve that, we need to go through a cycle of telling the chip to make the changes.

In both places, the "magic number" of 150 cycles is specified as being the amount of time that is needed for the change to take effect. I have to imagine that this means "at least" 150 cycles because, apart from anything else, you can't really be sure that any code you write will not be subject to interrupts. And the code that I am copying - unless it is unrolled - would seem to me to use 150 cycles in executing the nop operation, along with at least twice as many in handling the control loop. So I am going to assume that as long as we wait for at least 150 cycles, we will be fine.

It has to be said that while I think I understand what this code is trying to achieve, I don't understand why it works the way it does, and, specifically, I don't understand what is meant by "pull-up" and "pull-down" and why that has anything to do with selecting the function associated with the GPIO pins.

So what this description says (and that the code seems to say) is:
  • Write 0 to the register GPPUD to remove the current pull-up/down setting;
  • Wait for the system to recognize the change;
  • Write a word with bits 14 & 15 set to PUDCLK0;
  • Wait for the system to process the change;
  • Clean up by removing the GPPUD and PUDCLK - in our case, we don't need to clean up GPPUD, so we just need to write PUDCLK.
    mmio_write(GPPUD, 0);
    wait_a_while(150);
    mmio_write(GPPUDCLK0, (1<<14) | (1<<15));
    wait_a_while(150);
    mmio_write(GPPUDCLK0, 0);
And, finally, we need to configure the UART itself. The documentation for the UART starts on p175, and the section on the registers begins on p177.

I am somewhat confused by the first line of C code, which is supposed to clear the interrupts, because it seems to contradict what the documentation says on how the register is to be used. The C code writes a 1 into each of the 10 well-defined bits of the ICR, but it seems to me that the definition in the documentation expects that 0s will be written to clear the bits.

Given that the code works, and I don't have a great deal of faith in any of the documentation, I am going to copy the code rather than the documentation but I'm not sure I will be able to tell if this is "correct" or "just happens to work".
    mmio_write(UART_ICR, 0x7ff);
Now onto setting the baud rate. We want to set the rate to 115,200 baud based on a clock speed of 4MHz: to set this, we need to provide the "divisor": basically a "wavelength". This is very poorly explained in the Broadcom documentation, so I searched the web for other documentation and found this which may in fact be a good source of documentation in general.

It's fairly obvious that the integer part needs to be an unsigned integer between 1 and 65535, while the fraction part is more obscure. It turns out that the value of the fractional part is a numerator where the denominator is always 64, so FBRD is the number of 64ths. In trying to repeat the calculation that I am copying, the clock speed of 4,000,000 is divided by the baud rate of 115,200 giving a divisor of 3.7222...; according to the documentation, this needs to be further divided by 16 which gives me 2.170138...; the integer portion is clearly 2, and the fractional portion is just under 11/64. So I want to write 2 and 11 into the IBRD and FBRD registers respectively.

As luck would have it, those are exactly the numbers in the code I am copying, but I am going to write them in decimal rather than hex for stylistic reasons: I think hex tends to suggest something which derives from bitwise operations.
    mmio_write(UART_IBRD, 2);
    mmio_write(UART_FBRD, 11);
OK, we're nearly there now. The Line Control Register sets the rest of the transmission parameters, and has eight significant bits. We want to turn most of these bits off, but we want to select 8 bit transmission, which involves setting bits 5 and 6, and enable the FIFO mode (so that we can transmit a buffer in one go and then have the UART do all the hard work), which is in bit 4. So we want to set the register to 0x70.
    mmio_write(UART_LCRH, 0x70);
And finally, we re-enable the UART. The control register is described on pp185-187, and the relevant bits we need to set are 0, 8 and 9, which has the hex mask 0x301.
    mmio_write(UART_CR, 0x301); // Enable UART with Rx and Tx
And that all works in the emulator. But on the real hardware, not so much.

This is kind of what I was afraid of. In order to be able to debug things, I need to be able to have some means of knowing what is going on. My first approach was to think that I could write to the console. My second approach was to say, well, if I can't do that, can I at least write to the UART? Apparently, I can't do that either.

What's really interesting is that when I try and simplify this by eliminating some of the code, it still doesn't work. Even if I comment out the whole of the body of uart_init, the Pi starts up and does nothing. If I comment out the calls to avoid_emulator_segv and uart_init, it goes back to almost working (showing Homer in the wrong colors). What seems obvious to me is that something, probably eithr memory related or timing related, is going wrong in a way that is affected by the size of the kernel_main function. But I cannot guess what that could be, nor why it would happen on real hardware and not on the emulator. Interestingly, it does seem to be consistent: if I make one random change and then undo it, it goes back to the behaviour that it had before.

Having spent some time thinking about this away from the computer, I'm becoming increasingly convinced that it must be a timing thing and that it is something I did not copy correctly from the sample program.

Looking back at the C version of the program, I found these lines and the end of mbox_call, which has become send_mbox in my code:
    while(1) {
        /* is there a response? */
        do{asm volatile("nop");}while(*MBOX_STATUS & MBOX_EMPTY);
        /* is it a response to our message? */
        if(r == *MBOX_READ)
            /* is it a valid successful response? */
            return mbox[1]==MBOX_RESPONSE;
    }
What this is doing is checking that the data available to read in the message box is actually the answer to our message (rather than some other message). I suspect that because we are not doing this, it seems we have a response to the second mailbox message, but it's the response to the previous message (setting up the UART). So I'm now going to copy this across.
    loop {
        let rb = mmio_read(MBOX_READ);
        if rb == addr {
            break;
        }
    }
Now, this is back to working and I am getting messages coming across on the console. Taffing with this, it appears to continue working, which is the most important thing to me. On the other hand, Homer, is still blue in the face but I think I know why.

This is all checked in as RUST_BARE_METAL_INIT_UART.

Conclusion

With a few twists and turns, we managed to successfully wire up and set up the UART to communicate with the host PC. In doing this, we now have access to more standard tracing. I'm still a little nervous about how stable everything is, but I'm increasingly convincing myself that the problems are all me doing things wrong and not random instability.

Let's stop Homer being so blue.

Figuring out the Issues

So we find ourselves in a situation where the program runs in the emulator sometimes and not at other times. Why not?

Annoyingly, the fact that it works when we have tracing enabled, and not when it doesn't, means that we can't use tracing to find the problem.

But fortunately, we have determined that we can use the debugger. So let's try that again.

Continuing with our theme of optimization, let's add a new script, debug.sh, which contains the relevant commands to debug in the emulator. It will be recalled that this requires the debugger to start with "-s -S", and we can specify that to make with the GDB option as follows:
make "GDB=-s -S" run
We then need (again in a separate tab or window) to run the debugger, which we will put in a script gdb.sh:
gdb-multiarch -iex "file asm/kernel8.elf" -iex "target remote :1234"
where -iex is an option which specfies that the argument should be run as a command once gdb has started.

Sadly, when we try this, we are reminded that we switched to a release build to avoid some compilation/linking errors:
Reading symbols from asm/kernel8.elf...
(No debugging symbols found in asm/kernel8.elf)
Remote debugging using :1234
0x0000000000000000 in ?? ()
(gdb) 
So, we need to go back to trying to make a debug build work. A few judicious edits to our scripts, and debug.sh can now build and link a debug version of our executable. When we do this, it shows the following errors:
aarch64-linux-gnu-ld -nostdlib boot.o ../target/aarch64-unknown-linux-gnu/debug/libhomer_rust.rlib -T linker.ld -o kernel8.elf
aarch64-linux-gnu-ld: ../target/aarch64-unknown-linux-gnu/debug/libhomerrust.rlib(homerrust-c9f0f32593953886.3qdmjppfu6iwhc43.rcgu.o): in function `homerrust::lfbinit':
/home/gareth/Projects/IgnoranceBlog/homer_rust/src/lib.rs:133: undefined reference to `memset'
aarch64-linux-gnu-ld: /home/gareth/Projects/IgnoranceBlog/homer_rust/src/lib.rs:136: undefined reference to `core::panicking::panic'
aarch64-linux-gnu-ld: /home/gareth/Projects/IgnoranceBlog/homer_rust/src/lib.rs:195: undefined reference to `core::panicking::panic'
aarch64-linux-gnu-ld: ../target/aarch64-unknown-linux-gnu/debug/libhomerrust.rlib(homerrust-c9f0f32593953886.3qdmjppfu6iwhc43.rcgu.o): in function `homerrust::showhomer':
/home/gareth/Projects/IgnoranceBlog/homer_rust/src/lib.rs:213: undefined reference to `core::panicking::panic'
aarch64-linux-gnu-ld: /home/gareth/Projects/IgnoranceBlog/homer_rust/src/lib.rs:214: undefined reference to `core::panicking::panic'
aarch64-linux-gnu-ld: /home/gareth/Projects/IgnoranceBlog/homer_rust/src/lib.rs:222: undefined reference to `core::panicking::panic'
aarch64-linux-gnu-ld: ../target/aarch64-unknown-linux-gnu/debug/libhomerrust.rlib(homerrust-c9f0f32593953886.3qdmjppfu6iwhc43.rcgu.o):/home/gareth/Projects/IgnoranceBlog/homer_rust/src/lib.rs:222: more undefined references to `core::panicking::panic' follow
Boiled down, this comes to two things:
  • memset, which is defined in the C standard library, is undefined.
  • core::panicking::panic is undefined.
The latter is actually quite easy to fix given that we have already fixed the array bounds panic: we simply find the relevant (mangled) symbol and define it in boot.S.
.globl _ZN4core9panicking5panic17h8f06a2df29fa4962E
_ZN4core9panicking5panic17h8f06a2df29fa4962E:
    b halt
memset is slightly trickier, since we actually need to implement it. But it's not that difficult an implementation, providing we do the right magic to make it seem to be a C function, not a Rust function:
#[no_mangle]
pub extern fn memset(mut buf: *mut u8, val: u8, cnt: usize) {
    let mut i=0;
    while i<cnt {
        unsafe {
            *buf = val;
            buf = buf.add(1);
        }
        i+=1;
    }
}
Now we can use gdb to set a breakpoint. Let's put one in at lfb_init and another at mbox_send:
Reading symbols from asm/kernel8.elf...
Remote debugging using :1234
0x0000000000000000 in ?? ()
(gdb) b lfb_init 
Breakpoint 1 at 0x80360: file src/lib.rs, line 145.
(gdb) b mbox_send 
Breakpoint 2 at 0x80e48: file src/lib.rs, line 249.
(gdb) c
Continuing.

Thread 1 hit Breakpoint 1, homer_rust::lfb_init (fb=0x7ffd8) at src/lib.rs:145
145            let mut buf: [u32;36] = [0; 36];
(gdb) n
148            buf[0] = 35 * 4; // the buffer has 35 4-byte words
(gdb) n
149            buf[1] = 0; // we indicate we are sending a MBOX_REQUEST as 0
(gdb) n
154            buf[2] = 0x48003;
(gdb) n
155            buf[3] = 8; // the number of bytes in the request value
(gdb) p/x buf[2]
$1 = 0x48003
So far, so good. We are going through the code and we seem to be correctly setting up the buffer. Let's carry on and see how mbox_send gets on.
(gdb) c
Continuing.
Thread 1 hit Breakpoint 2, homer_rust::mbox_send (ch=8, buf=0x7fedc) at src/lib.rs:249
249            while mmio_read(MBOX_STATUS) & MBOX_BUSY != 0 {
(gdb) n
253            let volbuf = buf as *const u32;
(gdb) n
256            let ptr:u32 = volbuf as u32;
(gdb) n
259            let addr = (ptr & !0x0F) | ((ch as u32) & 0x0f);
(gdb) p/x ptr
$2 = 0x7fedc
Wait a minute! This code is predicated on ptr being aligned to a 16-byte boundary, but it clearly isn't. Let's just go one line further and check what happens:
(gdb) n
262            mmio_write(MBOX_WRITE, addr);
(gdb) p/x addr
$3 = 0x7fed8
(gdb) 
Yeah, that's not going to end up too well. It's going to try and read and write a buffer 12 bytes before we've set it up. What exactly happens will obviously depend on what it sees, but it's not going to be what we wanted (I guess we already knew that).

OK, so the debugger helped us out there to see what the problem is, but why? And how do we fix it?

It would seem that the C compiler has different alignment rules to the Rust compiler. I'm not entirely sure about what either set are, and I'm too lazy to check, but I suspect that in C, arrays that have a size which is a multiple of 16 bytes are aligned to 16 byte boundaries (which is why the array is declared as 36 words when we only use 35). In Rust, this obviously does not happen.

We therefore need to manually force the alignment. I considered a number of ways of doing this. One is to allocate an array too big, and then to figure out an offset into the array which is aligned; another is to "do the job properly" and introduce memory allocation and have a method which returns a value with arbitrary alignment; and the third is to see what compatibility features Rust has. I decided that the last was probably the best compromise although, as you'll see, I fully accept I could be wrong.

Rust offers an attribte #[repr(align(16))] which says it aligns things to the appropriate boundary. This seems to be just what we need. However, applying it to the array gives an error:
error[E0517]: attribute should be applied to a struct, enum, function, associated function, or union
   --> src/lib.rs:145:12
    |
145 |     #[repr(align(16))]
    |            ^^^^^^^^^
146 |     let mut buf: [u32;36] = [0; 36];
    |     -------------------------------- not a struct, enum, function, associated function, or union
It doesn't come much clearer than that. You can only align some things, and arrays are not one of them. So let's package the array in a struct and align that.
#[repr(align(16))]
struct Message {
    pub buf: [u32; 36]
}
We can then create our buffer and create a Message from that, and then use that (aligned) buffer to send to the mailbox:
    let mut msg = Message { buf: buf };
    mbox_send(8, &mut msg.buf);

    let volbuf: *mut u32 = &mut msg.buf as *mut u32;
Sadly, compiling this throws up another linker error:
aarch64-linux-gnu-ld: ../target/aarch64-unknown-linux-gnu/debug/libhomer_rust.rlib(homer_rust-c9f0f32593953886.3qdmjppfu6iwhc43.rcgu.o): in function `homer_rust::lfb_init':
/home/gareth/Projects/IgnoranceBlog/homer_rust/src/lib.rs:216: undefined reference to `memcpy'
aarch64-linux-gnu-ld: /home/gareth/Projects/IgnoranceBlog/homer_rust/src/lib.rs:216: undefined reference to `memcpy'
But fortunately, we are starting to get good at this:
#[no_mangle]
pub extern fn memcpy(mut dest: *mut u8, mut src: *const u8, cnt: usize) {
    let mut i=0;
    while i<cnt {
        unsafe {
            *dest = *src;
            dest = dest.add(1);
            src = src.add(1);
        }
        i+=1;
    }
}
And, now, it works! Great. Let's check that in before it stops working again. It's tagged as RUST_BARE_METAL_FIXED_EMULATOR

Conclusion

So, this is getting messier and messier, but at least it now works in the emulator somewhat reliably - with or without tracing turned on.

We learnt that alignment is different between C and Rust, and that we can (somewhat) hack that by using a struct with the retr attribute in Rust.

This also (coincidentally) seems to fix one of our problems on the physical box: Homer now appears. Sadly, he is once again the wrong color (blue this time). Looks like we still have work to do.