Ignorance may be Strength : A Very Simple Compiler

In this episode, we're going to write a compiler.

Now, when I was younger, I thought that writing compilers was the very pinnacle of software development. That if I could do that, I could do anything. After I had mastered the art of writing compilers, I then went and worked on real-time software with race conditions and realised that writing compilers was actually quite easy.

It's even easier if you leave out all of the hard parts such as type checking, helpful error messages, error cascade detection and so on. So if you're scared by the assertion that we are going to write a compiler in this blog post, don't be. On the other hand, if you're hoping to learn all the intricacies of doing so, that won't happen either.

If you are completely unfamiliar with compilers, they operate in "phases", with one phase generally coming after the next. While a typical compiler may have a dozen phases, we are just going to implement the most important two. Parsing is the process by which we understand the human-readable text of the source code and convert it into an internal data structure. Then Code Generation is the process by which we turn an internal data structure into something that can be executed (in our case a JSON file which can be read in the browser). Obviously, since we only have these two phases the "internal data structure" must be the same in both cases.

A Toy Language

And if you want to make it really easy, you write a compiler for a toy language. I spent some time thinking about what the simplest language I could come up with was, and while I didn't quite get there, I think I've come up with something fairly simple. For my purposes, the key feature it has to have is that it mustn't look like JavaScript, and it mustn't be possible to just say "ah, use a source map". So:

Its source code mustn't map directly onto underlying JavaScript;
The runtime data layout must not be at all obvious from the debugger;
Ideally, the function call stack isn't either.

In other words, I want a language where you would be almost hopelessly lost trying to use the standard Chrome JavaScript debugger to try and debug programs. If that seems like an odd specification for a language, bear in mind that I'm trying to motivate building a whole new debugger plugin for Chrome, and that I already have a non-trivial language with these characteristics (I know, because I'm struggling to debug it with the Chrome debugger).

So the concept I ended up with was a language to program the kinds of tills that you see these days in coffee shops. Now, when I last worked in such a place (Burger King, 1986), these things were all pretty much "manual". I'm sure there was something that mapped the buttons to the orders, but it wasn't on an iPad.

My vision, then, is of something with (say) five rows and four columns - 20 buttons. Each of those buttons has some label on it. Some of them are "nouns" - items you can order like tea or coffee. Others are "adjectives" - things that describe how you would like it, such as "strong" or "milk". And then there are a couple of "verbs" - "item is complete" and "order is complete".

As you push one or more of the buttons, it changes which other buttons are available. For instance, once you push "coffee", you can't push "tea" anymore. Until you have selected one "noun", neither "next" or "done" are available. How does it know? Because we have written a program, of course. In a programming language, of course. A very simple, toy programming language that I have, imaginatively, named till.

So, before I start writing a compiler, I'm going to present a sample of what I naively think is going to be a working till program. I've checked this in, and we will see whether it actually ends up looking anything like this when I'm done.

A layout enables us to say where the buttons go on the screen.

In theory, we could have any number of layouts, but for ease, I'm only going to support one for now,
and thus the name is irrelevant.  If we did support multiple layouts, then the idea would be
that there would be a verb "show" that switched to a different layout.  It may be the case that I
will find that adding more will add an essential complexity to the language that makes the debugger
more interesting.

    layout main
        row1 <- Coffee Tea Steamer Water
        row2 <- Strong Black
        row3 <- Dairy blank Oat Almond
        row4 <- blank blank blank blank
        row5 <- NEXT blank blank DONE

All good programming languages need an entry point and this one is no different.

The entry point is always called init, and, as shown here, does not have any arguments.
All routines are the same in that they have a header at one level of indentation and then
nested operations.  We'll come back to what the operations mean later.

    init
        enable all
        disable NEXT
        disable DONE
        milks <- Oat Dairy Almond
        drinks <- Tea Coffee Steamer Water

Each button on the screen has a name which is both an identifier (used elsewhere in the program)
and the label that appears on the button.  For simplicity, spaces are not allowed.  When pressed,
the operations listed "inside" it are carried out.

    button Coffee
        style noun
        noun <- Coffee
        disable all
        enable Strong Black milks
        enable NEXT

    button Tea
        noun <- Tea
        disable all
        enable Black milks
        enable NEXT

    button Water
        disable all
        enable NEXT

    button Oat
        style adjective
        adjs <- Oat
        disable milks Water

    button Dairy
        adjs <- Dairy
        disable milks Water

    button Almond
        adjs <- Almond
        disable milks Water

    button NEXT
        order <- noun adjs
        clear noun adjs
        enable all
        disable NEXT

    button DONE
        submit order
        clear order
        enable all
        disable NEXT
        disable DONE

CDP_SAMPLE:cdp-till/samples/cafe.till

I have no idea how intuitive this is to you right now, or even if it looks like a program to you. Certainly don't sweat the details, we should come back to all of that before we're done. The key point is that this language has been designed to be very, very easy to compile to a JSON intermediate form that we can then load into the browser and "run", just so we can get around to trying to debug it.

I suspect that if you are familiar with functional languages such as Miranda or Haskell the style will be a lot more familiar that if you are more used to C, Java, Go or JavaScript. To be clear, while I have designed this to be event driven and in some sense "declarative", it is most certainly not a functional language.

It does follow the "semi-literate" style of Miranda's "literate scripts". That is, any line which is not indented is treated as commentary and no attempt is made to parse it or use it in the compiler program. Identation is the exclusive means by which nesting is defined (no braces or keywords are used) with the caveat that non-indented lines, because they are ignored, may appear anywhere.

We are going to support three "top-level" keywords: layout, init and button. layout takes an argument which we will not (currently) use. init does not take an argument. button takes an argument that acts both as an identifier (by which it is referenced elsewhere) and a label. For simplicity, everything is case-sensitive and spaces are not permitted.

Inside the layout, we simply indicate for each row which buttons should appear on it.

Inside the other two blocks, we have a series of actions. Each action is either of the form keyword args or of the form var <- args. There are a bunch of keywords such as style, enable, disable, clear and submit which are "built in" to the language definition. The variables are not builtin and you can use them as you wish.

The "assignment" operator <- is not an assignment operator as such, but a "create-and-append" operator. All variables in Till are lists, and the <- operator appends the items on the right to the list on the left. If it does not already exist, an empty list is created. A variable which is referenced but does not exist has the "empty" value. If you want to remove things from a list, you use the clear keyword.

This probably all sounds more complicated than it really is.

A Parser

One time, when the team I was on was writing a parser for a templating language (like Velocity or Freemarker), a colleague asked me why we didn't use "tell-don't-ask" to write parsers. I explained that there were reasons, but we had a go anyway ... and failed.

Since then, I've practised doing TDA a lot more, and I've learnt how to do this. There are still reasons why it is tricky, mainly involving "global state", and I'm still not sure whether I like it or not, but the resulting parsers that are quite testable at a "unit" level.

So here is the basic tell-don't-ask parser pipeline:

package parser

import "github.com/gmmapowell/ignorance/cdp-till/internal/compiler"

func Parse(repo compiler.Repository, srcdir string) {
    scope := NewGlobalScope(repo)
    lineLexicator := NewLineLexicator()
    blocker := NewBlocker(scope, lineLexicator)
    splitter := NewSplitter(blocker)
    scanner := NewScanner(splitter)
    scanner.ScanFilesInDirectory(srcdir)
}

CDP_TDA_PARSER:cdp-till/internal/parser/parser.go

In the same commit, I have checked in all the minimal code to make this compile. It doesn't run - for a start, we haven't attached this to any code anywhere yet! - but it is enough to explain how we expect the parser to operate. In a little while, we will come back to the code generation pipeline.

As with all TDA code, this reads "backwards", so to understand what it does, you have to read from the bottom upwards. The Scanner is (going to be) a class that looks at a directory on the file system and finds files with the .till extension. It provides the paths to each of these files in turn to the splitter. The Splitter has the easy task of splitting these files into lines and telling its client about files starting, ending and the line number and text content of each line. Here, we specify its client as a Blocker, which is responsible for tracking indentation, throwing away comment lines and then keeping a "stack" of indentations and indented scopes. It has a scope which represents the "top level" scope, within which all the others will be nested, and a lineLexicator which is responsible for breaking up each line into tokens. Internally, it will pass the resulting tokens to the current scope and that will generate new scopes and build up the parse tree. This will become clearer when we look at the Blocker in detail.

All of these units communicate using interfaces to make it easy to test them in isolation. I am not going to do much of this here for two reasons: the didactic one is that it is just more code to present; the other is that I am stealing a lot of this code from other places and modifying it - in those other places there are unit tests.

Now, let's start filling in the blanks. I'm sure some or all of the assumptions I've made in this checkin will change, but no worries...

The Basic OS Stuff

The scanner and the splitter are so boring, I'm tempted to not even mention them. But they are there, so here we go.

The scanner just looks through a directory on the disk, given a path name.

(For ease, speed and clarity I am avoiding most of the usual error handling and just ignoring errors or panicking as seems appropriate.)

package parser

import (
    "fmt"
    "os"
    "path/filepath"
    "strings"
)

type FileHandler interface {
    ProcessFile(file string)
}

type Scanner struct {
    handler FileHandler
}

func (s *Scanner) ScanFilesInDirectory(srcdir string) {
    s.scanFor(srcdir, ".till")
}

func (s *Scanner) scanFor(srcdir, suffix string) {
    files, err := os.ReadDir(srcdir)
    if err != nil {
        panic(fmt.Sprintf("could not read src directory %s: %v", srcdir, err))
    }
    for _, f := range files {
        if !f.IsDir() && strings.HasSuffix(f.Name(), suffix) {
            s.handler.ProcessFile(filepath.Join(srcdir, f.Name()))
        }
    }
}

func NewScanner(handler FileHandler) *Scanner {
    return &Scanner{handler: handler}
}

CDP_PARSER_OS:cdp-till/internal/parser/scanner.go

This is just a simple piece of code that uses os.ReadDir to find all the files in a directory, sees if they have the desired suffix and, if so, joins the file name to the directory path and hands all of that off to the file handler.

package parser

import (
    "bufio"
    "fmt"
    "os"
)

type LineHandler interface {
    BeginFile(file string)
    ProcessLine(lineNo int, line string)
    EndFile(file string)
}

type Splitter struct {
    handler LineHandler
}

func (s *Splitter) ProcessFile(file string) {
    stream, err := os.Open(file)
    if err != nil {
        panic(fmt.Sprintf("could not process %s: %v", file, err))
    }
    defer stream.Close()

    s.handler.BeginFile(file)
    scanner := bufio.NewScanner(stream)
    i := 1
    for scanner.Scan() {
        text := scanner.Text()
        s.handler.ProcessLine(i, text)
        i++
    }
    s.handler.EndFile(file)
}

func NewSplitter(handler LineHandler) *Splitter {
    return &Splitter{handler: handler}
}

CDP_PARSER_OS:cdp-till/internal/parser/splitter.go

The splitter, likewise, takes a file and between notifying the LineHandler of the start and end of the file, splits it into lines and sends each line - along with its number - to the LineHandler.

The Blocker

The Blocker implements LineHandler and is responsible for grouping the incoming lines by indentation. Lines which are blank or have no leading white space are discarded. All other lines must begin with one or more tabs, followed by a non-white-space character. Anything else is an error (and we will panic).

We start off with some interface definitions, which provide placeholders for us to "send requests":

package parser

import (
    "fmt"
    "path/filepath"
    "unicode"
)

type Scope interface {
    PresentTokens([]string) Scope
    Close()
}

type LineLexer interface {
    Lexicate(line string) []string
}

type Blocker struct {
    lineLexer LineLexer
    scopes    []Scope
}