Ignorance may be Strength

Wednesday, June 19, 2019

Experiments with Formal Grammars

This post is technically a bit off topic, since the theme is supposed to be things I am ignorant about and I am in fact a compiler expert - I read the Dragon Book in 1986, after I'd already written a few parsers; I've got a PhD in the area; I've worked for a compiler company; I implemented the IDL spec for TIBCO's CORBA implementation; yada, yada, yada.

But most of the parsers I've written over the past 25 years have been done using Agile methods and TDD in particular while discovering the application domain, rather than relying on formal grammars. But while working on my FLAS compiler recently, I felt I'd hit a brick wall in terms of understanding the parser in terms of the wider context and looked back at that IDL spec I'd implemented in 1994 and thought, "I need something like this".

But I don't like writing documents like that, let alone maintaining them. So I thought "let's get the computer to write the document". And so, let the experimentation begin ...

Obviously the computer can't actually write documents ...

Computers can't write documents. At least not in general. But that isn't really what I meant. I want it to do all the hard formatting and repetition for me, based on an explanation of what the grammar should be. I want it to be flexible enough to change whenever I want to add or remove or rule, or just change the name of something. I want it to renumber everything and make sure all the references are accurate. I want it to tell me when I've made a mistake (omitted something, defined something twice, not used anything ...). In short, all I want to do is to specify the grammar formally, and that only in order to make myself commit to exactly what I want it to be and describe what the semantics are supposed to be.

To understand what I'm talking about, look at that IDL spec, particularly the grammar starting in section 3.4. Everything in the grammar is duplicated: section 3.4 has a "summary" of all the rules, and then section 3.5 goes back and looks at each group of rules together explaining both the intricacies of parsing them and the intended semantics of the various constructions.

Each of the rules references various other items, including tokens (described in section 3.2) and other rules in the grammar.

Each rule has a number associated with it, and fairly obviously, the numbers are presented in ascending order.

I want all of that, I just don't want to have to create it, and I certainly don't want to maintain it when I decide to add a new rule between 35 and 36. Or to change one name.

In short, I want a programming language. Except I don't want to use something like YACC, because I don't want to try and go straight from the specification to generating a parser; that never works for me.

So I'm going to use XML.

XML gets a lot of bad press, and has a lot of haters. Particularly among the "cool kids" who do front end development and think for some reason that YAML and JSON are so much better, apparently based on the fact that they have less angle brackets and more curly ones. But on this specific occasion, I think it's the right tool for the job. Let me briefly explain why.

First and foremost, it's very easy to parse. I've worked with various XML formats before for one reason or another (and often hated it) but have developed a very convenient XML processing library that wraps the standard Java one and makes this kind of work remarkably easy.

It is clear from the outset that this definition needs to combine "definitions" and "text". XML is very good at this - much better than JSON or YAML. Including HTML inside XML can be a bit tricky, but as long as you stick to XHTML and don't try and get fancy with DTDs, it all works fine.

The structure of the formal grammar is very repetitive and quite flat, and so the XML representation is remarkably easy both to represent and read (as I hope you'll agree with the examples below).

OK, enough of the XML sales pitch. Let's get down to the meat of it.

Representing a production as XML

So, all those rules in the grammar are basically just productions - that's the technical term for a name (specifically a non-terminal) on the left which can be produced when the parser sees one of the possible sequences of terminals (or tokens) and non-terminals on the right hand side.

Consider this rule from my grammar:

(50) list-literal ::= OSB CSB
| OSB <expression> <comma-expression>* CSB

This defines a list literal in the language, which is basically exactly the same as it is in JavaScript or any functional language. Things like this:

[] - the empty list
[1] - a simple list
[a,b] - a list with two elements
[2+3, "hello".length()] - a list with two expressions

Formally, it says that there are two cases for defining a list literal. It is possible to produce a list literal by either following the first case and having an "Open Square Bracket" followed by a "Close Square Bracket" (i.e. the empty list) or by having those with a non-terminal expression followed by zero or more non-terminal comma-expressions. You'd have to go and look for those rules (currently 48 and 51 respectively) to check, but informally it's fairly obvious that that covers all the other cases - one or more expressions separated by commas.

Note, however, that this formulation expressly prohibits the trailing comma which is allowed in JavaScript (but not in JSON). Changing the grammar to support that (in one of several different ways) is left as an exercise to the reader. What I'm interested in here is how that is represented in XML.

At the outermost level, it's a production and we want to call it list-literal, so let's do that:

<production name='list-literal'>
...
</production>

That's enough to define the production name, make it available as a reference, and give it a number. But that only defines the left hand side of the production.

The internals give more information.

For the more detailed description of the grammar, each production needs to be placed in a particular section, along with a description. It would be possible, indeed reasonable, to group the productions in XML inside a <section> block, but in the end I decided I wanted the flexibility to have the rules in one order and the sections potentially grouping from different areas. I haven't used that flexibility, but it led to me putting the section inside the production:

<section title='Expressions' />

One consequence of this choice is the odd feature that the first time a particular section is named, it has to include a <description> element; on subsequent usages, such as this one, that is forbidden and only the title is required.

Finally, the rule has a body; that is the set of productions that make it up. As you would expect, this is basically a tree of all the things you see on the right hand side.

As a tree, it consists of multiple "levels" of definition. At the top level, this is an "OR" definition, so we declare an <or> element, and then nest the cases inside it - one per line of the text representation. In this case, each of those is a "sequence" of other definitions, so we use a <seq> element for each line. The first line consists of just two, named tokens (using the <token> element) so we include those, in order, inside the first <seq>.

<or>

<seq>

</seq>

<seq>

<many>

</many>

</seq>

</or>

The second rule is more complicated (understandably). It has the same two tokens (at the start and the end). Inside this is a reference (<ref>) to another non-terminal, expression. This is followed by a <many> rule which says that whatever it contains will occur zero or more times; the content in this case is another reference to another non-terminal, in this case comma-expression.

Transforming into a Grammar Document

Having parsed this, we need to generate the grammar text. This is really not as complicated as it might seem.

The basic idea is to generate an HTML file, although we could also generate PDF with a little more effort. We can run through the productions we read from the XML and for each one apply a simple transformation to generate an appropriate div with an appropriate CSS class. We can then define an appropriate CSS stylesheet that lays out the rules in the way we want.

In order to generate both a summary and a detailed grammar description, it is necessary to run through the grammar twice. The first time generates the summary; on the second pass, the <section> tags are analyzed in detail and each section is generated with a header, the actual productions for that section and then the documentation from the section description.

The Gift that Keeps on Giving

While my original motivation for doing the grammar in this way was to simplify the documentation process, I have subsequently found that now that I have the grammar in machine-readable form, I find it an easy step to encourage the machine to do just "that one more thing" for me.

Testing the Grammar

When reading the grammar, the parsing tool builds a cross reference of all the rules and tokens. In doing so, it sometimes finds inconsistencies. It is possible to extend this to an automated grammar consistency test. For example, once all the grammar rules have been read, it is easy to tell whether there are some production references that refer to productions that are not defined anywhere in the grammar. Likewise, it is possible to tell if all the tokens are defined. With a minor extension, it is possible to check that all the productions and tokens are in fact referenced somewhere. And so on. Every time I change the grammar my automated build runs the process that generates the documentation (actually coded as a unit test) and the build fails if any of these invariants is not met.

Random Sentence Production

Moreover, it gives me the opportunity to automatically, randomly test the compiler's parser. By reversing the grammar, it is possible to generate random sentences by looking at each production and filling in the blanks. By doing so, it is possible to produce (admittedly meaningless) programs which should at the very least produce credible parse trees.

Again, my automated build contains a suite of unit tests which generate a few thousand random sentences based on the grammar and runs the parser on them. Any syntax errors represent failures in the parser and thus broken unit tests ... and a failed build.

Which Rule Went Wrong?

A step beyond this is statistical analysis pointing out which rules are failing. By knowing for each random sentence which production it involved and which sentences failed, it "should" be possible to identify which production(s) are not parsing correctly by analyzing the failures. While I have written code to do this, it is sufficiently hard to interpret the results that I have given up using it an reverted to the normal process of analyzing what I changed most recently as a guide as to what is likely to be broken ...

How Well Did I Do?

Judge for yourself ... this is the current version of my FLAS language specification (note this may well be different to what I've written above due to this blog post not being kept up to date).

TL;DR - Lessons Learned

The main lesson to be learnt here is what a "gift that keeps on giving" it is to make things machine readable. My experiences really did follow the path I outlined here - my original objective was just to save myself from having to lay out and update all of the rule numbers while continually adjusting the grammar. Everything else occurred to me much as I've stated it and became possible because of the decision to use a machine-readable format.

XML in this case worked out really well for me and I haven't regretted it at all. The virtues I extolled above were reinforced, rather than diminished, as I played the game.

Tuesday, June 4, 2019

Installing Haskell on a Mac

I'm not going to tell you how to install any Haskell implementation on any box. I'm just going to help you a little bit by telling you what I did to install ghc on my Mac.

Basically, I followed these instructions.

What this amounts to is running this curl script:

$ curl -sSL https://get.haskellstack.org/ | sh

Towards the end, it asks for permission to "sudo" in order to install to /usr/local/bin and you need to give your password.

In order to do the code generation, it needs you to carry out two additional steps, which it informs you about towards the end of the installation:

$ xcode-select --install

$ open /Library/Developer/CommandLineTools/Packages/macOS_SDK_headers_for_macOS_10.14.pkg

And then you need to run the compiler at least once to download the latest version and so on:

$ stack ghc

Sunday, May 19, 2019

Monads by Motivation

My background (long, long ago) is in functional programming. Back in the 80s and early 90s, there were no monads, but there was a lot of research into how to make state easier to deal with in functional programs. I did some of that research.

After I had left University and gone out into the real world, my supervisor sent me a couple of Phil Wadler's papers introducing the idea of monads. The idea seemed genius, but the presentation put me off. Because I've not really done anything more with pure functional languages over the years, I've never really wrangled with monads. And thus I've never really understood them. I've even regretted over the years not studying Category Theory when I was an undergraduate, on the grounds that that might have helped.

A friend recently pointed out that maybe I was just all too theoretical on the subject and maybe I should just try and "use" monads. I demurred, on the grounds that I like to understand what it is I'm doing, but then I went to Liam Griffin-Jowett's talk on "the bank kata in Haskell" and realized that he was just assuming practical knowledge of monads in order to solve a problem. And in following along, I felt I learnt a lot about monads - but not enough.

So, here, now, I'm going to try and put two and two together for the first time, and I'm going to do it by winding back the clock to 1990, declaring that monads don't exist and building a cafe loyalty card in a pure functional language. I'm going to use Haskell rather than Miranda, because that's what we do these days, and it will make the transition easier.

First off, make sure you have Haskell installed.

Secondly, if you want to follow along with code, you'll want my code which can be downloaded from github:

$ git clone git@github.com:gmmapowell/ignorance-may-be-strength.git

OK, let's get started.

A simple, purely functional, loyalty card

Let's consider a simple loyalty card - you know, the ones that you collect in coffee shops and have 10 pictures of coffee beans on them. Every time you go, you receive one stamp. When the card is full, you can have a free coffee and start again.

In Haskell, we can represent such a card quite simply:

data Loyalty = Card Int

And we can fairly easily represent the act of stamping a card like so:

stamp (Card n)

| stamped < 10 = ("You have " ++ (show stamped) ++ " stamps", Card stamped)

| otherwise = ("You have earned a free beverage", Card 0)

where
stamped = n+1

This takes a card and returns a pair - a message about the new state of the card and either an updated card, or a brand new one if the free beverage was earned.

So far, so good.

Visiting more than Once

Assume that Adam and I go for coffee and he collects both stamps. Using the definition above, that's easy, right? We just do:

stamp (stamp (Card 0))

Except when we do this, we see the following:

• Couldn't match expected type ‘Loyalty’

with actual type ‘([Char], Loyalty)’

• In the first argument of ‘stamp’, namely ‘(stamp (Card 0))’

In the expression: stamp (stamp (Card 0))

In an equation for ‘it’: it = stamp (stamp (Card 0))

The problem here is that we are returning both a message and the "current card" but when we go to apply the second stamp, stamp is only expecting a card to be passed in. We can work around this by pulling apart the pair and then returning a list of messages together with the final card, but this is starting to look ugly.

stampTwice c =

([msg1, msg2], card)

where

(msg1, tmp) = stamp c

(msg2, card) = stamp tmp

Fundamentally, this is a problem of composition. What we want to be able to do is to create a function stampTwice which can either be defined intuitively as the earlier (failing) example or using the compose operator as

stampTwiceC = stamp . stamp

I know function signatures can quickly get out of hand, and never more so than when composing functions, but let's look at this in a bit of detail. The simplest function signature is always a->b, and thus the simplest signature for compose is going to be (b->c)->(a->b)->(a->c).

In (almost) plain English, this is a function which takes two functions and returns a function. The input of the returned function is the same as that accepted by the second function (which is actually applied first); its return type will be the same as that of the first function; and the return type of the second function must match the input type of the first function. Again, so far, so good.

Our problem here is with that last clause. Our first function is returning the tuple (String, Card) and the second function is expecting just a Card. We somehow need to use sleight of hand to pass the card from one invocation to the next while collecting up the messages.

Card/0

stamp

1 stamp

Card/1

stamp

2 stamps

Card/2

A Composition Function of our Own

Of course, as any good computer scientist would tell you, there's more than one way to break down a problem - and more than one way to hide the complexity.

For me, one of the great things about Functional Languages has always been the cunning use of nested functions, helper functions, scopes and recursion to keep on moving the complexity off until it turns out it was never there in the first place: it's just a graph of obvious and simple steps.

But we have the ability here to shunt off some of the complexity by extracting stamp from the definition above. Quietly doing a small amount of concomitant renaming, we suddenly have:

composeState :: (c -> (b, d)) -> (a -> (b, c)) -> (a -> ([b], d))

composeState f g x =

([r1,r2], state)

where

(r1, tmp) = g x

(r2, state) = f tmp

This appears very similar to the above, but is in fact a composition function that takes two arbitrary functions which return tuples; the first element of the two tuples must be composable in a list, and the second arguments follow the same pattern as compose.

Applying this more general function to stamp and stamp behaves like so:

*Main> composeState stamp stamp (Card 0)

(["You have 1 stamps","You have 2 stamps"],Card 2)

Monads are all about Composition

Avid monad spotters will by now have noticed what we've done (of course, I wouldn't expect an avid monad spotter to be reading a basic introduction to the topic like this). We've recreated the State monad, albeit in a weird, limited and ugly form.

This is in fact, the very purpose of a monad. Monads are all about composition. Monads exist precisely to solve problems like this by wrapping up and "hiding" some ugly portion of the processing while allowing the "user" code to focus on the desired interactions - in this case the action of stamping cards and informing users how their rewards are stacking up.

So let's have a quick look at how we would do this in Haskell using the real State monad.

stampm :: State Loyalty String

stampm = do

card <- get

let (msg, next) = stamp card

put next

return msg

*Main> evalState ((++) <$> stampm <*> stampm) (Card 0)

"You have 1 stampsYou have 2 stamps"

You are probably thinking right now, "that looks pretty complicated and what does half of it mean?"; and you'd be right. But that's often the way with computer science solutions: for a simple example, the more general solution is more complex than a more specific solution. The payback comes as the system grows and you are only interested in the interactions, not the plumbing.

In future episodes, we'll come back (again and again) to the topic of monads, exploring what this means, what you can do and why it all works.