Ignorance may be Strength : June 2019

This post is technically a bit off topic, since the theme is supposed to be things I am ignorant about and I am in fact a compiler expert - I read the Dragon Book in 1986, after I'd already written a few parsers; I've got a PhD in the area; I've worked for a compiler company; I implemented the IDL spec for TIBCO's CORBA implementation; yada, yada, yada.

But most of the parsers I've written over the past 25 years have been done using Agile methods and TDD in particular while discovering the application domain, rather than relying on formal grammars. But while working on my FLAS compiler recently, I felt I'd hit a brick wall in terms of understanding the parser in terms of the wider context and looked back at that IDL spec I'd implemented in 1994 and thought, "I need something like this".

But I don't like writing documents like that, let alone maintaining them. So I thought "let's get the computer to write the document". And so, let the experimentation begin ...

Obviously the computer can't actually write documents ...

Computers can't write documents. At least not in general. But that isn't really what I meant. I want it to do all the hard formatting and repetition for me, based on an explanation of what the grammar should be. I want it to be flexible enough to change whenever I want to add or remove or rule, or just change the name of something. I want it to renumber everything and make sure all the references are accurate. I want it to tell me when I've made a mistake (omitted something, defined something twice, not used anything ...). In short, all I want to do is to specify the grammar formally, and that only in order to make myself commit to exactly what I want it to be and describe what the semantics are supposed to be.

To understand what I'm talking about, look at that IDL spec, particularly the grammar starting in section 3.4. Everything in the grammar is duplicated: section 3.4 has a "summary" of all the rules, and then section 3.5 goes back and looks at each group of rules together explaining both the intricacies of parsing them and the intended semantics of the various constructions.

Each of the rules references various other items, including tokens (described in section 3.2) and other rules in the grammar.

Each rule has a number associated with it, and fairly obviously, the numbers are presented in ascending order.

I want all of that, I just don't want to have to create it, and I certainly don't want to maintain it when I decide to add a new rule between 35 and 36. Or to change one name.

In short, I want a programming language. Except I don't want to use something like YACC, because I don't want to try and go straight from the specification to generating a parser; that never works for me.

So I'm going to use XML.

XML gets a lot of bad press, and has a lot of haters. Particularly among the "cool kids" who do front end development and think for some reason that YAML and JSON are so much better, apparently based on the fact that they have less angle brackets and more curly ones. But on this specific occasion, I think it's the right tool for the job. Let me briefly explain why.

First and foremost, it's very easy to parse. I've worked with various XML formats before for one reason or another (and often hated it) but have developed a very convenient XML processing library that wraps the standard Java one and makes this kind of work remarkably easy.

It is clear from the outset that this definition needs to combine "definitions" and "text". XML is very good at this - much better than JSON or YAML. Including HTML inside XML can be a bit tricky, but as long as you stick to XHTML and don't try and get fancy with DTDs, it all works fine.

The structure of the formal grammar is very repetitive and quite flat, and so the XML representation is remarkably easy both to represent and read (as I hope you'll agree with the examples below).

OK, enough of the XML sales pitch. Let's get down to the meat of it.

Representing a production as XML

So, all those rules in the grammar are basically just productions - that's the technical term for a name (specifically a non-terminal) on the left which can be produced when the parser sees one of the possible sequences of terminals (or tokens) and non-terminals on the right hand side.

Consider this rule from my grammar:

(50) list-literal ::= OSB CSB
| OSB <expression> <comma-expression>* CSB

This defines a list literal in the language, which is basically exactly the same as it is in JavaScript or any functional language. Things like this:

[] - the empty list
[1] - a simple list
[a,b] - a list with two elements
[2+3, "hello".length()] - a list with two expressions

Formally, it says that there are two cases for defining a list literal. It is possible to produce a list literal by either following the first case and having an "Open Square Bracket" followed by a "Close Square Bracket" (i.e. the empty list) or by having those with a non-terminal expression followed by zero or more non-terminal comma-expressions. You'd have to go and look for those rules (currently 48 and 51 respectively) to check, but informally it's fairly obvious that that covers all the other cases - one or more expressions separated by commas.

Note, however, that this formulation expressly prohibits the trailing comma which is allowed in JavaScript (but not in JSON). Changing the grammar to support that (in one of several different ways) is left as an exercise to the reader. What I'm interested in here is how that is represented in XML.

At the outermost level, it's a production and we want to call it list-literal, so let's do that:

<production name='list-literal'>
...
</production>

That's enough to define the production name, make it available as a reference, and give it a number. But that only defines the left hand side of the production.

The internals give more information.

For the more detailed description of the grammar, each production needs to be placed in a particular section, along with a description. It would be possible, indeed reasonable, to group the productions in XML inside a <section> block, but in the end I decided I wanted the flexibility to have the rules in one order and the sections potentially grouping from different areas. I haven't used that flexibility, but it led to me putting the section inside the production:

<section title='Expressions' />

One consequence of this choice is the odd feature that the first time a particular section is named, it has to include a <description> element; on subsequent usages, such as this one, that is forbidden and only the title is required.

Finally, the rule has a body; that is the set of productions that make it up. As you would expect, this is basically a tree of all the things you see on the right hand side.

As a tree, it consists of multiple "levels" of definition. At the top level, this is an "OR" definition, so we declare an <or> element, and then nest the cases inside it - one per line of the text representation. In this case, each of those is a "sequence" of other definitions, so we use a <seq> element for each line. The first line consists of just two, named tokens (using the <token> element) so we include those, in order, inside the first <seq>.

<or>

<seq>

</seq>

<seq>

<many>

</many>

</seq>

</or>

The second rule is more complicated (understandably). It has the same two tokens (at the start and the end). Inside this is a reference (<ref>) to another non-terminal, expression. This is followed by a <many> rule which says that whatever it contains will occur zero or more times; the content in this case is another reference to another non-terminal, in this case comma-expression.

Transforming into a Grammar Document

Having parsed this, we need to generate the grammar text. This is really not as complicated as it might seem.

The basic idea is to generate an HTML file, although we could also generate PDF with a little more effort. We can run through the productions we read from the XML and for each one apply a simple transformation to generate an appropriate div with an appropriate CSS class. We can then define an appropriate CSS stylesheet that lays out the rules in the way we want.

In order to generate both a summary and a detailed grammar description, it is necessary to run through the grammar twice. The first time generates the summary; on the second pass, the <section> tags are analyzed in detail and each section is generated with a header, the actual productions for that section and then the documentation from the section description.

The Gift that Keeps on Giving

While my original motivation for doing the grammar in this way was to simplify the documentation process, I have subsequently found that now that I have the grammar in machine-readable form, I find it an easy step to encourage the machine to do just "that one more thing" for me.

Testing the Grammar

When reading the grammar, the parsing tool builds a cross reference of all the rules and tokens. In doing so, it sometimes finds inconsistencies. It is possible to extend this to an automated grammar consistency test. For example, once all the grammar rules have been read, it is easy to tell whether there are some production references that refer to productions that are not defined anywhere in the grammar. Likewise, it is possible to tell if all the tokens are defined. With a minor extension, it is possible to check that all the productions and tokens are in fact referenced somewhere. And so on. Every time I change the grammar my automated build runs the process that generates the documentation (actually coded as a unit test) and the build fails if any of these invariants is not met.

Random Sentence Production

Moreover, it gives me the opportunity to automatically, randomly test the compiler's parser. By reversing the grammar, it is possible to generate random sentences by looking at each production and filling in the blanks. By doing so, it is possible to produce (admittedly meaningless) programs which should at the very least produce credible parse trees.

Again, my automated build contains a suite of unit tests which generate a few thousand random sentences based on the grammar and runs the parser on them. Any syntax errors represent failures in the parser and thus broken unit tests ... and a failed build.

Which Rule Went Wrong?

A step beyond this is statistical analysis pointing out which rules are failing. By knowing for each random sentence which production it involved and which sentences failed, it "should" be possible to identify which production(s) are not parsing correctly by analyzing the failures. While I have written code to do this, it is sufficiently hard to interpret the results that I have given up using it an reverted to the normal process of analyzing what I changed most recently as a guide as to what is likely to be broken ...

How Well Did I Do?

Judge for yourself ... this is the current version of my FLAS language specification (note this may well be different to what I've written above due to this blog post not being kept up to date).

TL;DR - Lessons Learned

The main lesson to be learnt here is what a "gift that keeps on giving" it is to make things machine readable. My experiences really did follow the path I outlined here - my original objective was just to save myself from having to lay out and update all of the rule numbers while continually adjusting the grammar. Everything else occurred to me much as I've stated it and became possible because of the decision to use a machine-readable format.

XML in this case worked out really well for me and I haven't regretted it at all. The virtues I extolled above were reinforced, rather than diminished, as I played the game.

Ignorance may be Strength

Wednesday, June 19, 2019

Experiments with Formal Grammars