Wednesday, October 7, 2020

Understanding TextMate Grammars

When I was working through the example VSCode syntax highlighter, I observed that the grammar was specified using "TextMate grammars" and noted that I had never heard of these. Research was called for.

My particular challenge was to figure out how to match the syntax of a language which leans heavily on indentation. It seems hard to find the end of a block using the definition that "it doesn't have more tabs than this one".

Excuse my ignorance (as it happens, this is a rule of reading this blog - I am exposing my ignorance precisely to help others) but it turns out somebody has already come up with a solution to this problem. It is apparently called  negative lookahead. As explained in the link, the idea of "lookahead" regular expressions is that they can check whether or not a pattern occurs but do not consume the matched characters. Thus, providing a negative lookahead pattern for the end pattern of a rule which matches one level more of nesting will cause the block to end on the previous line. I think.

I gained this insight from a  separate web discussion which simply went ahead and  used these expressions to achieve this objective without really explaining what they did.

Breaking Down the Example

So what does it do? Well, first off, I ran it to check that it did actually work in the way I was expecting. However, it doesn't really do what I want.

This is the specification of the rule in some cson format that I don't entirely understand but appears to be some form of simplified JSON:
'indent-block': {
  name: 'indent-block.test-grammar'
  begin: '^(\\s+)(?=\\S)'
  end: '^(?!\\1\\s+)(?!\s*$)'
  patterns: [
    {include: '$$self'}
  ]
}
So this says that a block starts on a line that has a specific amount of leading whitespace and ends at the first line which starts with no more whitespace and has exactly one token on it. This is good for matching blocks such as
  while true
     …
  done
because the begin rule will match while true and then end rule will match done.

Unpicking the patterns, the begin pattern wants to match text which:
  • starts at the beginning of the line;
  • has at least one whitespace (\\s) character;
  • is followed by at least one non-whitespace character which is not included in the pattern.
This allows the block to contain arbitrary matches for things like while as well as inner blocks.

The end pattern wants to match text which:
  • starts at the beginning of the line;
  • does not begin with more whitespace than the begin line (\\1 represents the pattern which matched the inside of the parenthesis in the begin pattern);
  • contains arbitrarily many non-whitespace characters until the end of the line.
None of this (except the beginning of the line) is actually part of the pattern, leaving the remaining text to belong to an outer block (although I'm not sure why this is desired).

I did not come to a positive conclusion, but it seemed to me several times during my debugging sessions that if the beginning of line anchor is matched, it cannot be used again. That is, it seems that VSCode treats the beginning of line anchor as it would any other character and once it is matched it cannot be matched by a subsequent pattern.

Making it Work for Me

But this is not the pattern I want to match.

First off, I want a block to include all the lines after this one that are indented more than this one. And I want the block to end the moment it sees a line that is indented the same amount as this one. But I want it to end before that line starts.

To add to the complexity, I'm happy for blank lines - or lines with non-whitespace text in column 1 - to appear within a block, so any line which is blank or doesn't start with whitespace is not an end marker.

An additional challenge for me personally is a mental model shift: I'm used to thinking outside-in: breaking up the document into large chunks, then breaking those up into smaller chunks and so on. But this mechanism for matching regular expressions approaches the problem in a more inside-out way: the outer blocks "do not get a look in" until the innermost block has reached its end.

So a block is a sequence of lines which starts with a line with at least one leading tab character and continues until it finds a line which has:
  • at least one leading tab character;
  • no more leading tab characters than the begin line had;
  • is not blank (consisting entirely of whitespace characters).
Finally, it does not actually consume any of this, leaving it all for the next start line.

So the begin pattern is fairly easy to write:
^(\t+)(\\S+)
This only matches lines with at least one leading tab and at least one non-whitespace character after the leading tabs. The second capture group is simply there to capture the "keyword" for highlighting to make it clear where the blocks begin.

The end pattern is considerably more complex:
(?=^\\1\\S)|(?=\t(?!\\1\\S))
This is an OR of two positive lookahead conditions.

The left hand condition says that the block will end if the putative end line starts with exactly the same number of tab characters that were in the block opener, followed by a non-whitespace character. The right hand condition says it will also come to an end if the line begins with a tab and does not then contain the full block opening.

Put another way, the left hand condition will end the block when a line appears with the same indentation; the right hand condition will end the block when a line appears with less indentation (but at least one tab, thus ignoring blank lines and literate comments).

I have put this rule in the grammar for .st files and you can experiment with that.

Once again, I want to point out how very hard debugging all this is: in particular, there are some things you can do with lookahead patterns (I'm not sure exactly what) that cause the syntax highlighter to just "give up" and when you try and open the inspector, it just sits there saying "loading". Your only option at this point is to quit the editor and try a different pattern.

It also really annoys me, both from a development and a documentation standpoint, that you cannot add comments to a JSON file.

But what I Really Want is …

Having put all that hard work in, I discovered that it wasn't what I really wanted.

The reason, of course, is my mental model is wrong. As I noted above, I tend to think "outside-in", so I feel that if I can break the code up into blocks, then I can tackle one block at a time. This is not the right way to think about the problem.

Instead, what I need to do is to match each type of block that I have, and then provide it with a set of included patterns. Since my grammar does, in fact, distinguish quite significantly between blocks, this is actually fine.

But all the effort was not wasted, because pretty much every pattern is going to be a variant of what we saw above. One simplification that arises is that for many of the blocks, I know exactly what the acceptable indentation levels are, so I can just use that directly rather than the \\1 matching. On the other hand, I've made it more complex by trying to catch explicitly some versions of invalid indentation.

Another Challenge - Nesting

Another challenge I am facing is: "how do rules nest"? That is, if I specify multiple sub-patterns for a rule, in what order are they selected?

Again, a lot of this just doesn't matter. We are not fully parsing the code, trying to determine its structure and meaning, and rejecting invalid programs; we are just trying to highlight things according to their category. So, except for going deep into blocks, it is generally enough just to say "ah, this is a type" or "ah, this is a variable".

A couple of things do seem to matter, though. One is that I want end-of-line comments (starting with //) to be selected in preference to anything else. Another is that I want to handle blank and literate comments as a priority. And so, by trial and error, I determined that the rules seem to be processed in the order in which they are presented in the patterns block for a given rule. And, as noted above, once a rule has been selected, no other rule will be selected until it has finished (except for those nested within it).

Creating a grammar

Taking all these points together, I looked at the sample .fl file and created a grammar which seemed to cover the subset of rules that I had relied on.

This is now installed in flas.json and can be seen in the repository.

Making it real

My original plan was to generate a grammar file from my formal grammar, but having experimented in this way, I've realized two things:
  • the impedance mismatch between a formal grammar and the regular expressions used by VSCode is vast and an automatic conversion between the two would not be easy;
  • having sunk the investment that I have into figuring it out, it actually isn't that hard to define rules in this way, and I might either do it by hand or invest in a quick conversion tool that automates all of the hard work of JSON formatting and repetition and checks my work.

Conclusion

It's not really news, but  doing anything with regular expressions is a pain.

That is particularly true when you are bending them out of shape to do something of the job of an actual parser.

But provided you can bend your mind around it all, it is perfectly possible to come up with an acceptable set of regular expressions for an indentation-based language.

No comments:

Post a Comment