Thursday, September 24, 2020

Syntax Highlighting in VSCode

So, time to start writing some code. Or, at least copying it.

I created a new directory in the ignorance repository, called vscode-java. This is where I'm going to put the VSCode half of the language server - the client if you will. As trailed in the last post, my starting point is going to be copying the contentprovider sample and simplifying it. So that's the code that I copied.

And then I went through "simplifying" it - i.e. deleting most of the actual code so that I was just left with the syntax highlighting portion. I then copied in a couple of sample text files from my own repository. And obviously I had to run npm install and open it in VSCode.

How Syntax Highlighting Works

The instructions on  configuring syntax highlighting in the Microsoft documentation are actually quite clear, but not very exhaustive. Mainly it seems to defer most of the details to "it's the same as TextMate" without referencing anything.

The official word on TextMate grammars appears to be  this document, but it's not very detailed itself. I haven't managed (yet) to find any more introductory work.

The key thing seems to be to configure language and grammar contributions in package.json, so I did this:
"contributes": {
  "languages": [
    {
      "id": "flas",
      "extensions": [ ".fl" ]
    },
    {
      "id": "flas-st",
      "extensions": [ ".st" ]
    }
  ],
  "grammars": [
    {
      "language": "flas",
      "path": "./syntax/flas.json",
      "scopeName": "source.flas"
    },
    {
      "language": "flas-st",
      "path": "./syntax/flas-st.json",
      "scopeName": "source.flas-st"
    }
  ]
}
Here I appear to be defining two languages, but that's just because I have two types of file for my language: the main files have extension .fl and the system tests have .st. Each of these has its own grammar. The grammars are placed in files under the syntax directory and each has its own scope name for theming purposes.

Defining the Grammars

The grammars are defined in JSON approximately in line with the description of "TextMate grammars" insofar as I understand it (not a lot as yet). I'm sure it will become clearer as I dig in more. Sadly, the syntax is sufficiently opaque as to discourage you from learning by example.

However, this is an excerpt of one of the grammars I defined.
{
  "name": "flas",
  "scopeName": "source.flas",
  "patterns": [
    { "include": "#comment" },
    { "include": "#contract-intro"}
  ],
  "repository" : {
    "comment": {
      "name": "comment.line.file",
      "match": "^[^ \t].*$"
    },
    "contract-intro": {
      "begin": "\tcontract\\b",
      "beginCaptures": {
        "0": { "name": "keyword.intro" }
      },
      "end": "(//.*$|$)",
      "name": "statement.contract",
      "patterns": [{"include":"#typename"}]
    }
  }
}
The name and scopeName match the language and scopeName from the package.json. Failure to match both invalidates the grammar and it will not be used for syntax highlighting. The patterns array defines a set of productions or rules that can occur at the top level. In spite of being defined by regular expressions, there is an element of grammar productions to this, and it is certainly NOT the case that each regular expression just matches what it feels like.

The repository allows you to define more complex, (possibly recursively) nested rules. The #include syntax says that instead of specifying a specific pattern it is possible to delegate to a rule in the repository. The reference to the rule name must begin with a #, while the rule name itself does not; I'm not sure why. The patterns array both at the top level and within a repository definition is an array, any of which of the patterns may match some or all of the portion of the inner text, but it is not possible for them to match overlapping text.

It's also important to realize that the begin and end patterns are not part of the body of the rule and so are not included in the sub-matching of the patterns array but rather have separate logic to style them (beginCaptures and endCaptures).

Debugging

If reading (and writing) these grammars is hard, figuring out what is going on - and wrong - is insanely hard. First off, every time you make a change to any of the grammar files, you need to restart the Extension instance. This is done by using Sh-F5 to stop the current instance and then F5 to start a new instance.

It is then possible to see the consequences of your actions. If you're fortunate, you will see visual effects on the screen. If not, or if you just want clarity about what happened, it's possible to bring up the Token Inspector to see what happens. In the extension window, type M-Sh-P to bring up the command window and then type some portion of Developer: Inspect Editor Tokens and Scopes. Selecting this pops up a window which shows which rules were applied at the current location. To choose a different location, simply click there (unless it's under the popup window, in which case you may need to resort to trickery such as clicking elsewhere first or using the keyboard). To dismiss the window, press ESC.

On the upside, every time you restart, VSCode picks up from where it left off, so you don't need to go through the steps of re-opening the relevant windows. It also learns fairly quickly that you want to use the Token Inspector and suggests it sooner. And of course, if you are desperate, you can bind it to a simple keyboard shortcut.

Conclusion

Actually wiring up syntax highlighting was surprisingly easy. Getting the patterns to work was not. The complete lack of any tooling (at least in VSCode) that points out errors or failings was really annoying. Some things that particularly caught me out were forgetting the # when referencing a rule in the repository; not doubling the backslash characters before special characters (such as \b) in regular expressions (but note that this is not wanted for \t, which is a tab character); the begin and end syntax along with the fact that they are not included in the inner patterns; and the fact that regular expressions do not overlap.

I need to spend considerably more time looking into the syntax and trying to figure out how to use it to reasonably describe a quite complex "context sensitive" grammar using this weird mixture of regular expressions and production rules. But that is more for the "real world" than it is for this blog (although I may come back here if I have any wisdom to distill), and in the meantime it is time to move on to the task of integrating with a Java back end.

No comments:

Post a Comment