Ignorance may be Strength : September 2024

Tuesday, September 10, 2024

Handling Git Projects

One of the key use cases I have for ScriptFormatter is "generating" this blog. As I have said many times, I am incredibly lazy and not at all good at dealing with anything involving "pixels", whether images, pixel-perfect UI layouts, powerpoints or manipulating a blog page to look just the way I want it.

What I would rather do is just type out (much as I am at the moment), a whole lot of words, group them according to meaning, and then have a tool translate all of that to "pixel-perfect" HTML.

Even worse than pixel-perfect layouts, however, is all the effort involved in copying across code. I just hate doing that. Apart from anything else, there's the duplication. And then the rework if you go back and fix a bug.

So, at some point (certainly not initially), I came up with the idea of including my code samples directly from git. (In fact, I thought this was such a good idea, I did it twice - we'll get to that.) The idea here is that git already has all of the changes you made to your code, over time, so at any point (e.g. when you are regenerating your blog two years later), you can reach back in time and pluck out the code exactly as it stood back then - or alternatively, you can make a change back in history, follow it through, retag everything and regenerate the blog to reflect the code as changed, but with the correct history still in place.

How does this work?

A Simple Overview of Git

I imagine most people who read this already know how git works, so I won't labour the obvious. The key features I am going to rely on here are:

Git keeps track of a series of versions of a file structure (a repository, itself located somewhere in the filesystem) and can, at any time, show you what changed between two versions of that structure, or any part of it;
Git allows you to tag any given version with a name of your choosing: that name can later be moved if you want to point to a different version, but Git will manage the mapping of name to version;
It is possible to pluck out what any given file looked like at any time in the past by giving the version number (or a tag) and the path to the file in the repository.

My first technique

The first thing that I did, and still the thing that I use most on this blog, is based on the idea that pretty much everything I write about proceeds on a commit-to-commit basis. In other words, I write some code, get it working, check it in, give it a tag and then blog about it. That being the case, almost every time I blog, I want to show you (the readers) what I changed and why. The details are a little gory, but this basically consists of three steps:

I indicate where the git repository is;
I specify which tag I want to use and which file, and possibly starting and ending points in that file;
The processor removes all the "annotation" lines, together with any lines that have been removed, and then pretties the result up for display and writes it into the output as if I had cut and paste it into the input myself.

As an example, I can write this

&git "~/IgnoranceBlog"

which specifies the root of the repository.

Then if I write this:

&import "FBAR_PLAYWRIGHT_DEMO" "build.gradle"

it will include the whole of a file (build.gradle) as it stood at a specific moment in time. This works well for short files and especially for short files that have just been introduced into the project.

If I only want to include an excerpt of a file, I can specify patterns to indicate where to start and end a selection:

&import "FBAR_PLAYWRIGHT_CHROMIUM" "FBAR.java" "chromium" "navigate"

This includes the contents of the specified file at the specified point from the first line matching "chromium" to the first line from there matching "navigate". For convenience, regular expressions are permitted. If the to pattern is omitted, the entirety of the file from the from pattern is included.

My Second Technique

And that all worked fine while this blog stayed within its fundamental remit to be essentially "co-written" with the code. If you look at the git history, you will see that almost every commit is tagged - for this very reason.

But then one day I went "off the rails". I wrote a lot of code, checked it in and then wanted to come back and comment on this and that that I'd done. And, because of where my mind was that day, I was thinking of everything in terms of files and the code I'd previously written (for the doc processor), to include code samples in documentation. And that works differently: you specify a file to include, and then specify the section of it you want to show, and then any sections you want to hide.

Of course, there's no reason you can't do that with a file extracted from git. So I put in a couple of hacks to say "well, you do need to specify the tag, because that isn't there - it's just pulling a file from a directory" and then, deep in the bowels of the include code, I need to say "ah, that file you want isn't that file, it's this thing that I've just pulled out of git".

And we have a solution to that

And as I looked at that code with my new-fangled "I can rewrite the file system better than it's ever been done before" hat on, I realized that I was describing a new way of thinking about git: as a filesystem provider.

In other words, it's possible to view git as a filesystem provider with multiple "roots" - each tag or branch in each repository can be considered as a root. So I could specify a path something like this:

git:~/IgnoranceBlog:FBAR_PLAYWRIGHT_DEMO:playwright-fbar/build.gradle

In this case, the repository is ~/IgnoranceBlog, the tag is FBAR_PLAYWRIGHT_DEMO and the path is playwright-fbar/build.gradle (which is relative to the repository). And if I can construct this internally, I can use all of my existing code that handles importing without any difficulty. (For full disclosure, the previous version did something not dissimilar to this, but had a specific hack in to make it work).

Obviously there is a lot of work putting that filesystem provider together, and then there is a little more work integrating it into the processor, but, yeah. It's that easy.

Wrapping up the Processors

With all of the GIT work done, and the blogger processor complete, it feels like we are about halfway done. But it turns out that we are a lot further along the road than that.

Firstly, there is a lot of infrastructure that I've built, including the configuration and the configurable processor, which have already been handled. Likewise, the code to convert most of the text into the intermediate form has been done.

Secondly, the majority of the hard cases come up in either the blogger or documentation processors. The movie and presenter portions have nothing to compare to including code from the disk or processing git commits.

Thirdly, I have a lot of different use cases for the documentation processor, and a lot of blog posts to consider for the blogger, but I only have one movie script to consider and only one (experimental) presentation.

In consequence, everything went very smoothly and was done within a day.

The Movie Processor

Almost everything to do with the movie processor was really just a question of cutting and pasting code and putting the relevant pieces into the relevant classes. It was very close to being a refactoring - and that means the code will hopefully be more testable.

The Presentation Processor

Likewise, almost all of the presentation file is ignored, and that which isn't is all basically passed through a parser-like processor. In consequence, mapping the presentation processor was mainly a question of connecting one scanner (for lines starting with *) to the existing backend, moving files around, and introducing the appropriate lifecycle events (to make sure that the project was flushed whenever files came to an end).

What's Left to Do?

I am tempted to call shenanigans on this at this point and move on to doing something else (mainly because I'm bored, but there are always things pressing down on me). But before I do...

I still need to go back and actually make sure that all the modules are actually declared in the configuration files and there are no "hacked-in" modules;
I want to write at least a few tests to demonstrate how tests of the various parts of the code (core and modular) would work;
I have only really modularized the processors up until this point; the intermediate form and the "sinks" (the back end portions of the code) are still largely hardcoded;
There are a couple more modules that I would like to add, and would specifically like to consider doing that in a different project to check that I really have achieved something;
The overall feel of the project has moved from "big-ball-of-mud" to "not-under-my-control", and it feels like there is a lot more that I could do (writing losts of tests, more refactoring) to improve that.

I'm probably not going to get to all of that, and what I do will probably be off-camera, but I may be back with updates in the fullness of time...

Friday, September 6, 2024

Modular processors

Now we come to the crunch of why I am doing this refactoring. I currently have five separate front-end processors, of which three are related by being "prose processors" and two of those have a similar input structure. Currently, those three form an inheritance hierarchy. What I want to do is to simplify that to only have one processor, but which is configurable by installing an array of "modules".

As I mentioned before, I don't think of any of this as "parsing" as such, more as "text processing". So the input files are broken into lines and each line is processed in turn. Currently, each of the processors takes a line of input at a time and consists of a sequence of "if-else" statements to find out what should be done with the line. My goal is to replace all of that with a single processor which asks each configured module if it can handle the line or not. Ultimately, there will always be a "default" handler which handles any input line which none of the others has - normally, this will applies to lines of text and to blank lines.

Which modules are installed into the processor is determined by two factors - firstly, the name of the processor in the configuration file, for example processor blog or processor doc, determines the default handler and can also install a default set of handlers. Secondly, it is possible to specify additional modules in the configuration of the processor which will add more features to an extension point defined by the existing processor. And (and this is the most important part for my current purposes), these modules can also configure aspects of the system introduced by other modules.

Do what now?

OK, I can tell I've confused you (if I haven't, you're either doing extraordinarily well, or you really don't understand what's going on). So let's take an example. In my doc (documentation) formatter, I have two main control lines: lines beginning with @name introduce blocks (which control how subsequent text is formatted until a subsequent @/); and lines beginning with &cmd perform some command, replacing the line with some other line (or applying some special format to the line).

So, I might have this input

@Chapter
title=Hello, World

This is the opening chapter of my book.

&example hello-world

Enough said.
@/

The @Chapter directive says that this is the start of a new chapter. Everything up until the @/ is part of this new chapter. The following lines (up until a blank line) are key-value pairs with the key being a simple alphanumeric string and the value being arbitrary text up until the end of the line. This is not at all dissimilar to the definition of an SGML or HTML element, just written in a form I find more elegant for my purposes. It is up to the code handling the @Chapter directive to decide what to do with this: as an example, chapters often carry numbers, and will be formatted appropriately (the chapter title is taken from the title parameter). The chapter may also be entered into a table of contents along with the current page number.

In this example, there are two lines of plain text, which are just processed as such and passed through to the back end along with the "current" paragraph style (as specified by the chapter for normal text).

The line beginning &example is an example of a command. In this case, the idea might be that we have somewhere a catalogue of examples to include in the book (extracted directly or indirectly from source code associated with the book in a repository). The argument to the &example command is the name of the example to be included - in this case hello-world. There needs to be a piece of code which does all this work and then spits out into the intermediate form a sequence of appropriately-formatted paragraphs containing the example code.

All of this currently exists and is hardcoded into the doc processor. When reworking my configuration files I took the step of declarinig that in fact I wanted the various commands that did this work to be configured as modules - I did this in large part because these operations themselves need to be configured (we need to know where to find the examples!) and I wanted to remove that configuration from the main doc processor.

In terms of this example, what I want to do is to have three levels of processing for &example$

The main processor just breaks everything into lines and submits them for analysis by the handlers installed by the doc module;
There is a handler which recognizes the & at the beginning of the line as the introduction to a command and then identifies the rest of the first token as a name; it then finds a handler for this command;
This handler does all the hard work.

Note that there is nothing magical about the three levels here: some lines (blank lines and ordinary lines) will be handled by the "default" handler; meanwhile some lines could be require four (or more) levels as each level identified a customizable pattern and called the appropriate nested handler.

The Goals

The main goal here is to eliminate all of the separate processors and replace them with just one configurable processor.

Additional goals include:

Changing the way that the configuration is processed so that the argument to processor in the configuration file is used to find a class which will configure the default processor;
Breaking up all of the big "if-else" statements and making them nested processors;
Breaking up any sub-logic and making that nested-nested logic;
I would like to see the code be more susceptible to testing.

At the end of this phase, I would hope that everything would work again and the intermediate form output matches its original form.

In terms of testing, my goal is that in breaking the code up into modules, each module is going to follow a simple interface in which it is configured with a set of options and then is expected to process one or more lines. I would hope that I can test top-level modules by asserting that they delegate to the correct nested modules, and that bottom-level modules create the appropriate intermediate form. I would hope that a command such as &example can be tested by passing in an appropriate double of the file system and demonstrate that it finds the correct item in the file system and formats its contents correctly.

Let's Get Started!

The `ConfiguredProcessor`

The first class I want to create is the ConfiguredProcessor. The idea is that this will be a drop-in replacement for DocProcessor and when I have updated all the ConfigListeners which now create the various Processor subclasses to all use ConfiguredProcessor, all the other versions will be just so much dead code, and then I can delete all of it.

So I'm basically starting again, and in a fresh class I'm creating a simple processor:

        public void process(FilesToProcess places) throws IOException {
                for (Place x : places.included()) {
                        ConfiguredState state = new ConfiguredState();
                        List<ProcessingScanner> all = createScannerList(state);

                        // Each of the scanners gets a chance to act
                        x.lines((n, s) -> {
                                for (ProcessingScanner scanner : all) {
                                        if (scanner.handleLine(trim(s)))
                                                return;
                                }
                                throw new CantHappenException("the default scanner at least should have fired");
                        });
                        for (ProcessingScanner scanner : all)
                                scanner.placeDone();
                }
        }

This is configured by giving it a list of classes which implement scanners (as well as two default handlers - one for blank lines and one for non-blank lines). createScannerList is responsible for instantiating all of these and putting them in a list so that the ones that were configured last are considered first down to the handler for non-blank lines being considered last.

The main loop here considers each input file in turn. As currenly written, each is processed individually and a new ConfiguredState is created as we process each file in turn; we likewise create a new set of scanners for each file, sharing the same state. In the fulness of time, I expect to revisit this and have a "super-state" which can hold information which should persist across files (the table of contents is an obvious example of this; but many things are shared between the files in a book).

Each file is broken up into lines and each of these is processed by the inner lambda, which traverses the list of scanners until it finds a match. The scanner is responsible for handling the line if it matches. If nothing else matches, the non-blank line scanner should, so it is an error for the loop to finish.

Finally, when we have processed all the lines in the file, we allow all the scanners to do anything they feel is appropriate to finish up anything they were working on.

Configuring It

So far, so good. But how do we configure it?

I'm still shying away from writing the actual code to include the "global" modules from declarations in the configuration file, but at this point, I definitely need to have that code somewhere. So I hacked into the code that reads the configuration file and pretended it loaded a couple of global modules. In the end, the module command will take a class name and then immediately instantiate it and call its install method. The interface GlobalModuleInstaller is used to define this contract.

public interface GlobalModuleInstaller {
void install();
}

The implementations of this are then responsible for configuring everything that module needs - the availability of a configurable processor block, any sinks and intermediate form options, and of course, all the configuration for the processor - the scanners listed above as well as all of the command that flows from that.

I started with the "doc" processor - the one that produces documents, books and documentation. I just started with one chapter of a manual to see how far I would get, and this is what I have (apologies for the length, but I think only with this amount of code does it become clear what I'm trying to do).

public class InstallDocModule implements GlobalModuleInstaller {
        private final ReadConfigState state;
        private final ScriptConfig config;

        public InstallDocModule(ReadConfigState state) {
                this.state = state;
                this.config = state.config;
        }

        @Override
        public void install() {
                state.registerProcessor("doc", DocProcessorConfigListener.class);

                installAtCommands();
                installAmpCommands();
        }

        // @ commands
        private void installAtCommands() {
                // structure
                this.config.extensions().bindExtensionPoint(AtCommandHandler.class, ChapterCommand.class);
                this.config.extensions().bindExtensionPoint(AtCommandHandler.class, SectionCommand.class);

                // should commentary be in a separate module?
                this.config.extensions().bindExtensionPoint(AtCommandHandler.class, CommentaryCommand.class);
        }

        //
        private void installAmpCommands() {
                this.config.extensions().bindExtensionPoint(AmpCommandHandler.class, FootnoteAmp.class);
                this.config.extensions().bindExtensionPoint(AmpCommandHandler.class, FutureAmp.class);
                this.config.extensions().bindExtensionPoint(AmpCommandHandler.class, LinkAmp.class);
                this.config.extensions().bindExtensionPoint(AmpCommandHandler.class, OutrageAmp.class);
                this.config.extensions().bindExtensionPoint(AmpCommandHandler.class, ReviewAmp.class);
                this.config.extensions().bindExtensionPoint(AmpCommandHandler.class, TTAmp.class);
        }
}

As an aside, I need to point out that this code is unlikely to stay like this: Both ReadConfigState and ScriptConfig are concrete classes, where they should be interfaces, and both of them are rather too strongly connected to the whole configuration process rather than being more abstract entities. What we really want here is something that says "if you are a module, I know how to offer all the things you need in order to configure yourself".

The key method here is install(). Its first step is to say that the module contains a class DocProcessorConfigListener which should be activated if the configuration contains the line processor doc. The configuration listener is responsible for processing all the options and then instantiating the processor which must, of course, be a ConfiguredProcessor, and then configuring it based on all of the instructions in the configuration file (see below). The other two lines call the methods that install the various command handlers for the @ and & commands needed for this chapter (I will add more as I expand the number of cases I am considering). Each of these methods consists of introducing a number of commands associated with an "extension point". This is simply a class name which is used as a unique identifier. Later, the code will ask for all of the commands associated with an extension point. The key here is that this enables us to extend not just the known parts of the system, but also the unknown parts of the system and in a way that will later enable us to add modules to the modules. You will see already that I am questioning if the @Commentary command should in fact be moved off into a separate module.

The `ConfigListener`

As noted above, the install() method registers a handler for the processor doc command. This causes the associated class to be instantiated. It is then given each (parsed) line in the configuration file nested underneath that command to handle. This is done in the dispatch method:

        @Override
        public ConfigListener dispatch(Command cmd) {
                switch (cmd.name()) {
                case "module": {
                        ModuleConfigListener nmc = new NestedModuleCreator(state).module(cmd.line().readArg());
                        modules.add(nmc);
                        return nmc;
                }
                case "joinspace":
                case "meta":
                case "scanmode":
                {
                        vars.put(cmd.depth(), cmd.name(), cmd.line().readArg());
                        return null;
                }
                default: {
                        throw new NotImplementedException("doc processor does not have parameter " + cmd.name());
                }
                }
        }

Again, this is a work in progress and I'm not overly happy with it. But there are two basic cases here: if the command is a module command, we want to hand off processing to a nested processor which will create the module and attach it to this processor; and in all other cases we have a simple key/value pair which we store in a dictionary. This latter fork of the code is in fact the legacy of how the whole configuration used to be processed: in the fullness of time, I expect to make it look the way you would expect, at least with each "key" having its own field in the class, if not its own processing to support an arbitrary number of parameters.

The NotImplementedException here reflects a refactoring in progress: I don't know for sure that the parameter is not valid; I just know that I haven't handled it yet. At the end of the refactoring, I expect I will come back and change these to all be proper errors.

When all the configurations have been parsed, the appropriate objects will be instantiated by calling the complete method on the ConfigListener:

        @Override
        public void complete() throws ConfigException {
                try {
                        Sink sink = state.config.makeSink();
                        ConfiguredProcessor proc = new ConfiguredProcessor(state.config.extensions(), sink...);
                        proc.setDefaultHandler(StandardLineProcessor.class);
                        proc.setBlankHandler(NewParaProcessor.class);
                        proc.addScanner(AmpSpotter.class);
                        proc.addScanner(AtSpotter.class);
                        proc.addScanner(FieldSpotter.class);
                        state.config.processor(proc);
                        for (ModuleConfigListener m : modules) {
                                m.activate(proc);
                        }
                } catch (Exception ex) {
                        ex.printStackTrace();
                        throw new ConfigException("Error creating DocProcessor: " + ex.getMessage());
                }
        }

Again, there are rough edges here as we find ourselves in the middle of a refactoring, but the structure is about right: we need to obtain a sink which represents the intermediate form; we create an instance of the ConfiguredProcessor which needs to at least have the sink; we configure it by adding handlers and scanners for the various text lines; we register it as a processor in the repository; and then we take all the nested modules we collected during the configuration process and activate them by passing in the configured processor. This gives them the opportunity to attach their own scanners and also to attach to any extension points that have been created.

The processor is aware of the map of extension points because it is passed in to the constructor as state.config.extensions(). I'm again not happy with doing it this way - that may well be cleaned up as we go along and be passed into the ConfigListener as a separate argument; or else I may go the other way and pass around a state or a config pointer. It all really depends on what feels right when (if?) I get to that level of polish (or testing).

The Extension Point Repo

So what is the nature of this extensions() thing, then? I'm glad you asked. Although you may come to regret it.

It's basically a map from a class to a list of Creators. Providers can register either a class or a Creator as a possible instantiation of the class of the extension point (a class will be instantiated through reflection; a Creator is an object which is asked to do the creation). Consumers can then ask for a List or a Map of instantiated objects attached to the extension point.

public interface ExtensionPointRepo {
        <T, Z extends T, Q> void bindExtensionPoint(Class<T> ep, Creator<Z, Q> impl);
        <T, Z extends T> void bindExtensionPoint(Class<T> ep, Class<Z> impl);

        <T extends NamedExtensionPoint, Q> Map<String, T> forPointByName(Class<T> clz, Q ctorArg);
}

Here, T is the class of the extension point, Z is the class of the implementation, and Q is the class of the argument which will be passed in to (each) constructor or Creator. As yet, only the method to obtain a Map of instantiated objects has been implemented, because that is the only one I have needed. It obviously requires the keys in the Map to come from somewhere, so each extension point, when instantiated, must be able to return a String with its name in it. This is "enforced" by the NamedExtensionPoint interface:

public interface NamedExtensionPoint {
String name();
}

(For Java Generic nerds wondering why there is no Z in the final method of ExtensionPointRepo, it's because each member of the Map will have its own type Zi and Java can't handle that, except to say that you can gloss over it and just give a common base class - in this case T).

Sharing State

One of the big problems with modularizing things is that some of the state that you need can no longer be held in just one place. Multiple processors all need to access - and update - the same state. This is a pain.

For a start, it means that there are dependencies between the processors: if I want to change how one processor works, and thus what it stores and how, then any other processors that share that state need to change too. This removes the key feature of modularity: the ability to have truly third-party modules.

Secondly, it may limit the functionality of a plug-in module, because the module may want state from an existing module which "does not exist" or "it cannot access" for one reason or another.

Thirdly, it may split up the functionality of a module, because various parts of its state may end up in different places.

I am going to share two examples of using state in this way. The first is footnotes, which is a fairly simple use case but shows that the state can get broken apart. The second will be to do with generating a table of contents.

Footnotes

In the document formatter, footnotes come in two "parts". First, there is a quick, inline command &footnote which appears in the middle of the sentence where the footnote marker is going to appear. Then, at some (probably later) point, there is an @Footnote...@/ block which provides the text of the footnote. "Logically", these two need to share a reference to the footnote number, which would involve sharing state. And it seems reasonable that the two of them should share a single state object. But in fact, these two commands are fundamentally independent, and so two counters can be used. It would probably be good at the end of processing to check that the same number of footnote markers and footnotes have been generated but I'm not going to bother for now.

As with almost everything in software, there are multiple approaches to handling this, and which one is chosen depends on what factors are most important in the given situation. If this were genuinely a third-party module, I would probably create a new sub-state (FootnoteState) that I would store in the main ConfiguredState. But as it is part of the overall document module, I'm just going to put each counter in the state associated with each command type:

public class InlineDocCommandState {
        private int nextFnMkr = 1;
        ...

        public int nextFootnoteMarker() {
                return nextFnMkr++;
        }
}

public class ScannerAtState {
        private int nextFnText = 1;
        ...

        public int nextFootnoteText() {
                return nextFnText++;
        }
}

And then these can be used by the command handlers, for example, to insert the footnote marker into the running stream of text:

public class FootnoteNumHandler implements InlineCommandHandler {
        private final InlineDocCommandState ics;

        public FootnoteNumHandler(InlineDocCommandState state) {
                ics = state;
        }
...

        @Override
        public void invoke() {
                state.nestSpan("footnote-number");
                state.text(Integer.toString(ics.nextFootnoteMarker()));
                state.popSpan();
                state.op(new SyncAfterFlow("footnotes"));
        }
}

The code to handle processing a table of contents is more complicated, because it depends on the various blocks that indicate the structure of the document (@Chapter, @Section, etc). These need to know in advance that someone out there is interested in knowing the levels of nesting and the text of the section title. At the same time, the code to process the table of contents needs to be a truly separate module because it needs to be configured with the appropriate locations to store the table of contents between runs ("obviously" generating a table of contents from the contents of the document, and then inserting it at the beginning, requires at least two runs; subsequent runs may be needed if inserting the table of contents then changes the page numbers).

Earlier, we discussed extension points. I am now going to introduce a new extension point called DocumentOutline. This is going to have one method entry which takes an integer level (where 0 is the mythical top level, 1 is the biggest division, 2 is a subdivision of 1, etc), a text title (the text which is used in the heading), and a "style" (the numbering style, e.g. roman vs arabic vs none).

public interface DocumentOutline extends ExtensionPoint {
void entry(int level, String title, String style);
}

Unlike the command extension points, this is not named. And, unlike those, we do not attempt to get a map of unique names to extension points, but rather a set of extension points, all of which will end up being called:

public class ScannerAtState {
        ...
        private Set<DocumentOutline> outline;

        public void configure(ConfiguredState state) {
                ...
                this.outline = state.extensions().forPoint(DocumentOutline.class, this);
        }
        ...
        public void outlineEntry(int level, String text, String style) {
                for (DocumentOutline e : outline) {
                        e.entry(level, text, style);
                }
        }
}

The rest of the work is just copying code from where it used to be to where it needs to be.

Bumps in the Road

I'm not going to deal with these right now, but I am becoming increasingly aware of various bumps in the road which stand in the way of truly clean modularity.

The first is the fact that I am handling "lifecycle" events in an ad-hoc way. This will not be sustainable. By "lifecycle" events, I mean things like: creating a state, finishing a state, resetting chapter and section numbers, etc. In the fullness of time, I will need to add something (probably an extra interface to be implemented by extension points) which enables the code handling extension points to be notified when these events occur. Obviously, creating new extension points won't help with this situation, since they will not be connected with the old ones.

The second has to do with argument processing: it seems to me that while I split the code across modules, the outlineEntry method is too closely coupled to the current use case (the table of contents). level and text seem natural things to pass to an outline, but the style felt a bit of a hack. But then in moving code across, I found I also needed a field called anchor (no, I'm not sure what that's about, either...). It feels to me that this should probably be modularized in some way, although obviously it will be important to allow the processing of arbitrary variables while at the same time ensuring that anything the user supplies which is not used generates an error. I suspect I will come back to this when I have more cases.

And thirdly, some of the code I copied across was specific to a subcase of the TOC module, which really should be a module plugged in to the TOC module. It goes without saying that it should be possible to extract this module and have everything work properly. But I'm not quite sure how to do that yet. This specific case (I think) can probably be handled by having another module attached to DocumentOutline. So, again, I will wait for more cases to appear before trying to clean anything up further.

And then ...

Having done all of that, and worked through a whole bunch of cases, I am able to load the resulting PDF into Acrobat and compare with the original. And there are no differences! Amazing.

For full disclosure, that did not happen first time. Among other things, I had not connected up the code to render the PDF from the intermediate form :-) And there were a number of other minor issues that turned up during the comparison. But, all-in-all, very succesful.