Now we come to the crunch of why I am doing this refactoring. I currently have five separate front-end processors, of which three are related by being "prose processors" and two of those have a similar input structure. Currently, those three form an inheritance hierarchy. What I want to do is to simplify that to only have one processor, but which is configurable by installing an array of "modules".
As I mentioned before, I don't think of any of this as "parsing" as such, more as "text processing". So the input files are broken into lines and each line is processed in turn. Currently, each of the processors takes a line of input at a time and consists of a sequence of "if-else" statements to find out what should be done with the line. My goal is to replace all of that with a single processor which asks each configured module if it can handle the line or not. Ultimately, there will always be a "default" handler which handles any input line which none of the others has - normally, this will applies to lines of text and to blank lines.
Which modules are installed into the processor is determined by two factors - firstly, the name of the processor in the configuration file, for example
processor blog or
processor doc, determines the default handler and can also install a default set of handlers. Secondly, it is possible to specify additional modules in the configuration of the processor which will add more features to an extension point defined by the existing processor. And (and this is the most important part for my current purposes), these modules can also configure aspects of the system introduced by other modules.
Do what now?
OK, I can tell I've confused you (if I haven't, you're either doing extraordinarily well, or you really don't understand what's going on). So let's take an example. In my
doc (documentation) formatter, I have two main control lines: lines beginning with
@name introduce blocks (which control how subsequent text is formatted until a subsequent
@/); and lines beginning with
&cmd perform some command, replacing the line with some other line (or applying some special format to the line).
So, I might have this input
@Chapter
title=Hello, World
This is the opening chapter of my book.
&example hello-world
Enough said.
@/
The
@Chapter directive says that this is the start of a new chapter. Everything up until the
@/ is part of this new chapter. The following lines (up until a blank line) are
key-value pairs with the key being a simple alphanumeric string and the value being arbitrary text up until the end of the line. This is not at all dissimilar to the definition of an SGML or HTML element, just written in a form I find more elegant for my purposes. It is up to the code handling the
@Chapter directive to decide what to do with this: as an example, chapters often carry numbers, and will be formatted appropriately (the chapter title is taken from the
title parameter). The chapter may also be entered into a table of contents along with the current page number.
In this example, there are two lines of plain text, which are just processed as such and passed through to the back end along with the "current" paragraph style (as specified by the chapter for normal text).
The line beginning
&example is an example of a
command. In this case, the idea might be that we have somewhere a catalogue of examples to include in the book (extracted directly or indirectly from source code associated with the book in a repository). The argument to the
&example command is the name of the example to be included - in this case
hello-world. There needs to be a piece of code which does all this work and then spits out into the intermediate form a sequence of appropriately-formatted paragraphs containing the example code.
All of this currently exists and is hardcoded into the
doc processor. When reworking my configuration files I took the step of declarinig that in fact I wanted the various commands that did this work to be configured as modules - I did this in large part because these operations themselves need to be configured (we need to know where to find the examples!) and I wanted to remove that configuration from the main
doc processor.
In terms of this example, what I want to do is to have three levels of processing for
&example$
- The main processor just breaks everything into lines and submits them for analysis by the handlers installed by the doc module;
- There is a handler which recognizes the & at the beginning of the line as the introduction to a command and then identifies the rest of the first token as a name; it then finds a handler for this command;
- This handler does all the hard work.
Note that there is nothing magical about the three levels here: some lines (blank lines and ordinary lines) will be handled by the "default" handler; meanwhile some lines could be require four (or more) levels as each level identified a customizable pattern and called the appropriate nested handler.
The Goals
The main goal here is to eliminate all of the separate processors and replace them with just one configurable processor.
Additional goals include:
- Changing the way that the configuration is processed so that the argument to processor in the configuration file is used to find a class which will configure the default processor;
- Breaking up all of the big "if-else" statements and making them nested processors;
- Breaking up any sub-logic and making that nested-nested logic;
- I would like to see the code be more susceptible to testing.
At the end of this phase, I would hope that everything would work again and the intermediate form output matches its original form.
In terms of testing, my goal is that in breaking the code up into modules, each module is going to follow a simple interface in which it is configured with a set of options and then is expected to process one or more lines. I would hope that I can test top-level modules by asserting that they delegate to the correct nested modules, and that bottom-level modules create the appropriate intermediate form. I would hope that a command such as
&example can be tested by passing in an appropriate double of the file system and demonstrate that it finds the correct item in the file system and formats its contents correctly.
Let's Get Started!
The ConfiguredProcessor
The first class I want to create is the
ConfiguredProcessor. The idea is that this will be a drop-in replacement for
DocProcessor and when I have updated all the
ConfigListeners which now create the various
Processor subclasses to all use
ConfiguredProcessor, all the other versions will be just so much dead code, and then I can delete all of it.
So I'm basically starting again, and in a fresh class I'm creating a simple processor:
public void process(FilesToProcess places) throws IOException {
for (Place x : places.included()) {
ConfiguredState state = new ConfiguredState();
List<ProcessingScanner> all = createScannerList(state);
// Each of the scanners gets a chance to act
x.lines((n, s) -> {
for (ProcessingScanner scanner : all) {
if (scanner.handleLine(trim(s)))
return;
}
throw new CantHappenException("the default scanner at least should have fired");
});
for (ProcessingScanner scanner : all)
scanner.placeDone();
}
}
This is configured by giving it a list of classes which implement scanners (as well as two default handlers - one for blank lines and one for non-blank lines).
createScannerList is responsible for instantiating all of these and putting them in a list so that the ones that were configured last are considered first down to the handler for non-blank lines being considered last.
The main loop here considers each input file in turn. As currenly written, each is processed individually and a new
ConfiguredState is created as we process each file in turn; we likewise create a new set of scanners for each file, sharing the same state. In the fulness of time, I expect to revisit this and have a "super-state" which can hold information which should persist across files (the table of contents is an obvious example of this; but many things are shared between the files in a book).
Each file is broken up into
lines and each of these is processed by the inner lambda, which traverses the list of scanners until it finds a match. The scanner is responsible for handling the line if it matches. If nothing else matches, the non-blank line scanner should, so it is an error for the loop to finish.
Finally, when we have processed all the lines in the file, we allow all the scanners to do anything they feel is appropriate to finish up anything they were working on.
Configuring It
So far, so good. But how do we configure it?
I'm still shying away from writing the actual code to include the "global" modules from declarations in the configuration file, but at this point, I definitely need to have that code somewhere. So I hacked into the code that reads the configuration file and pretended it loaded a couple of global modules. In the end, the
module command will take a class name and then immediately instantiate it and call its
install method. The interface
GlobalModuleInstaller is used to define this contract.
public interface GlobalModuleInstaller {
void install();
}
The implementations of this are then responsible for configuring everything that module needs - the availability of a configurable processor block, any sinks and intermediate form options, and of course, all the configuration for the processor - the scanners listed above as well as all of the command that flows from that.
I started with the "doc" processor - the one that produces documents, books and documentation. I just started with one chapter of a manual to see how far I would get, and this is what I have (apologies for the length, but I think only with this amount of code does it become clear what I'm trying to do).
public class InstallDocModule implements GlobalModuleInstaller {
private final ReadConfigState state;
private final ScriptConfig config;
public InstallDocModule(ReadConfigState state) {
this.state = state;
this.config = state.config;
}
@Override
public void install() {
state.registerProcessor("doc", DocProcessorConfigListener.class);
installAtCommands();
installAmpCommands();
}
// @ commands
private void installAtCommands() {
// structure
this.config.extensions().bindExtensionPoint(AtCommandHandler.class, ChapterCommand.class);
this.config.extensions().bindExtensionPoint(AtCommandHandler.class, SectionCommand.class);
// should commentary be in a separate module?
this.config.extensions().bindExtensionPoint(AtCommandHandler.class, CommentaryCommand.class);
}
//
private void installAmpCommands() {
this.config.extensions().bindExtensionPoint(AmpCommandHandler.class, FootnoteAmp.class);
this.config.extensions().bindExtensionPoint(AmpCommandHandler.class, FutureAmp.class);
this.config.extensions().bindExtensionPoint(AmpCommandHandler.class, LinkAmp.class);
this.config.extensions().bindExtensionPoint(AmpCommandHandler.class, OutrageAmp.class);
this.config.extensions().bindExtensionPoint(AmpCommandHandler.class, ReviewAmp.class);
this.config.extensions().bindExtensionPoint(AmpCommandHandler.class, TTAmp.class);
}
}
As an aside, I need to point out that this code is unlikely to stay like this: Both
ReadConfigState and
ScriptConfig are concrete classes, where they should be interfaces, and both of them are rather too strongly connected to the whole configuration process rather than being more abstract entities. What we really want here is something that says "if you are a module, I know how to offer all the things you need in order to configure yourself".
The key method here is
install(). Its first step is to say that the module contains a class
DocProcessorConfigListener which should be activated if the configuration contains the line
processor doc. The configuration listener is responsible for processing all the options and then instantiating the processor which must, of course, be a
ConfiguredProcessor, and then configuring it based on all of the instructions in the configuration file (see below). The other two lines call the methods that install the various command handlers for the
@ and
& commands needed for this chapter (I will add more as I expand the number of cases I am considering). Each of these methods consists of introducing a number of commands associated with an "extension point". This is simply a class name which is used as a unique identifier. Later, the code will ask for all of the commands associated with an extension point. The key here is that this enables us to extend not just the
known parts of the system, but also the
unknown parts of the system and in a way that will later enable us to add modules to the modules. You will see already that I am questioning if the
@Commentary command should in fact be moved off into a separate module.
The ConfigListener
As noted above, the
install() method registers a handler for the
processor doc command. This causes the associated class to be instantiated. It is then given each (parsed) line in the configuration file nested underneath that command to handle. This is done in the
dispatch method:
@Override
public ConfigListener dispatch(Command cmd) {
switch (cmd.name()) {
case "module": {
ModuleConfigListener nmc = new NestedModuleCreator(state).module(cmd.line().readArg());
modules.add(nmc);
return nmc;
}
case "joinspace":
case "meta":
case "scanmode":
{
vars.put(cmd.depth(), cmd.name(), cmd.line().readArg());
return null;
}
default: {
throw new NotImplementedException("doc processor does not have parameter " + cmd.name());
}
}
}
Again, this is a work in progress and I'm not overly happy with it. But there are two basic cases here: if the command is a module command, we want to hand off processing to a nested processor which will create the module and attach it to this processor; and in all other cases we have a simple key/value pair which we store in a dictionary. This latter fork of the code is in fact the legacy of how the whole configuration used to be processed: in the fullness of time, I expect to make it look the way you would expect, at least with each "key" having its own field in the class, if not its own processing to support an arbitrary number of parameters.
The
NotImplementedException here reflects a refactoring in progress: I don't know for sure that the parameter is not valid; I just know that I haven't handled it yet. At the end of the refactoring, I expect I will come back and change these to all be proper errors.
When all the configurations have been parsed, the appropriate objects will be instantiated by calling the
complete method on the
ConfigListener:
@Override
public void complete() throws ConfigException {
try {
Sink sink = state.config.makeSink();
ConfiguredProcessor proc = new ConfiguredProcessor(state.config.extensions(), sink...);
proc.setDefaultHandler(StandardLineProcessor.class);
proc.setBlankHandler(NewParaProcessor.class);
proc.addScanner(AmpSpotter.class);
proc.addScanner(AtSpotter.class);
proc.addScanner(FieldSpotter.class);
state.config.processor(proc);
for (ModuleConfigListener m : modules) {
m.activate(proc);
}
} catch (Exception ex) {
ex.printStackTrace();
throw new ConfigException("Error creating DocProcessor: " + ex.getMessage());
}
}
Again, there are rough edges here as we find ourselves in the middle of a refactoring, but the structure is about right: we need to obtain a
sink which represents the intermediate form; we create an instance of the
ConfiguredProcessor which needs to at least have the
sink; we configure it by adding handlers and scanners for the various text lines; we register it as a processor in the repository; and then we take all the nested modules we collected during the configuration process and
activate them by passing in the configured processor. This gives them the opportunity to attach their own scanners and also to attach to any extension points that have been created.
The processor is aware of the map of extension points because it is passed in to the constructor as
state.config.extensions(). I'm again not happy with doing it this way - that may well be cleaned up as we go along and be passed into the
ConfigListener as a separate argument; or else I may go the other way and pass around a
state or a
config pointer. It all really depends on what feels right when (if?) I get to that level of polish (or testing).
The Extension Point Repo
So what is the nature of this
extensions() thing, then? I'm glad you asked. Although you may come to regret it.
It's basically a map from a
class to a list of
Creators. Providers can register either a
class or a
Creator as a possible instantiation of the
class of the extension point (a
class will be instantiated through reflection; a
Creator is an object which is asked to do the creation). Consumers can then ask for a
List or a
Map of instantiated objects attached to the extension point.
public interface ExtensionPointRepo {
<T, Z extends T, Q> void bindExtensionPoint(Class<T> ep, Creator<Z, Q> impl);
<T, Z extends T> void bindExtensionPoint(Class<T> ep, Class<Z> impl);
<T extends NamedExtensionPoint, Q> Map<String, T> forPointByName(Class<T> clz, Q ctorArg);
}
Here,
T is the class of the extension point,
Z is the class of the implementation, and
Q is the class of the argument which will be passed in to (each) constructor or
Creator. As yet, only the method to obtain a
Map of instantiated objects has been implemented, because that is the only one I have needed. It obviously requires the keys in the
Map to come from somewhere, so each extension point, when instantiated, must be able to return a
String with its name in it. This is "enforced" by the
NamedExtensionPoint interface:
public interface NamedExtensionPoint {
String name();
}
(For Java Generic nerds wondering why there is no
Z in the final method of
ExtensionPointRepo, it's because each member of the
Map will have its own type
Zi and Java can't handle that, except to say that you can gloss over it and just give a common base class - in this case
T).
Sharing State
One of the big problems with modularizing things is that some of the state that you need can no longer be held in just one place. Multiple processors all need to access - and update - the same state. This is a pain.
For a start, it means that there are dependencies between the processors: if I want to change how one processor works, and thus what it stores and how, then any other processors that share that state need to change too. This removes the key feature of modularity: the ability to have truly third-party modules.
Secondly, it may limit the functionality of a plug-in module, because the module may want state from an existing module which "does not exist" or "it cannot access" for one reason or another.
Thirdly, it may split up the functionality of a module, because various parts of its state may end up in different places.
I am going to share two examples of using state in this way. The first is footnotes, which is a fairly simple use case but shows that the state can get broken apart. The second will be to do with generating a table of contents.
Footnotes
In the document formatter, footnotes come in two "parts". First, there is a quick, inline command
&footnote which appears in the middle of the sentence where the footnote marker is going to appear. Then, at some (probably later) point, there is an
@Footnote...@/ block which provides the text of the footnote. "Logically", these two need to share a reference to the footnote number, which would involve sharing state. And it seems reasonable that the two of them should share a single state object. But in fact, these two commands are fundamentally independent, and so two counters can be used. It would probably be good at the end of processing to check that the same number of footnote markers and footnotes have been generated but I'm not going to bother for now.
As with almost everything in software, there are multiple approaches to handling this, and which one is chosen depends on what factors are most important in the given situation. If this were genuinely a third-party module, I would probably create a new sub-state (
FootnoteState) that I would store in the main
ConfiguredState. But as it is part of the overall
document module, I'm just going to put each counter in the state associated with each command type:
public class InlineDocCommandState {
private int nextFnMkr = 1;
...
public int nextFootnoteMarker() {
return nextFnMkr++;
}
}
public class ScannerAtState {
private int nextFnText = 1;
...
public int nextFootnoteText() {
return nextFnText++;
}
}
And then these can be used by the command handlers, for example, to insert the footnote marker into the running stream of text:
public class FootnoteNumHandler implements InlineCommandHandler {
private final InlineDocCommandState ics;
public FootnoteNumHandler(InlineDocCommandState state) {
ics = state;
}
...
@Override
public void invoke() {
state.nestSpan("footnote-number");
state.text(Integer.toString(ics.nextFootnoteMarker()));
state.popSpan();
state.op(new SyncAfterFlow("footnotes"));
}
}
Table of Contents
The code to handle processing a table of contents is more complicated, because it depends on the various blocks that indicate the structure of the document (
@Chapter,
@Section, etc). These need to know
in advance that someone out there is interested in knowing the levels of nesting and the text of the section title. At the same time, the code to process the table of contents needs to be a truly separate module because it needs to be configured with the appropriate locations to store the table of contents between runs ("obviously" generating a table of contents from the contents of the document, and then inserting it at the beginning, requires
at least two runs; subsequent runs may be needed if inserting the table of contents then changes the page numbers).
Earlier, we discussed extension points. I am now going to introduce a new extension point called
DocumentOutline. This is going to have one method
entry which takes an integer level (where
0 is the mythical top level,
1 is the biggest division,
2 is a subdivision of
1, etc), a text title (the text which is used in the heading), and a "style" (the numbering style, e.g. roman vs arabic vs none).
public interface DocumentOutline extends ExtensionPoint {
void entry(int level, String title, String style);
}
Unlike the command extension points, this is not named. And, unlike those, we do not attempt to get a map of unique names to extension points, but rather a
set of extension points, all of which will end up being called:
public class ScannerAtState {
...
private Set<DocumentOutline> outline;
public void configure(ConfiguredState state) {
...
this.outline = state.extensions().forPoint(DocumentOutline.class, this);
}
...
public void outlineEntry(int level, String text, String style) {
for (DocumentOutline e : outline) {
e.entry(level, text, style);
}
}
}
The rest of the work is just copying code from where it used to be to where it needs to be.
Bumps in the Road
I'm not going to deal with these right now, but I am becoming increasingly aware of various bumps in the road which stand in the way of truly clean modularity.
The first is the fact that I am handling "lifecycle" events in an ad-hoc way. This will not be sustainable. By "lifecycle" events, I mean things like: creating a state, finishing a state, resetting chapter and section numbers, etc. In the fullness of time, I will need to add something (probably an extra interface to be implemented by extension points) which enables the code handling extension points to be notified when these events occur. Obviously, creating
new extension points won't help with this situation, since they will not be connected with the old ones.
The second has to do with argument processing: it seems to me that while I split the code across modules, the
outlineEntry method is too closely coupled to the current use case (the table of contents).
level and
text seem natural things to pass to an outline, but the style felt a bit of a hack. But then in moving code across, I found I also needed a field called
anchor (no, I'm not sure what that's about, either...). It feels to me that this should probably be modularized in some way, although obviously it will be important to allow the processing of arbitrary variables while at the same time ensuring that anything the user supplies which is
not used generates an error. I suspect I will come back to this when I have more cases.
And thirdly, some of the code I copied across was specific to a subcase of the TOC module, which really should be a module plugged in to the TOC module. It goes without saying that it should be possible to extract this module and have everything work properly. But I'm not quite sure how to do that yet. This specific case (I think) can probably be handled by having another module attached to
DocumentOutline. So, again, I will wait for more cases to appear before trying to clean anything up further.
And then ...
Having done all of that, and worked through a whole bunch of cases, I am able to load the resulting PDF into Acrobat and compare with the original. And there are no differences! Amazing.
For full disclosure, that did not happen first time. Among other things, I had not connected up the code to render the PDF from the intermediate form :-) And there were a number of other minor issues that turned up during the comparison. But, all-in-all, very succesful.