Ignorance may be Strength : August 2024

Wednesday, August 21, 2024

Reading the Configuration

One of the things that is tricky in these situations is you simultaneously want to change things and want to have them remain the same while you are changing the way it works.

I know I definitely don't want to make any changes to all the (dozens, at least) of source files I have. I want all of that to keep on working just as it has.

But I do know that the whole point of this is to make the configuration more modular, so I will ultimately have to change the configuration files so that they specify the modules in use. Part of me wants to do that up front, but the sensible part of me wants to do it last. So, for now, as I start to pull things out of the "main" code into modules, I am just going to "hardcode" the loading of the modules so that the configuration files do not change. Only at the last minute will I then quickly change all of the configuration files and strip out the hardcoded module definitions.

The `ReadConfig` class

I may have mentioned before that I write a lot of parsers. And in some sense, this project is a string of parsers connected in one way or another. But I would say that, in fact, it is more just a "text processor" at the moment. There is no abstract grammar involved - I just look at each line of text in context and do a switch based on the first token on the line.

And, tempting though it might be, I'm not going to change that. Very simply because in making everything modular, I am making it less susceptible to an overarching model. That would promote consistency (which is always a good thing), but would also place arbitrary constraints on how the modules work and potentially on literally what they can do; and that is a bad thing. I don't know what every single module might want to do, so I will keep the interface as simple as possible.

The ReadConfig class itself, though, knows almost nothing about this. It just creates a ConfigParser class and pumps it the contents of the Place that it is passed in.

        public Config read() throws ConfigException {
                ConfigParser parser = new ConfigParser(universe, place.region());
                place.lines(parser);
                try {
                        return parser.config();
                } catch (IOException ex) {
                        throw new ConfigException("Could not read configuration " + place + ": " + ex.toString());
                } catch (Exception ex) {
                        throw WrappedException.wrap(ex);
                }
        }

The `ConfigParser` class

The ConfigParser as it currently stands has too many responsibilities: in reality, it should take the input lines and turn these into a set of actions and inform a listener that something has happened. Because the configuration is hierarchical, it should keep a hierarchy of listeners that reflect the level of nesting. At the various levels, some of the rules remain the same; others change.

In fact, this class has the responsibility for creating the final configuration, as well as constructing all the objects from the parsed configuration. My first task is going to be to break these out. While I'm about this, I'm going to move all of the parsing code into a new package config.reader, and try and leave the actual configuration classes in the config package.

A Nasty Surprise

I was in the middle of reorganizing all of the configuration code, and was about halfway through, when suddently it asked me to authenticate with Blogger. I wasn't expecting that. It turns out that the constructor for the BloggerSink immediately contacts Blogger to find out which posts are live. This seems out of order to me, but, when I thought about where I would expect it to go, I started looking for a phase where back ends are initialized, and realized there is no such place. So I think I will have to consider this when revisiting main() and make sure that for each phase there is a clear "initialize front end processors" and "initialize back end processors".

Interestingly, the drive loader did not seem to load anything from Google Drive during the configuraion step. But did it load the index file from disk? Or is that something else again? It seems like there may be more gremlins hiding in this code than I had realized.

Losing Control ...

It's usually about this point when a major fork appears in the road and it feels like I am losing control of the changes. This is happening here. The fork appears when you have to choose between nailing down one section of the code, and keeping the code working. There is no "one, true" path. This is the price that must be paid for not doing things right in the first place.

In this case, I have the choice between taking the configuration all the way up front, or building some "scaffolding" to keep all the other cases working while I work through the entirety of one case (say the blogger portion). Given that there are no satisfactory solutions, it won't be surprising to hear that I have never found one. And, this time, as usual, I have decided that I don't like the inefficiency of building scaffolding so that I can keep jumping around in the code; I would rather tackle one part of the code and nail it down properly. On the other hand, I am doing the minimum for the moment to truly "clean up" the code; I am just moving what I have to around to make the configuration more "modular": the rest of the code is still hard-coupled together.

In some ways, this is a problem - for a while at least, I'm not going to be able to test anything. And I'm certainly not going to be able to release anything. And there is a lot of work left undone. On the other hand, this always was a huge task, and the configuration files are the one thing in all the projects that need to change (after all, we are trying to introduce modules into the configuration), so once we are through this phase, everything should settle down a little. So my "justification" is: we can't go all the way to our destination on a nice, smooth road, and this feels like the shortest amount of off-roading, so let's get on with it and back onto a road where all our (regression) tests pass again (i.e. all the existing cases where I have used the formatter).

One of the consequences of this choice is that some of the aspects of the configuration are going to have rough edges for now: for example, in refactoring the way in which we think about filesystems, the "Google Drive loader" is no longer specifically loading from Google Drive: it's loading from an abstract point in the filesystem. That needs to be cleaned up, but it's not urgent. We are also saying "why not allow multiple loaders?" but that is not going into the code just yet. So in these cases, we may have some scaffolding; and in others, the code may just be inconsistent. I'm not even 100% sure that by the end, when I declare "victory", it will all be done. Of course, if this were a professional project, the existence of end-to-end tests covering all possible cases would force me into making sure that was done. But in this case, I am just using my existing projects to drive the work forward, and when they all pass I will be happy.

Likewise, one of the biggest things I need to do is to make the system extensible, and that means pulling all of the module code out of the main body of code. As I'm going, I'm trying to move code into packages called "*.modules.name.*", but what I'm not doing yet is requiring the configuration file to have lines in it that look like:

module classname

But that is a step I need to go back to as soon as I have finished the first go-around of modifying the configuration processor.

Then Just Like That ...

And then, suddenly (as always), I make a couple of changes and I'm out the other side. All the configurations load and parse, and when I turn the rest of the formatter back on I'm just a couple of bugfixes away from everything (mainly) working.

The very fact that you are reading this says that I have all of the blogger code working at least to a first pass.

The one exception is that in reorganizing the configuration of the processor files, I explicitly said some variables would be used to configure "modules" - but nowhere did I specify those modules or make the code pass the variables to them. Consequently, those features simply don't work.

I'm not sure that was what I was going to work on next, but it is "divergent" - that is, I would expect it to create more problems than it fixes - so I think I'll tackle that next.

ScriptFormatter main()

Putting in place all the regression testing I wanted took more time than I would have liked. There were a number of reasons for this.

Actually isolating the intermediate form and writing it to a file took some time;
Reading this back in and decoding it to text, along with making sure everything was correct took a while;
Finding all the places where I had used ScriptFormatter, adding them to a regression script and then getting them all to work again (ScriptFormatter has "drifted" over the years, and with no automated testing :-));
Of particular issue was the "presentation" module, which (understandably, if wrongly) worked completely differently and didn't split into a "front" and "back" end.

The last one required me to do considerably more surgery than I was comfortable with on a piece of code that I'm trying not to adjust. But there you go. The price you pay.

But once I had done all that, I had a regression suite I was happy with, and that seemed to work quite well.

Replacing `main()`

My first goal was to try and pull main() into the shape I wanted. I suspected it would not be too hard, and it wasn't. It was basically in the right shape already, just with too many bits and pieces distributed in the wrong places for my liking. While I was about it, I moved "everything else" somewhere else. The idea of course, is to be able to make all of "everything else" be completely testable, and for main() to be the only thing that deals in "real" classes. It's worth pointing out that I haven't done that yet, but it gives that vague appearance. I will come back to that, although I haven't quite decided what gets shared here and what doesn't.

So, for now, main() looks like this:

package com.gmmapowell.script;

import com.gmmapowell.geofs.Universe;
import com.gmmapowell.geofs.lfs.LocalFileSystem;
import com.gmmapowell.geofs.simple.SimpleUniverse;
import com.gmmapowell.script.config.Config;
import com.gmmapowell.script.config.ConfigArgs;
import com.gmmapowell.script.intf.FilesToProcess;

public class Main {
        public static void main(String[] args) {
                Universe uv = new SimpleUniverse();
                LocalFileSystem lfs = new LocalFileSystem(uv);
                try {
                        Config cfg = ConfigArgs.processConfig(lfs, args).read();
                        FilesToProcess files = cfg.updateIndex();
                        cfg.generate(files);
                        cfg.show();
                        cfg.upload();
                } catch (Throwable t) {
                        ExceptionHandler.handleAllExceptions(t);
                }
        }
}

Yes, I've included the whole thing (imports and all) to show that there isn't any trickery going on. I did keep two other (exception-related) classes in the same package, but that's it. Yes, the ExceptionHandler class is static, but even so, it can be unit-tested if you are that way inclined (I'm not, at the moment, but the goal here is testability, not test coverage).

The eagle-eyed among you will notice that even now this does not quite reflect the outline I presented above. Partly because the names are wrong, but mainly because the steps themselves are wrong. But, remember, this is just the first step of a much bigger task: as that task unfolds, I will come back to this and make subtle adjustments.

The key thing to note is that everything already goes through a single object - the configuration cfg - and this is already an interface. The static class ConfigArgs is responsible for creating an instance of the ConfigReader interface and present it to us, at which point we call the exposed read() method. This separation allows us to test the ReadConfig class directly without worrying about the pesky details of file systems and the like.

Moving On

Having got this into shape, I can now tackle the thing I most want to tackle - reading the configuration file and breaking this up into "modules" and then introducing a class to handle modular configuration and dispatch. I'm hoping that this will turn out to be just what I need when I come to processing the modules within my text files. Oh, and the thing that started me down the path: when I want to add nested modules within the modules.

I think this is the design I had in mind when I started this project, although after four years it's hard to be sure. But it certainly seems that a lot of the code falls that way. So let's get started and start refactoring the code that reads the configuration.

Tuesday, August 13, 2024

Rescuing ScriptFormatter

I don't like duplication. And I generally don't like graphical things. I do, on the other hand, like things that are shared seamlessly between environments.

So when it comes to writing documents, I have moved towards writing everything in Google Docs. But I don't format anything there. At least, I don't use the Google Docs formatting tools.

One reason is that I prefer structure over appearance: while it's important to me to be able to indicate that I want certain words to be bold or italic at the drop of a hat, I want to be explicit about chapter and section breaks and not have to guess what's going on based on fonts.

So when I was writing a movie script during lockdown, I didn't "format" it in my word processor, I just wrote what came naturally and then said "I'll write a program to transform that to the appropriate standard". And I did.

It was a quick hack, and it was somewhat experimental - it involved downloading from Google Docs, and it involved generating PDFs. And while I had sort of done those things before, all of this was basically new to me, so I just went ahead and did it. And I hardcoded things like styles as well.

When I wanted to format some documentation, I thought "well, I've got that, so I could just...".

When I wanted to write a book, I said "well, it isn't so very different...".

When I became completely bored of formatting this blog in the Blogger tools, I said, "well, I could integrate with Blogger...".

When I wanted to work on a presentation, I said, "it would be cool if I could extend this..."

And now I have a mess.

The Good

The good news is that with the amount of experience I have, I did at least lay out a basic design and structure first time up that reflects at least a four-stage pipeline. There are the obvious two stages of "front end" (parsing all the input files into an intermediate form) and "back end" (turning that into something readable). There are also the stages of collecting the input files (e.g. from Google Drive) and uploading the final result somewhere (e.g. to Blogger). The process of configuring the tool might be considered a separate stage (or two). So we can say that there are six stages, and they are largely separated in the code.

Read a configuration file,
Configure the tool, finding any local files and loading modules,
Download source files from all appropriate remote locations,
Parse the source files,
Generate the output file(s),
Upload the files.

The individual processors are currently all classes in separate packages within the project, and have a flexible configuration mechanism. A lot of the processors use inheritance to share code, although in a haphazard way.

The Bad

Everything is currently monolithic and in the same project, and there is no ability to extend this.

There is quite a bit of duplication. For example, I have two different pieces of code (which work slightly differently) to access GIT repositories - one for the documentation tool and one for the blogger tool.

There are basically no tests of any description.

The Ugly

This is not really a piece of software that "does one thing well". In fact, it does a whole bunch of different things fairly well, but with no real consistency or vision. In other words, it's a hack.

I guess in a lot of ways, that's what I paid for when I built it. But it's not what I want now, and certainly not what I want going forward.

Often I up my game when I want to add new features. I don't see how to add a new feature without retrenching. On this occasion, however, it's just simply that I've reached a point where I can see more clearly what to "do one thing well" would look like. And thus how to factor (or modularize) this whole thing correctly.

For now, I'm going to do it all within one project, but I want to build it out to be extensible across multiple projects.

The Larger Design Space

In the original version of his paper on the spineless, tagless, G-machine, Simon Peyton-Jones described how different implementations of functional language machines originally "appeared as isolated 'islands'" but later it was possible to see how they related to each other in a "larger design space". Something very similar has happened here: I have five or six applications all of which fall into the overall design space of "converted annotated input from somewhere (possibly with additional metadata from elsewhere) into some readable document somewhere".

That is the one thing we want to do well. Including, of necessity, the task of finding all the modules that the user wants to incorporate and plugging them all together.

Testing and Testability

One thing I'm going to insist on as we do this is that everything I build should be testable at multiple levels:

I want lots of unit tests wrapped around the individual "functions" within modules;
I want to be able to test the configuration tool works correctly;
I want some tests which assert that the interfaces to the external world (e.g. Google Drive, Blogger, git) work;
I want the ability to create "golden" tests for whole phases - particularly the "front end" - that enable me to assert that certain collections of features work together nicely.

I tend to distinguish between testing and testability for a couple of reasons: one is that I am simply not very good at always "testing first" - I am often driven more by the overall goal or sweep of a story rather than specifically writing tests; the other is that it takes a lot more effort to make something testable than it does to write a specific test - so once you have made something "testable", it becomes a lot easier to actually test it.

My goal of this rescue project is to do the hard yards of making everything testable. I can then write tests as and when I feel like it.

Testability and the File System

"Always design to interfaces, not classes", we are told, and then we are given filesystem abstractions that are nothing like that. I don't understand why.

Over my life, I have tried fifty different ways to resolve and simplify the poor tools that we are given for dealing with files. So here is number 51. The main goal is testability, but I'm also keen on reducing bloat, making sure that it is easy to do the things I want to do, thus reducing duplication, and encouraging a "tell-don't-ask" style.

The Universe/World/Region/Place model

If you start of thinking about "content" rather than files, you immediately move to something like a URI notation rather than a path. I'm going for something similar, except I want more "active" objects rather than just a path that you then need to do something with.

So every content document is somewhere in "the Universe". This "Universe" consists of providers, some of which are going to be on your local filesystem, some could be on remote servers (such as NFS or SMB), some could be in databases, and some could be in the cloud (Google Drive, iCloud or S3). I am going to model each of these providers as a "World", although I accept that there might be situations where a single provider would offer multiple "Worlds" for some reason (e.g. AWS might offer S3v1 and S3v2, built against V1 and V2 of the S3 client library).

Each world can offer a default "root" and multiple named "roots". A root is basically the starting point for navigation and gives you access to a "Region". Loosely, a "Region" corresponds to a directory, a folder, a label or a prefix (in S3). Each Region can then nest other Regions, and finally, within some Regions it is possible to find "Place"s. A Place is simply a content file.

The only solid class you need to create to make all of this happen is the Universe - after that, you go the Universe to ask it for a World; you ask the World for a Region; Regions for subregions and Places; and you ask Places for their contents.

Each of these objects also has other operations you can perform - such as creating new Regions and Places, or writing to a Place. And each of these operations happens by interacting with an object you have already been given, not by creating a new concrete reader or writer implementation.

This, of course, makes the whole thing testable: simply have your main() implementation create the "real" Universe and pass it in to your code; alternatively, create a double of the Universe (or smaller object) in your test code and pass that in to a class under test.

In addition to be testable, this is also extensible: either explicitly or by using a service provider model, it is possible to have the "real" Universe pick up drivers for other Worlds and configure them appropriately. As long as they implement all the necessary methods, they should just slot straight in.

Initial Impressions

Basically, I'm really happy with this. It took me a couple of days (admittedly, spread out over a week or two, but that's just the pace at which I'm working at the moment) to convert everything and implement most of this, along with unit tests and integration tests against Google Drive.

One of the things that I struggled with was how much of the code was very similar and extracting that. I don't think I want all of these things to inherit from some base classes, so I created some utility classes instead that can take the abstract references (to Worlds, Regions and Places) and then do the hard work in a central place. One of the things I wanted Places to do was offer a lines method, that did all the hard work to provide the contents of the Place one line at a time: I found myself duplicating code about LineNumberReaders all over the place. Of course, I don't want to do that; I want to share that code, while having the code that can acquire a Reader from the Place.

Configuration and Modularity

In every project, there is a "bootstrapping" problem. In this case, it is: "how do you find the configuration file?". There are two parts to the answer to this question. The first part is that the application is going to take exactly one command-line argument, which is the name of the configuration file. The second part is that the configuration file must be somewhere in "the Universe" before configuration happens. For me, for now, that means it is on my local file system, and the driver for the local filesystem is loaded into "the Universe" in main(). After that, other worlds can be configured as modules in the configuration file.

This is, of course, not the only way of doing this. One alternative is to use some kind of "service provider" pattern in which any World implementation found on the classpath is automatically added to the Universe before looking for the configuration file. I have done this in other places on other occasions. I have never been as happy with it as I would have liked. It gives the appearance of reducing duplication, because you are not specifying a driver both in the classpath and in the configuration file. And when everything works, this is great. But when things do not work, you can spend forever trying to track down an obscure bug (I had this recently trying to use the javax.mail replacement: there was a nested dependency I had not included, and none of my IMAP messages were being decoded; four hours of my life I will not get back). But the thing is, at the end of the day, we are not talking about true duplication here: the classpath specifies where to look for what drivers are available; the configuration file selects which of these should be used. The real advantage is that if you want to use something that is not available, you fail early: an error message tells you that the driver you want is not available.

A Modular Pipeline

So, basically the tool that we are building is a configurable pipeline which:

Reads the configuration file;
Identifies the modules to be loaded and loads them;
Figures out which modules should be used in this configuration;
Configures them with any options the user might have provided;
Asks the "download" modules to collect all the relevant files;
Asks the "front end" modules to interpret the files and produce an intermediate form;
Asks the "back end" modules to convert the intermediate form into a final form;
Asks the "distribution" modules to upload files as necessary.

Note the plurals there. I think it is very reasonable that you could want, for example, multiple sources for your input files. Possibly multiple locations from the same system, or possibly a combination of systems (e.g. some local files and some from Google Drive). It's also possible you want to generate PDF, EPUB and HTML from the same intermediate form.

The core code then, should be very simple. Probably 20-30 classes with a total of about 1-2 kloc. All the hard work will be done in the modules. But this is great, because the more modular everything is, the easier it is to test close to the bone. The main code should be very short indeed (10 lines?) with only a couple of concrete class names being invoked in order to get everything started.

Testing from the Top

But the current implementation has few, if any, tests. How do I know if I'm going to break anything?

Good question. I'm going to have a three part strategy:

First off, I can eyeball things if I want to. For example, I'm writing this blog post in parallel with making the changes, and I'm checking through it as I go. If I break something obvious in the Blogger pipeline, it will show up.
I can compare "before" and "after" files somewhat automatically. For example, I can generate (but not upload), all my historical blog posts and then compare the output HTML to the HTML actually on my blog. Actually, I think I could add a different "distribution" module which directly compared the generated HTML to what's already up there (rather than uploading it). I love it when you can use something you're developing to help you solve its own problems! It is possible to do much the same for PDF using Acrobat.
I am going to (early on) add code to "capture" the intermediate form in a file. After doing this, I will be able to write "golden tests" for the front end which transform a known input in a test directory and then assert that it generates the correct intermediate form.

And then I hope to strangle the rest of it by adding more and more tests.

Right Now

I've implemented the "Universe" model - at least as far I need to - although I still want to clean it up and factor more things out.

Before getting in to the hard work, I want to save the intermediate form to a file to make "big picture" testing easier.

Then I want to try and separate out the "pipeline" from the modules, while currently hardwiring in the modules so that things keep on working. My goal is to try and reduce main() to about 10 lines which reflect the claims I made above about it being an 8 step pipeline.

I'm not going to blog continously about this, (or generally provide code samples), but you can find all of the code on github. From time to time, particularly if I have done anything interesting, or feel I have achieved something clear, I may follow up with another post.