You may recall from last year that I used Playwright to try and automate the process of filling out the US Government's Foreign Funds form FBAR.
The problem was I had to collect all the data by hand and put it into a JSON document to start the process. Not ideal. Especially when I keep all the information I need about what I own in a spreadsheet already. That is, I keep all the data in a spreadsheet. There aren't any calculations in the spreadsheet.
For the simple reason that I hate spreadsheets. Yes, they are fine for entering and holding data, but they suck as far as doing calculations. When I have this strong an opinion (especially when it isn't widely held), I do like to check it with a little rational analysis. Something like this:
Interrogator: Why do you feel this way about spreadsheets?
Me: Data and code are different things. You shouldn't mix them. Spreadsheets jumble data and calculations together and then make you repeat the calculations over and over in different cells when they apply to different data elements. And if you ever want to change the way you do a calculation ... Life isn't supposed to be that way.
Interrogator: So what's the alternative?
Me: To use individual sheets to store your data in much the same way you store data in a (relational) database - each sheet represents a table and each row represents a single instance of a common item type: not dissimilar to a struct or class in a programming language.
Interrogator: If you like databases so much, and they already exist, why not use one?
Me: Because I do like spreadsheets from a data entry point of view; they are much easier to manipulate and use for visualization than a database is.
So, in short, I want to enter data in a spreadsheet, organise it like a relational database, and then access it using a programming language with a functional metaphor. While I'm doing exactly what I want, I'm going to invoke event sourcing and star schemas. If you think that star schemas are all about "big data", you may be confused by the fact that what I'm talking about here is just a few hundred rows of data. But what I'm after is the fact that a star schema almost perfectly models the concept of event sourcing: for example, the ability to support metadata that changes slowly. If you own an asset for long enough, something will always change: the name, registered address, contact person, whatever. Not to mention the fact that what you own also changes, and thus you need to keep a record of what you currently own and what needs to be reported on. I may only own a few assets - and value each of them 2 or 3 times a year - but over a long enough period of time, it is complicated enough that you want that level of help in tracking what is currently owned and thus needs to be reported.
So let's start off with a nice, simple abstraction. We can have a set of asset valuations. Say we have three assets: A1, A2 and A3. Each of these has a particular value on a particular date. We thus create a table with columns for date, symbol and value. So far, so good. When we say we are modelling this using event sourcing, what we mean is that we never delete or update a row. The table just keeps on growing. We have a complete history of all the valuations we have ever recorded.
So how do we know what the "current" value is? Well, that information may not (in fact, generally is not) in the spreadsheet. But we can find the most recent valuation. We simply filter for the asset we're interested in and then sort by date to find the most recent one. This is not hard using the UI. It is hard (although possible) using things like VLOOKUP and pivot tables. You may want to do that; I don't.
However, I claim that adding the notion of star schemas and slowly changing dimensions makes it simply impossible to do in a spreadsheet (although I'm more than happy to admit I'm wrong if you want to show me how, it still won't change the fact that I don't want to do it that way). So, for example, we can add a second sheet that indicates the kind of information - such as registered address - that we need to have to report an FBAR. This can change over time, so we have each row in this sheet be keyed by date and asset - just like the main valuation table - and do the same "reduction" operation to find the most recent version of the information. We can then "join" these - in the sense intended by standard relational databases - on the asset symbol to come up with all the relevant information about that asset.
We can, of course, do more. And we need to in order to complete the FBAR. For example, we only want to submit information relevant to a given calendar year - so we need to trim our data set to exclude all the dates that are not in the last calendar year. Except, for the dimensional data, we need to make sure that we have the data which was up to date at the end of the year, even if it was from a previous year (the nature of slowly changing dimensions is that they change very slowly).
The final thing that we need to talk about before launching into code is the fact that I want to approach this in a "functional" or a "pipeline" way. I've written before about how, in my mind, these two things are the same thing, just presented differently, much like "Polish" and "Reverse Polish" mathematical notation. So my approach is to use a functional language to describe what I want to do, and thus build up a model of a pipeline; the program will then interrogate that model and design an appropriate pipeline with the requisite delivery between components to try and ensure the minimum amount of rework is done.
Fortunately, I carry a functional language compiler with me everywhere I go: as part of my day job, I have the FLAS compiler which is, fortunately enough, designed to be embeddable and extensible. Just the thing we need for this job.
Let's get started!
Modelling the Data
The whole point here is to store the data in a spreadsheet, and, because we want to store different kinds of data, we will use multiple sheets for this. Each sheet has a name, and that is how we will find it to load it. For our purposes, it does not matter that much which database is used: I started off doing this using Excel but moved to Google Sheets a few months ago (transferring the data was actually harder than I'd expected, but I think I've sorted all the issues out now). Each sheet, then, contains rows of data of a particular type with a header indicating the field. So we can model these in a programming language using a record-like construct. In FLAS, this is called a struct, just like it is in C.Let's start with what I'm going to describe as "the fact table", the table where I record the value of each asset every time I look or they send me a statement:
struct ValuationThere are a few more fields here than I outlined above, but hopefully the reasoning is obvious: some assets come as a quantity held and a price per unit: the total is obtained by multiplying these together; others just come with a simple total. We need to support both classes in this sheet and struct (yes, we could put them in different sheets and use different structs to hold them, but that is not the choice I made). There is also a field for comments that I have used from time to time, in particular to record "odd" things that have happened, such as an asset being split or redeemed or something that means that it appears to change value dramatically, but there is a good reason.
Date date
String code
Number quant
Number unitprice
Number total
String comment
While we're on historical reasons, I have some assets in British Pounds, and some in US Dollars. In order to be able to do anything (and, in particular, complete an FBAR), it's necessary to convert between these. So from time to time I look up the exchange rate between these two and store it as the asset GBP with a value of a Pound in Dollars ($1.26/£ last time I checked), although I'm starting to think that it's a sufficiently different beast that it should go in its own sheet as its own dimension. The clue is really in the fact that it has its own struct.
struct ExchRate
Date date
String symbol
Number rate
Querying the Sheets
Yes, I really do think about this as if it were a database. So I'm going to go ahead and write some "operations" that enable us to load data from the sheets. There are two "operations" for this: LoadFromXLS and LoadFromSheets. As noted, I'm fairly agnostic about where the data comes from - you just need to have the appropriate adapter. And these are the two that I have written: Excel and Google Sheets.What do I mean by "operation"? Well, in building up a data processing pipeline, we are going to do all sorts of things such as loading, transforming, merging and printing. Each of these steps is an "operation". Except, in line with almost all things functional, we are going to describe our desire or ability to do an operation, but not do it yet. We will only do that when we get to the very end and know what output is needed. Don't waste work.
So we describe our desire to load information from Google Sheets by creating a struct of type LoadFromSheets defined as follows:
struct LoadFromSheetsAnd so we can load our valuation fact table like so:
String fileId
String sheetName
Type emit
values = LoadFromSheets sheetId "Valuations" (type Valuation)This says that we would like to load a spreadsheet from Google Sheets. Obviously in order to do this, we need to tell it which sheets document, which sheet within that document, and then the kind of struct we want to map from there. Utlimately, it will attempt to match up (as best it can) the field names in the struct to the column headers in the spreadsheet and be ready to load the table.
Where did sheetId come from? Well, it's basically an environment variable. In this "Modeller" app, we can access external variables of this kind by asking the builtin object modeller to find a named variable, like so:
sheetId = modeller.getvar "sheetId"
Extracting the Exchange Rate
I noted above that for historical reasons, the exchange rate is mixed in with everything else. In order to extract it, we need to filter and map the Valuation objects to find the ones that have the symbol GBP and then convert them into ExchRate objects. This is a mix of "operations" (Map and Filter) with vanilla functional code (isExch and toExch):exch = Map toExch (Filter "code" isExch values)This first finds all the entries in the Valuation table which have the code GBP. It then converts these from a Valuation to an ExchRate giving the date, symbol and unit price.
isExch s = s == "GBP"
toExch (Valuation v) = ExchRate (v.date) "GBP" (v.unitprice)
Dimensions
I'm not going to claim that the FBAR is the most complicated form ever, but it does have a lot of requirements, and a lot of moving parts that need to be coupled together. To make all this happen, we have four additional dimensional tables. Again, these are tables which include information about the assets that just doesn't change very often: for example, your address or the name of the asset.We load all of them and populate objects in much the same way we loaded the fact table.
The first is not FBAR-specific, but is the "core information" about an asset:
struct CoreInfo
Date date
String code
String name
Boolean active
String currency
String accountNo
The FBARInfo is the information specific to FBAR associated with a given asset:coreInfo = LoadFromSheets sheetId "CoreInfo" (type CoreInfo)
struct FBARInfo
Date date
String code
String status
String mergeAs
String assetType
String institution
String accountNo
String address
String city
String postCode
String country
There are also two tables which are needed in order to complete the form. FBARBase contains the boring information that goes on the front page and FBARUser gives information about the several owners of the assets.fbarraw = LoadFromSheets sheetId "FBARAssets" (type FBARInfo)
Preparing the Data
There are many shenanigans that go on in preparing the data to be joined together: mainly around the task of choosing the most recent information. I'm going to skip most of it, because it's a lot of hard work, but I claim it is fairly well described in the code itself.I am going to present one piece of the code which I think is fairly typical of these shenanigans. In order to report on last year's valuations, we need to trim the Valuations table down to just a mapping of asset name to highest value. This is handled in these three lines of code:
lastYear = Filter "date" isYear valuesThe first line extracts only the Valuation entries which have this year's date (the function isYear is not shown for simplicity). Then highestLastYear is defined using the KeyedTable operation, which selects exactly one item for each matching key (the key in this case is just the code field, but it is possible to use multiple fields in the key). The function selectHigherValue is specified here to decide how to choose between any pair of source entries: in this case we choose the one with the higher value in the total field.
Doing the Join
Now that we have all of our tables in place, we can do a star join. The result will be a stream of values, each of which will have one fact and one item from each of the dimensions with which it is joined. Each source fact will only appear at most once; the dimensional data may be repeated. This is achieved by defining a new type which can hold the joined data:struct FBARStarThen we can specify the actual join itself. This requires the Star operation, which takes the output type (FBARStar as above) together with a list consisting of the field names where the various parts of the join are to be place, the field in each table which will be used to do the join, the fact "table", a list of dimensional tables and finally the inclusion semantics (see below).
CoreInfo core
FBARInfo fbar
Valuation highest
fbarstar = Star (type FBARStar) ["highest", "core", "fbar"] ["code", "code", "code"] highestLastYear [fbarCore, fbardim] [False, True]Getting just slightly ahead of ourselves, this works by looking at each entry in turn in the fact table. An attempt is made to join it to each of the dimensions in turn. If the join fails for a given dimension, it may either produce a null, or the whole fact may be thrown out. This is determined by the "inclusion semantics". If set to True for a given dimension, then the dimension must join in order for an output row to be produced; if False, an output row will be produced anyway, but that dimension will show up as null. In this case, we want the presence of rows in the final output to be conditional on the FBARInfo being there for them, so we specify True for that table.
Producing JSON
In order to connect to the automated Playwright tool to fill in the form (which I wrote last year), it will be necessary to produce a JSON file. Of course, this is not technically necessary, it would be possible to have this Modeller program directly talk to and control Playwright. But, for now at least, I am happy to run this as two separate steps, with a very visible (and checkable!) barrier between the two in the form of a JSON file.fbar = JsonStoreWow! That looks complicated. But wait, and sit with it for a while.
fbarFile // the output file
{} // a base hash to become the document
[ // a list of injections
("", injectbase, base), // inject the "base" information into the root of the object
("users", injectuser, users), // users
("solos", injectsolo, soloassets), // solo assets
("joints", injectjoint, jointassets) // joint assets
]
The first line is basically just saying that the rule fbar is a JsonStore command; and the second identifies the file where the resultant JSON should be stored. But what is the JSON? Well, it's a Hash object (i.e. a JavaScript object) which is populated from the various different sources of information that we have. The third line defines a "base" JSON object, which in this case is the empty hash. The final six lines offer four "injections" into this object. That is, each of them is a triple of a location in which to store the information; a function which injects the information into that location; and a pipeline expression which has the data to be injected. There is, of course, a bunch of plumbing that goes into supporting that, but it is all just boilerplate functional code. It's there in the repo if you want to look at it, but I'm not going to dwell on it here.
That variable fbar is not special, as such, but it is the one that we pass to Modeller on the command line to say that we want the effect of evaluating that pipeline. And that's what happens: the spreadsheets are read, the transformations done, the calculations calculated, and the JSON file is spat out at the end. It may be possible to do all of that in Excel, but I wouldn't like to try. It wasn't quick getting to this point this way, either, but most of that time was spent writing the tool, which is reusable work 🙂
How Does this Work?
OK, so now we've described what we think we want to do. How does the Modeller program deal with that? Well, it starts by finding and evaluating the constant specified in its command line (in our case, fbar). It expects that to form some kind of graph (specifically it expects it to be directed and without cycles) of operation instances. It then works its way through the graph until it reaches the leaves, which should be sources, that is, operations like LoadFromSheets that can magic information out of nowhere. It then inverts the graph, connecting each source to all the operations which reference it, until it comes together back at the root of the graph with an operation we expect to be a sink, that is an operation such as JsonStore which knows how to consume data (in this case storing it to disk) but doesn't produce any.As it does this, it records all of the sources in a list to be invoked, and then asks each of them to start generating data (in this case, reading it from a spreadsheet). As each of them reads a row from the spreadsheet, it passes it to all of the interested parties (or "subscribers"). They, in turn, do all of the processing they want to do and spit out their outputs in turn to their subscribers. This process continues until there is no more data coming out of the sources and, as this knowledge trickles through the network, the activity slowly dries up until finally the JsonStore is told that it has all the information it needs and it can stop updating the object and write the whole thing to disk. Note that nowhere in here are there any guarantees about ordering: we don't say which sheets will be read in which order, or whether the rows in a sheet will be presented in any particular order. It is up to the components (and, to a lesser extent, the code) to make sure that it is agnostic to the ordering of information by concentrating on the data itself and not the order in which the data arrives. For example, if you want something sorted by date, choose the record with the appropriate date, not just the first or last record to arrive.
There is a small wrinkle in all of this which is vaguely disquieting to me. Having worked on such systems in the past (at a bigger scale), it feels that this kind of flow should be continuous: the idea that we read from a sheet and then stop is vaguely unsettling to me. But it does mirror the small-scale workflow that we are targetting here. If we wanted to make it continuous, we would need to add a clock to the system which, from time to time, sent a message to all of the sources, saying "act as if you've just finished". As each source received this message, it would send out appropriate updates, followed by a message that said "this is a checkpoint, pass it on". This would carry on up the chain until the JsonStore wrote out a version of the file with the most-current data. As I say, this could be done, but really doesn't suit the kinds of very small workflows I have in mind. It is easier just to run this once a year (or once a month if you want a summary of your assets every month).
But That's Not All
Now, I claim that just being able to do all that is pretty cool. But actually, you can do so much more. For example, the monthly valuation report uses the same data (although just the Valuation fact table together with the CoreInfo dimension table). But it pulls it together differently and then summarizes it into a PDF file (using the MakePDF sink rather than JsonStore). This uses (some would say abuses) the notion of cards that is central to the conception of FLAS as a web programming language.The user defines and then creates a card as a recipient for one or more streams of data. As that data is delivered, the card processes it and renders the output as HTML in real time. Modeller can then read that HTML and turn it into PDF using styles specified by the user in a format not dissimilar to CSS.
Final Notes
Because, while this is an adventure into the unknown, it is not an adventure into the complete unknown starting from scratch, it is in its own GitHub repository at https://github.com/gmmapowell/modeller. Everything here is current at the tag MODELLER_FOR_FBAR_BLOG.The question is bound to arise as to why I did not simply do everything is a straightforward functional fashion - after all, if we can express our intent in a functional program, why not just code it that way? Again, the answer really comes down to history and experience. When I first started writing this (the evidence would seem to suggest over five years ago), the functional configuration was an aspiration (the FLAS compiler was in its infancy then) and I was basically trying to implement on a small scale an in-memory version of something like Ziggrid that I had been working on back in 2013. So I built an engine that worked that way, and hardcoded applications using it. And then over the past year I've backed my way into a more configurable thing.
But also, I have to give a plug to Phil Wadler's classic 1984 paper, and say that by inverting this as I have, I have once more shown that listlessness is better than laziness.
Conclusion
Even though I spent a lot of time building out the Modeller core, along with implementing all of this, and entering the relevant data into sheets, and then debugging it all, I'm going to say that this has been my easiest FBAR filing to date. But I'm hoping that it will be better in future years.But more than that, I'm just really happy with separating code and data, and there could not be a clearer example of it than this.
I realise that this is a best a brief overview of what is (now) a large project. I may come back and tackle various aspects of it at other times, but to cover the whole project here would be an enormous task and, while I am very happy with the product, I am substantially less happy with the code, which is at best held together with string and sealing-wax.