Ignorance may be Strength : A More Functional Database

I have recently been using DynamoDB and, to be frank, its API (at least in Java) is, IMHO, awful. So awful, in fact, that in wrapping the API as described below, I was forced to write a mini-wrapper to call from within the wrapper.

Obviously, I was not going to put up with this. But rather than launching in and doing "something", I sat back and considered what I really wanted. If you know me or read this blog frequently, you would probably fairly rapidly come up with a list something like this:

Functional or functional-leaning;
Tell-Don't-Ask;
Transactional;
Asynchronous.

Now, as it happens, these things are not really in conflict in this case, although I ended up with three APIs:

A mini-wrapper I had to create just to tidy up the $DynamoDB$ interface;
A low-level "tell-don't-ask" API; and
A more transactional API with a functional feel.

What I really want to talk about here is the last of these; the other two are really just stepping stones on the way, although, for full disclosure, it was only when I had used the TDA API for a while and experienced the pain of doing it, that the third API occurred to me.

While I’m fully disclosing, this code didn't start here as an experiment, but grew organically out of one of my many “real life” complex and messy projects across a swathe of production code intended to support multiple databases as well as other integrated systems; for simplicity I’ve rewritten history a little and pulled out all the relevant classes for DynamoDB and assembled an example test case that steps through the basic CRUD flow using the functional API.

What am I thinking?

First things first. What do I even mean by a "functional" database? Almost by definition, databases are the opposite of functional. The very way we think about - and describe - databases is CRUD: a sequence of read/update operations with occasional creation and deletion to keep life interesting.

Given my long background in functional programming, a lot of things have been percolating in my brain over the decades and recently have experimented a little with (the somewhat-functional) FaunaDB (which I will eventually get around to writing up here) and in doing that I noticed something of a similarity between functional programs and how I might ideally write database logic.

Here is some functional code:

top = repeat x '*'
x = 5
repeat 0 c = []
repeat n c = c:repeat (n-1) c

Nothing particularly special here. But there is a rhythm to functional programs, which can be described as "define something; use it; repeat". I've often described writing functional programs as "do a little bit of the work now and push the rest off onto another function".

On the other hand, what drives a functional program is the fact that somebody somewhere wants to know a result and expresses this somehow. Generally, this either comes from a "main" method or from some kind of console or REPL. The "top level" expression is broken down into sub-expressions which are evaluated in turn until the basic blocks are reached.

(As an aside, TDA almost exactly inverts this logic: it takes all the basic blocks and says that the results should be sent to a consolidator that combines them and then promotes them to the next level until the top level handler is reached.)

So the question is: how does this relate to databases?

A "pull" model for databases

The normal way of dealing with databases is as an imperative begin-read-write-commit loop. But this transaction can also be viewed as a single operation in a REPL - the model used by functional programs. In this model the transaction becomes:

Decide what you want to do
Do all the reads and transformations
Do all the writes in a single step

How is this different? Most importantly, an important topic in standard database theory isread your writeswhich basically says that there is a choice when you have done a write in a transaction about whether when you use "the value" again - particularly if you choose to read it back from the database - whether you see the version you wrote or the one that existed before your transaction started. By deferring all the writes until the end of the transaction, we avoid this conundrum, which is exactly what you would expect from a functional model, in which values do not change over the course of a function's evaluation.

The other major change is in the way in which we describe the steps of the logic. As with a functional program, we name each value that we read in; as we name it, it becomes available for other steps in the logic. For example, a transaction which reads two values and then writes the result might ordinarily look like this:

begin tx
x = get 'A'
y = get 'B'
z = x + y
put 'C' z
end tx

In a functional notation, we might write something like this:

tx = [Store 'C' z]
z = x + y
x = get db 'A'
y = get db 'B'

Now, this is neither purely functional or really executable; but the idea is that we are describing the transaction rather than actually running it.

Handling the impedance mismatch with imperative languages

Now, while I want to use a "more functional" approach to databases, I in fact still want to do this from within Java - an imperative language. How do I handle that?

It's actually not all that hard. Java has functions, and I can say that each of my "steps" can map to a function. I have an entry point, which kicks off the initial GET operations and identifies the logic operation (z) that I want to be invoked when both of the return values are available. I then define z and annotate its parameter arguments to say where they come from. Basically, the transaction becomes a state that the functions can access through the parameter annotations. Once a value is read or calculated, it is available and any method that depends on that can now be executed.

Handling asynchronicity

It's obviously very important that the database access be asynchronous - the alternatives are to waste threads blocking or just to kill performance - which aren't really alternatives at all.

This is done by having all the functions call into an asynchronous layer in the database and provide a central place to call back. When they do, the current state is updated with the new value and the list of pending logic calls is examined. Any that are now "ready" - i.e. those that were just waiting for this value to arrive - can now execute, possibly launching more GET or LOGIC requests. Any of these that are ready to execute can be run the moment that the current method returns; others will be deferred until all their arguments are present. Every logic method can also add to the pool of "pending writes".

Eventually, all of the logic is complete and it is possible to do the writes - unless an error occurred in the transaction, in which case none of the writes will be done.

An implementation

The implementation (available in the git repository) is called GLS for get-logic-set/subscribe, describing this alternative loop.

Start with the simple test case (SimpleGLSUsage). The fundamental concept here is the UnitOfWork, which is created in the test initializer initTest. This corresponds to the conventional notion of a "transaction" (I have shied away from the word transaction in part because it is overloaded and in part because the semantics of the underlying database are unspecified; in the case of DynamoDB it is not transactional).

Within a unit of work, it is possible to create multiple parallel Relations. A relation represents a thread of work going on with its own namespace - a scope, if you will, within an outer definition in a functional program. This allows multiple lines of logic to coexist while still being part of the same "unit" - the same values are only read once, have the same value across all relations, and the unit succeeds or fails together.

This simple test case is not a series of unit tests. Rather, it is a "script" for executing tests. To make this work, JUnit 5 is used along with its OrderAnnotations. The first test ensures that we can create a trivial object. The enact method on the unit of work says that we have configured everything, and it is ready to go; waitForResult blocks until the unit has completed - success or failure.

The second test simply attempts to read this object back in.

The third test reads the object back in and then prints it. Because Java does not treat functions as "first-class" objects, the method printHello is referenced as a string in the call to logic. This is the name of a method in the RelationHandler class, which in this case is defined to be this, which is the current class.

The fourth to sixth tests check that the greeting is what we would expect it to be, update it, and check that it has changed. The final test cleans up the record by deleting the record (although if anything goes wrong, the whole thing will be deleted on restart).

It is obviously possible to run these tests, but some setup is required: you obviously need an AWS account with a DynamoDB instance configured; you need to create a table and pass its name to the test in the test.table.name property.

Conclusion

The normal database paradigm fits well with imperative languages, but has the normal drawbacks of those languages - most specifically, very slow, synchronous behaviour. It's hard in imperative languages to break out of that because of the complexity of dealing with all the asynchronous state. Changing the metaphor - and making a central agent responsible for the state management - simplifies the code and improves reliability.

Interestingly, in writing this, I can see that my actual implementation does not map perfectly onto my mental model - my implementation has turned out to be more imperative than my mental model. My entry point is fairly close to the "bottom" of the execution stack - as it would be with a TDA implementation. To match the mental model more closely, the "get" operations should not be invoked directly from the entry point function, but should be their own functions, and each of the functions should be named to match the "value" it produces. Maybe I should try again and report on that experiment.

Ignorance may be Strength

Wednesday, September 2, 2020

A More Functional Database