Tuesday, July 15, 2025

A Graph Database

This may be an uncommon opinion, but I like relational databases. In fact, when they were first around (possibly even first being mooted) the idea that they eliminated duplication by storing each piece of information exactly once seemed both radical and brilliant.

In the past 20 years or so, they has been supplanted in large part by "document" databases. This has been driven by two factors: one is the increase in the number of "amateur" developers, for whom the structure and processing of data in a relational way (or even any formal way) is simply "too hard"'; and the other is the increasing deployment of databases "at massive scale" (often called "web scale") which makes the idea of the relation - and in particular, the relation between two things on different machines - tricky. Such database technology also generally throws away transactions as being "just too hard".

The words "baby" and "bathwater" occur to me.

The usual justification for preferring document databases over relational databases has something to do with cars and wheels. The idea being that all of these things are inherently part of the same thing. What it doesn't consider is that often we are talking about things that are inherently different.

Take for example, cars and drivers. If you are running a car rental business, it makes no sense to model this with all the information about the people in each record about each car; or to store all the information about each car with each person who has rented it. There are cars, there are people, and there is a "relation" between them: this person has hired (or will hire) this car.

It makes sense to use a document in a document database to model "a single thing" and it makes sense to use a relational database to model relationships between things. Except that now you have two databases. Oh, well, with document databases you don't have transactions anyway, so how much does any of it matter?

(As an aside, while doing the research that led to this, I came across Azure Cosmos DB which seems to offer everything in one package. I'm not sure if that's true or not, and I don't use Azure on a daily basis, so while I might come back to that, it would probably require somebody to pay me to explore that option. Feel free to get in touch.)

The Problem Domain

This all comes to a head for me when I am dealing with problems that are inherently about the relationships rather than about the entities. In particular, I have two "live" problems that I have been wrestling with on and off for a few years which are very easy to model in a relational database but all-but-impossible to handle at scale in a document database. Over the years I have discussed these with friends/colleagues at, for example, Couchbase, but so far I haven't managed to get a full resolution of my problems.

These two problems are security and notifications. Both represent the idea that a system has users (or "principals" if you prefer) and objects (or "resources" if you prefer) and there are relationships between them.

In the security case, we observe that there is some set of actions that a user can "do" to an object, which is very easily modelled as a triple in a relational database:
  • Principal
  • Action
  • Resource
It is very easy to write and run queries in SQL that operate on a set of Principals and a set of Resources and concludes the set of Actions that the combination of Principals can do the Resources.

(This notion of "sets" may seem confusing to you, but in general, a Principal is not defined just as a user, but as a user, or a group, or a group of groups, etc, and so any given "user" may turn up as being in one of many groups, so you need to consider whether any of those groups have the rights that you are looking for. On the other hand, resources may be allocated permissions in a variety of ways, including being nested in compound structures, so you may need to consider all the possible ways in which a pemission can be granted. In the "simple" case we will consider here, then, it is reasonable to say that the resulting set of Actions should be just considered as an "OR" of all the possible relationships, and so if any given Action appears in the set, the initial user (who is represented by the set of Principals) has the right to perform that Action on the initial object (which is identified through the various Resource paths).)

The notification problem is similar: when a given object changes, a given set of users should be notified, possibly in different ways or with different rules depending on the user's preferences. The model is much the same as before:
  • Object
  • Rules
  • User
When the object is changed, we find all the matching relationships in this table, apply the rules and then update the user if appropriate. The astute will notice that a similar "set reduction" takes place as before: we select rows that match the object being updated, but then we need to group the rules to be applied per user. We apply all of the rules simultaneously to come up with a determination to notify or not notify, and then we send exactly zero or one notifications to each user.

At Scale

I cannot emphasize enough that the only valid solutions to this problem are ones that work "at scale". If you are thinking about something in JavaScript inside a browser, that is a fine mental model, but it is not a realistic implementation. I am only talking about implementations which are global scale, and depend on at the very least replication of infrastructure in different data centres around the world. Since I am explicitly talking about AWS, what we are talking about is deploying a solution in multiple regions. Any changes need to be propagated across the regions, and any questions about security accesses need to handle users and objects which may be stored in different regions. I'm not planning to consider those issues here, but I want to be clear that they are in my head and may well come out in some of the things I talk about and do. If you are not thinking that way, my actions may seem bizarre.

The Two Database Solution

Realistically, it is not possible to do everything I want to do in a single database. If the previous paragraph didn't already convince you, let me put it simply: we need to use modern, "shared-nothing" databases to get the kind of "document" (object) performance that we want, and we need to have a database that can express relationships well to handle the relationships between them.

Slow Changes

During the time I spent (in a job long, long ago) working on data warehousing, I became a big fan of the star schema: it is amazing how much performance you can crank out of a relational database organized this way if you configure it properly (equally, it's amazing how large a machine you can bring to its knees on a trivial query if the developers on your team don't know what they're doing).

A star schema has exactly the attitude I want to the world: one in which objects can be stored in "dimension" tables and the relationships can be stored in "fact" tables (you would have to dig in to understand why they have those names). But one of the most fascinating things about the model is that the objects in the dimension tables never change. Once written, an entry in a dimension table is forever. If you want to make a change, you have to add a new entry with the new value. Subsequent entries in the fact table point to the new row, not the old row, but all the existing facts continue to point to the original dimension entry. (It's actually more subtle than that, but I'm trying to keep this short.)

Such changes are known as "slow changes" (in slowly changing dimensions) because between each such change, several hundred or thousand facts may well be created using that dimension entry. As an example, consider the field of telecommunications (where I was doing this IRL): for every time that a phone subscriber changes the name, address or plan associated with their number, they will generally make hundreds of phone calls, each of which is a "fact" pointing to the same version of their "subscriber" entry.

A Simplifying Assumption

The idea of slow changes is here by way of a metaphor, rather than something I intend to apply directly to the problem domain. In fact, there's a case to be made that I'm depending on exactly the opposite behaviour, that when an object or a user changes in the document database, it will retain the same identity. But this remains the same: I would expect to do at least an order (probably two) of magnitude more queries across the relationships than I would expect to see changes to the relationships which would be an order of magnitude more than I would expect anything in the object to change in such a way that the relationships will change.

If that didn't make sense, don't worry. It should all become clear as we play the game.

A Graph DB

It's my opinion that a relational database can at least model everything we want. It just isn't web scale. A document database really can't do the whole job. Thus we need two databases.

Having decided the only solution is to use two databases, we obviously need to pick the right database for the job. I think the best choice for the document database would be Couchbase, but that's hard (and expensive) to deploy at scale in AWS (they have Capella if you are taking things seriously, but it's still not a fully-managed pay-for-what-you-actually-use service), so I'm going to use DynamoDB which I hate (more, probably much more, on that later, but you can find it in other blogs if you look) but I am at least familiar with it and it is cloud-native . For modelling the relations, I could use a relational database, but in doing so I would put myself back with the set of scalability problems I already had. They wouldn't be as bad, and I could cope, but let's try something different.

Graph databases have been around for a while, but they have become popular in the past decade for modeling the kind of problems that social networks have, the Six Degrees of Kevin Bacon problem if you will. And that's not dissimilar to what we have here. You could, for example, phrase the security problem as "how many degrees is this user from this object through this many security groups and this many access objects"? And if the answer comes back as any finite number, they have access. If the answer comes back "this object has no connection to Kevin Bacon", they do not have access.

The Sample Application

At the moment, all I'm interested in doing is experimenting with the technologies. So I want to have two Dynamo tables, with the keys replicated in Neptune and links between them.

To do this, I'm going to create a simple web application that tracks (fake) stock values on a (fake) stock market. There will be users who are interested in "watching" the current values of certain stocks and the system is expected to "notify" those users with the relevant prices (and no others) when the prices change.

In my head, I have a JavaScript application connecting to an APIGateway over a websocket using with the hard code deployed using AWS Lambda. But to start with, I'm just going to write the back end of that code.

Conclusion

I'm not going to try doing anything specific in this episode. This is just setting the scene. Next time we'll come back and try and deploy some databases in AWS.

No comments:

Post a Comment