Why relations are better than objects

The title of this chapter is a bit misleading, as the first thing the word "object" brings to mind (in this context) is OOP. Here instead we'll be using said word in a looser sense, one that includes, but is not specific to OOP. For our purposes, an object is any mutable record that can be referenced and manipulated through a pointer or object id. So everything we will be saying applies not only to object-oriented languages, but also to procedural languages like C or Pascal, and even some impure functional ones.

In what follows, we'll be talking specifically about the particular flavor of the relational model that is used in Cell (which is very different from the one used in relational databases) and we'll assume the reader has a basic familiarity with it, so you'll need to read the introductory example first. It would also be useful, but not strictly necessary, to take a look at the chapter on relational automata.

Modelling the application state

Most developers are familiar with the process of database design, where you model your persistent data using the relational model. The main idea behind functional relational programming is to apply the same kind of thinking and methodology to software development, by modelling your entire application state (not just your persistent data) using a very high-level data model based on relations.

But what does that mean exactly? How does modelling the application state differ from what we do with conventional languages? Don't we "model" the application state in any language, when we define all the data structures that are used in our programs?

One answer is that difference between the kind of data modelling we do in functional relational programming and what we do in conventional languages is the same difference there is between a specification and an implementation. The purpose of a specification is to unambiguosly describe the structure and behaviour of a system in an abstract, high-level way, so that it is, among other things, (relatively) easy for humans to understand and reason about, independent of how the system is implemented, and amenable to being verified and analyzed via software. An implementation on the other hand has to deal with all the low-level concerns, and is typically messy and hard to understand, not to mention buggy.

An example

Now, this may sound a bit too abstract, so we need an example to see what all that means in practice. Take the following Cell schema:

schema WorkDB {
  // company(..) is a unary relation (set) that stores the ids of all
  // companies in the database, while company_name(..) is a binary
  // relation that stores the name of each company in company(..)
  company(CompanyId)
    company_name : String;

  // employee(..) stores the ids of all employees in the database and
  // name(..) and date_of_birth(..) the corresponding attributes
  employee(EmployeeId):
    name          : String,
    date_of_birth : Date;

  // works_for(..) is an associative binary relation that keeps track
  // of who works for whom. Note that it allows for an employee to have
  // more than one job. since(..) and salary(..) are ternary relations
  // that store attributes of the works_for(..) relation/relationship.
  works_for(EmployeeId, CompanyId)
    start_date : Date,
    salary     : Money;

  // Foreign key declaration
  works_for(e, c) -> employee(e), company(c);
}

The above code is just modelling the data. Its job is just to describe the abstract structure of the data, and it says nothing about how that data will be stored, represented or used in a computer program. It's just a specification, and, in and of itself, it's not about computers at all.

Let's now implement that specification, that is, let's design a set of objects that can be used to store the same information:

class Company {
  int id;
  String name;
}

class Employee {
  int id;
  String name;
  Date date_of_birth;
  Job[] jobs;
}

class Job {
  Company company;
  Date start_date;
  Money salary;
}

So far, so good. Kind of, at least. But even here we can start noticing a few seemingly trivial problems that aren't present in the Cell schema.

For starters, both company and employee ids should of course be unique, but there's nothing in there to enforce that constraint. You have to be careful when writing code not to accidentally create two objects with the same id. That's trivial and not at all likely to be a problem in practice, you may say. And in this specific case I would agree, that's probably not where bugs in the code are going to come from. But it still exposeses a weakness of the implementation, in that it doesn't describe the data with the same precision as the Cell schema.

Another problem is that the instances of both Company and Employee have two identifiers: their object id and and the integer stored in their id member variable. You need the latter because object ids cannot be stored when persisting the data. Again, no big deal, you might say. And that's mostly true, although sometimes having to deal with this system of redundant identifiers can be a bit annoying. But my point here is that, even at this very, very early stage, you can see the implementation concerns creep into and pollute your data structures.

A third imperfection is related to the jobs member variable. As it is, there's nothing to prevent the same company to appear more than once in the set of jobs of a particular employee, possibly with different start date and salary. That would be an inconsistency in the data, and inconsistencies lead to bugs. That could be fixed though by changing the code as follows:

class Employee {
  ...
  Dict<Company, JobInfo> jobs;
}

class JobInfo {
  Date start_date;
  Money salary;
}

Let's now proceed with the implementation. One thing we need is the ability, given a Company object, to retrieve the list of its employees, which in practice means changing the code like this:

class Company {
  ...
  Dict<Employee, JobInfo> employees;
}

The new employees member variable is completely redundant, in the sense that it doesn't contain any information that wasn't already present before. And when you introduce redundancy in your data structures you're opening a Pandora's box.

Redundancy

In the database community it has been conventional wisdom for nearly half a century now (basically since the invention of the relational model) that in designing your database schema you should be careful to avoid any kind of redundancy. That's what database normalization theory is all about. For some unfathomable reason, the same kind of thinking is never (or almost never) applied to software construction, even though it would be as beneficial (possibly even more so) as it is for databases. So, before we countinue our discussion, it's a good idea to talk a bit about redundancy, and to explain what's so harmful about it.

The simplest form of redundancy is when a given piece of information is stored in more than one place. In a relational database, it may for example appear in multiple rows of the same table if the database is not normalized. In an object model, it may be stored inside different objects, or it may appear in slightly subtler forms, as in the above example.

This type of redundancy causes two main problems. The first one is that if a piece of information is stored in multiple locations, those locations have to be kept in sync: every time you update one of them, you've to update all of them, or you'll put your dataset in an inconsistent state. While consistency is no guarantee of correctness, the former is certainly a prerequisite for the latter. So getting rid of redundancy increases the reliability of your code by avoiding an entire class of bugs.

The second main reason that makes redundancy harmful, one that is more relevant to software construction than database design, is that having to update a piece of information that is stored in multiple locations requires more work, which makes your code more complex than it should be. This is especially true for object systems, where these redundant pieces of data may be scattered all over a difficult-to-navigate object graph, and where an artificial barrier like encapsulation often stands in the way.

In our example this means that we have no guarantee that information stored in Employee.jobs objects agrees with that stored by Company.employees (an Employee may "think" it's working for a certain Company, but the latter may not have them in their list of employees and vice versa, or they may disagree on start date and salary), and that every time we update one of them we must also make sure to update the others. Failures to do so is a bug, bug that cannot occur at all with a better data model.

Back to our example

Now that we've created our basic domain objects, we need also a place to store them, a sort of directory objects whose purpose is to retain and provide access to those objects. Let's start with something simple (and inadequate):

class WorkDB {
  List<Company> companies;
  List<Employee> employees;
}

Let's examine this new data structure through the lens of data integrity and correctness. What could possibly go wrong here? One obvious issue is that there's no guarantee that companies and employees contains all the "live" objects of the corresponding type. An Employee may reference a Company that is not listed in WorkDB.companies, and vice versa. That may be for example be caused by a bug in the code that removes an Employee from its WorkDB, but forgets to update the corresponding Company objects. In the relational schema, that's enforced using a foreign key, but there's no equivalent to that in the object world.

Another thing we might need is the ability to quickly lookup Company and Employee objects by their persistent id. That may be needed for example, when reading data from persistent storage. We could change WorkDB as follow:

class WorkDB {
  Dict<int, Company> companies;
  Dict<int, Employee> employees;
}

This is another for of redundancy: the persistent ids of a companies and employees are stored twice, inside the objects themselves and as keys in the WorkDB.companies and WorkDB.employees dictionaries, and there's nothing in the data structures themselves that guarantees that the two will match. There are ways to fix this, by using "smarter" data structures instead of plain Dicts, but as far as I can tell, they involve making our data structures increasingly complex.

Now say you need to implement an Employee.remove() method, with the obvious semantics. Employee now needs a reference to WorkDB, because it needs to remove itself from WorkDB.employees:

class Employee {
  ...

  WorkDB work_db;
}

This new member variable (Employee.work_db) is not only, again, redundant, but it also has nothing to do with real world entities and concepts we're trying to model, it's there only to solve an implementation problem. That's yet another example of a low-level concern "polluting" data structures and making more complex.

To be continued...

Essential vs accidental state, and making invalid states impossible

Even in a trivial example like the one we've been considering, I hope it has become clear that when using objects it's not feasible in practice to stop low-level implementation concerns to creep into your data structure, creating the problem that we've been talking about, namely the fact that a bug in the code can put your application's data structures in an invalid state, and the fact that it takes more work to update them because when a piece of information is duplicated or, more generally, several pieces of information have to be kept in sync when you update one of them you have to update all of them.

The other things I hope you've noticed is that these low-level object-based data structures tend to be very unstable: as requirements change you need to keep tweaking them even when the real-world information that you're dealing with stays the same. And changes in data structures often create a cascade of changes throughout the code base, simply because there's more information to keep track of and pass around.

Contrast this with what happen with the Cell schema we've seen at the beginning of this page. That schema simply cannot be put in an invalid state, no matter how hard you try. In more complex schemas that's usually still possible, but it's much, much harder than in a object graph representation. And the fact that there's no redundancy means that updating a (well-designed) relational schema becomes easier, because there are fewer places in the data structure that need updating. In large applications, the problems with the object-based approch become of course much worse, and are the source of both complexity and bugs. And finally relational schemas tend to be be much more stable, as they don't contain low-level implementation data (or at least not as much as object systems).

So, to answer again the question we asked at the beginning of this chapter, what we termed "modelling the application state" is guided by two basic principles:

  • You should squeeze out of your data structures everything that is not essential. A piece of data is considered non-essential if it can be removed from the dataset without any loss of information or, in other words, if it can somehow be reconstructed from the information we're left with after it has been deleted.
  • Once all non-essential state has been thrown out, you should design the remaining data structures so that it is (ideally) impossible for them to be in an inconsistent/invalid state. In practice even with a high-level data model in many complex cases we may need to settle for less than that, but usually not much less than that.

In order to do so, we need a high-level data model. With those goals in mind, let's review some of the advantages offered by a relational representation of data

Relations are can be navigated in any direction

Using the works_for(..) relation defined above, you can easily obtain, given the id of a particular company, the list of all its employees using works_for(?, company_id) or given the id of an employee the list of companies it works for with works_for(employee_id, ?) or just works_for(employee_id) if they have a single employer.

With pointers, you've to set up two of them, one for each direction, because pointer are a low-level construct that can only be navigated in one direction. An we said before, one of them will be redundant, with the consequences that we've already talked about.

Relations can be searched efficiently on any combination of their arguments

Using the domain model described in the introductory example, you could, for example, easily retrieve:

// All users who signed up on day d
signup_date(?, d)

// All users whose last name is n
last_name(?, n)

// All users that joined a chat group g on a day d
joined_on(?, g, d)

The first result of those searches would be retrieved in constant time (on average: it's basically a hashtable lookup), and retrieving subsequent ones would be approximately equivalent in terms of performance to iterating through the elements of a linked list. With objects, you would need to create and maintain ad-hoc data structures for each of those searches, if you wanted them to be efficient. That means more code, more redundancy and more opportunities for bugs to slip in.

No need to wire an object graph

When designing your object model, you've to make sure that every object is wired to all the other objects it needs to have access to in order to perform the tasks that have been assigned to it. This wiring can easily get complex, and it tends to be rather unstable, in the sense that as the application grows, you'll often find yourself spending a not insignificant amount of time adding extra wiring to it. You mileage may vary here of course, but in my experience it's pretty common for a new feature to be difficult to implement not so much because it involves complex algorithms, but rather because it requires several objects/classes to cooperate for its implementation, in a way that requires them to be all wired to one another, and if that wiring is not already in place adding it may involve making significant changes all over your code base.

With the relational model, on the other hand, the wiring problem simply disappears. That's in part a consequence of the fact that relations can be navigated in any direction, and in part stems from the fact that every piece of data is always accessible from the automaton that contains it. The end result, anyway, is that, just like with searches, you get navigation for free.

Relations can elegantly encode complex facts that are awkward to represent with objects

Take, for example, the following statements:

  • Supplier S sells part P at price C (with every supplier charging a different price for the same part)
  • Supplier S sells part P at price C for orders of at least N items
  • Person A was introduced to person B by person C
  • User U joined chat group G on date D

In the relational world, they can be model by the following relations:

sells_for(SupplierId, PartId, Money)        [key: 0:1];
sells_for(SupplierId, PartId, Money, Int)   [key: 0:1:3];
introduced_by(PersonId, PersonId, PersonId) [key: 0:1];
joined_on(UserId, ChatGroupId, Date)        [key: 0:1];

where each entry in a relation represents an instance of the corresponding informal statement. I can't imagine a more natural way of encoding that information.

With objects, what you can model naturally is attributes of an entity. If, for example, every supplier charged the same price for the same part, you could model price as an attribute of parts, which could be represented as member variable price of class Part. But in a more complex case where every supplier charges a different price, you'll have to bend over backwards to do it.

If you're not convinced, just try to write the equivalent data structures using objects for one of the above facts (let's say it's the first one), remembering that you'll need to be able to, among other things:

  • Given a supplier s and a part p to check if the supplier sells a given part, and if so to retrieve its price (sells_for(s, p, _) and sells_for(s, p) in Cell, respectively)
  • Given a supplier s, to retrieve all the parts it sells, and their prices (sells_for(p, ?, ?) in Cell)
  • Given a part p, to retrieve all the suppliers that sell it, and the price they charge (sells_for(?, p, ?) in Cell)
  • To make sure that there cannot be duplicate entries for the same supplier/part combination (that's what the [key: 0:1] is for)

Of course you'll need to do all those things efficiently, since you might have a lot of parts and suppliers, so linear searches are out of the question. In the best-case scenario, you'll end up building an ad-hoc, informally-specified, bug-ridden, slow implementation of a relation.

Integrity constraints

With the relational model you can express declarative integrity constraints on your data: using the domain model described in the introductory example, you can for instance enforce the facts that no two users can have the same username, or that for every combination of user and chat group there can be only one join date and karma. That's in addition to the fact that the relational model allows you to squeeze most redundancy out of your data structures, thereby reducing the likelihood that they'll end up in an inconsistent state.

With object you've don't have such a simple, declarative and reliable way to enforce the consistency of your data, which deprives you of a powerful mechanism to avoid and to defend yourself against programming errors.

Attributes can be modeled in a uniform way independently of their cardinality

Let's say in your data model you want to store the telephone numbers of your users. If you want to allow multiple phone numbers for each user, you can declare a binary relation like the following one:

phone(UserId, String)

If on the other hand you want each user to have at most one phone number, you'll declare the relation as follow:

phone(UserId, String) [key: 0]

In both cases the data representation is exactly the same, and the only thing that changes is that in the latter case you'll have to declare a key for the relation, that is, an integrity constraint. The same goes for relationships: the works_for relation we saw before, for instance, can encode either a many-to-many or a many-to-one relationship (it could also become a one-to-one relationship of course, if it weren't for the fact that that doesn't make any sense in this context) depending on the integrity constraints that are applied to it, and the same happens with mandatory and optional attributes.

There's at least a couple of benefits in having such a uniform representation for attributes or relationships with different cardinalities. The first one is that code written in a declarative language like Datalog will keep working without changes even as those integrity constraints change over time. The second one is that this uniformity will make it easier for new code to work with old data: if an attribute that used to have only one value becomes multivalued, you'll be able to reuse the old data without having to do any sort of conversion (of course that's not going to work in the opposite direction, since the data model in that case becomes more restrictive). The same happens when a mandatory attribute suddenly becomes optional.

This is, by the way, one of the reasons why in Cell every attribute or relationship is stored in a separate relation. In SQL databases, all single-valued attributes are stored in the same table, while multivalued ones require a separate table, and optional attributes require the use of a problematic feature like NULLs.

Declarative query languages

With relations you can have declarative query languages (think Datalog, not SQL). I've never seen anything similar for any other data model. A language like Datalog is not Turing complete, of course, but for the kind of computation it allows, its simplicity and expressiveness are simply unrivaled.

I can't quite put my finger on why there's nothing remotely equivalent for any other data model, but I think that the fact that relations can be easily searched, and that their tuples/entries are not ordered (unlike what happens, for example, with sequences, where some of the information is conveyed by the order in which elements are arranged, not just by the elements themselves) are both crucial to that.

The availability of a declarative query language is not important only because it simplifies some types of computation. Another reason is that the result of certain classes of declarative expressions can be recalculated incrementally and automatically whenever the input data changes. That's especially important in a programming language, as opposed to a database system.

Let's illustrate this with an (admittedly contrived) example. Let's say you're building a massively multiplayer online game, where each player gets to build their own base/town, and where they need to extract resources of some sort (oil/timber/ore/...) to do so, and to build their armies. Let's also say that players can sign bilateral trade agreements, that allow them to buy the resources they need and sell those they have an excess of. Trade can be done through intermediaries: if Alice has a trade agreement with Bob, and Bob has one with Charlie, then Alice and Charlie can trade even if they don't have a direct trade agreement, although in this case Bob will be charging a fee for his services. For simplicity, every trade will be allowed to go through only one or maybe two intermediaries at most (things would get a lot more complicated if we allowed intermediary chains of arbitrary length).

In the game, we want every player to be able to see in real time if any given resource can be purchased, what's the best price for it, what's the second best one and so on, and how much of it can be bought at each price.

Expressing this logic in a Datalog-like language is straightforward: it's almost as easy as stating it in natural language. Computing the same information from scratch in either an imperative or functional language is significantly more complex that that, although still not excessively so.

But what we would probably want to do in a case like this is not to recompute everything from scratch every time something in the (massively multiplayer) game changes, but to update the results incrementally as the game progresses. In a declarative query language, all the queries involved in this particular case can be updated incrementally and automatically by a reasonably sophisticated query engine (or by code generated under the hood by a reasonably sophisticated compiler) without any extra effort on the part of the developer. But in an imperative (or functional) one, incrementally recomputing the output map is significantly more complex than computing it from scratch, and a lot more error-prone. That's because there's a lot of factors than can change the results, all of which have to be accounted for. The stocks of resources for sale can be replenished, or they can decrease and run out. Prices can change. So can the fees that players charge for acting as intermediaries in a trade. New trade agreements can be signed, and existing ones can expire or be canceled. All of these changes have to propagated to all pieces of information that depend on them, and the exact propagation paths are different in each case.

Values vs objects

Some of the benefits conferred by the relational model are just a consequence of the fact that the relational model is value-based, that is, it has no notion of pointers. For a good overview of why it's better to avoid pointers in a high-level language, I recommend that you watch, if you haven't already, this talk by Rich Hickey (the creator of Clojure and Datomic): The Value of Values. He explains it much better than I ever could. We'll take all that for granted, and we'll be focusing here instead on several aspects that are either not discussed in that talk, or that are specific to the relational model.

Of course, just removing pointers from your data model is going to cripple its expressiveness. As an example, try to implement in a value-only language (like, for example, Haskell) a set of data structures that are more or less equivalent to the OnlineForum automaton shown in the introductory example. There's several ways to do it, but in all cases the resulting data structures will be, compared to their OOP equivalents, both more difficult to navigate and less reliable, since they will contain more redundancy. So you need to replace pointers with some other type of construct that provides the same functionalities, but without the disadvantages. The most obvious candidate for that role is of course relations.

Controlling the scope of side effects

Since the relational model is value-based, you can have mutability and control over the scope of side effects at the same time: when you update the state of an automaton instance in Cell you've the guarantee that nothing else is affected, because automata don't share mutable state (and also because state mutation and I/O cannot be mixed, of course). That combines the advantages of functional and imperative programming, while mostly avoiding the problems of both: when you can control the scope of side effects it becomes much easier to reason about your code, and you can safely update different automata concurrently, but with most of the convenience and performance that comes from being able to update the application's state in place.

But once you introduce pointers in your data model, it becomes very difficult to retain any amount of control over the scope of the side effects (if there's an effective way to do it, I'm not aware of it). By mutating an object, you affect any part of your program from which such object can be reached, either directly or indirectly.

Pointer aliasing is incompatible with reactive programming

Let's illustrate that with an example. Say you've two variables, x and y, and you want y = f(x) to hold in your program at any time. Moreover, let's say that evaluating f(x) is computationally expensive, so that you cannot afford to do it every time you need to read the value of y. That's the kind of task reactive programming is for. In a reactive system, every time x changes, f(x) is automatically reevaluated and the result stored in y. But in order for the compiler to insert the under-the-hood code that does so, it must be able to detect when x changes. With a value-based data model that's easy to to: a variable can change only if a new value is explicitly assigned to it. But in the presence of pointer aliasing that's not possible anymore, at least in the general case. That's because there can be other references to whatever object x is pointing to, and any of them can be used to mutate it. Moreover, any change of state of any object that is reachable from x even indirectly can be enough to change the result of the evaluation of f(x).

There are, of course, reactive frameworks for object-oriented languages. For the most part, they just pretend that the above problem can't happen, and typically place the burden of detecting when a given data structure has changed on the programmer. The end result is that the code ends up being a lot more complex than it needs to be, thereby obfuscating the much simpler underlying logic. There are also other, more subtle (and insidious) problems: for example, when dealing with non-trivial reactive systems, where changes are propagated through several layers of state, all recalculations have to be done in a very specific order to ensure that they never make use of "stale" values (check the Wikipedia article on reactive programming for more details). In a value-based language, the compiler can easily figure out the correct order, but with objects that's not possible anymore and those recalculations end up being done in what is basically a random order.

Other frameworks, like ReactiveX, eschew the notion of state altogether and go fully functional (even though they are designed for imperative languages), with computation being expressed as the composition of functions that transform streams of events. That's a sound and interesting approach, but it takes a rather restrictive view of reactive programming, whose underlying ideas are useful in many context where that specific approach cannot be employed.