NoSQL is fixing the wrong problem

Imagine if someone told you “Cars are bad. They make a lot of pollution that warms the planet. So we’re going to fix this by engineering a really great electric motor and we’re going to put it in a hovercraft so you don’t need to build all those roads.”

What you’ve got there is a good idea (a really good electric motor) but in throwing out a bunch of other good ideas (like roads), the result is worse.

SQL sucks like a really, really sucky thing

SQL is, in an awful lot of ways, well, awful.

The language looks like it was the second job someone did after designing Cobol. It’s verbose, hard to read, hard to write, hard to parse (for humans and computers) and hard to generate.

It’s full of inconsistencies.

The semantics aren’t much better. Everything is heavyweight. Namespacing and data hiding is primitive at best (Why can’t I have a schema as the value of a field? Why isn’t a polymorphic join part of the standard?).

Modularity is virtually entirely absent (want to do one query almost exactly like another one? Copy-paste).

Recursion is an ugly after-thought or absent entirely (I’m looking at you, MySQL).

It’s awful, and when you dig into what ought to have been, it gets a lot worse…

Before the tragedy

Rewind to the 70s. After the world turned to color but before Microsoft blessed the world with MS-DOS.

In those halcyon days of yore, if you wanted to store data, you used puts and gets to read and write flat files. This was a brittle, tedious business that clearly wasn’t going to scale. Folks started looking for a better solution.

IBM had developed IMS (Information Management System) in the 60s. This was a hierarchical data storage model (sound familiar?) where data were nodes in a tree. And this started to become popular as a more flexible alternative to flat files.

But it still had problems: applications didn’t only need to know about the data being stored, but had to spend a great deal of time fussing about how it was stored (sound familiar?). If you had multiple applications accessing the same data (kind of a major point of a database system), each of them had to be separately aware of this pointless cruft. Changing the path to the data was difficult to the point of being virtually impossible. Sound familiar?

But then folks started to take note of the relational model, first described by E F Codd in 1969.

Simplifying things a good deal, you could say that the relational model represented data in a way that involved a very simple representation (crucially, all data is equal, because everything is “top level”) which is nonetheless provably very expressive (it is essentially equivalent to first-order logic, which is about the most expressive computable representation available).

Once IBM developed SEQUEL (renamed to SQL because of a trademark issue) and particularly once Relational Software (now Oracle) released their database, the writing was on the wall for competing database models.

Queue the tragedy

SQL was much, much better than what came before it, but it was still far from perfect. Apart from the aforementioned issues, it wasn’t a proper implementation of Codd’s relational algebra. It was sufficiently close to be very useful, but had issues which meant that the elegant simplicity of the relational model didn’t apply.

A properly relational database would allow for simple automated query optimization much as it is possible to optimize the calculation of mathematical expressions. But SQL databases are much, much more complex to write because of all the corner cases and general weirdness of the language. This is why there are only a handful of good SQL implementations.

The SQL language also pretty much requires the monolithic, heavyweight storage model and all the rest.

To be clear: none of this is implicit to the relational model, just to this particular language that only sort-of implements the relational model.

Which leads us now to the tragedy of the NoSQL movement. A new generation of developers, keen to make their mark and do something cool and new has come along. They are too young to remember what went on in the 70s, and computer science departments don’t teach history. They noted the many issues with SQL and wanted to do something new. Great, well done, brilliant idea, let’s get rid of SQL.

But not remembering the past, the NoSQL folks are tragically doomed to repeat it. Having found the water a bit brackish, they’ve thrown out the baby along the way and reinvented the hierarchical and network data models.

It makes me want to cry

The NoSQL developers are doing some great work, developing data storage that is easier to scale, where the schema is lightweight and changeable and other great stuff. But in throwing out the relational model along the way, all this work is mostly useless.

There is no such thing as a “schemaless” database; you can have the schema as part of the database, or you can have it in your application. It belongs in the database.

You can store data buried in a JSON object or some such thing, but why? There is in general no significant advantage to this complexity, and I’ve heard many stories of folks who started with a NoSQL database and wound up having to (painfully) convert to a SQL database after they (re)discover the pain of having to hunt their data down inside some complex, arbitrary data structure and then finding it’s the wrong structure when the problem changes.

Since PostgreSQL 9.4 added robust JSON support, any situation where JSON storage of data is useful is covered by about the best SQL database around.

If only

If all this engineering effort started with the lessons of the past, we could now have lightweight, eventually-consistent easily scalable relational databases and the world would actually be a better place. But until that happens, consider NoSQL databases with a skeptical eye, and stick with Postgres.

Sigh.