Joins in NoSQL database

Joins are hard to do in any distributed database. Even on distributed Sql databases. It is not that joins cannot be supported in a NoSql database. It is just that the database is not designed to be efficient for joins (I’ll explain why), and hence there is no point in supporting joins. Because if you want to do joins on a distributed database, you are incurring a lot of computational and operational overhead, and current thought is that it is usually easier to be clever and avoid having to do such joins. This can be achieved by either storing data in a denormalized way, or doing expensive joins as batch operations in a distributed way on a separate system (like Hadoop for example).

Before we go into why joins are hard on distributed databases, we should also see why people use NoSql databases in the first place. “NoSql” itself only means that the database does not support Sql queries. So they can be optimized for only a small subset of use cases. Generally, this use case is lookups by key or by a range of keys. This allows easy scaling by splitting the key-space into shards, and having a routing layer which redirects the query to the appropriate shard. So NoSql allows for a particular kind of query to be fast and the database to be scalable, but this comes at the cost of other functionality (ACID guarantees).

So why are joins hard on distributed database?

Say you want to join on field foo over tables A and B, you need to first support an index so that you can look up values in A and B by foo. The index creates a problem because the database only supports lookup by key, so the index must be another database that has key as the values in field foo and the value as the primary key of the database. This index might now be located on some other shard. Even if we can somehow guarantee that the index will reside on the same shard, this is still a complex operation.
The data for field foo in tables A and B might reside on different shards. So the join will then need to happen on a layer that sits above the distributed shards (the “broker”). This again makes the broker complex which defeats the purpose of simplicity and easy scalability of the database.
If it is not a distributed database, you could easily implement joins in the application layer on a key-value store if the key to be joined is the key for both tables. But if it is another field, then again you would need to maintain a separate index at which point you should think if you are just better off using Sql.



Person.all('$where' => "function() {

return this.diaspora_handle.match(/^#{query}/i) ||

this.profile.first_name.match(/^#{query}/i) ||

this.profile.last_name.match(/^#{query}/i); }")

Leave a Reply

Your email address will not be published. Required fields are marked *