Saturday, April 2, 2011

Distributed Hash tables

Hash tables
Hash tables are efficient data structures which leverage memory to obtain speed. All its operations are optimized for constant time. However, as everyone knows, the internal implementation of hash table decides how efficient the hash table can work. A poorly designed hash function will cause more collisions, causing a higher cost for both inserts and searches. Also, a wrong hash function might waste a lot of memory by not evenly distributing the keys into the slots. To get a clear picture of how to determine hash functions, you can use "Introduction to Algorithms" which gives a detailed study on this topic.

Problems with Hash table
I don't know a data structure which works for all problems in all situations. Any data structure has its own set of limitations. It is true for hash table also. Unlike other data structures like list, stack, queue, tree, etc the size of hash table is not determined by the amount of the data being stored. At the same time, if you want to increase the size of the Hash table, you have no option other than hashing all the keys into the newer table. Also, for a machine, the RAM size has some physical limits based on the machine's addressing capabilities. So in case the hash-table is maintained in-memory (as in most of the cases), the size that your hash table can occupy is has a fixed upper limit. What if you want to break that hard limit and go beyond that size?

DHT - Solution for a larger Hash table
There are so many reasons why you want a large hash table. As you might know, hash tables are very good data-structures for caching the results as it enables very fast look-ups. There is no way that you can go beyond the size limitations of a single machine. So if you want a larger hash table, the only option is to go for distributed Hash table, which means you should distribute your data across multiple hash tables in multiple machines. Many a times, scaling up is not a good solution. Scaling up means replacing the machine with a higher end machine which might have high power CPU, a large memory, etc. Scaling up is always costly and might be prone to single point failure. Scaling out is a better choice. It means, instead of having one machine, you maintain a bunch of machines which are of commodity type. They are cost efficient, at the same time, you should always accommodate fault tolerance in your architecture as you are using cheap commodity machines.

Architecture
Given that you go for a distributed hash table, how would you hash a key? The first step would be to identify where the key should go i.e. the machine where the key should be hashed to. Then as the second step in that machine, you should save the key using the normal hashing technique. So instead of applying a single hash, you should start applying 2 levels of hashes. The first to identify the machine and the second to save/look up the key in the specific machine you got in the first step. The architecture thus becomes natural. The duty of the first component thus becomes to identify which machine to use. The second component has a list of independent machines. Each are simple implementation of hash table and are ignorant of the fact that outside itself there are multiple cousins doing the same functionality.

Practical application
Memcached is a stunning implementation of the above said technique. It is a cache layer over the database and is expected to reduce the number of database look-ups. It is being used by a number of very active websites. Facebook is one such site which uses Memcached and has a hit-rate of around 99%.

4 comments:

  1. Good & informative article!!!! Giving some real time solution is really a nice job.

    Will it be applicable for all kind of scenarios? Can we use DHT in all places?

    ReplyDelete
  2. Also read about Consistent Hashing to hash the nodes to avoid data re-distribution on node failure.

    ReplyDelete
  3. Thanks guys. Subha, when you want a large hash table, this is the only way as far as I know. As Puneet has said, there are so many live implementations of Distributed Hash tables that are working successfully.

    ReplyDelete