- Describe several uses for a hashmap
- Implement a hashmap is the language of your choice
- Describe the advantages of using a hashmap vs. other data structures
In JavaScript, everything is a hashmap. MongoDB is essentially a distributed hashmap. Hashmaps are used constantly in computer science, and provide a flexible and fast way of storing and retrieving data. Why is this? To find out, we have to look at how a hashmap is implemented.
Data structures are just collections of information, and there are four operations that you'll need to do with any data structure:
- Insertion - The data structure would be useless if you couldn't add things to it.
- Search - Finding an item when you don't initially know where it is
- Access - Accessing an item when you know it's location
- Deletion - You should be able to remove things from a data structure when you need to
In the most basic possible data structure, an unorganized array, insertion is easy as long as you store the length of the array. Just add something to the end. Lookup is hard. In the worst case, you go through the entire array, and only find what you're looking for at the end. Deletion is similarly worse. You might have to go through the entire array, and then if you remove an item, you either leave gaps in your array, which means degradation in performance over time, or you double your workload by resizing the array after you've made the deletion.
Things are slightly better in an ordered array, but the sort step, and the search step still aren't guaranteed to be fast, depending on the size. Hashmaps solve for many of these problems.
Think of the process of putting books into a neatly organized bookshelf. The first thing you do is look at the first letter of the author's last name. Then you go directly to the section of the bookshelf that contains authors whose last name begins with that letter. You then figure out where the book belongs within that section, and insert the book.
We can see that this example has a few main components:
- Finding the letter of the author's last name - In the context of a HashMap, this is known as our hashing function
- Having sections for each letter of the alphabet, in which you'll put authors - These are our hash buckets. Finding the right hash bucket can be done in constant time
- Figuring out where the book belongs in that section - This is known as a hash collision. (presuming we have any other authors whose last name begins with that letter!)
Insertion, deletion, and lookup can be very fast. Almost constant, depending on how you handle hash collisions. Moreover, they remain so, even when the hashmap scales.
If we ignore case, we can assume that our bookshelf has 26 hash buckets. Our hashing function is just to take the last letter of the author's last name, figure out which letter of the alphabet it is, and go directly to that bucket.
If we had to use fewer hash buckets, we could modulo the number of the last letter by the number of buckets we have, and use that.
If our hashing function returned an arbitrarily high number, we'd still just mod it by the number of buckets.
The hash function can make the entire hashmap perform better or worse. Luckily, most languages have hash function implementations that are pretty good, so you don't have to worry about it.
Abstractly here, what we're doing, is converting a key into a number. We have algorithms that can do this with arbitrary strings. Meaning, we can take a random string and turn it into a unique number, which we can then use to find an appropriate hash bucket. With random objects, we could turn them into numbers in many ways, from finding the memory addres they're stored in, to figuring out how many bytes of storage they take up, to any combination of things. You're limited only by your creativity.
We could use as many different data structures for our hash buckets as we'd like. We could even use another hashmap for it, as long as we used a different hashing function, and we eventually went to a different data structure. Linked lists have constant insertion times, and linear lookup and deletion times. Ordered arrays mean you can use binary searches, which have O(log(n)) time for everything.
The number of hash buckets we use also have an effect on the performance of our hashmap. Too few buckets, and we have a lot of hash collisions, which degrades performance. Too many, and we use more memory than we need to.
MongoDB is essentially a distributed hashmap. Notice how every item you insert gets a big ugly string as an ID? That's basically a big number that's generated by it's internal hashing function. The advantage is that we can split our hash buckets over a number of servers to distribute load more evenly, and make our system more fault-tolerant.
Everything is a hashmap in JavaScript. This makes it easy to add arbitrary values to objects without worrying about organization.