Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use concurrent (aka "lockless") hash map in the AtomTable #2553

Open
linas opened this issue Apr 26, 2020 · 9 comments
Open

Use concurrent (aka "lockless") hash map in the AtomTable #2553

linas opened this issue Apr 26, 2020 · 9 comments
Labels

Comments

@linas
Copy link
Member

linas commented Apr 26, 2020

The AtomTable uses a mutex to guard access to the TypeIndex. This mutex could be mostly avoided by using a concurrent hash map.

AtomTable Status: After exploring this for a while, there's a problem. It seems like what AtomTable/Typeindex really requires is a concurrent unordered multimap, with thread-safe erasure. There are 5 or 6 packages that provide concurrent hash maps, but only one provides a concurrent multimap: Intel TBB. Unfortunately, it does NOT provide a thread-safe erase. So it seems like a dead-end! There's still one possibility: use a concurrent hash map, and store concurrent linked lists as the value. Yikes! Fixed in #2907

Atom Status: The Atom (Atom.cc) stores Values in std::map. It does NOT use a hash table, in order to keep Atoms as small as possible. Although it could be replaced by a lockless hash table, this risks blowing up RAM usage. So: how bad would this be? How would this change the size of an Atom? Atom also uses an ordinary std::set for the incoming set .... same as before: we want a tree here, not a hash, to keep the Atom as small as possible...

Overall status: This issue primarily affects users with highly threaded workloads, on modern multi-core (more than 8 core) machines. It appears that the mutexes are NOT the primary bottleneck; instead, its the CPU implementation of atomic increment. See issue #2758 for discussion. Conclude: a lockfree find() in a hashset could speed things up for truly demanding users, but we don't have users like that.

Lockfree comments: Notes below review the available options. We need two things: a lockfree hash set (for TypeIndex), and a lockfree tree (i.e. something smaller, less RAM intensive than a hash set, for the Incoming set). Current options are thin, approximating zero: most of them are hard to use or fail to meet requirements. Lockfree theory looks strong; drop-in replacements for std::set and std::unordered_set are missing. (fb folly comes close) Documentation and benchmarks are lacking or inadequate. This is still the bleeding edge.

@linas linas added enhancement good first issue Good for newcomers atomspace performance Performance issues labels Apr 26, 2020
@linas
Copy link
Member Author

linas commented Apr 26, 2020

Implementing this requires

  • Copying the concurrent hash map source to cog-utils
  • Modifying TypeIndex to use it.
  • Removing the mutex lock in TypeIndex

Do the same for Atom.cc

@linas
Copy link
Member Author

linas commented Apr 26, 2020

Possibilities:

Each seems to have limitations and usability issues... See below for some commentary.

Starting with the simplest one, even if it is not the fastest, is probably the best/safest idea.

@linas linas changed the title Use concurrent hash map in the AtomTable Use concurrent (aka "locklless") hash map in the AtomTable Jul 16, 2021
@linas linas pinned this issue Jul 16, 2021
@linas
Copy link
Member Author

linas commented Jul 16, 2021

This is important work: the current scheme bindings have a great amount of trouble parallelizing, and I think that this is due to lock collisions in the AtomTable.

No. See issue #2758 Almost all the parallelism problems are due to the CPU implementation of atomic increment. There could be some speedup from a lock-free TypeIndex (at least if the set find() is made lock-free), but this now seems as not being as high-priority as I first imagined.

@linas linas changed the title Use concurrent (aka "locklless") hash map in the AtomTable Use concurrent (aka "lockless") hash map in the AtomTable Jul 16, 2021
@linas
Copy link
Member Author

linas commented Jul 16, 2021

What is actually needed is a drop-in replacement for std::unordered_multimap and none of the solutions above provide that. However, Intel TBB does seem to provide that! However, the TBB version does not provide a thread-safe erase(). And we need that.

Comments about the concurrent hash maps, mentioned above.

  • libcds -- has Debian packages. Multiple choices of containers. All of them have difficult-to-use API's, requiring all sorts of complex compare functions aka traits to be provided. None are just-plain sets. Under-documented. No examples.
    • michael_set is fixed-size, not dynamically expandable.
    • skip_list_set - O(log N) perf.
    • striped_set -- lock stripeing
    • feldman_hashset -- requires a perfect hash, and the hash is fixed size. So no strings as keys.
    • ellen_bintree_set -- unbalanced binary tree, O(log N) perf for uniformly distributed hash.
    • cuckoo_set -- seems to use unique_ptr instead of GC
  • junction -- Docs say: "To make QSBR work, each thread must periodically call junction::DefaultQSBR.update at a moment when that thread is quiescent" -- Each thread? Not sure how to do that, because the AtomSpace doesn't know about threads, and doesn't know when things are quiescent. So this seems like a problem w/using junction. Also: "The maps don’t currently support smart pointers." but otherwise very impressive.
  • Taymandis/hastable - Unclear documentation, unmaintained.
  • folly -- there's no Debian package for it. Vast number of users; maintained. Several candidates:
    • AtomicHashMap -- Lockless find(). Dropin replacement for std::unordered_map. Erased elements are not reclaimed, but are tombstoned forever. Bad idea for the generic AtomSpace.
    • F14ValueSet -- this is NOT concurrent; locks are still needed. Its just "faster". Yes, it does appear to work. Performance improvement, if any, is hard to discern.
    • ConcurrentHashMap -- Lockless find(). Dropin replacement for std::unordered_map. Works. No discernible performance impact on bio-loop3.scm benchmark.
  • divfor/atomich_hash -- simple API, lacks c++ container API. Appears to be fixed-size hash table; does not auto-resize. The AtomSpace needs a resizable table.

@linas
Copy link
Member Author

linas commented Nov 21, 2021

@linas
Copy link
Member Author

linas commented Nov 21, 2021

Idea: in TypeIndex.h, instead of declaring

typedef std::unordered_multimap<ContentHash, Handle> AtomSet;

do NOT use ConentHash, but use the string s-exp of the Atom instead! This now becomes a plain map, instead of a multimap. .. except that the string version cannot deal with alpha-equivalent expressions. Bummer.

This could be solved by having a flag for each atom: "contains alpha terms". if set to true, fallback to a lock plus content-hash. else use the lock-free lookup! That might work!

It would require a fast atom->short-string function, however. (This could be made fast by storing substrings in RAM, but this blows up RAM usage even more.)

FIXED in #2902 the multimap design was insane.

linas added a commit that referenced this issue Nov 22, 2021
@linas
Copy link
Member Author

linas commented Nov 24, 2021

In TypeIndex.h, there was no reason for having an unordered multimap. I removed it in pull req #2902 replacing with with an unordered set. This now opens the door for the concurrent data structures. The TypeIndex set replacement needs to support:

  • insertion
  • deletion
  • find

After the cleanup of #2907 (and the four prior pull reqs) experimenting with alternatives is now possible.

@linas
Copy link
Member Author

linas commented Nov 27, 2021

Confusion reigns: Recompiled both Atom.h and TypeIndex.h with the locks completely disabled. Then ran the opencog/benchmark bio-loop3.scm benchmark and saw only minor (almost un-noticeable) parallelism improvements. So whatever is causing the lock contention, its not these two.

Double-check these results: using folly:ConcurrentHashmap in TypeIndex (and removing the locks) also has no effect on parallelism.

@linas
Copy link
Member Author

linas commented Nov 28, 2021

Some mutrace experimental results.

For the locked version: the lock in TypeIndex::findAtom() is hit 19 million times, and this accounts for almost all of the unserializable time in the parallel bio-loop3.scm benchmark. Most of the calls are coming from PatternMatchEngine.cc:1337, this snippet of code:

         // Yuck. What we really want to do here is to find out
         // if `Link(t, oset)` is in the incoming set of `hg`. But
         // there isn't any direct way of doing this (at this time).
         Handle hup(_pmc.get_link(hg, t, std::move(oset)));

and pmc.get_link() just calls as->get_link() which calls TypeIndex::findAtom()

The unlocked versions show no lock contention. Conclude: it is not lock contention that is slowing us down.

Latest theory: The smart pointers and other assorted atomics in the code are used so frequently, that they are causing cache-line contention for the locks. (CPU's implement atomics inside of cache-lines.) Bingo! That's exactly it! Things parallelize nicely on modern CPU's! See issue #2758 for a general discussion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant