-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
suggestions #1
Comments
I wonder whether this is really necessary: Using this library does not make sense with such small datasets. See next point for more explanations
The main idea of distogram was to keep its implementation very simple, and eventually use it as foundation for other higher level libraries. For example, the equivalent of the pandas describe function can be implemented on top of the current functions. There is an example here:
Performance wise, I consider pypy as the only option today on a real application. I adapted the distogram bench to streamhist to compare the update function. On CPython distogram is 25% faster, and on pypy distogram is 13 times faster. |
Thank you for responding.
Some applications are general, and the dataset size is not known. For example, I using streaming histogram in performance timers. As a general utility, it is not known whether the code under test will be invoked 10 times or 1 million times during the session. It's an advantage to use the available bin capacity and report exact quantiles when possible, yet have the implementation transition gracefully to approximate histogram as the data points grow. It only requires minor tweaks to the Ben-haim algorithms.
I'm not sure what this means. Pypy has limitations (it is not a 1:1 replacement for CPython), and there are legacy applications which cannot transition to pypy easily, or which are heavily dependent on numpy. For such applications, a numpy + numba implementation is useful. Having such an implementation does not preclude having a pure Python implementation which supports Pypy. They can exist along side each other.
A numba-only implementation will not perform, because numba does not support fast mode with Python arrays. The only way to get performance on this algorithm with numba is via numpy arrays.
I measured my pure Python implementation (no numa or numpy) vs. distogram. It is 20% faster (and less code, but I didn't compare closely). The implementation uses a "maintain cost function array" approach just as distogram does. So distogram appears to have some room for improvement. My numba+numpy implementation is 20x faster than streamhist (with 64 bins). |
ok, this makes sense. Since this is a minor evolution we can add it. I created issue #2 for a dedicated numba/numpy discussion. |
Hello-- noticed the link to this project from carsonfarmer/streamhist. I've been tinkering with optimization and correctness of this algorithm for about a year, and contributed some correctness changes to streamhist. Most of the performance work I did hasn't been published yet.
Suggestions after browsing this project:
consider a few small changes to the algorithm to make the implementation "exact" when the histogram is below capacity - ideally it should match the output of a well-tested library like numpy in this mode. I contributed such support to streamhist.
carsonfarmer/streamhist#11
support reporting of multiple quantile points efficiently - this is a common use, numpy and streamhist API's support it, and streamhist is an example of efficiently computing the sums one time in advance and using it for each quantile
Beyond that, one of the things I've developed which is missing here is a fast implementation based on numba + numpy, for projects which cannot use pypy. The implementation is about 10-15x faster than streamhist last time I measured. The dependencies could be optional. Would an enhancement like that fit into distogram?
The text was updated successfully, but these errors were encountered: