Scipp performance #3335

dg-pb · 2023-12-01T18:10:47Z

dg-pb
Dec 1, 2023

Hello all,

I have a question or a request for references.

I would like a low-mid level understanding of the library. The aspects that I am most interested in is this scipp's realtion to numpy and its performance.

Does scipp inherit from numpy or is it completely independent?
Numpy seems to have a better performance for array operations up to size of 60K, which is reasonable as it is more barebone by design. However, scipp seems to outperform numpy for larger arrays. I suspect there is a parallelisation that is taking place (well, I am almost sure). Is it always in effect or does it get enabled dynamically? Also, from fist glance scipp.array seems almost as barebone as numpy, why then is its performance so much worse for arrays of size less than 50K?

If anyone could shed some light on these it would be much appreciated.

SimonHeybrock · 2023-12-04T05:20:38Z

SimonHeybrock
Dec 4, 2023
Maintainer

Hi @dgrigonis!

Great question. There is a publication about Scipp, but it is from the early days. The architecture is still nearly identical, but it does not go into many details.

The main features in your figure are explained by an interplay of two aspects:

Scipp is written in C++. It is using pybind11 for this purpose. Making a function call across from Python->C++ comes with some overhead (independent of the array size). We have recently started considerations about moving to nanobind, which should reduce this overhead (see Consider moving from pybind11 to nanobind #3314), but right now no one is actively working on or looking into this. See the benchmarks in the nanobind docs for an interesting comparison.
Scipp uses TBB for multi-threading all of its operations. Currently, essentially all operations are multi-threaded regardless of the array size. We always had in mind to "fine tune" this, e.g., by choosing a thread count based on the size and cost of an operation (for example, trigonometry like sin or cos should see a benefit from multi-threading earlier — and even for division with / Scipp outperforms NumPy already at 32k elements) but have not done so yet.

It should also be noted that if you search for NumPy and multi-threading, you will find that NumPy is multi-threaded if you install the right extra packages... however not for normal array operation but just some linear-algebra. This multi-threading may therefore be irrelevant for many applications.

We have recently also added additional information in our docs in Improving performance: Allocators and HugePages. HugePages will mainly affect large arrays. I do not know if the allocator makes a difference for small arrays — if you try it out, let me know of the results, I'd be interested!

Finally, partially unrelated to your results above, remember that memory allocation/initialization is often the most expensive part. That is, make sure to also compare, e.g., += instead of just +. You may also see a big difference between a freshly started application or Python kernel which requests additional memory from the System kernel, vs. an application that has previously released memory and is "reclaiming" it.

I'd be happy to provide more info. If you are running more benchmarks and find some unusual behavior we can also look into particular reasons, there may be performance bugs that still need fixing (an interesting and quite recent example was datetime64 array initialization in #3123.

Cheers,
Simon

0 replies

dg-pb · 2023-12-04T06:03:54Z

dg-pb
Dec 4, 2023
Author

Thank you for reply.

I will investigate further in due time.

For now what I can also share is summation benchmark. Currently, it is the bottleneck in one of my applications. numpy does it in parallel, but it seems only with MKL (this benchmark is with openblas, which doesn't seem to provide parallel sum).

The blue one is the function that I wrote, which at certain point switches to dot product, which is parallelised well. I suspect the initialisation cost of parallel functions prevents their outperformance for lower size arrays.

lr is larray library.

Application at hand is not big data, and I am looking at summation of arrays around size 10K, and I couldn't find anything that outperforms numpy. Well, except those that use MKL. Pytorch and numpy with MKL did best if I remember correctly.

So I am wandering if summation is parallelised in scipp. Also, in this case it appears to be doing significantly worse than numpy. Any insights why such difference?

Pandas and xarray aren't included, because their performance is not comparable to any of these.

16 replies

SimonHeybrock Dec 4, 2023
Maintainer

Ahh, I see. You have ran into an interesting edge-case here. The array you have is 2-D, so Scipp does not branch into the special "1-D parallelization" branch. Can you compare the performance on scarr.squeeze()? I think you should see better performance then.

SimonHeybrock Dec 4, 2023
Maintainer

Basically, short inner dimensions (in your rng.random((10000, 1)) case: length 1) are not handled well in Scipp's current parallelization (of reduction operations such as sum or mean). You should see similar problems, e.g., with length 2.

dg-pb Dec 4, 2023
Author

Yeah, squeeze sorts it out. I imagine squeeze takes a view in scipp as I didn't see any performance penalty while it being part of the benchmark.

Wouldn't it be sensible to pre-process array view to optimal parallelisation structure disregarding dimensions? But I guess this would only apply to a fairly small set of operations, like sum,min,max performed on all dimensions.

SimonHeybrock Dec 4, 2023
Maintainer

I think ultimately we should just put in some extra work to multi-thread reduction operations also for small inner dims. For now I have opened #3336, but I don't know when someone will find the time to work on this (on our side).

I'd be interested in a slightly bigger picture of what you are using Scipp for. In our applications we mainly deal with very large arrays, so our point of view is sometimes limited. Could you give an example of your computations?

dg-pb Dec 4, 2023
Author

I am not using scipp yet. I am just exploring and comparing. Currently I was just testing scipp if it can improve current application, where distance between image subsets is at the core. So ~10k sized array subsets calculated over and over.

But I came across scipp when I was searching for robust multidimensional data structure libraries. There isn't concrete application in mind but rather search for a standard that I am going to use moving forward in variety of applications. Scipp so far seems to be the best of whats out there in this regard, given its performance and robust data standard. So the choice I suspect will be between scipp and in-house solution building on top of numpy. But there is no rush here so I will explore in time.

Thank you for your help here. That was informative.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sci++

Scipp performance #3335

{{title}}

Replies: 2 comments 16 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Sci++

Scipp performance #3335

dg-pb Dec 1, 2023

Replies: 2 comments · 16 replies

SimonHeybrock Dec 4, 2023 Maintainer

dg-pb Dec 4, 2023 Author

SimonHeybrock Dec 4, 2023 Maintainer

SimonHeybrock Dec 4, 2023 Maintainer

dg-pb Dec 4, 2023 Author

SimonHeybrock Dec 4, 2023 Maintainer

dg-pb Dec 4, 2023 Author

dg-pb
Dec 1, 2023

Replies: 2 comments 16 replies

SimonHeybrock
Dec 4, 2023
Maintainer

dg-pb
Dec 4, 2023
Author

SimonHeybrock Dec 4, 2023
Maintainer

SimonHeybrock Dec 4, 2023
Maintainer

dg-pb Dec 4, 2023
Author

SimonHeybrock Dec 4, 2023
Maintainer

dg-pb Dec 4, 2023
Author