Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Call overhead compared to C bindings #126

Open
dannywillems opened this issue Apr 26, 2023 · 8 comments
Open

Call overhead compared to C bindings #126

dannywillems opened this issue Apr 26, 2023 · 8 comments

Comments

@dannywillems
Copy link

I have been doing some experimentation to see the actual cost of using ocaml-rs.
When writing C bindings, I have been used to have a 4-6ns overhead due to root registrations, independently of the number of parameters.
With ocaml-rs, it seems different. See https://github.com/dannywillems/ocaml-rust-experimentation.

Estimated testing time 16s (8 benchmarks x 2s). Change using '-quota'.
┌────────────────────────────────┬──────────┬────────────┐
│ Name                           │ Time/Run │ Percentage │
├────────────────────────────────┼──────────┼────────────┤
│ Bench C unit                   │   4.69ns │     11.17% │
│ Bench c add integers           │   4.37ns │     10.40% │
│ Bench C two integers unit      │   4.39ns │     10.46% │
│ Bench C three integers unit    │   4.71ns │     11.22% │
│ Bench Rust unit                │  13.78ns │     32.81% │
│ Bench Rust add integers        │  34.36ns │     81.81% │
│ Bench Rust two integers unit   │  32.29ns │     76.89% │
│ Bench Rust three integers unit │  41.99ns │    100.00% │
└────────────────────────────────┴──────────┴────────────┘

It seems the number of arguments play a role in the overhead. Is it possible to have an explanation please?

@tizoc
Copy link
Contributor

tizoc commented Apr 26, 2023

Wild guess, but the overhead (or part of it) could be coming from the automatic conversion of the raw arguments into ocaml::Value, give me a few minutes to try something.

@zshipko
Copy link
Owner

zshipko commented Apr 26, 2023

Interesting! I don't have an immediate answer but am not surprised that ocaml-rs has additional overhead. I would guess it's probably related to the fact that ocaml::Value is rooted using the boxroots library and other values are converted to a Rust type using ocaml::FromValue. These aren't zero cost abstractions.

Edit: Didn't see @tizoc's response, but that seems most likely. Curious to see what you're testing!

@tizoc
Copy link
Contributor

tizoc commented Apr 26, 2023

Here are the results that confirm my guess:

Done: 96% (25/26, 1 left) (jobs: 1)Estimated testing time 18s (9 benchmarks x 2s). Change using '-quota'.
┌────────────────────────────────────────────────┬──────────┬────────────┐
│ Name                                           │ Time/Run │ Percentage │
├────────────────────────────────────────────────┼──────────┼────────────┤
│ Bench C unit                                   │   2.69ns │      9.04% │
│ Bench c add integers                           │   3.32ns │     11.15% │
│ Bench C two integers unit                      │   3.13ns │     10.52% │
│ Bench C three integers unit                    │   3.13ns │     10.52% │
│ Bench Rust unit                                │   9.45ns │     31.71% │
│ Bench Rust add integers                        │  24.35ns │     81.76% │
│ Bench Rust two integers unit                   │  23.49ns │     78.85% │
│ Bench Rust three integers unit                 │  29.79ns │    100.00% │
│ Bench Rust three integers unit (no conversion) │   4.36ns │     14.63% │
└────────────────────────────────────────────────┴──────────┴────────────┘

Here is the version without conversions:

#[no_mangle]
pub extern "C" fn caml_rust_three_int_unit_no_conversion_stubs(x: ocaml::Raw, y: ocaml::Raw, z: ocaml::Raw) -> ocaml::Raw {
    #[inline(always)]
    fn inner(gc: &mut ocaml::Runtime, _x: isize, _y: isize, _z: isize) -> isize {
        {
            let _ = &gc;
        };
        ocaml::sys::UNIT
    }
    {
        let gc = unsafe { ::ocaml::Runtime::recover_handle() };
        #[cfg(not(feature = "no-std"))] ::ocaml::inital_setup();
        {
            {
                let res = inner(gc, x.0, y.0, z.0);
                #[allow(unused_unsafe)]
                let mut _gc_ = unsafe { ocaml::Runtime::recover_handle() };
                ocaml::Raw(res)
            }
        }
    }
}

There is still a bit of overhead but it looks like most of it comes from the conversion.

@tizoc
Copy link
Contributor

tizoc commented Apr 26, 2023

One more test, if I remove the call to ::ocaml::initial_setup() the remaining overhead goes away:

Done: 96% (25/26, 1 left) (jobs: 1)Estimated testing time 18s (9 benchmarks x 2s). Change using '-quota'.
┌────────────────────────────────────────────────┬──────────┬────────────┐
│ Name                                           │ Time/Run │ Percentage │
├────────────────────────────────────────────────┼──────────┼────────────┤
│ Bench C unit                                   │   2.68ns │      9.01% │
│ Bench c add integers                           │   3.31ns │     11.13% │
│ Bench C two integers unit                      │   3.14ns │     10.54% │
│ Bench C three integers unit                    │   3.13ns │     10.53% │
│ Bench Rust unit                                │   9.44ns │     31.72% │
│ Bench Rust add integers                        │  24.36ns │     81.88% │
│ Bench Rust two integers unit                   │  23.28ns │     78.23% │
│ Bench Rust three integers unit                 │  29.75ns │    100.00% │
│ Bench Rust three integers unit (no conversion) │   3.09ns │     10.40% │
└────────────────────────────────────────────────┴──────────┴────────────┘

@zshipko for the conversion overhead I guess there is not much option other than giving the option to avoid that conversion?

For the initialization part, the overhead is much less, but I guess it could be made configurable (so that the user has the option to perform it when launching the program)

@zshipko
Copy link
Owner

zshipko commented Apr 26, 2023

Thanks for digging into that!

It seems like allowing pre-initialization should be an easy feature to add, I will take a look at that when I get a moment.

The conversion could likely be optimized in some places, but when performance is the most important concern I would suggest using the raw types from ocaml-sys and performing conversion as needed.

@tizoc
Copy link
Contributor

tizoc commented Apr 26, 2023

when performance is the most important concern I would suggest using the raw types from ocaml-sys and performing conversion as needed.

Yes, accessing the OCaml values directly (without conversion) is exactly what I had to do in the past to keep overhead to the minimum (for example accessing strings, arrays, etc directly without converting to avoid allocations), so my guessing was informed by that. In ocaml-interop the work of performing the conversions is always left to the user, but it is not the best default since for most functions the overhead is unlikely to matter.

@dannywillems
Copy link
Author

Thanks both of you for having a look and quickly answered!
It is a bit obscur for me at the moment. Could you explain what is the memory layout of each types (ocaml::Raw, ocaml::Value, etc)? I will try to dig into boxroots later.

I would like to understand how ocaml-rs works under the hood. I try to understand why I get this benchmark and this benchmark when using custom blocks. It should be around 10ns-20ns for addition. We have already the overhead of the argument as we saw above, and there is certainly a better way to represent the value.

When writing C, I have been doing something like this, i.e. save in the custom block the C value. No heap allocation, no indirection when accessing the value in C (gain some ns).

The problem can be summarized as follow:
When we have a stack only Rust value (like a fixed size array or a structure without heap allocation), what is the best layout to use?

@crackcomm
Copy link
Contributor

crackcomm commented Apr 27, 2023

I would like to understand how ocaml-rs works under the hood.

I recommend using cargo-expand to expand macros into code and you can go from there.

It should be around 10ns-20ns for addition.

It shouldn't be more. Floats can be unboxed, passed by value so possibly even less.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants