Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Adds boost root finders with reverse mode specializations #2720

Open
wants to merge 25 commits into
base: develop
Choose a base branch
from

Conversation

SteveBronder
Copy link
Collaborator

Summary

Adds a wrapper for boost's Halley root finding algorithm along with a reverse mode specialization similar to our algebra solver. Note these functions are only meant to be used at the Stan math level. Specifically for the quantile functions in stan-dev/design-docs#42

This also works for forward mode, but all it does is iterate through the boost functions so it is probably very slow.

Tests

Tests for forward and reverse mode using the cubed and 5th root examples from boost's docs. Tests can be run with

./runTests.py -j4 ./test/unit/math/ -f root_finder

Side Effects

Note that I had to write specializations for frexp and sign for fvar and var types because boost would throw compiler errors if they did not exist.

The WIP here is because of some questions for @bgoodri

  1. Would we ever want to use the newton raphson or schroder root finders? I wrote this keeping those in mind and it would be very simple to support.
  2. Right now the API takes in a tuple of functors to calculate the value and derivatives. But you can see in the 5th root example this is kind of inefficient since we are just doing x^n and could share that computation between the calculations. For the reverse mode specialization we do need a functor to calculate just the function we want to solve, but we could also allow for the user to pass two functors where one returns a tuple with all the values for the root finder and another that just gives back the value of the function we want to solve.
  3. Do we need functions for working over vectors elementwise or should we make that later?

Release notes

Adds boost root finders with reverse mode specializations for

Checklist

  • Math issue #

  • Copyright holder: Steve Bronder

    The copyright holder is typically you or your assignee, such as a university or company. By submitting this pull request, the copyright holder is agreeing to the license the submitted work under the following licenses:
    - Code: BSD 3-clause (https://opensource.org/licenses/BSD-3-Clause)
    - Documentation: CC-BY 4.0 (https://creativecommons.org/licenses/by/4.0/)

  • the basic tests are passing

    • unit tests pass (to run, use: ./runTests.py test/unit)
    • header checks pass, (make test-headers)
    • dependencies checks pass, (make test-math-dependencies)
    • docs build, (make doxygen)
    • code passes the built in C++ standards checks (make cpplint)
  • the code is written in idiomatic C++ and changes are documented in the doxygen

  • the new changes are tested

@bgoodri
Copy link
Contributor

bgoodri commented Apr 29, 2022

(1) I imagine it will end up like the ODE solvers where we have slightly different versions with a mostly common set of arguments. But Halley is probably the one that would get used the most inside Stan Math.

@bgoodri
Copy link
Contributor

bgoodri commented Apr 29, 2022

(2) It seems like it would be good to permit (or require) one functor that returns a tuple of: level, first derivative, second derivative.

@bgoodri
Copy link
Contributor

bgoodri commented Apr 29, 2022

(3) If a root needs to be found for multiple parameter values, I would say that the caller is responsible for invoking root_finder multiple times.

@bgoodri
Copy link
Contributor

bgoodri commented Apr 29, 2022

I think we should default the digits argument according to the recommendation in the Boost docs.

The value of digits is crucial to good performance of these functions, if it is set too high then at best you will get one extra (unnecessary) iteration, and at worst the last few steps will proceed by bisection. Remember that the returned value can never be more accurate than f(x) can be evaluated, and that if f(x) suffers from cancellation errors as it tends to zero then the computed steps will be effectively random. The value of digits should be set so that iteration terminates before this point: remember that for second and third order methods the number of correct digits in the result is increasing quite substantially with each iteration, digits should be set by experiment so that the final iteration just takes the next value into the zone where f(x) becomes inaccurate. A good starting point for digits would be 0.6D for Newton and 0.4D for Halley or Shröder iteration, where D is std::numeric_limits::digits.

@bgoodri
Copy link
Contributor

bgoodri commented Apr 29, 2022

You could also easily test a * x^2 + b * x + c = 0 where a, b, and / or c are var, fvar<var>, etc. to make sure our implicit da / dx, db / dx, and dc / dx match those obtained by differentiating x = (-b + sqrt(b^2 - 4 * a * c)) / (2 * a) explicitly.

@bgoodri
Copy link
Contributor

bgoodri commented Apr 29, 2022

From your email, guess, min, and max should definitely be templated but we just need to pull out the underlying numbers since we wouldn't need to differentiate with respect to them.

@bgoodri
Copy link
Contributor

bgoodri commented Apr 29, 2022

Also, @charlesm93 might have some thoughts. As I mentioned yesterday, the motivation for having a one-dimensional specialization of algebraic_solver in Stan Math (we are not worried about exposing it to the Stan language yet) is

  1. Everything is in terms of scalars so no there is Eigen overhead making vectors of size 1, 1x1 Jacobians, etc.
  2. We can use bounds to ensure the correct root is found
  3. We can utilize second derivatives to obtain a tripling of the number of correct digits when close to the root

I guess the main question is what is the fastest way of implicitly differentiating the root with respect to unknown parameters? A concrete example would be thinking about implementing the inverse CDF of the Beta distribution in C++ with possibly unknown shape parameters. Given a cumulative probability, p, we would need to find the value of x between 0 and 1, such that beta_cdf(x, alpha, beta) - p = 0. We know the first derivative wrt x is the beta PDF and the second derivative is the first derivative of the beta PDF and can presume within Stan Math that the caller has specified those two derivatives explicitly. Once x is found, its sensitivity to p is the reciprocal of the beta PDF, but the sensitivity of x to alpha and beta requires implicit differentiation.

@bgoodri
Copy link
Contributor

bgoodri commented Apr 30, 2022

We should probably also do the beta inverse CDF as a test case. Boost has a more advanced way of obtaining an initial guess in this case, but they still usually do Halley iteration

var p, alpha, beta; // give these values
using namespace boost::math; 
beta my_beta(alpha, beta);
auto x = quantile(my_beta, p);
// check our version against this

@SteveBronder
Copy link
Collaborator Author

So I got the tests for beta up and running, for the beta lcdf it seems to work fine for values where lgamma(alpha + beta) > 1. Maybe I wrote the derivative of the pdf wrt x wrong? Or there could be a numerics trick I'm missing.

For the test against boost's quantile function, var does not work with the boost distributions because of some functions they defined that we do not have defined for vars. But using finite diff I get answers that are within 4e-4. Perhaps I'm not doing something quite right there as well? Or it could be just finite diffs own error. There are also some places that finite diff gives -nan values for alpha and beta but where autodiff can return the correct values

# from running 
# ./runTests.py -j2 test/unit/math/rev/functor/root_finder_test.cpp
--- Finit Diff----
fx: 0.288633
grads: 
p: 0.6987
alpha: 0.245729
beta: -0.121384
--- Auto Diff----
fx: 0.288714
grads: 
p: 0.700012
alpha: 0.248911
beta: -0.12257

grad diffs: 
p: -0.00131203
alpha: -0.00318197
beta: 0.00118579

@SteveBronder
Copy link
Collaborator Author

Also do we want to test lcdf or cdf functions?

@bgoodri
Copy link
Contributor

bgoodri commented May 10, 2022

There are analytical derivatives for the beta quantile case at

https://functions.wolfram.com/GammaBetaErf/InverseBetaRegularized/20/01/

@charlesm93 charlesm93 self-assigned this May 11, 2022
@SteveBronder
Copy link
Collaborator Author

@bgoodri so I got the rev test up and running with the explicit gradients for the beta cdf. I also wrote a lil' function that just checks the gradients and our function to make sure they are near'ish to each other. Though the test is failing for some odd values that I wouldn't expect them to. The failing ones all print their values after a failure so you can take a look if you like (though note after one failure it will just keep printing all the sets of values it is testing. This is just because I have not figured out how to stop that from happening in gtest). I wonder if it could be some numerically inaccuracy with our beta_cdf function?

@charlesm93
Copy link
Contributor

(2): if the function is going to be used internally, we should try and optimize as much as possible. I see two options:

  • having one functor argument which returns the value and one functor argument which returns a tuple with the value and the derivatives.
  • passing a single functor which admits an argument indicating the order of the derivatives we need to compute. Internally, only do the calculations for the specified order, and return a tuple of value and derivatives (with the latter potentially just containing nans).

If this is exposed to the user, we may want to trade some efficiency for a simpler API.

@charlesm93
Copy link
Contributor

Why is there a specialized rev method? The implicit function is valid for both forward and reverse mode autodiff. Since the solution is a scalar, the full Jacobian of derivatives reduces to a vector, so we don't gain much from doing contractions with either an initial tangent or a cotangent.

I think it would make sense to explicitly compute the full Jacobian and then do the vector-vector product for either forward or reverse mode.

Copy link
Contributor

@charlesm93 charlesm93 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this looks good to me but it still requires some additional work. Some of the C++ looks arcane to me, so I've asked for clarifications and I'm relying on the quality of the unit tests.

Here are the main requests:

  • The unit tests need to check that the correct root is returned. You can work out the solution analytically; plug the root into the function being evaluated and check that it is close to 0 within the specified tolerance; or use algebra_solver as a benchmark.
  • When setting tuning parameters for the rootfinder, leave a comment to justify those choices. Are these arbitrary tuning parameters or are they required to make the root finder work?
  • Need additional tests which prompt the error messages.
  • Address comments on the file. A handful of them are about doxygen doc.

Some suggestions:

  • Consider using the implicit function theorem for both forward and reverse mode autodiff. I don't think using a direct method for forward diff is justified.
  • As mentioned in the discussion, consider passing only one functor rather than a tuple of functors.

namespace math {
namespace internal {
template <typename Tuple, typename... Args>
inline auto func_with_derivs(Tuple&& f_tuple, Args&&... args) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs some doxygen doc. It's not clear what this function does.

* initial lower bracket
* @param max The maximum possible value for the result, this is used as an
* initial upper bracket
* @param digits The desired number of binary digits precision
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indicate that digits cannot exceed the precision of f_tuple.

return boost::math::tools::halley_iterate(args...);
},
std::forward<FTuple>(f_tuple), guess, min, max, digits, max_iter,
std::forward<Types>(args)...);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This my lacking C++ skills speaking but could you describe how the call to root_finder_tol works here? How does the [](auto&&... args) argument work?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  return root_finder_tol(
      [](auto&&... args) {
        return boost::math::tools::halley_iterate(args...);
      },
      std::forward<FTuple>(f_tuple), guess, min, max, digits, max_iter,
      std::forward<Types>(args)...);

The lambda is in this call to tell root_finder_tol which root finder it is using. So in this case we pass a lambda expecting a parameter pack that calls the halley root finder. If we used a struct here it would look like

struct halley_iterator {
  template <typename Types>
  operator(Types&&... args) {
        return boost::math::tools::halley_iterate(args...);
  }
};

auto root_finder(SolverFun&& f_solver, FTuple&& f_tuple,
const GuessScalar guess, const MinScalar min,
const MaxScalar max, Types&&... args) {
constexpr int digits = 16;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how was this default chosen? Maybe add one sentence about this choice in the doxygen doc.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to update this to be in line with what boost's docs say

The value of digits is crucial to good performance of these functions, if it is set too high then at best you will get one extra (unnecessary) iteration, and at worst the last few steps will proceed by bisection. Remember that the returned value can never be more accurate than f(x) can be evaluated, and that if f(x) suffers from cancellation errors as it tends to zero then the computed steps will be effectively random. The value of digits should be set so that iteration terminates before this point: remember that for second and third order methods the number of correct digits in the result is increasing quite substantially with each iteration, digits should be set by experiment so that the final iteration just takes the next value into the zone where f(x) becomes inaccurate. A good starting point for digits would be 0.6D for Newton and 0.4D for Halley or Shröder iteration, where D is std::numeric_limits::digits.

https://www.boost.org/doc/libs/1_62_0/libs/math/doc/html/math_toolkit/roots/roots_deriv.html

[&x_var, &f](auto&&... args) { return f(x_var, std::move(args)...); },
std::move(args_vals_tuple));
fx_var.grad();
Jf_x = x_var.adj();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a scalar. You might want to call it df_dx, since J usually suggest a Jacobian matrix (but this is a minor detail). Does nested_rev_autodiff mean you're using reverse mode? For computing a scalar, I recommend using forward mode.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does nested_rev_autodiff mean you're using reverse mode?

Yes

For computing a scalar, I recommend using forward mode.

What is the intuition on fwd being faster in this case? I thought if it was many input single output then reverse is the best?

return beta_lpdf_deriv(x, alpha, beta, p);
};
double guess = .3;
double min = 0.0000000001;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would min = 0 not work?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If not this should be indicated in a comment.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I can update this

double p = 0.45;
double alpha = .5;
double beta = .4;
stan::test::expect_ad(full_f, alpha, beta, p);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reminder for me: this checks both reverse and forward mode AD, and higher-order AD as well, correct?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does expect_ad allow a tolerance parameter -- as a way to keep track of how closely finite-diff and autodiff agree?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in addition to testing derivatives, are you also checking that the root finder returns the correct value?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reminder for me: this checks both reverse and forward mode AD, and higher-order AD as well, correct?

Yep!

Does expect_ad allow a tolerance parameter -- as a way to keep track of how closely finite-diff and autodiff agree?

Yes, though I'm not sure yet which level of tolerance we can / should expect

in addition to testing derivatives, are you also checking that the root finder returns the correct value?

It just checks finite diff vs autodiff of the function for correct gradients, if the primative and reverse mode specializations give back different answers it would throw an error, but I do need some test against like our current algebra solver to make sure things are as we expect.

double alpha = .4;
double beta = .5;
stan::test::expect_ad(full_f, alpha, beta, p);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment about this unit test that I made for the previous test.

<< finit_grad_fx(1)
<< "\n"
"beta: "
<< finit_grad_fx(2) << "\n";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand this is here for the discussion in the PR, but in time, we'll remove the std::cout.

double known_p_grad = deriv_p(p, a, b);
double known_alpha_grad = deriv_a(p, a, b);
double known_beta_grad = deriv_b(p, a, b);
std::cout << "--- Mathematica Calculate Grad----\n";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If using another software as a benchmark, it would be good to have the code or a link to the code (or in the case of Mathematica, the prompt that you use)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes def agree would be good to copy/paste those into here

@bgoodri
Copy link
Contributor

bgoodri commented May 18, 2022

Thanks @charlesm93

@SteveBronder
Copy link
Collaborator Author

Thanks @charlesm93! I'm going to rework the API rn, I'd like to make it a bit more efficient for reverse mode which would mean supplying two separate functors, one for calculating the function and its derivatives all at the same time for efficiency, and another functor for calculating just the function

@SteveBronder
Copy link
Collaborator Author

@bgoodri alrighty so I got reverse mode to be adjoints to be right up to 1e-9. It turns out that we need more precision when doing the forward pass during reverse mode which I found kind of weird and interesting. For the beta cdf example it seemsl like we need about 23 (???) digits of precision which is very odd to me. Like we only work in doubles so I'm not sure why having anything over 16 would be a thing? But tests now pass for testing alpha,beta, and p from .1 to .9

@bob-carpenter
Copy link
Contributor

bob-carpenter commented Jun 9, 2022

It turns out that we need more precision when doing the forward pass during reverse mode which I found kind of weird and interesting

The forward-mode values are often used as part of the derivative calculations, so I suspect that may be the problem.

For the beta cdf example it seemsl like we need about 23 (???) digits of precision which is very odd to me

As you point out, we only have about 16 digits of relative precision in double-precision, so we can't return results with that high precision. CPUs often do higher-precision internal arithmetic then truncate the return, but there's simply no way to get 23 digits of accuracy in a floating-point result at double precision. We can get to within 1e-23 of an answer, though in absolute precision.

@dmi3kno
Copy link

dmi3kno commented Jan 4, 2023

May I suggest that we adopt existing (closed-form) QF/CDF pairs for tests? Here are suggested QFs and CDFs (where missing in Stan).

  • Unbounded:
    • logistic distribution
      $$Q(u)=\mu+s[\ln(u)-\ln(1-u)]$$
  • Semi-bounded:
    • Weibull
      $$Q(u)=\lambda[-\ln(1-u)]^{1/k}$$
    • Gumbel distributions
      $$Q(u)=\mu-\beta\ln(-\ln(u))$$
    • Exponential (simplest QF, but perhaps most difficult to invert)
      $$Q(u)=(1/\lambda)\ln(1-u)$$
  • Bounded:
    • Kumaraswamy distribution
      $$Q(u)=(1-(1-u)^{1/b})^{1/a}$$
      $$F(x)=1-(1-x^a)^b$$

All of these distributions have closed form CDF and QF so we always have analytical solution to check against.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants