Categorical distribution #148

gdalle · 2021-08-23T11:43:32Z

Partially solves #145. I did not add alternative (log-p) parametrizations yet.
I think this one is a typical example of the limits of our approach, which was to default on Distributions.jl for sampling and other things whenever possible. Indeed, their implementation of Categorical relies on DiscreteNonParametric, which is a nightmare in terms of efficiency (see here for the worst part).
In my opinion, this boils down to their library being "safe" whereas we over at MeasureTheory.jl choose to trust that the user will avoid nonsensical inputs.

codecov · 2021-08-23T11:46:49Z

Codecov Report

Merging #148 (2f7e03c) into master (419d2f2) will decrease coverage by 8.79%.
The diff coverage is 0.00%.

@@            Coverage Diff             @@
##           master     #148      +/-   ##
==========================================
- Coverage   42.06%   33.27%   -8.80%     
==========================================
  Files          50       30      -20     
  Lines        1034      568     -466     
==========================================
- Hits          435      189     -246     
+ Misses        599      379     -220

Impacted Files	Coverage Δ
src/parameterized/categorical.jl	`0.00% <0.00%> (ø)`
src/distproxy.jl	`50.00% <0.00%> (-50.00%)`	⬇️
src/parameterized/normal.jl	`17.24% <0.00%> (-40.46%)`	⬇️
src/parameterized/cauchy.jl	`14.28% <0.00%> (-28.58%)`	⬇️
src/parameterized/studentt.jl	`33.33% <0.00%> (-26.67%)`	⬇️
src/parameterized/laplace.jl	`9.09% <0.00%> (-24.25%)`	⬇️
src/combinators/product.jl	`0.00% <0.00%> (-22.48%)`	⬇️
src/combinators/weighted.jl	`0.00% <0.00%> (-20.69%)`	⬇️
src/parameterized/gumbel.jl	`12.50% <0.00%> (-16.08%)`	⬇️
src/const.jl	`26.66% <0.00%> (-10.84%)`	⬇️
... and 26 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 419d2f2...2f7e03c. Read the comment docs.

github-actions · 2021-08-23T11:52:58Z

Package name	latest	stable
Mitosis.jl
Soss.jl

cscherrer · 2021-08-23T15:46:54Z

Thanks @gdalle for pushing on this, it's great to have it moving forward. Yes, I think we can get things much more efficient, in a few ways. First, in the logp parameterization, ...

the logdensity will just be getindex
we can use the Gumbel-max trick for sampling
We can avoid normalization, leaving it to depend on logsumexp(logp)

Then, in the case where the user passes something already normalized, it will turn out independent of the measure.

Clearly we also need the unlogged parameterization, and I think we can do much better in that case as well.

It's very strange to me that they sortperm(vs). If anything, I would expect sorting the probabilities in reverse order, so when you walk down them to sample you can stop earlier. Anyway, I think we can avoid this, or when we do have to sort we can lean on PermutedArrays:
https://github.com/cscherrer/PermutedArrays.jl

I have a whole plan for using this to build a better DiscreteNonparametric based around a PriorityQueue, but implemented more efficiently than that. It will use Dictionaries.jl. That's a whole separate issue, so we can discuss separately if you like :)

cscherrer · 2021-08-23T15:49:09Z

Also I should maybe point out that the whole distproxy thing isn't core to our approach. It's a stand-in, and when it works well we can keep it. But the great thing is, we can break away as needed to get better flexibility and/or performance

gdalle · 2021-08-24T05:50:22Z

I think the sortperm is useful for quick cdf computations, because then you only need a searchsortedfirst(vs, x) to know at which step of the discrete distribution you are.

cscherrer · 2021-08-25T21:40:34Z

Hi @gdalle , how's this going? Let me know if you get stuck or have questions, or when it's ready for review :)

gdalle · 2021-08-26T07:45:41Z

Hey @cscherrer,
Just added the logp parametrization, and realized the sortperm may not be a worry since the values are a range type: judging by a little @benchmark, sorting p doesn't seem to reallocate.
However, in the log parametrization, the sampling is very inefficient since it requires computing the exponential of logp. Is that what you wanted me to use the Gumbel trick for? I don't know what it is just yet, but I will look it up.

github-actions · 2021-08-26T07:52:57Z

Package name	latest	stable
Mitosis.jl
Soss.jl

cscherrer · 2021-08-26T12:35:46Z

If you have a vector logp of log-probabilities, it's just (pseudo-Julia)

draw_categorical(logp) = argmax(logp .- log(-log.(rand(length(logp))))

The second term is a vector of Gumbel samples.

The trade-offs are

you pay two logs instead of one exp
but after that the only dependency is the argmax, so it's a reduce and should scale well
staying in log space avoids underflow
maybe the biggest thing, there's no need to normalize anything

jeremiahpslewis · 2022-11-09T20:40:26Z

Hey! Is this PR blocked or is there something manageable which still needs to be tackled?

gdalle · 2022-11-10T15:25:44Z

Honestly I haven't touched this in years so I better let @cscherrer answer

irisallevi · 2024-03-20T09:03:23Z

Hi! Any news on this?

gdalle · 2024-03-20T22:27:19Z

Not on my end, my work no longer requires MeasureTheory so I don't think I'll pursue this further

gdalle · 2024-03-20T22:27:30Z

Feel free to take it over though

Categorical distribution

28b5539

Add logp parametrization for Categorical

2f7e03c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Categorical distribution #148

Categorical distribution #148

gdalle commented Aug 23, 2021

codecov bot commented Aug 23, 2021 •

edited

Loading

github-actions bot commented Aug 23, 2021

cscherrer commented Aug 23, 2021

cscherrer commented Aug 23, 2021

gdalle commented Aug 24, 2021

cscherrer commented Aug 25, 2021

gdalle commented Aug 26, 2021

github-actions bot commented Aug 26, 2021

cscherrer commented Aug 26, 2021

jeremiahpslewis commented Nov 9, 2022

gdalle commented Nov 10, 2022

irisallevi commented Mar 20, 2024

gdalle commented Mar 20, 2024

gdalle commented Mar 20, 2024

Categorical distribution #148

Are you sure you want to change the base?

Categorical distribution #148

Conversation

gdalle commented Aug 23, 2021

codecov bot commented Aug 23, 2021 • edited Loading

Codecov Report

github-actions bot commented Aug 23, 2021

cscherrer commented Aug 23, 2021

cscherrer commented Aug 23, 2021

gdalle commented Aug 24, 2021

cscherrer commented Aug 25, 2021

gdalle commented Aug 26, 2021

github-actions bot commented Aug 26, 2021

cscherrer commented Aug 26, 2021

jeremiahpslewis commented Nov 9, 2022

gdalle commented Nov 10, 2022

irisallevi commented Mar 20, 2024

gdalle commented Mar 20, 2024

gdalle commented Mar 20, 2024

codecov bot commented Aug 23, 2021 •

edited

Loading