-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comparingfit!
and evaluate!
for LogisticClassifier
with scikit-learn
#98
Comments
fit!
and evaluate!
for LogisticClassifier
in comparison with scikit-learn in Python
A first note is that you're comparing MLJ to ScikitLearn, not just MLJLinearModels to ScikitLearn, this will incur additional overheads to do with data handling, MLJ has a more sophisticated treatment of your data than SKL and that takes a bit of time, mostly negligible when you move to larger datasets. If you want to compare "bare" LogisticClassifier, then you have to directly call MLJLinearModels on the raw floating-point data as a matrix, and do the same on the sklearn side. In doing so, you'll get the results mentioned in #14. Note that while in the experiments we ran we did observe performance gains, this is not very important. If one were to write perfect code in Python for LC, the core of the computation it for non-trivial datasets would boil down to calling NumPy primitives which are not written in python and so are comparable in speed to what you get on the Julia side. In terms of parameters, the parameters from the classifiers can be adjusted in PS: for the python code you should be using LogisticCV but that's a side note. |
fit!
and evaluate!
for LogisticClassifier
in comparison with scikit-learn in Pythonfit!
and evaluate!
for LogisticClassifier
with scikit-learn
@tlienart Thank you very much for your detailed answer. But I'm afraid, some points are missed. In general, I'm not trying to blame MLJ or MLJLinearModels, my point is just to compare performance. And I like MLJ very much and I'm eager to obtain much much better performance for MLJ models than SKL. Please do not think I'm critical of MLJ.
Could you please say what points are missed by me in https://gist.github.com/irublev/6e8f928ef78993922aa4b5d735bf3efc (except the parameters for the solver) to make my comparison 1-1? If nothing is missed, what in this case are your results exactly for the same code? Once again, thank you very much for you kind help. |
Thanks, no worries I didn't take your first message badly, just pointing out things that one would need to be careful of when doing comparisons. For the solver configuration, you can specify it in If you wanted to pass non-default parameters to the LBFGS solver, then you'd have to extend the definition of LBFGS above, and pass the parameters to the My point about CV can probably be ignored in this conversation, CV just confuses the comparison and should be dropped. What I was saying is that there's a custom algorithm when you want to do CV with Ridge or LR (which is not implemented here but is implemented in sklearn) but it's a separate point not important here. |
Folks, I appreciate there may be performance issues and I'm willing to investigate or help people investigating but in order to do this the reports to be precise... As a result, could I please ask the following:
and run this a few times. This will help in isolating the issue. Thanks! |
Hi Thibaut, I'm currently using MLJLinearModels for a personnal project where I need to run millions of multinomial regressions. Unfortunately, I also encounter a performance bottleneck as noted above. I have tried to provide here a reproducible benchmark comparing with sklearn called from Julia. I think I have followed your guidelines except for the data generation process as I don't think it matters but happy to change if deemed necessary. I use MLJ but it is only to have access to the models, eg I don't use machines. I have explored 3 dimensions: Number of samples, number of features and number of categories. It seems there is overall a non negligible difference in performance which unfortunately keeps increasing for large samples and large categories. As you say, I don't think we should try to make better than sklearn as it is C behind the hood but getting the same order of magnitute seems reasonable. I hope this will help you figure out potential improvements leads if you have time to dedicate to this issue. Also, let me know if I can provide more relevant figures or if I missed something. # Fit the underlying model
using MLJ
using Random
using StableRNGs
using BenchmarkTools
using DataFrames
using Plots
LogisticClassifier = @load LogisticClassifier pkg=MLJLinearModels verbosity=0
SKLogisticClassifier = @load LogisticClassifier pkg=ScikitLearn verbosity = 0
mljmodel = LogisticClassifier(lambda=0, fit_intercept=true)
skmodel = SKLogisticClassifier(fit_intercept=true, penalty="none")
results = DataFrame(Id=Int[], P=Int[], N=Int[], NCat=Int[], MLJLTime=Float64[], SKTime=Float64[])
# Experiment 1: Increasing sample size
X = nothing
y = nothing
rng = StableRNG(1234)
p = 4
ncat = 3
for n_samples in [100, 1000, 10_000, 100_000, 1_000_000, 10_000_000]
X = rand(rng, n_samples, p)
y = categorical(rand(rng, 1:ncat, n_samples))
time_mljl = @belapsed MLJ.fit(mljmodel, 1, X, y)
time_skl = @belapsed MLJ.fit(skmodel, 1, X, y)
push!(results, (Id=1, P=p, N=n_samples, NCat=ncat, MLJLTime=time_mljl, SKTime=time_skl))
end
# Experiment 2: Increasing feature dimension
ncat = 3
n_samples = 500_000
for p in [5, 10, 30, 50, 100]
X = rand(rng, n_samples, p)
y = categorical(rand(rng, 1:ncat, n_samples))
time_mljl = @belapsed MLJ.fit(mljmodel, 1, X, y)
time_skl = @belapsed MLJ.fit(skmodel, 1, X, y)
push!(results, (Id=2, P=p, N=n_samples, NCat=ncat, MLJLTime=time_mljl, SKTime=time_skl))
end
# Experiment 3: Increasing number of categories
p = 4
n_samples = 500_000
for ncat in [2, 3, 4, 5, 6, 7, 8, 9, 10]
X = rand(rng, n_samples, p)
y = categorical(rand(rng, 1:ncat, n_samples))
time_mljl = @belapsed MLJ.fit(mljmodel, 1, X, y)
time_skl = @belapsed MLJ.fit(skmodel, 1, X, y)
push!(results, (Id=3, P=p, N=n_samples, NCat=ncat, MLJLTime=time_mljl, SKTime=time_skl))
end
# Plotting
first_exp = filter(:Id => ==(1), results)
plot(first_exp.N, Matrix(first_exp[!, [:MLJLTime, :SKTime]]),
yaxis="Time(s)",
xaxis="N samples",
title="Logistic regression performance (p=4, ncat=3)",
label=["MLJLinearModels" "SKLearn"])
savefig("nsamples.png")
second_exp = filter(:Id => ==(2), results)
plot(second_exp.P, Matrix(second_exp[!, [:MLJLTime, :SKTime]]),
yaxis="Time(s)",
xaxis="N features",
title="Logistic regression performance (n=500 000, ncat=3)",
label=["MLJLinearModels" "SKLearn"])
savefig("nfeatures.png")
third_exp = filter(:Id => ==(3), results)
plot(third_exp.NCat, Matrix(third_exp[!, [:MLJLTime, :SKTime]]),
yaxis="Time(s)",
xaxis="N categorties",
title="Logistic regression performance (n=500 000, p=4)",
label=["MLJLinearModels" "SKLearn"])
savefig("ncategories.png")
|
Thanks for doing this analysis carefully, I've done something very similar to you except removing confounders as much as possible (removing MLJ, removing CategoricalArrays, ensuring they use the same loss functions etc.). Reproducible notebook attached is, ran with Julia 1.6.2. One thing that should be noted is that the way the algorithms are stopped are different which can affect the timings (a bit); but there shouldn't be huge discrepancies. Since MLJLM calls Optim.jl in the background; we could try to expose more of the parameters of Optim.jl. Note: I ran this on an Intel i7; some of the numbers you show in your graph seem to suggest you have a machine that is up to 3x faster than mine which I find pretty impressive but that's for another discussion. Exp 1: N increasesp fixed at 4, n_cat at 3 Comments:
Exp 2: p increasesn fixed at 100_000, n_cat at 3. Comments:
Exp 3: c increasep fixed at 4 x axis is c Comments:
What I'm taking from this is that it might be worth investigating along the following lines:
|
Witnessing another pretty serious performance issue on a dataset I'm working with, using multinomial regression. I think both scikit-learn and MLJLinearModels use LBFGS:
We are talking 30 folds faster while the final losses are pretty similar. This seems embarassingly huge so please let me know if there is anything wrong with the experiment below. The data: data_multinomial_perf.csv.zip I'd be happy to help investigate sensitivity to major hyperparameters and how they might affect loss but there's not much exposed in the API. Also when looking at scikit-learn / Optim I can see that the APIs differ quite a bit so I don't really know if both algorithms are on the same footing. Since this issue has been open for quite some time and is becoming critical for my project, can I ask whether anyone will have bandwidth/skills to look into it ? Code to reproduce: using CSV
using DataFrames
using MLJScikitLearnInterface
using MLJLinearModels
using MLJBase
using CategoricalArrays
data = CSV.read("/Users/olivierlabayle/data_multinomial_perf.csv", DataFrame)
X = data[!, ["X1", "X2", "X3", "X4", "X5", "X6", "X7", "X8"]]
y = categorical(data.y)
model_mlj = MLJLinearModels.LogisticClassifier(penalty=:none)
fr_mlj = MLJBase.fit(model_mlj, 1, X, y)
ŷ_mlj = MLJBase.predict(model_mlj, fr_mlj[1], X)
mean_ll_mlj = mean(log_loss(ŷ_mlj, y))
@time MLJBase.fit(model_mlj, 1, X, y);
model_skl = MLJScikitLearnInterface.LogisticClassifier(penalty="none", tol=1e-6)
fr_skl = MLJBase.fit(model_skl, 1, X, y)
ŷ_skl = MLJBase.predict(model_skl, fr_skl[1], X)
mean_ll_skl = mean(log_loss(ŷ_skl, y))
@time MLJBase.fit(model_skl, 1, X, y); |
Thanks for providing the data; your example is nice and it would be great to open a PR against the docs to discuss how to pass optimizer options. The code below discards some MLJ stuff to keep things simple but I show further down how to adjust your code. Take aways are:
As explained in the discussion above the point is not to show that MLJ gets to a "better" value (meaningless in this context) but rather that in the same time we get to a comparable objective. using CSV
using DataFrames
using MLJScikitLearnInterface
using MLJLinearModels
using MLJBase
using CategoricalArrays
using Optim
data = CSV.read("data_multinomial_perf.csv", DataFrame)
X = data[!, ["X1", "X2", "X3", "X4", "X5", "X6", "X7", "X8"]]
y = categorical(data.y)
enc(e) = e == "TC" ? 1 : e == "CC" ? 2 : 3
y_enc = [enc(e) for e in y]
X_enc = Matrix(X)
mc = MLJLinearModels.MultinomialRegression(penalty=:none, nclasses=3)
@time theta_1 = MLJLinearModels.fit(mc, X_enc, y_enc)
@time theta_2 = MLJLinearModels.fit(mc, X_enc, y_enc; solver=MLJLinearModels.LBFGS(optim_options=Optim.Options(f_tol=1e-6)))
J = MLJLinearModels.objective(mc, X_enc, y_enc, c=3)
J(theta_1) # got 458 663
J(theta_2) # got 458 840
# model_mlj = MLJLinearModels.LogisticClassifier(penalty=:none)
# fr_mlj = MLJBase.fit(model_mlj, 1, X, y)
# ŷ_mlj = MLJBase.predict(model_mlj, fr_mlj[1], X)
# mean_ll_mlj = mean(log_loss(ŷ_mlj, y))
# @time MLJBase.fit(model_mlj, 1, X, y);
model_skl = MLJScikitLearnInterface.LogisticClassifier(penalty="none", tol=1e-6)
fr_skl = MLJBase.fit(model_skl, 1, X, y)
ŷ_skl = MLJBase.predict(model_skl, fr_skl[1], X)
mean_ll_skl = mean(log_loss(ŷ_skl, y))
@time MLJBase.fit(model_skl, 1, X, y);
(fm, _), _ = fr_skl
J(vec(vcat(fm.coef_', fm.intercept_'))) # got 615 536 To adjust your code: import Optim
model_mlj = MLJLinearModels.MultinomialClassifier(
penalty=:none,
nclasses=3,
solver=MLJLinearModels.LBFGS(optim_options=Optim.Options(f_tol=1e-6))
) It would be great if you extended the docs a bit to indicate how to do this. |
Thanks for the explanation on how to use the Optim package with MLJLinearModels and happy to see that there's a way forward. I've got two further questions:
using MLJLinearModels
using Optim
MLJLinearModels.LBFGS(optim_options=Optim.Options(f_tol=1e-6))
ERROR: MethodError: no method matching MLJLinearModels.LBFGS(; optim_options=Optim.Options(x_abstol = 0.0, x_reltol = 0.0, f_abstol = 0.0, f_reltol = 1.0e-6, g_abstol = 1.0e-8, g_reltol = 1.0e-8, outer_x_abstol = 0.0, outer_x_reltol = 0.0, outer_f_abstol = 0.0, outer_f_reltol = 0.0, outer_g_abstol = 1.0e-8, outer_g_reltol = 1.0e-8, f_calls_limit = 0, g_calls_limit = 0, h_calls_limit = 0, allow_f_increases = true, allow_outer_f_increases = true, successive_f_tol = 1, iterations = 1000, outer_iterations = 1000, store_trace = false, trace_simplex = false, show_trace = false, extended_trace = false, show_every = 1, time_limit = NaN, )
)
Closest candidates are:
MLJLinearModels.LBFGS() at ~/.julia/packages/MLJLinearModels/Goxow/src/fit/solvers.jl:62 got unsupported keyword argument "optim_options"
Stacktrace:
[1] top-level scope
|
|
Would you mind giving me access to the repo? I will try to open a PR in the coming week. |
This issue is closely related to #14
I'm trying to compare performance of
fit!
andevaluate!
in Julia and analogous methods in scikit-learn in Python. The code and data are available by the link: https://gist.github.com/irublev/6e8f928ef78993922aa4b5d735bf3efcWhat concerns Julia code, I run it in Windows 10, Julia version 1.6.1 (2021-04-23), MLJ v0.16.4, MLJBase v0.18.6, MLJLinearModels v0.5.4 (and the Python code was run in Python 3.8.6, scikit-learn 0.23.2, numpy 1.19.2).
And the results are as follows:
fit!
/fit
[ Info: Training Machine{LogisticClassifier,…} @116.
0.211018 seconds (6.75 k allocations: 23.712 MiB)
Training took 0.07000088691711426 seconds
evaluate!
/cross_val_score
Evaluating over 10 folds: 100%[=========================] Time: 0:00:01
1.343403 seconds (94.58 k allocations: 232.942 MiB, 0.60% gc time)
Scoring took 0.5599837303161621 seconds
It should be noted that the corresponding Julia code marked with
@time
(see https://gist.github.com/irublev/6e8f928ef78993922aa4b5d735bf3efc) was executed several times to exclude compilation time.And the question I'd like to answer is why MLJ is 2.5-3 times slower than scikit-learn. I tried to make parameters of models in Julia and in Python as close as possible.
But I have not found the way to configure solvers. In Python we have the following list of parameters:
Thus, we have here maximal number of iterations, tolerance, etc. And I'm not sure the comparison above is correct, because I do not know what are the values of the respective parameters for LFBGS solver in Julia.
Could you please say how to make sure that all is configured in Julia exactly as it is in Python? And what may be the reason for such performance of Julia? May it be improved somehow? It does not looke as
expect on par or better
as pointed in #14The text was updated successfully, but these errors were encountered: