-
Notifications
You must be signed in to change notification settings - Fork 109
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Out of memory on CUDA GPU #207
Comments
Could you try to reproduce with XGBoost.jl only? It's hard for me to imagine how the wrapper can cause this: it's really not doing anything during training other than calling the update function from the library; though it's been a while since I've worked on this package so my memory might be hazy. |
I guess my question is what does XGBoost.jl do fundamentally when training / inference on large data -- is there a way to manually tell it to partition more finely? |
106 function cross_train(df_all; model = XGBoostClassifier(; tree_method="hist", eta=0.08, max_depth=7, num_round=90), Nfolds=5) 107 y, eventNumber, X = unpack(df_all, ==(:proc), ==(:eventNumber), ∈(BDT_in put_names))
108 eventNumber_modN = mod1.(eventNumber, Nfolds) 109 machines = []
110 @views for ith_fold in 1:Nfolds
111 ▏ fold_mask = eventNumber_modN .!= ith_fold
112 ▏ fold_y = y[fold_mask]
113 ▏ fold_X = X[fold_mask, :]
114 ▏ mach_bdt = machine(model, fold_X, fold_y)
115 ▏ fit!(mach_bdt)
116 ▏
117 ▏ push!(machines, mach_bdt)
118 end
119 machines
120 end I think my problem is each MLJ machine would hold a reference to the data, and the data probably got moved to CUDA so it's now a memory leak |
setting |
The library itself doesn't exactly give us a huge amount of leeway. Basically you have to put your entire dataset into memory and it runs on it. There is a little bit of utility for larger datasets but last I looked it was pretty half-baked, at least within the C API. I wonder if there is memory leaking when creating lots of |
Uh. You could try: mach = restore!(serializable(mach)) Methods exported by MLJ and MLJBase, I think. |
according to JuliaAI/MLJBase.jl#750 it doesn't fully remove reference to data somehow? But looks like it helps quite a lot at least w/ the "fix":
w/o the "fix":
|
I'm not 100% sure, but I doubt the cited issue applies here. |
Could you please share what's half baked? I can help take a look. |
this happens with less and less
Free memory:
so something is leaking memory... I'm using XGBoost via MLJ, any clue?The text was updated successfully, but these errors were encountered: