Out of memory on CUDA GPU #207

Moelf · 2024-11-22T04:44:16Z

[ Info: Training machine(XGBoostClassifier(test = 1, …), …).
[ Info: XGBoost: starting training.
┌ Warning: [04:42:46] WARNING: [/workspace/srcdir/xgboost/src/common/error_msg.cc:27](https://jiling-notebook-1.notebook.af.uchicago.edu/workspace/srcdir/xgboost/src/common/error_msg.cc#line=26): The tree method `gpu_hist` is deprecated since 2.0.0. To use GPU training, set the `device` parameter to CUDA instead.
│ 
│     E.g. tree_method = "hist", device = "cuda"
└ @ XGBoost ~/.julia/packages/XGBoost/nqMqQ/src/XGBoost.jl:34
┌ Error: Problem fitting the machine machine(XGBoostClassifier(test = 1, …), …). 
└ @ MLJBase ~/.julia/packages/MLJBase/7nGJF/src/machines.jl:694
[ Info: Running type checks... 
[ Info: Type checks okay. 

XGBoostError: (caller: XGBoosterUpdateOneIter)
[04:42:46] [/workspace/srcdir/xgboost/src/tree/updater_gpu_hist.cu:781](https://jiling-notebook-1.notebook.af.uchicago.edu/workspace/srcdir/xgboost/src/tree/updater_gpu_hist.cu#line=780): Exception in gpu_hist: [04:42:46] [/workspace/srcdir/xgboost/src/c_api/../data/../common/device_helpers.cuh:431](https://jiling-notebook-1.notebook.af.uchicago.edu/workspace/srcdir/xgboost/src/common/device_helpers.cuh#line=430): Memory allocation error on worker 0: Caching allocator
- Free memory: 12582912
- Requested memory: 9232383

this happens with less and less Free memory: so something is leaking memory... I'm using XGBoost via MLJ, any clue?

The text was updated successfully, but these errors were encountered:

ExpandingMan · 2024-11-22T14:53:23Z

Could you try to reproduce with XGBoost.jl only? It's hard for me to imagine how the wrapper can cause this: it's really not doing anything during training other than calling the update function from the library; though it's been a while since I've worked on this package so my memory might be hazy.

Moelf · 2024-11-22T15:16:28Z

I guess my question is what does XGBoost.jl do fundamentally when training / inference on large data -- is there a way to manually tell it to partition more finely?

Moelf · 2024-11-22T16:36:20Z

106 function cross_train(df_all; model = XGBoostClassifier(; tree_method="hist",       eta=0.08, max_depth=7, num_round=90), Nfolds=5)                                107     y, eventNumber, X = unpack(df_all, ==(:proc),  ==(:eventNumber), ∈(BDT_in      put_names))
  108     eventNumber_modN = mod1.(eventNumber, Nfolds)                              109     machines = []
  110     @views for ith_fold in 1:Nfolds
  111     ▏   fold_mask = eventNumber_modN .!= ith_fold
  112     ▏   fold_y = y[fold_mask]
  113     ▏   fold_X = X[fold_mask, :]
  114     ▏   mach_bdt = machine(model, fold_X, fold_y)
  115     ▏   fit!(mach_bdt)
  116     ▏
  117     ▏   push!(machines, mach_bdt)
  118     end
  119     machines
  120 end

I think my problem is each MLJ machine would hold a reference to the data, and the data probably got moved to CUDA so it's now a memory leak

Moelf · 2024-11-22T16:48:35Z

3292MiB /  16384MiB

5252MiB /  16384MiB

7468MiB /  16384MiB

...

setting mach_bdt = machine(model, fold_X, fold_y; cache=false) doesn't help, @ablaom do you have any suggestion? how to stripe away reference to CUDA data after machine is trained?

ExpandingMan · 2024-11-22T19:15:33Z

I guess my question is what does XGBoost.jl do fundamentally when training / inference on large data -- is there a way to manually tell it to partition more finely?

The library itself doesn't exactly give us a huge amount of leeway. Basically you have to put your entire dataset into memory and it runs on it. There is a little bit of utility for larger datasets but last I looked it was pretty half-baked, at least within the C API. I wonder if there is memory leaking when creating lots of Booster objects? The wrapper is of course supposed to ensure this doesn't happen by letting the Julia GC know about them, but could be a bug, and that would actually be a bug in the wrapper that would be solvable, whereas whatever happens to the data and how it's trained when it gets into the C lib is totally mysterious.

ablaom · 2024-11-23T03:27:41Z

@ablaom do you have any suggestion? how to stripe away reference to CUDA data after machine is trained?

Uh. You could try:

mach = restore!(serializable(mach))

Methods exported by MLJ and MLJBase, I think.

Moelf · 2024-11-23T05:55:55Z

according to JuliaAI/MLJBase.jl#750 it doesn't fully remove reference to data somehow? But looks like it helps quite a lot at least

w/ the "fix":

2366MiB /  16384M
# GC.gc()
882MiB /  16384M

w/o the "fix":

2818MiB /  16384M
# GC.gc()
2624MiB /  16384M

ablaom · 2024-11-24T00:02:32Z

JuliaAI/MLJBase.jl#750 it doesn't fully remove reference to data somehow? But looks like it helps quite a lot at least

I'm not 100% sure, but I doubt the cited issue applies here.

trivialfis · 2024-11-24T04:52:58Z

There is a little bit of utility for larger datasets but last I looked

Could you please share what's half baked? I can help take a look.

Moelf closed this as completed Nov 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Out of memory on CUDA GPU #207

Out of memory on CUDA GPU #207

Moelf commented Nov 22, 2024

ExpandingMan commented Nov 22, 2024

Moelf commented Nov 22, 2024

Moelf commented Nov 22, 2024

Moelf commented Nov 22, 2024 •

edited

Loading

ExpandingMan commented Nov 22, 2024

ablaom commented Nov 23, 2024

Moelf commented Nov 23, 2024 •

edited

Loading

ablaom commented Nov 24, 2024

trivialfis commented Nov 24, 2024

Out of memory on CUDA GPU #207

Out of memory on CUDA GPU #207

Comments

Moelf commented Nov 22, 2024

ExpandingMan commented Nov 22, 2024

Moelf commented Nov 22, 2024

Moelf commented Nov 22, 2024

Moelf commented Nov 22, 2024 • edited Loading

ExpandingMan commented Nov 22, 2024

ablaom commented Nov 23, 2024

Moelf commented Nov 23, 2024 • edited Loading

ablaom commented Nov 24, 2024

trivialfis commented Nov 24, 2024

Moelf commented Nov 22, 2024 •

edited

Loading

Moelf commented Nov 23, 2024 •

edited

Loading