Memory issues #97

TimCraigCGPS · 2024-11-11T14:24:31Z

Hi,
I am finding that after I run a job it seems to be running out of memory after 1-2 days. I don't think that this is a problem with the machine memory not being enough to run at all, since it's able to go through ~ 21-23 trajectories in the trajectory_stats.csv file. I'm also able to run the PDL1 example but get a similar error after some time if I try to get 200 binders (66 trajectories). Given a previous comment that we might need to run 200-300 trajectories for "easy" targets and 2000-3500 for harder ones I'm wondering what might be causing this.

I'm getting an error that looks like this:
<Signals.SIGKILL: 9>.; 2341581)

I'm wondering if the best practice might be to do something like shut down bindcraft and then restart it every 12 hours or so?

martinpacesa · 2024-11-12T08:29:15Z

This is interesting, how much RAM are you allocating?

TimCraigCGPS · 2024-11-12T18:03:38Z

I'm using an A100, so 80GB (@-task(queue='gpu', executor_config={'--mem-per-gpu': '80G', '-G': '1'})). I'm not sure that we are specifying an amount of RAM.

Should we be using os.environ["XLA_PYTHON_CLIENT_MEM_FRACTION"] = 2 or higher? (found this in another thread)

martinpacesa · 2024-11-12T23:14:02Z

I meant computer RAM, the code compilation and PyRosetta still require a decent memory to run, we normally use 32 Gb

agitter · 2024-11-16T14:53:00Z

I had one run with untrimmed EGRF 6aru as the target and an 80GB A100 that was held in our shared computing system because it exceeded the 100GB RAM I requested. It completed when I reran it and requested 200GB RAM.

martinpacesa · 2024-11-18T07:02:51Z

How big was that? I so far never needed that much RAM

agitter · 2024-11-18T14:52:39Z

Chain A of 6aru is 622 residues. I tried lengths of 50-250 with hotspots and default settings.

Our system uses cgroups to monitor and enforce resource sharing, and this is part of the error message I got

Job has gone over cgroup memory limit of 102400 megabytes. Last measured usage: 261 megabytes.  Consider resubmitting with a higher request_memory.

I can't confirm if that was accurate memory usage because it was running in a batch setting that I wasn't monitoring.

martinpacesa · 2024-11-18T14:54:47Z

Okay, thanks a lot for the report. I will keep an eye on the memory usage settings over the coming weeks, but this should not happen.

martinpacesa added the help wanted Extra attention is needed label Nov 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory issues #97

Memory issues #97

TimCraigCGPS commented Nov 11, 2024

martinpacesa commented Nov 12, 2024

TimCraigCGPS commented Nov 12, 2024 •

edited

Loading

martinpacesa commented Nov 12, 2024

agitter commented Nov 16, 2024

martinpacesa commented Nov 18, 2024

agitter commented Nov 18, 2024

martinpacesa commented Nov 18, 2024

Memory issues #97

Memory issues #97

Comments

TimCraigCGPS commented Nov 11, 2024

martinpacesa commented Nov 12, 2024

TimCraigCGPS commented Nov 12, 2024 • edited Loading

martinpacesa commented Nov 12, 2024

agitter commented Nov 16, 2024

martinpacesa commented Nov 18, 2024

agitter commented Nov 18, 2024

martinpacesa commented Nov 18, 2024

TimCraigCGPS commented Nov 12, 2024 •

edited

Loading