DBSCAN do not free memory #18

ShJacub · 2023-12-06T20:30:16Z

Thank for your fast DBSCAN realization. I have a problem. Calling dbscan.DBSCAN(x) consums additional memory. If I call dbscan.DBSCAN(x) n time consums n*V memory, where V is memory for one dbscan.DBSCAN(x) calling.

anivegesana · 2023-12-07T05:54:14Z

It is possible that didn't handle the reference counts in the Python wrapper correctly. Can you share a small example that I can use to reproduce the issue?

This seems to reproduce it:

import dbscan
import sys
f = np.random.rand(5000000, 5)
for i in range(100):
  q = dbscan.DBSCAN(f)

sys.getrefcount(q)
sys.getrefcount(q[0])

If the object passed in is a NULL pointer, it is assumed that this was caused because the call producing the argument found an error and set an exception.
https://docs.python.org/3/c-api/arg.html#c.Py_BuildValue

Does the following build fix the issue for you?

pip install git+https://github.com/anivegesana/dbscan-python@memleak

ShJacub · 2023-12-08T08:41:00Z

Good day, I'm sorry for late response. I ran your code and I have the same memory leak. Outputs of sys.getrefcount(q), sys.getrefcount(q[0]) are 2 and 3. I also tried pip install git+https://github.com/anivegesana/dbscan-python@memleak and have the same memory leak.

anivegesana · 2023-12-10T22:30:49Z

Hey @ShJacub,

I think I need some help reproducing the memory leak on my new branch. This is code snippet that I am currently running and its output.

>>> import numpy as np
>>> import dbscan
>>> dbscan.__version__
'0.0.12.dev1+gc993316.d20230420'
>>> x = np.random.rand(5000000, 5)
>>> r = dbscan.DBSCAN(x)
>>> sys.getrefcount(r)
2
>>> sys.getrefcount(r[0])
2
>>> sys.getrefcount(r[1])
2
>>> import weakref
>>> del r
>>> g = weakref.ref(r[0])
>>> g()
array([0, 0, 0, ..., 0, 0, 0], dtype=int32)
>>> gc.collect()
480
>>> g()

Thank you for pointing out the memory leak. I need to be a little bit more careful when reading Python C API documentation since information about borrowed and owned references isn't always in an obvious place. 😅

ShJacub · 2023-12-11T10:32:39Z

Is it possible to solve this problem?

anivegesana · 2023-12-11T15:57:55Z

Yes, it is. Just need some more information. Can you run the code that I shared in the previous comment and share the output? Also, can you share the OS, Python version, and NumPy version that you are using?

ShJacub · 2023-12-12T09:27:24Z

Ubuntu 20.04.6 LTS
python 3.8.10
numpy 1.24.4

I ran code placed above. These are outputs:

import numpy as np
import dbscan
dbscan.version
'0.0.12'
x = np.random.rand(5000000, 5)
r = dbscan.DBSCAN(x)
sys.getrefcount(r)
Traceback (most recent call last):
File "", line 1, in
NameError: name 'sys' is not defined
import sys
sys.getrefcount(r)
2
sys.getrefcount(r[0])
3
sys.getrefcount(r[1])
3
import weakref
del r
g = weakref.ref(r[0])
Traceback (most recent call last):
File "", line 1, in
NameError: name 'r' is not defined
g()
Traceback (most recent call last):
File "", line 1, in
NameError: name 'g' is not defined
gc.collect()
Traceback (most recent call last):
File "", line 1, in
NameError: name 'gc' is not defined
g()
Traceback (most recent call last):
File "", line 1, in
NameError: name 'g' is not defined

anivegesana · 2023-12-14T00:58:58Z

Sorry about the delay. Was a little bit busy at work for a couple of days and didn't have a chance to take a look at this. It seems like the version of the dbscan-python library doesn't match up. You have the production version (0.0.12) and I have the version with the fix (0.0.12.dev1+gc993316.d20230420.) Perhaps it failed to compile on your machine? I will build you a wheel tonight for you to try it out.

For some reason, the version name on the wheel is messed up, but it should mostly work fine.

curl 'https://drive.google.com/uc?export=download&id=1Wrglr9Xo9dyDiD9ngPLcPjI9sFCiYLIJ' -o "dbscan-0.1.dev90+g6a8f3e3-cp36-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl"
pip install "dbscan-0.1.dev90+g6a8f3e3-cp36-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl" --force-reinstall --no-deps

alwansm · 2024-01-22T09:38:31Z

Hello,

Please, I have encountered a memory problem but in C++. When I call the DBSCAN function inside a for loop, the memory usage increases and keeps increasing until it crashes

#include <iostream>

#include "dbscan/capi.h"
#include "dbscan/point.h"
#include "dbscan/geometryIO.h"
#include "dbscan/pbbs/parallel.h"
#include "dbscan/pbbs/parseCommandLine.h"

int main(int argc, char *argv[])
{

  commandLine P(argc, argv, "[-o <outFile>] [-eps <p_epsilon>] [-minpts <p_minpts>] <inFile>");
  char *iFile = P.getArgument(0);
  char *oFile = P.getOptionValue("-o");
  size_t rounds = P.getOptionIntValue("-r", 1);
  double p_epsilon = P.getOptionDoubleValue("-eps", 1);
  size_t p_minpts = P.getOptionIntValue("-minpts", 1);
  double p_rho = P.getOptionDoubleValue("-rho", -1);

  for (int i = 0; i < 10000; i++)
  {
    parlay::internal::start_scheduler();
    int dim = readHeader(iFile);
    _seq<double> PIn = readDoubleFromFile(iFile, dim);

    bool *coreFlag = new bool[PIn.n / dim];
    int *cluster = new int[PIn.n / dim];
    double *data = PIn.A;

    if (DBSCAN(dim, PIn.n / dim, data, p_epsilon, p_minpts, coreFlag, cluster))
      cout << "Error: dimension >20 is not supported." << endl;

    // if (oFile != NULL)
    // {
    //   writeArrayToFile("cluster-id", cluster, PIn.n / dim, oFile);
    // }

    PIn.del();
    delete[] coreFlag;
    delete[] cluster;
    parlay::internal::stop_scheduler();
  }
  return 0;
}

william-silversmith · 2024-03-05T02:32:09Z

I ran into a similar problem in some code I wrote a while back (not in DBSCAN) and here was its solution: seung-lab/mapbuffer@3746bd8

I suspect the issue has something to do with not releasing Xobj or X. You may want to call: Py_DECREF(X); to decrement the reference count on the python object before returning stuff.

anivegesana · 2024-03-05T02:52:28Z

Oh yes. You are absolutely right! PyArg_ParseTupleAndKeywords creates a strong reference to objects. The C API has a lot of sharp corners. Will try it out. I don't think this solves @alwansm 's problem, but Valgrind should help find that.

GunjanKholapure · 2024-05-09T01:48:07Z

Any updates on this? It increasingly takes up more memory in python when run continuously

GunjanKholapure · 2024-05-17T11:03:45Z

Update: This is what worked for me:

Since I just needed labels and not the core mask I have changed the return statement and I haven't seen any memory leak after this for my use case:

Instead of returning:
return PyTuple_Pack(2, labels, core_samples);

I am returning:
Py_DECREF(X); Py_DECREF(core_samples); return (PyObject*) labels;

After making these changes I did pip install -e . in the repo
I had to comment these lines for it to work:
py_limited_api=True ('Py_LIMITED_API', '0x03020000')

GunjanKholapure · 2024-06-20T03:31:14Z

Hey Yuki Thanks a lot for this. Is it possible to publish a new pip package with the fixes? That would be very useful. Gunjan

…

On Wed, 19 Jun 2024 at 10:36, yuki-inaho ***@***.***> wrote: JFYI, I have addressed the memory leak issue by implementing the following changes. You can see the commits that were made to resolve the issue in the forked repository: Commit 1 <yuki-inaho@1bb13f3> Commit 2 <yuki-inaho@748040c> We confirmed the improvement by running tests outlined in the following notebook: Memory Leak Check Notebook <https://github.com/yuki-inaho/dbscan_comparison/blob/main/check_memory_leaks.ipynb> — Reply to this email directly, view it on GitHub <#18 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACOFE36LNUKB4X5DLHXQPODZIEGUPAVCNFSM6AAAAABAJ7AECGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNZXG42TENJWHA> . You are receiving this because you commented.Message ID: ***@***.***>

ShJacub changed the title ~~Do not free memory~~ DBSCAN do not free memory Dec 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DBSCAN do not free memory #18

DBSCAN do not free memory #18

ShJacub commented Dec 6, 2023

anivegesana commented Dec 7, 2023 •

edited

Loading

ShJacub commented Dec 8, 2023

anivegesana commented Dec 10, 2023

ShJacub commented Dec 11, 2023

anivegesana commented Dec 11, 2023

ShJacub commented Dec 12, 2023 •

edited

Loading

anivegesana commented Dec 14, 2023 •

edited

Loading

alwansm commented Jan 22, 2024

william-silversmith commented Mar 5, 2024

anivegesana commented Mar 5, 2024

GunjanKholapure commented May 9, 2024 •

edited

Loading

GunjanKholapure commented May 17, 2024 •

edited

Loading

GunjanKholapure commented Jun 20, 2024 via email

DBSCAN do not free memory #18

DBSCAN do not free memory #18

Comments

ShJacub commented Dec 6, 2023

anivegesana commented Dec 7, 2023 • edited Loading

ShJacub commented Dec 8, 2023

anivegesana commented Dec 10, 2023

ShJacub commented Dec 11, 2023

anivegesana commented Dec 11, 2023

ShJacub commented Dec 12, 2023 • edited Loading

anivegesana commented Dec 14, 2023 • edited Loading

alwansm commented Jan 22, 2024

william-silversmith commented Mar 5, 2024

anivegesana commented Mar 5, 2024

GunjanKholapure commented May 9, 2024 • edited Loading

GunjanKholapure commented May 17, 2024 • edited Loading

GunjanKholapure commented Jun 20, 2024 via email

anivegesana commented Dec 7, 2023 •

edited

Loading

ShJacub commented Dec 12, 2023 •

edited

Loading

anivegesana commented Dec 14, 2023 •

edited

Loading

GunjanKholapure commented May 9, 2024 •

edited

Loading

GunjanKholapure commented May 17, 2024 •

edited

Loading