Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error in database downloading #52

Open
uzun-masha opened this issue Mar 13, 2023 · 15 comments
Open

Error in database downloading #52

uzun-masha opened this issue Mar 13, 2023 · 15 comments

Comments

@uzun-masha
Copy link

Hi!
I'm trying to download database using mdmcleaner makedb, and I've got an error like this:
01a: download GTDB data--

    Now downloading from gtdb: "gtdb_taxfiles" (attempt 1)...

    Now downloading from gtdb: "gtdb_fastas" (attempt 1)...

    Now downloading from gtdb: "gtdb_vs_ncbi_lookup" (attempt 1)...

Traceback (most recent call last):
File "/mnt/storage/lab4/progs/miniconda3/envs/mdmcleaner/bin/mdmcleaner", line 10, in
sys.exit(main())
^^^^^^
File "/mnt/storage/lab4/progs/miniconda3/envs/mdmcleaner/lib/python3.11/site-packages/mdmcleaner/mdmcleaner.py", line 231, in main
read_gtdb_taxonomy.main(args, configs)
File "/mnt/storage/lab4/progs/miniconda3/envs/mdmcleaner/lib/python3.11/site-packages/mdmcleaner/read_gtdb_taxonomy.py", line 1142, in main
getNprepare_dbdata_nonncbi(args.outdir, verbose=args.verbose, settings=configs.settings)
File "/mnt/storage/lab4/progs/miniconda3/envs/mdmcleaner/lib/python3.11/site-packages/mdmcleaner/read_gtdb_taxonomy.py", line 1036, in getNprepare_dbdata_nonncbi
progressdump = _download_dbdata_nonncbi(targetdir, progressdump, verbose=verbose)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/mnt/storage/lab4/progs/miniconda3/envs/mdmcleaner/lib/python3.11/site-packages/mdmcleaner/read_gtdb_taxonomy.py", line 714, in _download_dbdata_nonncbi
progressdump["gtdb_download_dict"], progressdump["gtdb_version"] = download_gtdb_stuff(gtdb_source_dict, targetdir, verbose=verbose)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/mnt/storage/lab4/progs/miniconda3/envs/mdmcleaner/lib/python3.11/site-packages/mdmcleaner/read_gtdb_taxonomy.py", line 316, in download_gtdb_stuff
download_dict = get_download_dict(sourcedict, targetfolder)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/mnt/storage/lab4/progs/miniconda3/envs/mdmcleaner/lib/python3.11/site-packages/mdmcleaner/read_gtdb_taxonomy.py", line 300, in get_download_dict
okdownloadfilelist, allisfine = check_gtdbmd5file(which_md5filename(targetfolder), targetfolder, sourcedict[x]["pattern"])
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/mnt/storage/lab4/progs/miniconda3/envs/mdmcleaner/lib/python3.11/site-packages/mdmcleaner/read_gtdb_taxonomy.py", line 243, in which_md5filename
return glob.glob(targetdir + "/" + MD5FILEPATTERN_GTDB)[0] # --> assumes there is only one hit, therefore takes only the first of the list returned by glob.glob(); todo: make sure md5sum file is always deleted after db-setup! otherwise there may be problems if preexisting dbs are updated
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^
IndexError: list index out of range

Can you please help me to solve it?

Regards,
Maria

@jianshu93
Copy link

I have the same error here. Is this resolved?

Best,
Jianshu

@leonmhartman
Copy link

@jianshu93 I believe I encountered a similar problem and fixed it by editing read_gtdb_taxonomy.py, which for me was located:
~/conda_environments/mdmcleaner/lib/python3.12/site-packages/mdmcleaner/read_gtdb_taxonomy.py

I changed the server address at line 18 from:
gtdb_server = "https://data.ace.uq.edu.au/public/gtdb/data/releases/latest"

to:
gtdb_server = "https://data.gtdb.ecogenomic.org/releases/latest"

@jianshu93
Copy link

hi @leonmhartman,

Thanks! It actually works. I am downloading it right now. I will be using it to remove host contigs from the American gut project. Perhaps I will have more questions later on when I actually ran the program to evaluate the quality of my genomes.

Thanks,

Jianshu

@leonmhartman
Copy link

Hi @jianshu93

Unfortunately, it seems that this only partially worked. The main DB files downloaded successfully overnight (~9 hrs), but the makedb process failed to complete due to the following error (note that I have edited the file paths from the original for brevity):

WARNING: something went wrong during download: 0 expected files are missing, and 1 files have mismatching MD5-checksums!
  --> Missing files: 
  --> Corrupted files: MD5SUM.txt
Traceback (most recent call last):
  File "~/conda_environments/mdmcleaner/bin/mdmcleaner", line 10, in <module>
    sys.exit(main())
             ^^^^^^
  File "~/conda_environments/mdmcleaner/lib/python3.12/site-packages/mdmcleaner/mdmcleaner.py", line 231, in main
    read_gtdb_taxonomy.main(args, configs)
  File "~/conda_environments/mdmcleaner/lib/python3.12/site-packages/mdmcleaner/read_gtdb_taxonomy.py", line 1143, in main
    getNprepare_dbdata_nonncbi(args.outdir, verbose=args.verbose, settings=configs.settings)
  File "~/conda_environments/mdmcleaner/lib/python3.12/site-packages/mdmcleaner/read_gtdb_taxonomy.py", line 1037, in getNprepare_dbdata_nonncbi
    progressdump = _download_dbdata_nonncbi(targetdir, progressdump, verbose=verbose)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "~/conda_environments/mdmcleaner/lib/python3.12/site-packages/mdmcleaner/read_gtdb_taxonomy.py", line 715, in _download_dbdata_nonncbi
    progressdump["gtdb_download_dict"], progressdump["gtdb_version"] = download_gtdb_stuff(gtdb_source_dict, targetdir, verbose=verbose)
                                                                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "~/conda_environments/mdmcleaner/lib/python3.12/site-packages/mdmcleaner/read_gtdb_taxonomy.py", line 319, in download_gtdb_stuff
    assert download_dict, "\nERROR: Still incomplete download or mismatching MD5sums after {} download-attempts. Please check connection and try again later...\n".format(trycounter +1)
           ^^^^^^^^^^^^^
AssertionError: 
ERROR: Still incomplete download or mismatching MD5sums after 4 download-attempts. Please check connection and try again later...

You're probably better equipped than me to troubleshoot this (I'm not a bioinformatician), but it seems odd that the MD5SUM file is corrupt – if it is, surely other people have would have had issues. Anyway, my next strategy is to change the URL again and download the previous DB (release 214.1) and hope that the MD5SUM file for that dataset is okay.

Cheers,
Leon

@leonmhartman
Copy link

@jianshu93 It turns out that problems with the MD5SUM file were flagged on the GTDB forum several weeks ago (see here). I have added a post on the forum asking the GTDB admins to update the corrupt file.

@jianshu93
Copy link

Hi @leonmhartman, Thank you so much for it! Let me know when you get the response from authors. I will use the old version first (v207 pr something). Thanks! Jianshu

@jianshu93
Copy link

Hi @leonmhartman I mentioned the problem to GTDB-Tk team and they solved it today. You can try again now. I am also trying to download the newest database.

@jianshu93
Copy link

Hi @leonmhartman,

The gtdb problem is solved, however:

--03a: downlad SILVA data--

Now trying to get current silva release version...

Traceback (most recent call last):
File "/home/jiz322/miniconda3/envs/mdmcleaner/bin/mdmcleaner", line 10, in
sys.exit(main())
File "/home/jiz322/miniconda3/envs/mdmcleaner/lib/python3.8/site-packages/mdmcleaner/mdmcleaner.py", line 231, in main
read_gtdb_taxonomy.main(args, configs)
File "/home/jiz322/miniconda3/envs/mdmcleaner/lib/python3.8/site-packages/mdmcleaner/read_gtdb_taxonomy.py", line 1143, in main
getNprepare_dbdata_nonncbi(args.outdir, verbose=args.verbose, settings=configs.settings)
File "/home/jiz322/miniconda3/envs/mdmcleaner/lib/python3.8/site-packages/mdmcleaner/read_gtdb_taxonomy.py", line 1037, in getNprepare_dbdata_nonncbi
progressdump = _download_dbdata_nonncbi(targetdir, progressdump, verbose=verbose)
File "/home/jiz322/miniconda3/envs/mdmcleaner/lib/python3.8/site-packages/mdmcleaner/read_gtdb_taxonomy.py", line 748, in _download_dbdata_nonncbi
progressdump["silva_download_dict"], progressdump["silva_version"] = download_silva_stuff(silva_source_dict, targetdir, verbose=verbose)
File "/home/jiz322/miniconda3/envs/mdmcleaner/lib/python3.8/site-packages/mdmcleaner/read_gtdb_taxonomy.py", line 378, in download_silva_stuff
version , url = getsilvaversion(sourcedict["silva_version"], targetfolder)
File "/home/jiz322/miniconda3/envs/mdmcleaner/lib/python3.8/site-packages/mdmcleaner/read_gtdb_taxonomy.py", line 365, in getsilvaversion
version = versionfile.read().strip()
File "/home/jiz322/miniconda3/envs/mdmcleaner/lib/python3.8/encodings/ascii.py", line 26, in decode
return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 977: ordinal not in range(128)

This is related to download silva database, any idea?

Thanks,
Jianshu

@leonmhartman
Copy link

Hi @jianshu93

Thanks for contacting the GTDB-Tk team.

Like you, the file update allowed me to continue the makedb process, but I have now encountered the same error that you posted :(

I'll have a closer look at it tomorrow. In the meantime I have run MAGpurify on my data. It's not my preferred option, but it runs and removes some discordant contigs from my test MAGs.

Cheers,
Leon

@jianshu93
Copy link

I think it is because the website: https://www.arb-silva.de/fileadmin/silva_databases/current is not working, it was under maintenance. Any idea?

Jianshu

@leonmhartman
Copy link

Hi @jianshu93

Good pick-up! Wow, we are really having some bad luck!

The SILVA archive is back online now, however restarting the the makedb process still failed for me until I deleted the VERSION.txt file (it's actually an HTML file), which contained info about the status of the SILVA website.

After deleting the file, the makedb process was able to continue and it will be interesting to see how far I get this time.

Cheers,
Leon

@jianshu93
Copy link

jianshu93 commented Oct 29, 2024

let me know what you get, but the website link is not available in a browser, no idea why.

Line 23: silva_server = "https://www.arb-silva.de/fileadmin/silva_databases/current"

Thanks,
Jianshu

@leonmhartman
Copy link

It seems that accessing SILVA with a web-browser via that link is forbidden, but other actions are ok (for example, see below). My makedb process is also still running and no new errors have been reported.

$ wget  https://www.arb-silva.de/fileadmin/silva_databases/current/VERSION.txt

--2024-10-29 12:15:59--  https://www.arb-silva.de/fileadmin/silva_databases/current/VERSION.txt
Resolving www.arb-silva.de (www.arb-silva.de)... 194.94.219.5
Connecting to www.arb-silva.de (www.arb-silva.de)|194.94.219.5|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 6 [text/plain]
Saving to: ‘VERSION.txt’

VERSION.txt   100%[==================================================>]   6  --.-KB/s    in 0s      

2024-10-29 12:16:01 (7.81 MB/s) - ‘VERSION.txt’ saved [6/6]

@jianshu93
Copy link

Thank you so much. Let's wait and see. I start from scratch, so you know it will take a couple of more hours. Additionally, see my pull request to use unicode parsing of version number, in case there are some non unicode. e.g., version 2.1_1.

#58

Thanks,
Jianshu

@fra-vh
Copy link

fra-vh commented Nov 26, 2024

@jianshu93 sorry to tag you, but did you manage to create the database succesfully and run mdmcleaner with it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants