Missing part in matrix table? #3664
Replies: 5 comments
-
Note The following post was exported from discuss.hail.is, a forum for asking questions about Hail which has since been deprecated. (Dec 02, 2021 at 14:30) tpoterba said:This might indicate a failure to copy over the file correctly – I think it was copied over from a file on Google Cloud. I can see that part file in Google: $ gsutil ls gs://hail-datasets-us/1000_Genomes/NYGC_30x/GRCh38/autosomes_unphased.mt/entries/rows/parts/part-00311-7-311-0-ae371ed3-1c91-eca9-9251-acc1b0de3620
gs://hail-datasets-us/1000_Genomes/NYGC_30x/GRCh38/autosomes_unphased.mt/entries/rows/parts/part-00311-7-311-0-ae371ed3-1c91-eca9-9251-acc1b0de3620 cc danking any ideas? |
Beta Was this translation helpful? Give feedback.
-
Note The following post was exported from discuss.hail.is, a forum for asking questions about Hail which has since been deprecated. (Dec 02, 2021 at 16:14) danking said:Hey CreRecombinase ! I’m really sorry you’re having trouble with the Hail datasets. It appears that, due to a not yet understood error, nine files failed to copy from GCS to S3. The nine files are: part-00311-7-311-0-ae371ed3-1c91-eca9-9251-acc1b0de3620
part-00319-7-319-0-9b13bfeb-a79d-03e9-0625-872135ff2cd5
part-00323-7-323-0-f10b20b8-e33e-c1ae-62f0-2b09276483a5
part-00330-7-330-0-9a26e1c3-d1a4-a42a-a05c-119d81320d57
part-00331-7-331-0-dde65b4b-744e-aecf-15e9-61930f92a7eb
part-00334-7-334-0-cf54a5dc-de73-5375-a518-033129181b9d
part-06648-7-6648-0-b254866b-09f8-9a9d-6d44-ff2810d20080
part-06659-7-6659-0-a046b619-4719-cedc-9741-69cc9c2473ea
part-06667-7-6667-0-4b850f2b-be02-4213-d47a-abf1636839d9 These are all entries data files. There are no missing row, column, or global data files. I have restored these nine files from GCS to S3. You can fix your copy of the dataset by executing this script, substituting in the name of your bucket: YOUR_BUCKET=your-bucket
aws s3 cp s3://hail-datasets-us-east-1/1000_Genomes/NYGC_30x/GRCh38/autosomes_unphased.mt/entries/rows/parts/part-00311-7-311-0-ae371ed3-1c91-eca9-9251-acc1b0de3620 \
s3://$YOUR_BUCKET/1000_Genomes/NYGC_30x/GRCh38/autosomes_unphased.mt/entries/rows/parts/part-00311-7-311-0-ae371ed3-1c91-eca9-9251-acc1b0de3620
aws s3 cp s3://hail-datasets-us-east-1/1000_Genomes/NYGC_30x/GRCh38/autosomes_unphased.mt/entries/rows/parts/part-00319-7-319-0-9b13bfeb-a79d-03e9-0625-872135ff2cd5 \
s3://$YOUR_BUCKET/1000_Genomes/NYGC_30x/GRCh38/autosomes_unphased.mt/entries/rows/parts/part-00319-7-319-0-9b13bfeb-a79d-03e9-0625-872135ff2cd5
aws s3 cp s3://hail-datasets-us-east-1/1000_Genomes/NYGC_30x/GRCh38/autosomes_unphased.mt/entries/rows/parts/part-00323-7-323-0-f10b20b8-e33e-c1ae-62f0-2b09276483a5 \
s3://$YOUR_BUCKET/1000_Genomes/NYGC_30x/GRCh38/autosomes_unphased.mt/entries/rows/parts/part-00323-7-323-0-f10b20b8-e33e-c1ae-62f0-2b09276483a5
aws s3 cp s3://hail-datasets-us-east-1/1000_Genomes/NYGC_30x/GRCh38/autosomes_unphased.mt/entries/rows/parts/part-00330-7-330-0-9a26e1c3-d1a4-a42a-a05c-119d81320d57 \
s3://$YOUR_BUCKET/1000_Genomes/NYGC_30x/GRCh38/autosomes_unphased.mt/entries/rows/parts/part-00330-7-330-0-9a26e1c3-d1a4-a42a-a05c-119d81320d57
aws s3 cp s3://hail-datasets-us-east-1/1000_Genomes/NYGC_30x/GRCh38/autosomes_unphased.mt/entries/rows/parts/part-00331-7-331-0-dde65b4b-744e-aecf-15e9-61930f92a7eb \
s3://$YOUR_BUCKET/1000_Genomes/NYGC_30x/GRCh38/autosomes_unphased.mt/entries/rows/parts/part-00331-7-331-0-dde65b4b-744e-aecf-15e9-61930f92a7eb
aws s3 cp s3://hail-datasets-us-east-1/1000_Genomes/NYGC_30x/GRCh38/autosomes_unphased.mt/entries/rows/parts/part-00334-7-334-0-cf54a5dc-de73-5375-a518-033129181b9d \
s3://$YOUR_BUCKET/1000_Genomes/NYGC_30x/GRCh38/autosomes_unphased.mt/entries/rows/parts/part-00334-7-334-0-cf54a5dc-de73-5375-a518-033129181b9d
aws s3 cp s3://hail-datasets-us-east-1/1000_Genomes/NYGC_30x/GRCh38/autosomes_unphased.mt/entries/rows/parts/part-06648-7-6648-0-b254866b-09f8-9a9d-6d44-ff2810d20080 \
s3://$YOUR_BUCKET/1000_Genomes/NYGC_30x/GRCh38/autosomes_unphased.mt/entries/rows/parts/part-06648-7-6648-0-b254866b-09f8-9a9d-6d44-ff2810d20080
aws s3 cp s3://hail-datasets-us-east-1/1000_Genomes/NYGC_30x/GRCh38/autosomes_unphased.mt/entries/rows/parts/part-06659-7-6659-0-a046b619-4719-cedc-9741-69cc9c2473ea \
s3://$YOUR_BUCKET/1000_Genomes/NYGC_30x/GRCh38/autosomes_unphased.mt/entries/rows/parts/part-06659-7-6659-0-a046b619-4719-cedc-9741-69cc9c2473ea
aws s3 cp s3://hail-datasets-us-east-1/1000_Genomes/NYGC_30x/GRCh38/autosomes_unphased.mt/entries/rows/parts/part-06667-7-6667-0-4b850f2b-be02-4213-d47a-abf1636839d9 \
s3://$YOUR_BUCKET/1000_Genomes/NYGC_30x/GRCh38/autosomes_unphased.mt/entries/rows/parts/part-06667-7-6667-0-4b850f2b-be02-4213-d47a-abf1636839d9 |
Beta Was this translation helpful? Give feedback.
-
Note The following post was exported from discuss.hail.is, a forum for asking questions about Hail which has since been deprecated. (Dec 07, 2021 at 21:00) CreRecombinase said:Looks like that worked, thanks for the help! |
Beta Was this translation helpful? Give feedback.
-
Note The following post was exported from discuss.hail.is, a forum for asking questions about Hail which has since been deprecated. (Nov 10, 2022 at 22:17) PedroSebe said:Hey there! This issue is happening to the CADD dataset as well. Here’s a sample from my code: db = hl.experimental.DB(region='us', cloud='aws')
mt = db.annotate_rows_db(mt, "CADD") And the error message: Could you please restore these files on S3? Thanks!! |
Beta Was this translation helpful? Give feedback.
-
Note The following post was exported from discuss.hail.is, a forum for asking questions about Hail which has since been deprecated. (Nov 11, 2022 at 13:16) tpoterba said:Must have been an issue copying from the primary repository on Google Cloud – we’ll fix this. |
Beta Was this translation helpful? Give feedback.
-
Note
The following post was exported from discuss.hail.is, a forum for asking questions about Hail which has since been deprecated.
(Nov 29, 2021 at 23:17) CreRecombinase said:
I have some data in a matrix table stored in s3 in the US-west region. I’d like to merge this data with the [ 1000 Genomes HighCov autosomes data. Anticipating (correctly) that this would not be a straightforward, one-time thing, I made a US-west copy of all the objects that make up the matrix table in
s3://hail-datasets-us-east-1/1000_Genomes/NYGC_30x/GRCh38/autosomes_unphased.mt
.My attempt to merge these datasets looks more or less like this (I’m using version 0.2.72-cfce5e858cab)
What ends up happening is I will get
FileNotFoundException:
No such file or directory: s3a://my_bucket/1000_Genomes/NYGC_30x/GRCh38/autosomes_unphased.mt/entries/rows/parts/part-00311-7-311-0-ae371ed3-1c91-eca9-9251-acc1b0de3620
If I do a
aws s3 ls --no-sign-request s3://hail-datasets-us-east-1/1000_Genomes/NYGC_30x/GRCh38/autosomes_unphased.mt/entries/rows/parts/
I can confirm thatpart-00311-7-311-0-ae371ed3-1c91-eca9-9251-acc1b0de3620
is not there. When I compare themetadata.json.gz
's_partFile
entries I do find the missing part. It appears that there are both s3 objects not listed in the metadata.json.gz as well as objects in the metadata.json.gz file that do not appear in s3.I guess my question is: what’s going on here? Is this the right way to merge two cohorts from two matrix tables? Are there actually parts missing from the GRCh38 30x 1000 genomes matrix table?
The full backtrace is :
Beta Was this translation helpful? Give feedback.
All reactions