KeyError: exon in xpore dataprep #217

mumdark · 2024-05-02T14:54:25Z

Hi, thank you for developing such an excellent tool!

I encountered an error while running the dataprep function in the xpore software as follows:

KeyError                                  Traceback (most recent call last)
/tmp/ipykernel_9791/dataprep.py in <module>
      2     for id in dict:
      3         tx_pos,tx_start=[],0
----> 4         for pair in dict[id]["exon"]:                    # line 242
      5             tx_end=pair[1]-pair[0]+tx_start
      6             tx_pos.append((tx_start,tx_end))

KeyError: 'exon'

I carefully checked line 242 of the dataprep.py and found that some transcripts in dict do not have the corresponding exon start and end points annotated, as shown in the screenshot for ENSMUST00000193812.

Subsequently, I tested the readAnnotation function, adding the following check code to inspect the attrDict, type, start, and end variables for ENSMUST00000193812:

The output is as follows:

attrDict: {'gene_id': 'ENSMUSG00000102693', 'gene_version': '1', 'transcript_id': 'ENSMUST00000193812', 'transcript_version': '1', 'exon_number': '1', 'gene_name': '4933401J01Rik', 'gene_source': 'havana', 'gene_biotype': 'TEC', 'havana_gene': 'OTTMUSG00000049935', 'havana_gene_version': '1', 'transcript_name': '4933401J01Rik-201', 'transcript_source': 'havana', 'transcript_biotype': 'TEC', 'havana_transcript': 'OTTMUST00000127109', 'havana_transcript_version': '1', 'exon_id': 'ENSMUSE00001343744', 'exon_version': '1', 'tag': 'basic', 'transcript_support_level': 'NA'} 
 type: exon 
 start: 3073253 
end: 3074322

attrDict: {'gene_id': 'ENSMUSG00000102693', 'gene_version': '1', 'transcript_id': 'ENSMUST00000193812', 'transcript_version': '1', 'gene_name': '4933401J01Rik', 'gene_source': 'havana', 'gene_biotype': 'TEC', 'havana_gene': 'OTTMUSG00000049935', 'havana_gene_version': '1', 'transcript_name': '4933401J01Rik-201', 'transcript_source': 'havana', 'transcript_biotype': 'TEC', 'havana_transcript': 'OTTMUST00000127109', 'havana_transcript_version': '1', 'tag': 'basic', 'transcript_support_level': 'NA'} 
 type: transcript 
 start: 3073253 
end: 3074322

This indicates that the exon line for ENSMUST00000193812 is above the transcript line, leading to this single-exon transcript, ENSMUST00000193812, not generating the expected information during the following code condition.

if tx_id not in dict:
    dict[tx_id]={'chr':chr,'g_id':g_id,'strand':ln[6]}
    if type not in dict[tx_id]:
        if type == "transcript":
            dict[tx_id][type]=(start,end)
else:
    if type == "exon":
        if type not in dict[tx_id]:
            dict[tx_id][type]=[(start,end)]
        else:
            dict[tx_id][type].append((start,end))

Then, I added the following code to prevent this sequencing error:

if tx_id not in dict:
    dict[tx_id]={'chr':chr,'g_id':g_id,'strand':ln[6]}
    if type not in dict[tx_id]:
        if type == "transcript":
            dict[tx_id][type]=(start,end)
        if type == "exon":                            # add
            dict[tx_id][type]=[(start,end)]      # add
else:
    if type == "exon":
        if type not in dict[tx_id]:
            dict[tx_id][type]=[(start,end)]
        else:
            dict[tx_id][type].append((start,end))

Although this resolved the issue, I encountered an error in the next loop due to not all genes having multiple exons.

if is_gff < 0:
    for id in dict:
        tx_pos,tx_start=[],0
        for pair in dict[id]["exon"]:
            tx_end=pair[1]-pair[0]+tx_start
            tx_pos.append((tx_start,tx_end))
            tx_start=tx_end+1
        dict[id]['tx_exon']=tx_pos
else:
    for id in dict:
        tx_pos,tx_start=[],0
        if dict[id]["strand"] == "-":
            dict[id]["exon"].sort(key=lambda tup: tup[0], reverse=True)
        for pair in dict[id]["exon"]:
            tx_end=pair[1]-pair[0]+tx_start
            tx_pos.append((tx_start,tx_end))
            tx_start=tx_end+1
        dict[id]['tx_exon']=tx_pos

#TypeError                                 Traceback (most recent call last)
#/tmp/ipykernel_13771/2116236259.py in <module>
#      3         tx_pos,tx_start=[],0
#      4         for pair in dict[id]["exon"]:
#----> 5             tx_end=pair[1]-pair[0]+tx_start
#      6             tx_pos.append((tx_start,tx_end))
#      7             tx_start=tx_end+1

TypeError: 'int' object is not subscriptable

For example, it does not produce an error for this key：

'ENSMUST00000187528': {'chr': '1',
  'g_id': 'ENSMUSG00000101714',
  'strand': '+',
  'exon': [(35806974, 35807035), (35810073, 35810462)]}

But it does produce an error for this keys with only one pair of exon loci:

{'ENSMUST00000193812': {'chr': '1',
  'g_id': 'ENSMUSG00000102693',
  'strand': '+',
  'exon': (3073253, 3074322)}

I am unsure if this is a bug or an error in the order of the GTF file.

Thanks!

The text was updated successfully, but these errors were encountered:

yuukiiwa · 2024-05-06T00:01:44Z

Hi @mumdark,

Can you send me your gtf file, please? I will see whether I can reproduce the problem.

Thanks!

Best wishes,
Yuk Kei

mumdark · 2024-05-06T01:23:32Z

Dear @yuukiiwa,

Thanks! I've uploaded the first 100 lines of the GTF file I’m using.

Best regard,
mumdark

example_line100.txt

mumdark · 2024-05-07T01:26:17Z

I modified the if statement structure in line 238, removing transcripts that don't have the exon key. Now, the code works properly. The changes are as follows:

#convert genomic positions to tx positions（raw）
#if is_gff < 0:
#    for id in dict:
#        tx_pos,tx_start=[],0
#        for pair in dict[id]["exon"]:
#            tx_end=pair[1]-pair[0]+tx_start
#            tx_pos.append((tx_start,tx_end))
#            tx_start=tx_end+1
#        dict[id]['tx_exon']=tx_pos

# The code modification above is as follows, removing transcripts that don't contain the exon key:
if is_gff < 0:
    id_names=list(dict.keys())
    for id in id_names:
        tx_pos,tx_start=[],0
        if 'exon' not in dict[id].keys():
            removed_value=dict.pop(id, None)
            print("Remove: ", id, "-", removed_value)
            continue
        for pair in dict[id]["exon"]:
            tx_end=pair[1]-pair[0]+tx_start
            tx_pos.append((tx_start,tx_end))
            tx_start=tx_end+1
        dict[id]['tx_exon']=tx_pos

However, is it correct to remove the transcripts that lack the exon key? Or should they be retained, with the transcript key renamed to exon? What is the subsequent role of the extracted exon locus?

Additionally, The set p-value is 0.5, but the diffmod.table only has around 300 rows. Is this normal?

Thanks!

Best regard,
mumdark

yuukiiwa · 2024-06-04T20:35:45Z

Hi @mumdark,

Sorry for the delayed reply!

If the flags --genome --gtf_or_gff --transcript_fasta are giving you problems, you can avoid passing those flags to xpore dataprep, which should give you transcriptome cordinates.

You can always remove the following from the xpore diffmod config.yml file to get more sites:

method:
    prefiltering:
        method: t-test
        threshold: <P_VALUE_THRESHOLD>

Thanks!

Best wishes,
Yuk Kei

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KeyError: exon in xpore dataprep #217

KeyError: exon in xpore dataprep #217

mumdark commented May 2, 2024 •

edited

Loading

yuukiiwa commented May 6, 2024

mumdark commented May 6, 2024

mumdark commented May 7, 2024 •

edited

Loading

yuukiiwa commented Jun 4, 2024

KeyError: exon in xpore dataprep #217

KeyError: exon in xpore dataprep #217

Comments

mumdark commented May 2, 2024 • edited Loading

yuukiiwa commented May 6, 2024

mumdark commented May 6, 2024

mumdark commented May 7, 2024 • edited Loading

yuukiiwa commented Jun 4, 2024

mumdark commented May 2, 2024 •

edited

Loading

mumdark commented May 7, 2024 •

edited

Loading