Skip to content

Evaluation and Dataset

jerrygao edited this page Jul 26, 2017 · 14 revisions

GENIA Corpus

GENIA Corpus is a popular corpus that has been used to evaluate various ATE algorithm for the last decade. In JATE2, instead of using the annotation file "GENIAcorpus302.xml", the 'concept.txt' containing a breakdown list of GENIA concepts and relations (more like ontology) are used as the "Gold Standard" (GS) list. However, various issues need to be taken into consideration for the term matching (as below).

  • To match the gs terms, the term should be stemmed,e.g., LPS-stimulated CMV transfected cell (surface form was "LPS-stimulated CMV transfected cells")

  • Remove GS term with "" (seems only the hypernym or descriptor of the term), e.g., "LPS-stimulated", "LT-beta" (there wasn't a '' in raw text)

    "*" in concept.txt provideds a pattern of terms indicating the head words or tail words of a concept/entity/term and also need to remove gs concepts with the predicates forms starting with "(NOT ONLY BUT ALSO", "(OR", "(TO", "(THAN", "(VERSUS", "(AND", "(BUT NOT", etc. (e.g., "(OR (AND GATA-1 gene product GATA-2 gene product) (AND GATA-1 gene product GATA-3 gene product))" and "". Possibly all concepts with "(" and ")" should be removed for the evaluation.

  • Need to do stemming or lemmentisation

    It seems that the 'concept.txt' contains various inflectional forms of concepts which may or may not appear in raw text. for example, "primary myelodysplastic syndrome" cannot be found in the raw corpus and only appear in the form of "primary myelodysplastic syndromes". More examples are such like "primary mouse T cell"

    The stemming also needs to apply to all the term units. For example, the surface form of "il 2 treated cell il 7 treated cel" is "IL-2-treated cells" in xml file and "IL-2-treated cells" in "GENIAcorpus302.xml.txt-all_in_one.txt". The surface form of term"amino acids 1-421" is normalised into "amino acid 1-421" in "concept.txt" along with many terms starting with "amino acids". Stemming should also take considerations of different surface forms, such like "monocytes/macrophages", "highly purified monocytes/macrophages", which is not normally segmented by word tokeniser and the term units won't be able to normalised. This will cause mismatching with gold standards.

    Our experiment shows that applying lemmatization can produce better results for both ATR and evaluation.

  • Punctuation need to be removed

    e.g., "signal transduction event" does not appear in the raw corpus.

    Tokensiation may cause mismatching problems, for instance, "'lysis genes'" in MEDLINE:95163578 is tokenised into ["'lysis", 'genes', "'"] by using NLTK's recommended word tokenizer. In this case, a punctuation filter or pre-processed data should be used to avoid various concerns.

  • Abbreviation are not stemmed directly (e.g., 'VSMCs')

    In this case, SnowballStemmer will not be able to remove 's' for the 'VSMCs', while only "VSMC" is listed in reference list. This cause the mismatching when computing recall. PorterStemmer seems working well for this situation, although it may cause overstemming in other cases.

  • Infrequency term Few of GENIA terms only appear once, e.g., "TPA response elements", "TPA responsiveness".

    Note that the infrequency terms can be easily filtered by minimum frequency threshold.

  • Different term forms (i.e., term variants, particularly coodination variations)

    e.g., "Xi translocations" appears in "Autosome/Xi translocations" in raw txt.

    Few examples of coodination variants in GENIA corpus (more than 30 examples can be found in the corpus):

     1) "red cell antibody" appears in "sensitized target cells has been used to detect red cell, platelet and granulocyte antibodies." in doc 'MEDLINE:93121695'
    
     2) raw text "viral gene expression and replication" is associated with the term "viral gene replication".
    
     3) "CD45RA+ (naive) and CD45RO+ (memory) T lymphocytes" is associated with "CD45RA+ (naive) T lymphocyte" and "CD45RO+ (memory) T lymphocyte".
    
     4) "C-terminal 13 or 141 aa" is associated with "C-terminal 13 aa" and "C-terminal 141 aa"
    
     ...
    

    Examples of inflectional forms:

    e.g., , 'virus' vs 'viruses', 'woman' vs 'women'

    The use of PorterStemmer will convert 'virus' to 'viru' and 'viruses' to 'virus', which may cause mismatching in evaluation. For instance, the surface form of 'mammalian oncogenic viruses' is associated with the concept 'mammalian oncogenic virus'. "healthy women" is associated with two concepts including "healthy woman" and "healthy women". "X-linked spinal and bulbar muscular atrophy" is associated with "X-linked spinal atrophy" and "bulbar muscular atrophy".

  • Inclusion relationship between two terms

    For instance, "src homology (SH)2 domain" is a substring of the concept "src homology (SH)2 domain-containing intracellular molecules" connected with compound words (via hyphen), which is difficult to be picked up by generic/domain-independent term extraction method. "12-O-tetradecanoyl 12-phorbol 13-acetate" is a sub string of "12-O-tetradecanoyl 12-phorbol 13-acetate-stimulated T cells" connected with compound words (via hyphen).

  • Possibly annotation error and synonym replacement causing mismatching

    For instance, "conventional responsivenessp" does not exist in raw text and the possible term is "conventional responsiveness". Also, "activationp" does not exist and the possible term expression is 'activation'.

    Examples of synonyms replacement (need to be normalised in evaluation) are:

    1. 'mouse' vs 'mice', 'analysis' vs. 'analyses'

    2. 'PU.1- /- mouse' -> "PU.1- /- mice"

    3. 'TCR-beta/TCR-delta double knockout mouse' -> 'TCR-beta/TCR-delta double knockout mice'

    4. 'EMS analysis' -> 'EMS analyses'

  • Domain-specific annotation rule

    e.g., both "human proximal sequence element" and "human proximal sequence element-binding transcription factor" are annotated as term from the raw text "human proximal sequence element-binding transcription factor"

  • False positive sentence segmentation

    If you are pre-processing the corpus from raw text instead of extracting sentences from the annotated xml file that is provided, potential false positive sentence segmentation should be taken into consideration. False positive sentence splitting may cause some of irregular term candidates not being able to extracted. e.g., the term "ca. bp -850" in doc 'MEDLINE:96323085' will be splitted into two different sentences. Common practice of pre-processing (sentence splitting -> tokenisation) won't work well. More examples are such as "Sch. pombe".

  • Ngram range should consider punctuation

    For example, "nuclear factor of activated T cells (NFAT) group" is an annotated term with ngram length 10, although the punctuations will be normalised for evaluation (normed form: 'nuclear factor activ cell nfat group'). "Jak/Stat (signal transducer and activator of transcription) pathway" is a 10 length ngrams, even though it can be normalised to "jak/stat signal transduc activ transcript pathway".

  • Terms with punctuation

    Some terms (e.g., "Ph'" in doc MEDLINE:96226351, "5'-TGCGTCA-3'" in doc MEDLINE:97364688, "5'-GTTAAGGTTCGTAGGTCATGGA-3'" in doc MEDLINE:95197545) with punctuations can be neglected/normalised by proprocessing (tokenisation). Common practice of normalisation will skew the results. For example, "Ph'" and "pH" are different term and concepts. A simple solution for proper evaluation might be to tokenise the gs terms and concatenate the term units in the same way as the pre-processing raw corpus. For the example above, Tokenisation can remove the last punctuation, e.g., "5'-TGCGTCA-3'" -> "5'-TGCGTCA-3"

  • Unmatched concepts in raw txt

    Still, some of concept are not exist if you are matching surface forms of terms in the corpus. Just to name a few, "transcription-independent surface transcription" "ectopic endrometrium" is not found and this maybe a spelling error in the annotation, as there is no word 'endrometrium'. The surface form of the concept should be "ectopic endometrium". "primary hyperthyroidism" is not found in the corpus, which is related to "primary hypothyroidism", "hyper- and hypothyroidism". "bonep" is not a surface form of term in the corpus, which is related to "bone".

  • Many concepts are very long in GENIA

    For example, "human T cell leukemia virus type I-transformed human T cell line" is a 11 grams term. "human T-cell leukemia virus type 1 (HTLV-1) transcriptional trans-activator" and "human T cell leukemia virus type I (HTLV-I) trans-activator (tax1) antigen" are 15-grams terms.

Chinese/Multilingual Reference Corpus

JATE2 can be applied to multiple languages with language-specific settings. For contrastive measures for domain-specificity (e.g.Weirdness), you can use Chinese generic reference corpus to replace the default BNC corpus compiled by Adam Kilgarriff (essentially a frequency distribution of tokens). We list two options as follows:

  • Mandarin Chinese News Text

This reference corpus is available via Linguistic Data Consortium with cost of $500 for non-member, accessible via https://catalog.ldc.upenn.edu/LDC95T13. The corpus consists of the texts of several news papers from Renmin Ribao (People's Daily), China Radio International and Xinhua newswire service.

size: 250 million GB-encoded text characters.

  • The Lancaster Corpus of Mandarin Chinese

Available via University of Oxford Text Archive (http://ota.ox.ac.uk/desc/2474)

size: 30 files : ca. 42.8 MB

  • A collection of Chinese corpora and frequency lists for Leeds

The collection of three corpus is compiled by Serge Sharoff and Richard Xiao. The frequency list can be directly used in JATE2, avail via http://corpus.leeds.ac.uk/query-zh.html

  1. Chinese Internet Corpus, 280 million words (tokens).

  2. The Lancaster Corpus of Mandarin Chinese,

  3. Chinese Business Corpus, 30 million words (tokens).

For more multilingual reference corpus, you can find them from http://corpus.leeds.ac.uk/list.html.

Biomedical corpus for abbreviation and acronym detection

Medstract text collection is one of most popular corpus used for bench-marking acronyms detection and extraction.

More resources can be downloaded via http://bioc.sourceforge.net/BioCresources.html

Wren, J. D., Chang, J. T., Pustejovsky, J., Adar, E., Garner, H. R., & Altman, R. B. (2005). Biomedical term mapping databases. Nucleic acids research, 33(suppl_1), D289-D293. Access via https://www.ncbi.nlm.nih.gov/pubmed/15608198

TO_BE_CONTINUED