Skip to content

Term candidate extraction

jerrygao edited this page Sep 23, 2017 · 4 revisions

What is candidate term ?

Some presupposition must be set up to identify multi-word terms. Even though there is no standard definition for all the domains, a commonly accepted assumption [HuiZhong 1986] can be considered:

  1. "Multi-word terms are mainly nominals";
  2. "Multi-word terms cannot go across punctuation marks";
  3. "Verbs may be terms by themselves but not part of a multi-word term because of 1.";
  4. "Function words should be excluded with the exception of prepositions, because prepositions may be part of multi-word terms (e.g. 'speed OF propaga- tion', 'revolution PER minute', etc.);";
  5. "Adverbs may be part of a multi-word term (e.g. 'VERY high frequency', 'POSITIVELY charged ions', etc.), but adverbs for text cohesion (e.g. 'subsequently', 'naturally', 'usually', etc.) should be excluded;"
  6. "No multi-word terms can end up with an adjective or adverb";
  7. "S-endings should be removed for the purpose of frequency counting";

How to configure JATE with Apache Solr

  • input reader

  • term candidate extraction, and pre-filters

<TO_BE_CONTINUED>

Reference

[HuiZhong 1986] Y. Huizhong, “A New Technique for Identifying Scientific / Technical Terms and Describing Science Texts,” 1986.