This repository contains files and information about step 2 of Kaphta Architecture: Information Extraction. In this stage, PubMed abstracts classified as positive in the previous stage (Text Classification step) were used to extract information. Information was extracted from sentences of PubMed abstracts with associations of recognized entities. The following are the files used in the tasks of NER (Named entity recognition), AR (Association recognition) and your respective results:
For more information about this and other steps of the Kaphta Architecture, see sections of the Kaptha Web Tool available in https://portal.ifsuldeminas.edu.br/kaphtawebtool/.
- ner-pubmed-abstracts-gh.R: R script for named entity recognition (NER) in PubMed abstracts classified as positive in the previous stage (Text Classification step), using PubTator API
- functions.R: R script with auxiliary functions. Save this file in the same folder of ner-pubmed-abstracts-gh.R and association-recognition-pubmed-abstracts-gh.R scripts, because it is needed to execute these scripts.
- db_total_project.db: SQLite Database needed to execute all R scripts of kaphta architecture steps. This database contains tables with the Entity dictionary, Total PubMed abstracts textual corpus, and Pubmed abstracts classified as positive in text classification. Save this file in the same folder of ner-pubmed-abstracts-gh.R script, because it is needed to execute this script.
- association-recognition-pubmed-abstracts-gh.R: R script for association recognition (AR) in PubMed abstracts classified as positive in the previous stage (Text Classification step), using regular expressions from rules dictionary (see sequential-pattern-mining-in-pubmed-abstracts-sentences repository).
- To execute this R script it's necessary to download the entities-associations-sentences-recognized and entities-recognized folders.
- entities-recognized: folder with files resulted from NER task in information extraction with the named entities (polyphenols, cancers and genes) recognized on PubMed abstracts classified as positive in the previous stage (Text Classification step). Save this folder with the files in the same folder of association-recognition-pubmed-abstracts-gh.R script, because it is needed to execute this script, on the Association recognition task.
- entities-associations-sentences-recognized: folder with files resulted of NER task in information extraction with sentences recognized with entities (polyphenols, cancers and genes) associations on PubMed abstracts classified as positive in the previous stage (Text Classification step). Save this folder with the files in the same folder of association-recognition-pubmed-abstracts-gh.R script, because it is needed to execute this script, on the Association recognition task.
- ner-frequency: folder with files with the frequency of entities about polyphenols, cancers and/or genes recognized in PubMed abstracts classified as positive in the previous stage (Text Classification step).
- Rule_associations_recognized.rar: compacted file resulted of AR task containing the PubMed abstract sentences with at least one rule from rules dictionary recognized.
Below is presented a table with the results of the Association Recognition task, separated for category, rules and sentence type (PC, PG, and P).