Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

✨ Add GLiNER #78

Merged
merged 6 commits into from
Aug 15, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 7 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -47,9 +47,9 @@ Can be used to perform:

### Optional Dependencies

* <a href="https://github.com/flairNLP/flair" target="_blank"><code>flair</code></a> - Required if you want to use Flair mentions extractor and for TARS linker.
* <a href="https://github.com/flairNLP/flair" target="_blank"><code>flair</code></a> - Required if you want to use Flair mentions extractor and for TARS linker and TARS Mentions Extractor.
* <a href="https://github.com/facebookresearch/BLINK" target="_blank"><code>blink</code></a> - Required if you want to use Blink for linking to Wikipedia pages.

* <a href="https://github.com/urchade/GLiNER" target="_blank"><code>gliner</code></a> - Required if you want to use GLiNER Linker or GLiNER Mentions Extractor.

## Installation

Expand Down Expand Up @@ -81,7 +81,7 @@ ZShot contains two different components, the **mentions extractor** and the **li
### Mentions Extractor
The **mentions extractor** will detect the possible entities (a.k.a. mentions), that will be then linked to a data source (e.g.: Wikidata) by the **linker**.

Currently, there are 6 different **mentions extractors** supported, SMXM, TARS, 2 based on *SpaCy*, and 2 that are based on *Flair*. The two different versions for *SpaCy* and *Flair* are similar, one is based on Named Entity Recognition and Classification (NERC) and the other one is based on the linguistics (i.e.: using Part Of the Speech tagging (PoS) and Dependency Parsing(DP)).
Currently, there are 7 different **mentions extractors** supported, SMXM, TARS, GLiNER, 2 based on *SpaCy*, and 2 that are based on *Flair*. The two different versions for *SpaCy* and *Flair* are similar, one is based on Named Entity Recognition and Classification (NERC) and the other one is based on the linguistics (i.e.: using Part Of the Speech tagging (PoS) and Dependency Parsing(DP)).

The NERC approach will use NERC models to detect all the entities that have to be linked. This approach depends on the model that is being used, and the entities the model has been trained on, so depending on the use case and the target entities it may be not the best approach, as the entities may be not recognized by the NERC model and thus won't be linked.

Expand All @@ -90,14 +90,15 @@ The linguistic approach relies on the idea that mentions will usually be a synta
### Linker
The **linker** will link the detected entities to a existing set of labels. Some of the **linkers**, however, are *end-to-end*, i.e. they don't need the **mentions extractor**, as they detect and link the entities at the same time.

Again, there are 4 **linkers** available currently, 2 of them are *end-to-end* and 2 are not. Let's start with those thar are not *end-to-end*:
Again, there are 5 **linkers** available currently, 3 of them are *end-to-end* and 2 are not.

| Linker Name | end-to-end | Source Code | Paper |
|:-----------:|:----------:|----------------------------------------------------------|--------------------------------------------------------------------|
| Blink | X | [Source Code](https://github.com/facebookresearch/BLINK) | [Paper](https://arxiv.org/pdf/1911.03814.pdf) |
| GENRE | X | [Source Code](https://github.com/facebookresearch/GENRE) | [Paper](https://arxiv.org/pdf/2010.00904.pdf) |
| SMXM | &check; | [Source Code](https://github.com/Raldir/Zero-shot-NERC) | [Paper](https://aclanthology.org/2021.acl-long.120/) |
| TARS | &check; | [Source Code](https://github.com/flairNLP/flair) | [Paper](https://kishaloyhalder.github.io/pdfs/tars_coling2020.pdf) |
| SMXM | &check; | [Source Code](https://github.com/Raldir/Zero-shot-NERC) | [Paper](https://aclanthology.org/2021.acl-long.120/) |
| TARS | &check; | [Source Code](https://github.com/flairNLP/flair) | [Paper](https://kishaloyhalder.github.io/pdfs/tars_coling2020.pdf) |
| GLINER | &check; | [Source Code](https://github.com/urchade/GLiNER) | [Paper](https://arxiv.org/abs/2311.08526) |

### Relations Extractor
The **relations extractor** will extract relations among different entities *previously* extracted by a **linker**..
Expand Down
3 changes: 1 addition & 2 deletions docs/entity_linking.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,6 @@

The **linker** will link the detected entities to a existing set of labels. Some of the **linkers**, however, are *end-to-end*, i.e. they don't need the **mentions extractor**, as they detect and link the entities at the same time.

There are 4 **linkers** available currently, 2 of them are *end-to-end* and 2 are not. Let's start with those thar are not *end-to-end*.

There are 5 **linkers** available currently, 3 of them are *end-to-end* and 2 are not.

::: zshot.Linker
11 changes: 11 additions & 0 deletions docs/gliner_linker.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
# GLiNER Linker

GLiNER is a Named Entity Recognition (NER) model capable of identifying any entity type using a bidirectional transformer encoder (BERT-like). It provides a practical alternative to traditional NER models, which are limited to predefined entities, and Large Language Models (LLMs) that, despite their flexibility, are costly and large for resource-constrained scenarios.

The GLiNER **linker** will use the **entities** specified in the `zshot.PipelineConfig`, it just uses the names of the entities, it doesn't use the descriptions of the entities.


- [Paper](https://arxiv.org/abs/2311.08526)
- [Original Source Code](https://github.com/urchade/GLiNER)

::: zshot.linker.LinkerGLINER
11 changes: 11 additions & 0 deletions docs/gliner_mentions_extractor.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
# GLiNER Mentions Extractor

GLiNER is a Named Entity Recognition (NER) model capable of identifying any entity type using a bidirectional transformer encoder (BERT-like). It provides a practical alternative to traditional NER models, which are limited to predefined entities, and Large Language Models (LLMs) that, despite their flexibility, are costly and large for resource-constrained scenarios.

The GLiNER **mentions extractor** will use the **mentions** specified in the `zshot.PipelineConfig`, it just uses the names of the mentions, it doesn't use the descriptions of the mentions.


- [Paper](https://arxiv.org/abs/2311.08526)
- [Original Source Code](https://github.com/urchade/GLiNER)

::: zshot.mentions_extractor.MentionsExtractorGLINER
5 changes: 4 additions & 1 deletion docs/mentions_extractor.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
# MentionsExtractor
The **mentions extractor** will detect the possible entities (a.k.a. mentions), that will be then linked to a data source (e.g.: Wikidata) by the **linker**.

Currently, there are 6 different **mentions extractors** supported, 2 of them are based on *SpaCy*, 2 of them are based on *Flair*, TARS and SMXM. The two different versions for *SpaCy* and *Flair* are similar, one is based on NERC and the other one is based on the linguistics (i.e.: using PoS and DP). The TARS and SMXM models can be used when the user wants to specify the mentions wanted to be extracted.
Currently, there are 7 different **mentions extractors** supported, 2 of them are based on *SpaCy*, 2 of them are based on *Flair*, TARS, SMXM and GLiNER. The two different versions for *SpaCy* and *Flair* are similar, one is based on NERC and the other one is based on the linguistics (i.e.: using PoS and DP). The TARS and SMXM models can be used when the user wants to specify the mentions wanted to be extracted.

The NERC approach will use NERC models to detect all the entities that have to be linked. This approach depends on the model that is being used, and the entities the model has been trained on, so depending on the use case and the target entities it may be not the best approach, as the entities may be not recognized by the NERC model and thus won't be linked.

Expand All @@ -10,4 +10,7 @@ The linguistic approach relies on the idea that mentions will usually be a synta
The SMXM model uses the description of the mentions to give the model information about them.

TARS model will use the labels of the mentions to detect them.

The GLiNER model will use the labels of the mentions to detect them.

::: zshot.MentionsExtractor
2 changes: 1 addition & 1 deletion docs/tars_mentions_extractor.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,4 +7,4 @@ The TARS **mentions extractor** will use the **mentions** specified in the `zsho
- [Paper](https://kishaloyhalder.github.io/pdfs/tars_coling2020.pdf)
- [Original Source Code](https://github.com/flairNLP/flair)

::: zshot.linker.LinkerTARS
::: zshot.mentions_extractor.MentionsExtractorTARS
1 change: 1 addition & 0 deletions requirements/test.txt
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@ pytest-cov>=3.0.0
setuptools>=65.5.1
scipy<1.13.0
flair>=0.13
gliner>=0.2.9
flake8>=4.0.1
coverage>=6.4.1
pydantic==1.9.2
Expand Down
8 changes: 8 additions & 0 deletions zshot/evaluation/dataset/med_mentions/entities.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,14 @@
"NEG": "NEG"
}

MEDMENTIONS_EXPLANATORY_MAPPING = {
k: v.replace("_", " ") for k, v in MEDMENTIONS_TYPE_INV.items()
}

MEDMENTIONS_EXPLANATORY_INVERSE_MAPPING = {
v: k for k, v in MEDMENTIONS_EXPLANATORY_MAPPING.items()
}

MEDMENTIONS_SPLITS = {
"train": ['Biologic_Function', 'Chemical', 'Health_Care_Activity', 'Anotomical_Structure', "Finding",
"Spatial_Concept", "Intellectual_Product", "Research_Activity", 'Medical_Device', 'Eukaryote',
Expand Down
13 changes: 13 additions & 0 deletions zshot/evaluation/dataset/ontonotes/entities.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,18 @@
from zshot.utils.data_models import Entity

ONTONOTES_EXPLANATORY_MAPPING = {
'PERSON': "Person", "NORP": "Affiliation", "FAC": "Building name",
"ORG": "Organization", "GPE": "Geopolitical Entity", "LOC": "Location", "PRODUCT": "Product",
"DATE": "Date", "TIME": "Time", "PERCENT": "Percentage", "MONEY": "Money",
"QUANTITY": "Quantity", "ORDINAL": "Ordinal", "CARDINAL": "Cardinal", "EVENT": "Event",
"WORK_OF_ART": "Work of Art", "LAW": "Law", "LANGUAGE": "Language",
"NEG": "NEG"
}

ONTONOTES_EXPLANATORY_INVERSE_MAPPING = {
v: k for k, v in ONTONOTES_EXPLANATORY_MAPPING.items()
}

ONTONOTES_ENTITIES = [Entity(name='NEG',
description="Coal, water, oil, etc. are normally used for traditional electricity "
"generation. However using liquefied natural gas as fuel for joint "
Expand Down
16 changes: 14 additions & 2 deletions zshot/evaluation/evaluator.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,12 +10,19 @@
class ZeroShotTokenClassificationEvaluator(TokenClassificationEvaluator):

def __init__(self, task="token-classification", default_metric_name=None,
mode: Optional[str] = 'span', alignment_mode=AlignmentMode.expand):
mode: Optional[str] = 'span', alignment_mode=AlignmentMode.expand,
entity_mapper: Optional[Dict[str, str]] = None):
super().__init__(task, default_metric_name)
self.alignment_mode = alignment_mode
self.mode = mode
self.entity_mapper = entity_mapper

def process_label(self, label):
if label != "O":
if self.entity_mapper is not None:
label_prefix = label[:2]
label = label_prefix + self.entity_mapper[label[2:]]

return f"B-{label[2:]}" if label.startswith("I-") and self.mode == 'token' else label

def prepare_data(self, data: Union[str, Dataset], input_column: str, label_column: str, join_by: str):
Expand Down Expand Up @@ -55,12 +62,17 @@ def prepare_pipeline(

class MentionsExtractorEvaluator(ZeroShotTokenClassificationEvaluator):
def __init__(self, task="token-classification", default_metric_name=None,
mode: Optional[str] = 'span', alignment_mode=AlignmentMode.expand):
mode: Optional[str] = 'span', alignment_mode=AlignmentMode.expand,
entity_mapper: Optional[Dict[str, str]] = None):
super().__init__(task, default_metric_name, alignment_mode=alignment_mode)
self.mode = mode
self.entity_mapper = entity_mapper

def process_label(self, label):
if label != "O":
if self.entity_mapper is not None:
label_prefix = label[:2]
label = label_prefix + self.entity_mapper[label[2:]]
if (label.startswith("B-") or label.startswith("I-")) and self.mode == 'span':
label = label[:2] + "MENTION"
else:
Expand Down
51 changes: 42 additions & 9 deletions zshot/evaluation/run_evaluation.py
Original file line number Diff line number Diff line change
@@ -1,13 +1,19 @@
import argparse

import spacy

from zshot import PipelineConfig
from zshot.evaluation import load_medmentions_zs, load_ontonotes_zs
from zshot.evaluation.dataset.dataset import DatasetWithEntities
from zshot.evaluation.dataset.med_mentions.entities import MEDMENTIONS_EXPLANATORY_MAPPING, \
MEDMENTIONS_EXPLANATORY_INVERSE_MAPPING
from zshot.evaluation.dataset.ontonotes.entities import ONTONOTES_EXPLANATORY_INVERSE_MAPPING, \
ONTONOTES_EXPLANATORY_MAPPING
from zshot.evaluation.metrics.seqeval.seqeval import Seqeval
from zshot.evaluation.zshot_evaluate import evaluate, prettify_evaluate_report
from zshot.linker import LinkerTARS, LinkerSMXM, LinkerRegen
from zshot.linker import LinkerTARS, LinkerSMXM, LinkerRegen, LinkerGLINER
from zshot.mentions_extractor import MentionsExtractorSpacy, MentionsExtractorFlair, \
MentionsExtractorSMXM, MentionsExtractorTARS
MentionsExtractorSMXM, MentionsExtractorTARS, MentionsExtractorGLINER
from zshot.mentions_extractor.utils import ExtractorType

MENTION_EXTRACTORS = {
Expand All @@ -16,28 +22,49 @@
"flair_pos": lambda: MentionsExtractorFlair(ExtractorType.POS),
"flair_ner": lambda: MentionsExtractorFlair(ExtractorType.NER),
"smxm": lambda: MentionsExtractorSMXM,
"tars": lambda: MentionsExtractorTARS
"tars": lambda: MentionsExtractorTARS,
"gliner": lambda: MentionsExtractorGLINER
}
LINKERS = {
"regen": LinkerRegen,
"tars": LinkerTARS,
"smxm": LinkerSMXM
"smxm": LinkerSMXM,
"gliner": LinkerGLINER
}
END2END = ['tars', 'smxm', 'gliner']
ENTITIES_MAPPERS = {
"medmentions": MEDMENTIONS_EXPLANATORY_MAPPING,
"ontonotes": ONTONOTES_EXPLANATORY_MAPPING
}
ENTITIES_INVERSE_MAPPERS = {
"medmentions": MEDMENTIONS_EXPLANATORY_INVERSE_MAPPING,
"ontonotes": ONTONOTES_EXPLANATORY_INVERSE_MAPPING
}
END2END = ['tars', 'smxm']


def convert_entities(dataset: DatasetWithEntities, dataset_name: str) -> DatasetWithEntities:
mapping = ENTITIES_MAPPERS[dataset_name]
for entity in dataset.entities:
entity.name = mapping[entity.name]

return dataset


if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument("--dataset", default="ontonotes", type=str,
help="Name or path to the validation data. Comma separated")
help="Name or names of the datasets. One of: ontonotes; medmentions. Comma separated")
parser.add_argument("--splits", required=False, default="test", type=str,
help="Splits to evaluate. Comma separated")
parser.add_argument("--mode", required=False, default="full", type=str,
help="Evaluation mode. One of: full; mentions_extractor; linker")
parser.add_argument("--entity_name", default="original", type=str,
help="Type of entity name. One of: original; explanatory. Original by default.")
parser.add_argument("--mentions_extractor", required=False, default="all", type=str,
help="Mentions extractor to evaluate. "
"One of: all; spacy_pos; spacy_ner; flair_pos; flair_ner; smxm; tars")
"One of: all; spacy_pos; spacy_ner; flair_pos; flair_ner; smxm; tars; gliner")
parser.add_argument("--linker", required=False, default="all", type=str,
help="Linker to evaluate. One of: all; regen; smxm; tars")
help="Linker to evaluate. One of: all; regen; smxm; tars; gliner")
parser.add_argument("--show_full_report", action="store_false",
help="Show evalution report for each label. True by default")

Expand Down Expand Up @@ -99,8 +126,14 @@
dataset = load_ontonotes_zs(split)
else:
raise ValueError(f"{dataset_name} not supported")

if args.entity_name == "explanatory":
convert_entities(dataset, dataset_name)

nlp.get_pipe("zshot").mentions = dataset.entities
nlp.get_pipe("zshot").entities = dataset.entities

evaluation = evaluate(nlp, dataset, metric=Seqeval(), mode=mode)
evaluation = evaluate(nlp, dataset, metric=Seqeval(), mode=mode,
entity_mapper=ENTITIES_MAPPERS[dataset_name] if args.entity_name != "original"
else None)
print("\n".join(prettify_evaluate_report(evaluation, name=f"{dataset_name}-{split}")))
8 changes: 5 additions & 3 deletions zshot/evaluation/zshot_evaluate.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,8 @@ def evaluate(nlp: spacy.language.Language,
dataset: Dataset,
metric: Optional[Union[str, EvaluationModule]] = Seqeval(),
mode: Optional[str] = 'span',
batch_size: Optional[int] = 16) -> dict:
batch_size: Optional[int] = 16,
entity_mapper: Optional[Dict[str, str]] = None) -> dict:
""" Evaluate a spacy zshot model

:param nlp: Spacy Language pipeline with ZShot components
Expand All @@ -25,10 +26,11 @@ def evaluate(nlp: spacy.language.Language,
- token: The evaluation is done at token level,
so if any of the tokens of the entity is missing the other are still valid
:param batch_size: the batch size
:param entity_mapper: Mapper for entity names
:return: Result of the evaluation. Dict with metrics results for each component
"""
linker_evaluator = ZeroShotTokenClassificationEvaluator(mode=mode)
mentions_extractor_evaluator = MentionsExtractorEvaluator(mode=mode)
linker_evaluator = ZeroShotTokenClassificationEvaluator(mode=mode, entity_mapper=entity_mapper)
mentions_extractor_evaluator = MentionsExtractorEvaluator(mode=mode, entity_mapper=entity_mapper)

results = {'evaluation_mode': mode}
if nlp.get_pipe("zshot").linker:
Expand Down
1 change: 1 addition & 0 deletions zshot/linker/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,3 +4,4 @@
from zshot.linker.linker_smxm import LinkerSMXM # noqa: F401
from zshot.linker.linker_tars import LinkerTARS # noqa: F401
from zshot.linker.linker_ensemble import LinkerEnsemble # noqa: F401
from zshot.linker.linker_gliner import LinkerGLINER # noqa: F401
58 changes: 58 additions & 0 deletions zshot/linker/linker_gliner.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,58 @@
from typing import Iterator, List, Optional, Union

import pkgutil

from spacy.tokens import Doc
from gliner import GLiNER

from zshot.config import MODELS_CACHE_PATH
from zshot.linker.linker import Linker
from zshot.utils.data_models import Span


MODEL_NAME = "urchade/gliner_mediumv2.1"


class LinkerGLINER(Linker):
""" GLINER linker """

def __init__(self, model_name=MODEL_NAME):
super().__init__()

if not pkgutil.find_loader("gliner"):
raise Exception("GLINER module not installed. You need to install gliner in order to use the GLINER Linker."
"Install it with: pip install gliner")

self.model_name = model_name
self.model = None

@property
def is_end2end(self) -> bool:
""" GLINER is end2end model"""
return True

def load_models(self):
""" Load GLINER model """
if self.model is None:
self.model = GLiNER.from_pretrained(self.model_name, cache_dir=MODELS_CACHE_PATH).to(self.device)

def predict(self, docs: Iterator[Doc], batch_size: Optional[Union[int, None]] = None) -> List[List[Span]]:
"""
Perform the entity prediction
:param docs: A list of spacy Document
:param batch_size: The batch size
:return: List Spans for each Document in docs
"""
if not self._entities:
return []

labels = [ent.name for ent in self._entities]
sentences = [doc.text for doc in docs]

self.load_models()
span_annotations = []
for sent in sentences:
entities = self.model.predict_entities(sent, labels, threshold=0.5)
span_annotations.append([Span.from_dict(ent) for ent in entities])

return span_annotations
Loading
Loading