-
Notifications
You must be signed in to change notification settings - Fork 3
Pipelines: Text Opinion Annotation
In this post we are focusing on the relations extraction between a couple mentioned named entities in text.
Speaking precisely, we consider sentiment connections between mentioned named entities in text, i.e. positive
or negative
(and additionally neutral
)
In order to automatically extract relations of such type, it is necessary to provide their markup in texts first, --
an algorithm, which allows us to extract text parts with potential connections between mentioned named entities.
The result markup could be then used for samples generation -- data, which is requred from Machine Learning model training, inferring, testing and so on.
The snippet below illustrates the core function text_opinion_extraction_pipeline
, pipeline, which allows us to
pipeline = text_opinion_extraction_pipeline(
annotators=[
# LIST OF YOUR ANNOTATIONS
],
text_opinion_filters=[
# LIST OF YOUR FILTERS
],
get_doc_func=lambda doc_id: doc_ops.get_doc(doc_id),
text_parser=text_parser)
According to the snippet above, first of all we deal with annotators -- implmentations which provides iterators of the text opinions. To accomplish this task, AREkit-0.22.1 provides the following annotators
-
PredefinedTextOpinionAnnotator
-- consider to convert BRAT-based relations into text opinions -
AlgorithmBasedOpinionAnnotator
-- consider to adopt algoritms for opinion annotation onto the document level. -
AlgorithmBasedTextOpinionAnnotator
-- the same as (2), but with a conversion onto text-level; this is inherited from the (2) in AREkit-0.22.1;
Another parameter in the snippet above is a set of text-opinion filters, which allows us to declare boolean rules in order reject some of text opoinions depending on the certaion limitations, our needs, and so on. By default, AREkit-0.22.1 provides the following filters for text opinions:
-
EntityBasedTextOpinionFilter
-- filter based on ends of the text opinion participants (subject and object named entities) -
DistanceLimitedTextOpinionFilter
-- filter based on distance between mentioned named entities (in terms) -
FrameworkLimitationsTextOpinionFilter
-- limitations that should be considered due to specifics of the internal fucntionality implementation.
NOTE:
FrameworkLimitationsTextOpinionFilter
is an internal limitations which is applied by default so there is no need to declare them manually, but it might be better to consider already known limitations for your personal needs.
In order to deal with data and prove the related docuemnts, it is necessary to declare DocumentOperations
interface.
This interface has the following description and provides a single get_doc
method, for a known document identifier:
class DocumentOperations(object):
def get_doc(self, doc_id):
raise NotImplementedError()
This wiki already covers the question on
Binding a custom annotated collection for Relation Extraction.
You may refer to that page for a greater details on how reader might be implemented for a foo
example collection.
The snippet below illustrates on how the related reader might be wrapped into DocumentOperations
class FooDocumentOperations(DocumentOperations):
def get_doc(self, doc_id):
return FooDocReader.read_document(str(doc_id), doc_id=doc_id)
At last, as for the text_parser
parameter, we also have a post which covers it in a greater details.
Please refer to the AREkit Tutorial: Compose your text-processing pipeline! for the related details.
Let's take a closer look on how each of them might be, first declared and, second, adopted in the final in the common text opinion extraction pipeline. In this section we cover all the possible annontators that might be crafter out-of-the-box.
Before we start with annotators, the common formatter we also required is a labels formatter.
To implement your own formatter, it is important to inherit StringLabelsFormatter
base AREkit-0.22.1 class.
Labels formatter allows us to perform transformation from str
to Label
type and vice versa.
Type Label
is a core type for every label utilzed in a project.
The snippet below illustrates an example of the custom label formatter for a couple labels, such as positive
and negative
:
class CustomLabelsFormatter(StringLabelsFormatter):
def __init__(self, pos_label_type, neg_label_type):
stol = {"neg": neg_label_type, "pos": pos_label_type}
super(CustomLabelsFormatter, self).__init__(stol=stol)
Simple predefined annotator for the case, when we already have annotated text opinions (like BRAT-based collections). For texts in Russian, this might be a NEREL collection. The snipped below illstrates on how the annotator of the predefined relations might be implemented:
class PositiveLabel(Label):
pass
class NegativeLabel(Label):
pass
predefined_annotator = PredefinedTextOpinionAnnotator(
doc_ops=doc_ops,
label_formatter=CustomLabelsFormatter(pos_label_type=PositiveLabel,
neg_label_type=NegativeLabel))
Sometimes, the annotation might be provided on document level, like in RuSentRel.
The snippet below illustrates the way on how document level opinions, provided separately for every document of RuSentRel collection,
might be adopted and then converted to text opinions.
This conversion is performed via RuSentRelOpinionCollection.iter_opinions_from_doc
which leave outside of this post.
predefined_annotator = AlgorithmBasedTextOpinionAnnotator(
annot_algo=PredefinedOpinionAnnotationAlgorithm(
lambda doc_id: __get_document_opinions(doc_id, synonyms, labels_fmt)),
create_empty_collection_func=lambda: OpinionCollection(
opinions=[], synonyms=synonyms,
error_on_duplicates=True,
error_on_synonym_end_missed=False),
get_doc_existed_opinions_func=lambda _: None,
value_to_group_id_func=lambda value:
SynonymsCollectionValuesGroupingProviders.provide_existed_value(synonyms, value))
def __get_document_opinions(doc_id, synonyms, labels_fmt):
return OpinionCollection(
opinions=RuSentRelOpinionCollection.iter_opinions_from_doc(doc_id, labels_fmt),
synonyms=synonyms,
error_on_synonym_end_missed=True,
error_on_duplicates=True)
No label algorithm-based opinion annotator.
As for the snippet above, it will be also required a synonyms collection,
which is a StemmerBasedSynonymCollection
in our case by default, based on the Yandex Mystem stemmer.
By default, AREkit-0.22.1 provides a NoLabel
instance.
synonyms = StemmerBasedSynonymCollection(
iter_group_values_lists=[], stemmer=MystemWrapper(), is_read_only=False, debug=False)
nolabel_annotator = AlgorithmBasedTextOpinionAnnotator(
annot_algo=PairBasedOpinionAnnotationAlgorithm(
dist_in_sents=0,
dist_in_terms_bound=50,
label_provider=ConstantLabelProvider(NoLabel())),
create_empty_collection_func=lambda: OpinionCollection(
opinions=[], synonyms=synonyms, error_on_duplicates=True, error_on_synonym_end_missed=False),
get_doc_existed_opinions_func=lambda _: None,
value_to_group_id_func=lambda value:
SynonymsCollectionValuesGroupingProviders.provide_existed_value(synonyms, value))
In this section we provide the details on which filters might be declared.
The base class for the opinion filter declaration has the following API:
NOTE: We consier entity service by default in order to accelerate filtering process since this service is the most required in filtering across the predefined filters.
class TextOpinionFilter(object):
def filter(self, text_opinion, parsed_news, entity_service_provider):
raise NotImplementedError()
Where the its custom implementation might be as follows:
class CustomEntityFilter(EntityFilter):
supported = ["GPE", "PERSON", "LOCAL", "GEO", "ORG"]
def is_ignored(self, entity, e_type):
if e_type == OpinionEntityType.Subject or e_type == OpinionEntityType.Object:
return entity.Type not in CustomEntityFilter.supported
return True
Here is how the filter, based on the entity filtering details, might be gathered:
custom_filter = EntityBasedTextOpinionFilter(entity_filter=CustomEntityFilter())
In the snippet below, we consider only those text opinions,
in which distance in terms between participats is not exceeds amount of 50
:
distance_filter = DistanceLimitedTextOpinionFilter(terms_per_context=50)
© Nicolay Rusnachenko 2016-Present. Released under the MIT license.