MultiNER is a webservice that combines the output from four different named-entity recognition packages into one answer. This impementation is based on the multiNER software implemented by the KB, where it is part of the DAC project (Entity linker for the Dutch historical newspaper collection of the National Library of the Netherlands).
The following packages are used:
MultiNER is a Flask application that exposes one method (collect_from_text
) that returns the entities found in a text based on the Named Entities suggested by the various NER packages. These packages are its main dependencies. Stanford and DbPedia Spotlight are Java applications that run as webservices, whereas Spacy and Polyglot are Python packages.
MultiNER returns entities of four types: LOCATION, PERSON, ORGANIZATION and OTHER. All entities suggested by the various ner packages are translated into these four.
One of the nice features of multiNER is that the user can configure the weight of the different packages for each API call. See below for details.
More information and download link can be found in this article.
Stanford by default includes some English models (in the folder classifiers
). Models in other languages can be found as archives in tar format, e.g., dutch.tar.gz
. Untar such a model and supply it as <themodelyouwanttouse>
.
java -mx400m -cp stanford-ner.jar edu.stanford.nlp.ie.NERServer -outputFormat inlineXML -encoding "utf-8" -loadClassifier classifiers/<themodelyouwanttouse>.gz -port <whateverportyoulike>
For example, with the complete English model loaded and running at port 9899:
java -mx400m -cp stanford-ner.jar edu.stanford.nlp.ie.NERServer -outputFormat inlineXML -encoding "utf-8" -loadClassifier classifiers/english.all.3class.distsim.crf.ser.gz -port 9899
Accessing the Stanford webservice directly is done via the Telnet protocol. For example:
import telnetlib
done = False
retry = 0
max_retry = 10
while not done and retry < max_retry:
try:
conn = telnetlib.Telnet(host='<hostname>', port=<port>, timeout=1000)
done = True
except Exception as e:
print(e)
retry += 1
text = 'a text with lots of nice entities'
text = text.encode('utf-8') + b'\n'
conn.write(text)
result = conn.read_all().decode('utf-8')
conn.close()
More info on the DbPedia API can be found here. Note that it returns a plurality of types for each entity (e.g. 'DbPlace', 'DbLocation', 'Administrative Region', 'Governmental Region', etc)
Downloads are available from here
Models: Models for DbPedia are available here
Download the Dutch model:
wget https://sourceforge.net/projects/dbpedia-spotlight/files/2016-10/nl/model/nl.tar.gz/download
Java above JDK version 8 (i.e. 9 and higher) do not include all required Java modules. (see this SO answer). Therefore add --add-modules java.se.ee
when you start the webservice:
java --add-modules java.se.ee -jar dbpedia-spotlight-1.0.0.jar models/<whatevermodelyouwanttouse> http://<hostname>:<portnumber>/rest
For example, run with the Italian model on localhost, port 2222:
java --add-modules java.se.ee -jar dbpedia-spotlight-1.0.0.jar models/it http://localhost:2222/rest
You can access the DbPedia webservice via HTTP, it even works in your browser:
http://localhost:2222/rest/annotate/?text='Another pretty text with some incredibly relevant entities'
Of course you can always use some other tool to make the request for you, like curl or Postman, or whatever you like.
Spacy is a Python package and can be installed via Pip. Setting up the multiNER application will automatically install Spacy for you. Note that installing spacy takes a looooooong time.
Spacy models are Python packages as well. These are installed automatically as well, but there is something special about them.
As the Spacy documentation specifies, they are loaded directly from Github (i.e. they are in requirements.txt
as (for example) https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.1.0/en_core_web_sm-2.1.0.tar.gz#egg=en_core_web_sm
)
Spacy is run and accessed from the multiNER code. See the documentation if you need to consume it yourself.
Polyglot, like Spacy, is a Python package and can be installed via Pip. Setting up the multiNER application will automatically install Spacy for you.
It is somewhat special in that it tries to guess the language of the text it is being supplied (although you can suggest a language). This potentially leads to errors if models are not installed for the language Polyglot thinks it is NER-ing.
Installing Polyglot is dependent on the presence of the package libicu-dev
(sudo apt install libicu-dev
). If you do not have this package installed, you will get errors like Command "python setup.py egg_info" failed with error code 1 in /tmp/tmp3wehjwuobuild/pyicu/
when installing via pip (or compiling with pip-tools). This requirement that this error comes from is pyicu. More solutions to this problem (i.e. for other OSs) in this SO post
Polyglot models are Python packages, but these are not automatically installed. You need to manually download them. For example, Dutch:
polyglot download embeddings2.nl
polyglot download ner2.nl
Find all available models here
Polyglot is run and accessed from the multiNER code. See the documentation if you need to consume it yourself.
Once you have the Stanford and DbPedia Spotlight webservices up and running, you can set up multiNER itself.
Setup a virtualenv with Python 3.4
virtualenv .env -p python3.4 --prompt "(multiner) "
Enter the virtualenv: source .env/bin/activate
Install requirements: pip install -r requirements.txt
Set Flask variables and run:
export FLASK_APP=app.py
export FLASK_DEBUG=1
flask run
The multiNER application includes a bunch of unit tests. To run these, install pytest into your virtualenv (pip install pytest
) and run: pytest
.
You should now have a webservice running at http://localhost:5000/ner/collect_from_text
. It takes three parameters, of which one is required:
Name | Required | Info |
---|---|---|
title | False | If provided this will also be NER-ed |
text | True | The main text to be NER-ed |
configuration | False | If False, use default configuration |
Example request:
{
"title": "Mooi tekstje",
"text": "Utrecht ligt ver weg van Rotterdam",
"configuration": {
"language": "nl",
"context_length": 3,
"leading_ner_packages": [
"stanford",
"spotlight"
],
"other_packages_min": 2
}
}
The configuration parameter gives the user the ability to configure how multiNER creates one Named Entity from the sudggestions made by the various NER packages. It allows for several settings to be set:
Setting | Info | Allowed values | Default |
---|---|---|---|
language | The language of the text | 'nl', 'en', 'it' | 'en' |
context_length | Number of words to collect left and write of each entity | 0 to 15 | 5 |
leading_ner_packages | List of packages whose suggestions should always be included | 'stanford', 'spotlight', 'spacy', 'polyglot' | ['stanford', 'spotlight' ] |
other_packages_min | The minimum number of non-leading-packages required to suggest a Named Entity before it is included. E.g. if only Spacy suggest that a particular piece of text is a LOCATION and this setting's value is 2, the suggestion will be ignored. | 1 to 4 | 2 |
type_preference | A dictionary with format {<number>:<type>} specifying which type is preferred when more than one type is suggested. Note that this determines which type an entity is if there is no agreement (i.e. all packages suggest a different type). Currently, 'type_preference' is not integrated with 'leading_ner_packages', i.e. it just considers the different types, not which package suggested it. |
Keys: [1, 2, 3, 4], Values: ['LOCATION', 'PERSON', 'ORGANIZATION', 'OTHER'] | { 1: 'LOCATION', 2: 'PERSON', 3: 'ORGANIZATION', 4: 'OTHER' } |
MultiNER returns a JSON object with the following basic structure:
{
"title": {
"text": "The title",
"entities": [
// A list of the entities found
]
}
},
{
"text": {
"text": "The title",
"entities": [
// A list of the entities found
]
}
}
That is to say it returns an embedded JSON object for each part you send: the title and the text. Each of these contains the original text and the entities found in that text.
MultiNER returns Named Entities in which the suggestions from the various NER packages are integrated. Therefore, each entity has some special fields pertaining to the details of the integration, such as 'type_certainty', and 'alt_nes'. In addition, a list of all types suggested is also present.
Attribute | Info |
---|---|
pos | The start index of the entity in the text |
ne | The actual named entity, i.e. the text for which a type is suggested |
alt_nes | Alternative entities suggested for the same position in the text. For example, Stanford might suggest 'Angelina Jolie', whereas 'spacy might suggest 'Angelina'. Such double data is stored here. Note that it is also possible that an alt_ne does not have the same starting index. For example, Polyglot may find 'Universiteit Utrecht' (an ORGANIZATION), and Spotlight suggests 'Utrecht' (a LOCATION). As long as Spotlight is referring to the same positions in the text, this is also considered an 'alt_ne'. |
right_context | A configurable number of words to the right of the entity |
left_context | A configurable number of words to the left of the entity |
count | The number of NER packages that suggest this entity |
type | The type of the entity found. Based on 'type_preference' if there is no agreement between the various NER packages. |
type_certainty | The number of packages that suggested the entity's type. |
ner_src | The packages that suggested this entity |
types | A list of the types suggested |
Example response:
"count": 3,
"type_certainty": 2,
"type": "PERSON",
"right_context": "zich mij te vragen, of",
"left_context": "op zijn allerlaatst verwaardigde mevrouw",
"pos": 1324,
"ne_context": "Manchon",
"ne": "Manchon",
"ner_src": [
"stanford",
"spacy",
"polyglot"
],
"types": [
"PERSON",
"LOCATION"
]