Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

using cld2-cffi instead of pycld2 #8

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -99,3 +99,4 @@ ENV/

# mypy
.mypy_cache/
.idea/
27 changes: 22 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,10 +1,17 @@
# spaCy-CLD: Bringing simple language detection to spaCy

This package is a [spaCy 2.0 extension](https://spacy.io/usage/processing-pipelines#section-extensions) that adds language detection to spaCy's text processing pipeline. Inspired from a discussion [here](https://github.com/explosion/spaCy/issues/1172).
This package is a
[spaCy 2.0 extension](https://spacy.io/usage/processing-pipelines#section-extensions)
that adds language detection to spaCy's text processing pipeline.
Inspired from a discussion [here](https://github.com/explosion/spaCy/issues/1172).

## Installation

`pip install spacy_cld`
`python setup.py install`

If you can't compile it, retry after

`export CFLAGS="-Wno-narrowing"`

## Usage

Expand All @@ -23,10 +30,20 @@ doc._.languages # ['en']
doc._.language_scores['en'] # 0.96
```

spaCy-CLD operates on `Doc` and `Span` spaCy objects. When called on a `Doc` or `Span`, the object is given two attributes: `languages` (a list of up to 3 language codes) and `language_scores` (a dictionary mapping language codes to confidence scores between 0 and 1).
spaCy-CLD operates on `Doc` and `Span` spaCy objects. When called on a `Doc` or `Span`,
the object is given two attributes: `languages` (a list of up to 3 language codes)
and `language_scores` (a dictionary mapping language codes to confidence scores between
0 and 1).

## Under the hood

spacy-cld is a little extension that wraps the [PYCLD2](https://github.com/aboSamoor/pycld2) Python library, which in turn wraps the [Compact Language Detector 2](https://github.com/CLD2Owners/cld2) C library originally built at Google for the Chromium project. CLD2 uses character n-grams as features and a Naive Bayes classifier to identify 80+ languages from Unicode text strings (or XML/HTML). It can detect up to 3 different languages in a given document, and reports a confidence score (reported in with each language.
spacy-cld is a little extension that wraps the
[CLD2-CFFI](https://github.com/GregBowyer/cld2-cffi) Python library, which in turn
wraps the [Compact Language Detector 2](https://github.com/CLD2Owners/cld2)
C++ library originally built at Google for the Chromium project.
CLD2 uses character n-grams as features and a Naive Bayes classifier to identify
80+ languages from Unicode text strings (or XML/HTML).
It can detect up to 3 different languages in a given document, and reports
a confidence score (reported in with each language.

For additional details, see the linked project pages for PYCLD2 and CLD2.
For additional details, see the linked project pages for CLD2-CFFI and CLD2.
2 changes: 1 addition & 1 deletion requirements.txt
Original file line number Diff line number Diff line change
@@ -1,2 +1,2 @@
cld2-cffi>=0.1.4
spacy>=2.0.0
pycld2==0.31
2 changes: 1 addition & 1 deletion setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@ def setup_package():
packages=find_packages(),
install_requires=[
'spacy>=2.0.0,<3.0.0',
'pycld2>=0.31'],
'cld2-cffi>=0.1.4'],
zip_safe=False,
)

Expand Down
4 changes: 2 additions & 2 deletions spacy_cld/spacy_cld.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
from pycld2 import detect, error as pycld_error
from cld2 import detect
from spacy.tokens import Doc, Span


Expand All @@ -19,7 +19,7 @@ def get_scores(text, cld_results=None):
def detect_languages(text):
try:
_, _, results = detect(text.text)
except pycld_error as err:
except (ValueError, MemoryError):
results = [[None, "error", 0.0, None]]
return results

Expand Down