Releases: GlobalMaksimum/sadedegel
More Datasets & Prebuilt Models
New Features
- Exceptional handling of
emoji
,hashtag
andmention
tokens by word tokenizers. Refer tosadedegel config
for details.- Options also into
Text2Doc
text to sadedegelDocument
converter
- Options also into
- [Incomplete]
HashVectorizer
(Works far better than TfIdf or BM25 vectorization for majority of the prebuilt models) unary
option foridf
Datasets
We do keep adding new datasets with this new release. Refer to Dataset ReadMe for details.
- Customer Review dataset
- Telco (Turkcell) Sentiment dataset
- Movie Sentiment dataset
- Hotel Sentiment dataset
- Categorized Product Sentiment dataset
Prebuilt Models
We do keep adding new prebuilt models with this new release. Refer to Prebuilt Model ReadMe for details.
- Turkish Movie Review Sentiment Classification
- Telco Brand Tweet Sentiment Classification
- Turkish Customer Reviews Classification
Others
- Lazy evaluation of word
shape
property
Behavioural Changes
- Significant behavior change on
tokens
property. Previously property returnsList[str]
, nowList[Token]
- Sentence Tokenizer is renamed to be Sentence Boundary Detector to prevent confusion with Word Tokenizer
Contribution
- Welcome our new contributor @ertugruldemir
IcU Tokenizer & Better Vocabulary Structure
Sadedegel is now not only "An extraction based Turkish news summarizer", but rather "A General Purpose NLP library for Turkish"
News
We have added icu tokenizer as the default tokenizer (word tokenizer) which is very fast and accurate.
- We have moved BERT as optional dependency which can be installed using
pip install sadedegel[bert]
- Word embeddings are introduced. To retrain use
pip install sadedegel[w2v]
- By making those dependencies optional sadedegel installation is now way faster than before
pip install sadedegel
takes 3 minutes @40Mbps for version 0.18pip install sadedegel
takes 40 sec @40Mbps for version 0.19
- Vocabulary files are now in hdf5 format.
bert
,icu
andsimple
have their own vocabulary files.- Only
icu
vocabulary file includes word embeddings.
- Only
- Relax dependencies (less strict module version coupling)
Feature Drop & Deprecation
Others
- Pre-trained models under
prebuilt
are refreshed- They now use
icu
tokenizer - They now return class probabilities for predictions
- They now use
More Prebuilt Models
0.18 adds more prebuilt models into sadedegel library
News
-
Our main contributor @dafajon has implemented a new BM25Summarizer similary to TfIdf summarizer. BM25Summarizer outperforms slightly in short summaries.
-
We have packaged two new prebuilt models (Refer to README for model accuracies )
- tweeter profanity classification (
sadedegel.prebuilt.tweet_profanity
) - tweeter sentiment classification (
sadedegel.prebuilt.tweet_sentiment
)
- tweeter profanity classification (
-
Change the way we report summarizer performance. Instead of a grid search of summarizer options, we now use a RandomSearch to decide optimal summarizer and parameters. Refer to README for details.
Feature Drop & Deprecation
sents
property onDoc
is dropped. use__iter__(Doc)
instead.tf
property onDoc
is deprecated (will be dropped by 0.18) in favor ofget_tf
function which gives a more flexible way to access document level tf vectors.tfidf
function onDoc
is deprecated (will be dropped by 0.18) in favor ofget_tfidf
function which gives a more flexible way to access document level tf-idf vectors.lexrank
external dependency is dropped andLexRankPureSummarizer
is renamed to beLexRankSummarizer
set_config
,get_config
,describe_config
andget_all_configs
are dropped in favor of new configuration implementation.
Others
tf
property is now a part ofTfImpl
class using default configuration settings to yield a tf vector for aDoc
orSentence
- We've updated documentation for our datasets.
idf
property is now a part ofIdfImp
class using default configuration settings to yield a idf vector for aDoc
orSentence
- More default parameters in
default.ini
based on our summarizer performance.
Direction to General Purpose NLP Library for Turkish
0.17 release introduces several non summarisation related NLP capabilities in Sadegel
News
- Starting with this release, sadedegel now ships prebuilt models for various basic NLP tasks. The purpose is to allow developers to load & use those models with minimal configuration.
- Our first model is a news classifier (Thanks Taner Sezer for his corpus support)
- We report accuracy of our tokenizers (word) for potential enhancement points in future releases (Thanks Taner Sezer for his corpus support)
- To support the development of prebuilt models,
sklearn
compatiblleextension.sklearn
module is introduced for feature engineering Token.is_stopword
is added to flag stopword token types.LexRankSummarizer
(based on lexrank external module, to be deprecate in future releases) andLexRankPureSummarizer
(pure sadedegel version of the same method) is added into set of extractive summarizers.
Feature Drop & Deprecation
sents
property onDoc
is dropped. use__iter__(Doc)
instead.tf
property onDoc
is deprecated (will be dropped by 0.18) in favor ofget_tf
function which gives a more flexible way to access document level tf vectors.tfidf
function onDoc
is deprecated (will be dropped by 0.18) in favor ofget_tfidf
function which gives a more flexible way to access document level tf-idf vectors.
Others
- We have pushed up TF and IDF implementations from Sentence and Doc to separate classes using python multiple inheritance support to reduce code duplication.
Minor Performance Enhancements & Tidy Up
In one month time we have added lots into sadedegel library.
News
- We have resolved an old and major issue caused by improper
from transformers import AutoTokenizer
calls here and there and lazy loading sentence boundary detector (sbd). Just to given an idea:sadedegel config
CLI call to show sadedegel configuration took 11 sec in 0.16.1.1 release whereas 2 sec in 0.16.2.1+from sadedegel import Doc
call (which is usually the first one to start working with sadedegel) took 9.5 sec in 0.16.1.1 release whereas 1 sec in 0.16.2.1+
Feature Drop & Deprecation
- Old configuration capabilities are deprecated (this time unfortunately without prior warnings in earlier releases)
DeprecationWarning
is the indication that you do access one of such APIs which will completely be removed by0.18
- Please use new API
config_context
(tf_context
andidf_context
are just simplified wrappers)
Documentation
- CONFIG.md details the configuration of sadedegel.
Others
__getitem__
function to access any token of aSentence
- Iterator on
Sentence
yields allToken
s in order. - default tf method is now
log_norm
instead ofbinary
thanks to @dafajon's most recent summarizer experiments.
Config, Configuration, Configurable
This release is mainly devoted to centralized configuration. Lot's have changed, hopefully not but maybe broken (Always feel free to open an issue)
New Capabilities
- New command for sadedegel CLI,
sadedegel config
to retrieve all possible configurations.- default values (
sadedegel/default.ini
) are shipped with sadedegel can be overwritten by creating a user defined config file in~/.sadedegel/user.ini
(overwritten values are indicated onsadedegel config
output.)
- default values (
- Configurable
tf
andidf
vectors perSentence
is ready with new configuration model. - We have finally implemented
forward
version ofBandSummarizer
explained in sadedegel Presentation
Internal Update
- Previously
sadedegel.Doc
was a Python class which is initialized with a document (string), we have seen some caveats in this approach and nowsadedegel.Doc
is an instance ofsadedegel.DocumentBuilder
and without changing (hopefully !!!) end user experience what you do is to trigger__call__
function returning asadedegel.Document
instance.
One Big Release
In one month time we have added lots into sadedegel library.
News
- We have @doruktiktiklar as the first code contributor out of Global Maksimum AI team.
New Capabilities
- ADD: Addition of Vocabulary and Token concepts into library
Token
: singleton per word (case sensitive) to store unique token features (lower form, shape, document frequency, etc.)- New
sadedegel-build-vocabulary
to manage sadedegel vocabularies.
New Summarizers
- ADD: TextRank Summarizer
TextRank summarizer uses Google's PageRank algorithm based on distance/similarity defined by BERT embedding cosine distance/similarity (as of this release and more to come) - ADD: TFIDF Summarizer
TFIDF Summarizer uses element sum of tfidf vector of a sentence as the relevance score of a sentence in a document.
Others
- UPDATE: Some annotator consensus issues on summary corpus.
- UPDATE: A better command-line for summarizer evaluation. Check
sadedegel-summarize evaluate
for more - ADD: Sentences level
tf
,idf
andtfidf
embeddings - ADD:
Doc
hastfidf_embeddings
property similar tobert_embeddings
property.
Documentation
- ADD: Youtube webinar videos (in Turkish) on sadedeGel YouTube Channel
Contribution Guidelines
- ADD: Commit Guidelines
- ADD: New Feature checklist
Feature Drop & Deprecation
-
DROP: Code quality guidelines is removed since Code Inspector limits the number of lines per open source project. We might continue with other providers later in the future.
-
DEPRECATED:
Doc.sents
will be removed by version0.17
- Use
[i]
to access ith sentences of a document Doc
object now implements__iter__
to let iterate over all sentences of a document.
- Use
Bugfix
- Properly handle empty documents. Ex
Doc("")
orDoc('')
Jekyll based sadedegel Github pages
- ADD: We have initialize sadedegel web page on Jekyll SadedeGel WebSite
- ADD: Add hotfix contribution process for sadedegel into
CONTRIBUTING.md
- ADD: Sadedegel Slack channel.
- ADD: Evaluation scores of new experimental tokenizer (Simple Tokenizer)
Regular Expression based Simple Word Tokenizer & Code Quality
- ADD: Major change of this release is Simple word tokenizer implementation by @dafajon after seeing the issues with BERT Tokenizer. Note that simple tokenizer is still experimental and not compatible with all summarizers (Cluster based summarizer automatically switch to BERT Tokenizer in order to be able to utilize BERT embeddings)
- ADD: Introduction of
sadedgel.set_config
to modify some sadedegel configurations. Such as word tokenizer. - ADD:
tags
are added toExtractiveSummarizer
in order to filter them out (in evaluation etc.) easily. - ADD: Thanks to Code Inspector
sadedeGel
is under constant code quality monitoring with an intial grade of A (Score 94). We will keep it high as much as we can as the capabilities of the library grows. - CHANGE: Downgrade sklearn dependency back to
0.23.1
to prevent serialization compatibility warnings. - CHANGE: Score normalization of summarizers push up to parent abstract class
ExtractiveSummarizer
, improving code quality by reducing repetitive code blocks.
Improving APIs & Add commandline entrypoints
-
⚠️ CHANGE: We have changedDoc
constructor. Use newfrom_sentences
class method to construct a newDoc
object using list of strings (representing sentences) resolves: #47 -
CHANGE:
Sentences
object now holds a reference to originatingDoc
object (Previously reference toDoc.sents
) for more flexibility. -
CHANGE: We have significantly standardized our summarizers (specifically cluster based summarizers) resolves: #59 Summarizers now allow following parameter types on
predict
and__call__
functions:Doc
List[Sentences]
List[str]
(each element is taken as a sentence)
-
ADD: We have completed documentation of
sadedegel*
commandlines' entrypointssadedegel
sadedegel-dataset
sadedegel-dataset-extended
sadedegel-summarize
sadedegel-sbd
sadedegel-server
-
FIX:
sadedegel info
returns Heroku Application address properly. -
FIX: Fix memoization bug on
Sentences.tokens_with_special_symbols
providing 10% fasterSentences.tokens
calls.