Releases · GlobalMaksimum/sadedegel

22 Apr 00:57

husnusensoy

0.20.1

91564f5

More Datasets & Prebuilt Models Latest

Latest

New Features

Exceptional handling of emoji, hashtagand mentiontokens by word tokenizers. Refer to sadedegel config for details.
- Options also into Text2Doc text to sadedegel Document converter
[Incomplete] HashVectorizer (Works far better than TfIdf or BM25 vectorization for majority of the prebuilt models)
unaryoption for idf

Datasets

We do keep adding new datasets with this new release. Refer to Dataset ReadMe for details.

Customer Review dataset
Telco (Turkcell) Sentiment dataset
Movie Sentiment dataset
Hotel Sentiment dataset
Categorized Product Sentiment dataset

Prebuilt Models

We do keep adding new prebuilt models with this new release. Refer to Prebuilt Model ReadMe for details.

Turkish Movie Review Sentiment Classification
Telco Brand Tweet Sentiment Classification
Turkish Customer Reviews Classification

Others

Lazy evaluation of word shapeproperty

Behavioural Changes

Significant behavior change on tokensproperty. Previously property returns List[str], now List[Token]
Sentence Tokenizer is renamed to be Sentence Boundary Detector to prevent confusion with Word Tokenizer

Contribution

Welcome our new contributor @ertugruldemir

Assets 2

02 Apr 17:37

husnusensoy

0.19.1

8f03d59

IcU Tokenizer & Better Vocabulary Structure

Sadedegel is now not only "An extraction based Turkish news summarizer", but rather "A General Purpose NLP library for Turkish"

News

We have added icu tokenizer as the default tokenizer (word tokenizer) which is very fast and accurate.

We have moved BERT as optional dependency which can be installed using pip install sadedegel[bert]
Word embeddings are introduced. To retrain use pip install sadedegel[w2v]
By making those dependencies optional sadedegel installation is now way faster than before
- pip install sadedegel takes 3 minutes @40Mbps for version 0.18
- pip install sadedegel takes 40 sec @40Mbps for version 0.19
Vocabulary files are now in hdf5 format. bert, icu and simple have their own vocabulary files.
- Only icu vocabulary file includes word embeddings.
Relax dependencies (less strict module version coupling)

Feature Drop & Deprecation

Others

Pre-trained models under prebuilt are refreshed
- They now use icu tokenizer
- They now return class probabilities for predictions

Assets 2

17 Mar 09:59

husnusensoy

0.18.2

4582022

More Prebuilt Models

0.18 adds more prebuilt models into sadedegel library

News

Our main contributor @dafajon has implemented a new BM25Summarizer similary to TfIdf summarizer. BM25Summarizer outperforms slightly in short summaries.
We have packaged two new prebuilt models (Refer to README for model accuracies )
1. tweeter profanity classification (sadedegel.prebuilt.tweet_profanity)
2. tweeter sentiment classification (sadedegel.prebuilt.tweet_sentiment)
Change the way we report summarizer performance. Instead of a grid search of summarizer options, we now use a RandomSearch to decide optimal summarizer and parameters. Refer to README for details.

Feature Drop & Deprecation

sents property on Doc is dropped. use __iter__(Doc) instead.
tf property on Doc is deprecated (will be dropped by 0.18) in favor of get_tf function which gives a more flexible way to access document level tf vectors.
tfidf function on Doc is deprecated (will be dropped by 0.18) in favor of get_tfidf function which gives a more flexible way to access document level tf-idf vectors.
lexrank external dependency is dropped and LexRankPureSummarizer is renamed to be LexRankSummarizer
set_config, get_config, describe_config and get_all_configs are dropped in favor of new configuration implementation.

Others

tf property is now a part of TfImpl class using default configuration settings to yield a tf vector for a Doc or Sentence
We've updated documentation for our datasets.
idf property is now a part of IdfImp class using default configuration settings to yield a idf vector for a Doc or Sentence
More default parameters in default.ini based on our summarizer performance.

Assets 2

17 Mar 09:40

husnusensoy

0.17.1.1

fdbdf3c

Direction to General Purpose NLP Library for Turkish

0.17 release introduces several non summarisation related NLP capabilities in Sadegel

News

Starting with this release, sadedegel now ships prebuilt models for various basic NLP tasks. The purpose is to allow developers to load & use those models with minimal configuration.
- Our first model is a news classifier (Thanks Taner Sezer for his corpus support)
We report accuracy of our tokenizers (word) for potential enhancement points in future releases (Thanks Taner Sezer for his corpus support)
To support the development of prebuilt models, sklearn compatiblle extension.sklearn module is introduced for feature engineering
Token.is_stopwordis added to flag stopword token types.
LexRankSummarizer (based on lexrank external module, to be deprecate in future releases) and LexRankPureSummarizer (pure sadedegel version of the same method) is added into set of extractive summarizers.

Feature Drop & Deprecation

sents property on Doc is dropped. use __iter__(Doc) instead.
tf property on Doc is deprecated (will be dropped by 0.18) in favor of get_tf function which gives a more flexible way to access document level tf vectors.
tfidf function on Doc is deprecated (will be dropped by 0.18) in favor of get_tfidf function which gives a more flexible way to access document level tf-idf vectors.

Others

We have pushed up TF and IDF implementations from Sentence and Doc to separate classes using python multiple inheritance support to reduce code duplication.

Assets 2

07 Jan 23:13

husnusensoy

0.16.2.1

cd9bc60

Minor Performance Enhancements & Tidy Up Pre-release

Pre-release

In one month time we have added lots into sadedegel library.

News

We have resolved an old and major issue caused by improper from transformers import AutoTokenizer calls here and there and lazy loading sentence boundary detector (sbd). Just to given an idea:
- sadedegel config CLI call to show sadedegel configuration took 11 sec in 0.16.1.1 release whereas 2 sec in 0.16.2.1+
- from sadedegel import Doc call (which is usually the first one to start working with sadedegel) took 9.5 sec in 0.16.1.1 release whereas 1 sec in 0.16.2.1+

Feature Drop & Deprecation

Old configuration capabilities are deprecated (this time unfortunately without prior warnings in earlier releases)
- DeprecationWarning is the indication that you do access one of such APIs which will completely be removed by 0.18
- Please use new API config_context (tf_context and idf_context are just simplified wrappers)

Documentation

CONFIG.md details the configuration of sadedegel.

Others

__getitem__ function to access any token of a Sentence
Iterator on Sentence yields all Tokens in order.
default tf method is now log_norm instead of binary thanks to @dafajon's most recent summarizer experiments.

Assets 2

07 Jan 22:38

husnusensoy

0.16.0.1

12cf928

Config, Configuration, Configurable Pre-release

Pre-release

This release is mainly devoted to centralized configuration. Lot's have changed, hopefully not but maybe broken (Always feel free to open an issue)

New Capabilities

New command for sadedegel CLI, sadedegel config to retrieve all possible configurations.
- default values (sadedegel/default.ini) are shipped with sadedegel can be overwritten by creating a user defined config file in ~/.sadedegel/user.ini (overwritten values are indicated on sadedegel config output.)
Configurable tf and idf vectors per Sentence is ready with new configuration model.
We have finally implemented forward version of BandSummarizer explained in sadedegel Presentation

Internal Update

Previously sadedegel.Doc was a Python class which is initialized with a document (string), we have seen some caveats in this approach and now sadedegel.Doc is an instance of sadedegel.DocumentBuilder and without changing (hopefully !!!) end user experience what you do is to trigger __call__ function returning a sadedegel.Document instance.

Assets 2

09 Oct 19:57

husnusensoy

0.15.2.4

faa556a

One Big Release Pre-release

Pre-release

In one month time we have added lots into sadedegel library.

News

We have @doruktiktiklar as the first code contributor out of Global Maksimum AI team.

New Capabilities

ADD: Addition of Vocabulary and Token concepts into library
- Token: singleton per word (case sensitive) to store unique token features (lower form, shape, document frequency, etc.)
- New sadedegel-build-vocabulary to manage sadedegel vocabularies.

New Summarizers

ADD: TextRank Summarizer
TextRank summarizer uses Google's PageRank algorithm based on distance/similarity defined by BERT embedding cosine distance/similarity (as of this release and more to come)
ADD: TFIDF Summarizer
TFIDF Summarizer uses element sum of tfidf vector of a sentence as the relevance score of a sentence in a document.

Others

UPDATE: Some annotator consensus issues on summary corpus.
UPDATE: A better command-line for summarizer evaluation. Check sadedegel-summarize evaluate for more
ADD: Sentences level tf, idf and tfidf embeddings
ADD: Doc has tfidf_embeddings property similar to bert_embeddings property.

Documentation

ADD: Youtube webinar videos (in Turkish) on sadedeGel YouTube Channel

Contribution Guidelines

ADD: Commit Guidelines
ADD: New Feature checklist

Feature Drop & Deprecation

DROP: Code quality guidelines is removed since Code Inspector limits the number of lines per open source project. We might continue with other providers later in the future.
DEPRECATED: Doc.sents will be removed by version 0.17
- Use [i] to access ith sentences of a document
- Doc object now implements __iter__ to let iterate over all sentences of a document.

Bugfix

Properly handle empty documents. Ex Doc("") or Doc('')

Assets 2

13 Sep 20:11

husnusensoy

0.14.1

3434128

Jekyll based sadedegel Github pages Pre-release

Pre-release

ADD: We have initialize sadedegel web page on Jekyll SadedeGel WebSite
ADD: Add hotfix contribution process for sadedegel into CONTRIBUTING.md
ADD: Sadedegel Slack channel.
ADD: Evaluation scores of new experimental tokenizer (Simple Tokenizer)

Assets 2

05 Sep 21:38

husnusensoy

0.13.5

ad41663

Regular Expression based Simple Word Tokenizer & Code Quality Pre-release

Pre-release

ADD: Major change of this release is Simple word tokenizer implementation by @dafajon after seeing the issues with BERT Tokenizer. Note that simple tokenizer is still experimental and not compatible with all summarizers (Cluster based summarizer automatically switch to BERT Tokenizer in order to be able to utilize BERT embeddings)
ADD: Introduction of sadedgel.set_config to modify some sadedegel configurations. Such as word tokenizer.
ADD: tags are added to ExtractiveSummarizer in order to filter them out (in evaluation etc.) easily.
ADD: Thanks to Code Inspector sadedeGel is under constant code quality monitoring with an intial grade of A (Score 94). We will keep it high as much as we can as the capabilities of the library grows.
CHANGE: Downgrade sklearn dependency back to 0.23.1 to prevent serialization compatibility warnings.
CHANGE: Score normalization of summarizers push up to parent abstract class ExtractiveSummarizer, improving code quality by reducing repetitive code blocks.

Assets 2

15 Aug 23:02

husnusensoy

0.12

06b73c5

Improving APIs & Add commandline entrypoints Pre-release

Pre-release

⚠️ CHANGE: We have changed Doc constructor. Use new from_sentences class method to construct a new Doc object using list of strings (representing sentences) resolves: #47
CHANGE: Sentences object now holds a reference to originating Doc object (Previously reference to Doc.sents) for more flexibility.
CHANGE: We have significantly standardized our summarizers (specifically cluster based summarizers) resolves: #59 Summarizers now allow following parameter types on predict and __call__ functions:
- Doc
- List[Sentences]
- List[str] (each element is taken as a sentence)
ADD: We have completed documentation of sadedegel* commandlines' entrypoints
- sadedegel
- sadedegel-dataset
- sadedegel-dataset-extended
- sadedegel-summarize
- sadedegel-sbd
- sadedegel-server
FIX: sadedegel info returns Heroku Application address properly.
FIX: Fix memoization bug on Sentences.tokens_with_special_symbols providing 10% faster Sentences.tokens calls.

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New Features

Datasets

Prebuilt Models

Others

Behavioural Changes

Contribution

News

Feature Drop & Deprecation

Others

News

Feature Drop & Deprecation

Others

News

Feature Drop & Deprecation

Others

News

Feature Drop & Deprecation

Documentation

Others

New Capabilities

Internal Update

News

New Capabilities

New Summarizers

Others

Documentation

Contribution Guidelines

Feature Drop & Deprecation

Bugfix

Releases: GlobalMaksimum/sadedegel

More Datasets & Prebuilt Models

New Features

Datasets

Prebuilt Models

Others

Behavioural Changes

Contribution

IcU Tokenizer & Better Vocabulary Structure

News

Feature Drop & Deprecation

Others

More Prebuilt Models

News

Feature Drop & Deprecation

Others

Direction to General Purpose NLP Library for Turkish

News

Feature Drop & Deprecation

Others

Minor Performance Enhancements & Tidy Up

News

Feature Drop & Deprecation

Documentation

Others

Config, Configuration, Configurable

New Capabilities

Internal Update

One Big Release

News

New Capabilities

New Summarizers

Others

Documentation

Contribution Guidelines

Feature Drop & Deprecation

Bugfix

Jekyll based sadedegel Github pages

Regular Expression based Simple Word Tokenizer & Code Quality

Improving APIs & Add commandline entrypoints