Given a collection of documents and a collection of tweets from the same time period, this tool tags each document with relevant hashtags using the Hashtagger+ algorithm adapted to an offline setting.
Unlike Hashtagger+ (which was designed for tagging a stream of documents in real-time), this tool operates on offline collections.
The details of real-time Hashtagger+ can be found in
- Hashtagger+: Efficient high-coverage social tagging of streaming news.
Bichen Shi, Gevorg Poghosyan, Georgiana Ifrim, Neil Hurley
IEEE Transactions on Knowledge and Data Engineering 30.1 (2018): 43-58. - Learning-to-rank for real-time high-precision hashtag recommendation for streaming news.
Bichen Shi, Georgiana Ifrim, Neil Hurley
Proceedings of the 25th International Conference on World Wide Web (2016): 1191-1202.
conda create -n hashtagger_offline python=3.7
conda activate hashtagger_offline # use 'source activate hashtgagger_offline' for older anaconda versions
conda install beautifulsoup4 lxml nltk numpy pytz scikit-learn
pip install --upgrade pip
pip install elasticsearch stemming sner
Download Stanford Named Entity Recognizer at
https://nlp.stanford.edu/software/CRF-NER.shtml#Download.
Download Stanford Log-linear Part-Of-Speech Tagger at
https://nlp.stanford.edu/software/tagger.html#Download.
Although basic English Stanford Tagger should work fine,
this code has been tested only with full Stanford Tagger.
Unpack both the downloaded archives into your desired location.
Start Stanford NER Tagger server
cd your_stanford_ner_dir
java -Djava.ext.dirs=./lib -cp stanford-ner.jar edu.stanford.nlp.ie.NERServer -port 9199 -loadClassifier ./classifiers/english.all.3class.distsim.crf.ser.gz
The Stanford NER Tagger server port ("9199" in the example above) must be set in the configuration file along with the host (most likely "localhost" if you run the NER server on the same machine).
Using the Stanford tagger with NLTK is about 100 times slower!
All the parameters are set in hashtagger_config.py file as attributes of a Config class object.
Set STANFORD_CLASSPATH, STANFORD_MODELS, NLTK_DATA_PATH to the corresponding paths.
Set ES_HOST_ARTICLE, ES_ARTICLE_INDEX_NAME, ES_HOST_TWEETS, ES_TWEET_INDEX_NAME.
The other parameters are configured to sensible values, but feel free to change them.
The parameters that have a significant effect on the tagging consistency and speed are the following:
GLOBAL_TWEET_WINDOW_BEFORE,
GLOBAL_TWEET_WINDOW_AFTER,
GLOBAL_TWEET_SAMPLE_SIZE,
GLOBAL_TWEET_SAMPLE_RANDOM_FLAG,
GLOBAL_HASHTAG_TWEET_SAMPLE_SIZE,
GLOBAL_ARTICLE_WINDOW_BEFORE,
GLOBAL_ARTICLE_WINDOW_AFTER,
COLDSTART_FLAG,
COLDSTART_ARTICLE_WINDOW_BEFORE,
COLDSTART_ARTICLE_WINDOW_AFTER,
COLDSTART_N_TWEETS_PER_NEIGHBOUR_ARTICLE,
LOCAL_TWEET_WINDOW_BEFORE,
LOCAL_TWEET_WINDOW_AFTER,
LOCAL_TWEET_SAMPLE_SIZE,
HASHTAG_WINDOW_TWEET_SAMPLE_SIZE.
The 'default' values are not the optimal ones, but the ones used in the online version of Hashtagger+
described in DOI:10.1109/TKDE.2017.2754253.
The default values with a following '*' indicate that this value is not used in online Hashtagger+ as is,
but is equivalent to those resulting from the periodic processes in Hashtagger+ (crawling tweets from Twitter
every 5 minutes, assigning tweets to article every 15 minutes, tagging articles every 15 minutes).
NOTE! The tagging quality is strongly dependent on the keyword extraction from articles and tweets.
Logging level, console/file output and format can be configured in the very same file.
This implementation requires a collection of tweets indexed in Elasticsearch.
See es_tweets.py for the required mapping
(although not all the fields are necessary and the current mapping is supporting also another tool in our lab).
es_tweets.py includes import_web_archive_tweets() function for indexing tweets from WebArchive tweet collections.
#lpt Downloading the WebArchive tweets is usually much faster with the torrent than with the direct download link.
N.B. Twitter’s language classification metadata is available in the archive beginning on March 26, 2013.
Nevertheless, the 'lang' field appears in tweets as early as December 2012.
A collection of high quality news-related hashtagged tweets is available for 15.07.2015-24.05.2017 at https://doi.org/10.6084/m9.figshare.7932422.
The expected format is text file where each line is JSON object of a document that needs to be tagged. For efficient execution, the documents must be ordered in time, which will allow less frequent computation of global time window stats.
The expected fields are "id", "headline", "subheadline", "content", "epoch", "url", "source", "type".
Note that these fields are not mandatory at the input file level, but are necessary for the code to run. If a subset of these is missing or has different names, the corresponding mappings can be defined in a function and passed as article_constructor_function argument to the load_articles_from_json_lines() function. The document's unix timestamp accessor can be defined as a function and passed as article_epoch_accessor argument to the load_articles_from_json_lines() function, so don't worry if your dataset has document timestamps in milliseconds or in a textual format. Example mapping functions can be found in the main script.
For now it's recommended to create a dedicated script for each input document collection.
2 example scripts tag_irish_articles.py and
tag_wapost_articles.py are provided as templates.
For each input document collection, please, specify the input file path (articles_json_path),
document's unix timestamp accessor (article_epoch_accessor)
and article field mapping function (article_constructor_function) in the script
and run the code from the terminal, e.g.
python tag_irish_articles.py
or
python tag_wapost_articles.py
The logs are saved in the logs/ folder. The logs in the file and the ones printed in the console window may not match depending on the configuration.
It is worth repeating that the tagging quality and coverage are strongly dependent on the keyword extraction from articles and tweets. The tagging algorithm is of secondary importance...
Another aspect (related to keyword extraction) influencing the tagging performance is the way that tweets are getting matched to the articles.
Tweets are being matched to the n-grams in article's "stream_keywords" field
(the name comes from the original Hashtagger+ where these n-grams were used to stream tweets from Twitter with the Streaming API).
The original Hashtagger+ filtered the tweets by requiring all of the n-gram terms to appear in a tweet in any order.
This corresponds to setting COLDSTART_TWEET_NGRAM_MATCH_MODE="must" and LOCAL_TWEET_NGRAM_MATCH_MODE="must" in the configuration file.
To relax the constraint and allow matching any of the n-gram terms matching a tweet, set the earlier mentioned parameters to "should" instead.
Currently, none of the available modes require preserving the matching terms' order in the n-grams.
The original real-time Hashtagger+ was providing multiple hashtag recommendations per article.
More precisely, up to 10 hashtags would be recommended ever 15 minutes for a duration of 24 hours, resulting in 96 recommendations of up to 10 distinct hashtags each.
For each recommendation, a new set of tweets was being added to the article's tweet set.
Each following recommendation was being done on a bigger tweet collection which included all the tweets from the previous recommendation.
This means that although the newer recommendations would be capturing the change is social discussions, nevertheless this change would be coming with some inertia.
In offline setting the multiple recommendations can be done both with and without inertia by selecting a corresponding overlapping/non-overlapping tweet windows respectively.
- fix the bug in the stream_keywords extraction and article profile extraction
- add an examples of tagging Spinn3r documents, Signal Media One-Million News Articles, Signi-Trend articles and more
- improve the tweet assignment for cases when there are less than <SOME_THRESHOLD> tweets assigned to an article... increase the time window and/or include more keywords to make sure all articles have (enough) tweets assigned to them
- fix the suboptimal routine in the global window article sample retrieval for cases when there are less articles than requested in the random sample
- add some labeled data on which the tool's performance can be evaluated quantitatively
- add a pseudocode and a diagram explaining the tagging process and the key differences from the original online (real-time) Hashtagger+
- index tweets in their native format, instead of the custom one, possibly with some additional fields like n_hashtags [breaking the backward-compatibility with legacy code]
- move to an object-oriented code