Skip to content

Latest commit

 

History

History
27 lines (20 loc) · 2.19 KB

README.md

File metadata and controls

27 lines (20 loc) · 2.19 KB

Kawarith

an Arabic Twitter Corpus for Crisis Events

Data Description

The Kawarith corpus comprises Arabic tweets from 22 crisis events that occurred between October 2018 and September 2020. Kawarith focuses on high- to medium-risk events that are most likely to trigger substantial Twitter activity and encompasses a wide range of hazard types, including floods, shootings, bombing, wildfires, pandemics, sandstorms and explosions.

Repository Structure

  1. Unlabelled Corpus: A large-scale crisis-related Arabic Twitter corpus of 1,658,795 unique tweets from 22 emergency events.
    Each folder contains tweet IDs collected during a specific crisis along with query terms and collection dates.

  2. Labelled Data: A gold-standard dataset comprising ~12k unique tweets from seven events: the Jordan floods, Kuwait floods-18, Hafr Albatin floods-19, the Cairo bombing, the Dragon storms, the Beirut explosion and Covid-19. Apart from Covid-19, which was labelled by relatedness to the event, tweets were annotated in terms of information type in a multi-label schem. A copy of the annotation instructions (translated to English) has been uploaded.
    There is a folder for each event containing the following:
    a) tweet IDs and their assigned labels.
    b) tweet IDs for train and test sets.

Rehydrating Tweets (for research purposes)

To comply with Twitter’s policies, only tweet IDs are published. You can retrieve complete Tweets Objects using hydration tools such as Twarc or Hydrator.

How to cite this resources:

Alharbi, Alaa, and Mark Lee. "Kawarith: an Arabic Twitter Corpus for Crisis Events." Proceedings of the Sixth Arabic Natural Language Processing Workshop. 2021

Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.