Skip to content

QUT-Digital-Observatory/tidy_tweet

Repository files navigation

tidy_tweet

Tidies Twitter json collected with Twarc into relational tables.

The resulting SQLite database is ideal for importing into analytical tools, or for using as a datasource for a programmatic analytical workflow that is more efficient than working directly from the raw JSON. However, we always recommend retaining the raw JSON data - think of tidy_tweet and its resulting databases as the first step of data pre-processing, rather than as the original/raw data for your project.

WARNING - tidy_tweet is still released in a preliminary version, not all data fields are loaded into the database, and we can't guarantee no breaking changes either of library interface or database schema before 1.0 release. Most notably, the database schema will have a significant change to allow multiple JSON files to be loaded into the same database file.

Contents

Collecting Twitter data

If you do not have a preferred Twitter collection tool already, we recommend Twarc. tidy_tweet is designed to work directly with Twarc output. Other collection methods may work with tidy_tweet as long as they output the API result from Twitter with minimal alteration (see Input and Output), however at this time we do not have the resources to support Twitter data outputs from tools other than Twarc.

Input and Output

Input: Twitter tweet results pages

tidy_tweet takes as input a series of JSON/dict objects, each object of which is a page of Twitter API v2 search or timeline results. Typically, this will be a JSON file such as those output by twarc2 search. At present, API endpoints oriented around things other than tweets, such as the liking-users endpoint, are not properly supported, though we hope to support them in future.

JSON files with multiple pages of results are expected to be newline-delimited, with each line being a distinct results page object, and no commas between top-level objects.

Output: Sqlite database of tweets and metadata

After processing your Twitter results pages with tidy_tweet (see Usage), you will have an SQLite database file at the location you specified.

See the current database schema.

Prerequisites

  • Python 3.8+
  • A command line shell/terminal, such as bash, Mac Terminal, Git Bash, Anaconda Prompt, etc

This tool requires Python 3.8 or later, the instructions assume you already have Python installed. If you haven't installed Python before, you might find Python for Beginners helpful - note that tidy_tweet is a command line application, you don't need to write any Python code to use it (although you can if you want to), you just need to be able to run Python code!

The instructions assume sufficient familiarity with using a command line to change directories, list files and find their locations, and execute commands. If you are new to the command line or want a refresher, there are some good lessons from Software Carpentry and the Programming Historian.

The instructions assume you are working in a suitable Python virtual environment. RealPython has a relatively straightforward primer on virtual environments if you are new to the concept. If you installed Python with Anaconda/conda, you will want to manage your virtual environments through Anaconda/conda as well. If you have a virtual environment already set up for using Twarc, you can install tidy_tweet in that same environment.

Installation

tidy_tweet is a Python package and can be installed with pip.

  1. Ensure you are using an appropriate Python or Anaconda environment (see Prerequisites)

  2. Install tidy_tweet and its requirements by running:

    python -m pip install tidy_tweet
  3. Run the following to check that your environment is ready to run tidy_tweet:

    tidy_tweet --help

If you wish to install a specific version of tidy_tweet, for example to replicate past results, you can specify the desired version when installing with pip, for example to install tidy_tweet version 1.0.1 (which does not currently exist):

python -m pip install tidy-tweet==1.0.1 

Usage

tidy_tweet may be used either as a command line application or as a Python library. The command line interface (CLI) is recommended for general use and is intended to be more straightforward to use. The Python library interface is designed for use cases such as integrating tidy_tweet usage into other tools, scripts, and notebooks.

Command line interface

After installing tidy_tweet, you should be able to run tidy_tweet as a command line application:

tidy_tweet --help

Running the above will show you a summary of how to use the tidy_tweet command line interface (CLI). The tidy_tweet CLI expects you to provide specific arguments in a specific order, as follows:

tidy_tweet DATABASE JSON_FILE

DATABASE: This is the filename where you want to save the tidied data as a database. As this is an SQLite database, it is conventional for the filename to end in ".db". Example: my_dataset.db

JSON_FILE: This is the file of tweets you wish to tidy into the database. For more information, see Input and Output Example: my_search_results.json

Example:

tidy_tweet tree_search_2022-02-22.db tree_search_2022-02-22.json

Loading multiple JSON files into a database

tidy_tweet can accept more than one JSON file at a time. If you have multiple JSON files, for example resulting from different search terms or Twitter accounts, you can list them all in a single tidy_tweet command:

tidy_tweet DATABASE JSON_FILE_1 JSON_FILE_2 JSON_FILE_3

For example:

tidy_tweet tree_searches_2022-02-22.db pine_tree_2022-02-22.json eucalypt_2022-02-22.json jacaranda_2022-02-22.json

At present, there is no metadata to tell what data came from which file, but we plan to fix this soon!

Python library

Here is an example using the test data file included with tidy_tweet:

from tidy_tweet import initialise_sqlite, load_twarc_json_to_sqlite
import sqlite3

initialise_sqlite('ObservatoryTeam.db')
load_twarc_json_to_sqlite('tests/data/ObservatoryTeam.jsonl', 'ObservatoryTeam.db')

with sqlite3.connect('ObservatoryTeam.db') as connection:
    db = connection.cursor()

    db.execute("select count(*) from tweet")

    print(f"There are {db.fetchone()[0]} tweets in the database!")

Feedback and contributions

We appreciate all feedback and contributions!

Found an issue with tidy_tweet? Find out how to let us know

Interested in contributing? Find out more in our contributing.md

Acknowledgements

Some of this documentation is copied from Gab Tidy Data, and much of the structure and functionality is also modelled on gab_tidy_data, which was our initial foray into developing a tool like this.

About tidy_tweet

Tidy_tweet is created and maintained by the QUT Digital Observatory and is open-sourced under an MIT license. We welcome contributions and feedback!

A DOI and citation information will be added in future.