Tidies Twitter json collected with Twarc into relational tables.
The resulting SQLite database is ideal for importing into analytical tools, or for using as a datasource for a programmatic analytical workflow that is more efficient than working directly from the raw JSON. However, we always recommend retaining the raw JSON data - think of tidy_tweet and its resulting databases as the first step of data pre-processing, rather than as the original/raw data for your project.
WARNING - tidy_tweet is still released in a preliminary version, not all data fields are loaded into the database, and we can't guarantee no breaking changes either of library interface or database schema before 1.0 release. Most notably, the database schema will have a significant change to allow multiple JSON files to be loaded into the same database file.
- Collecting Twitter Data
- Input and Output
- Prerequisites
- Installation
- Usage
- Feedback and Contributions
- About tidy_tweet
If you do not have a preferred Twitter collection tool already, we recommend Twarc. tidy_tweet is designed to work directly with Twarc output. Other collection methods may work with tidy_tweet as long as they output the API result from Twitter with minimal alteration (see Input and Output), however at this time we do not have the resources to support Twitter data outputs from tools other than Twarc.
tidy_tweet takes as input a series of JSON/dict objects, each object of which is a page of Twitter API v2 search or
timeline results. Typically, this will be a JSON file such as those output by twarc2 search
. At present, API endpoints
oriented around things other than tweets, such as the liking-users
endpoint, are not properly supported, though we
hope to support them in future.
JSON files with multiple pages of results are expected to be newline-delimited, with each line being a distinct results page object, and no commas between top-level objects.
After processing your Twitter results pages with tidy_tweet (see Usage), you will have an SQLite database file at the location you specified.
See the current database schema.
- Python 3.8+
- A command line shell/terminal, such as bash, Mac Terminal, Git Bash, Anaconda Prompt, etc
This tool requires Python 3.8 or later, the instructions assume you already have Python installed. If you haven't installed Python before, you might find Python for Beginners helpful - note that tidy_tweet is a command line application, you don't need to write any Python code to use it (although you can if you want to), you just need to be able to run Python code!
The instructions assume sufficient familiarity with using a command line to change directories, list files and find their locations, and execute commands. If you are new to the command line or want a refresher, there are some good lessons from Software Carpentry and the Programming Historian.
The instructions assume you are working in a suitable Python virtual environment. RealPython has a relatively straightforward primer on virtual environments if you are new to the concept. If you installed Python with Anaconda/conda, you will want to manage your virtual environments through Anaconda/conda as well. If you have a virtual environment already set up for using Twarc, you can install tidy_tweet in that same environment.
tidy_tweet is a Python package and can be installed with pip.
-
Ensure you are using an appropriate Python or Anaconda environment (see Prerequisites)
-
Install tidy_tweet and its requirements by running:
python -m pip install tidy_tweet
-
Run the following to check that your environment is ready to run tidy_tweet:
tidy_tweet --help
If you wish to install a specific version of tidy_tweet, for example to replicate past results, you can specify the desired version when installing with pip, for example to install tidy_tweet version 1.0.1 (which does not currently exist):
python -m pip install tidy-tweet==1.0.1
tidy_tweet may be used either as a command line application or as a Python library. The command line interface (CLI) is recommended for general use and is intended to be more straightforward to use. The Python library interface is designed for use cases such as integrating tidy_tweet usage into other tools, scripts, and notebooks.
After installing tidy_tweet, you should be able to run tidy_tweet
as a command line application:
tidy_tweet --help
Running the above will show you a summary of how to use the tidy_tweet command line interface (CLI). The tidy_tweet CLI expects you to provide specific arguments in a specific order, as follows:
tidy_tweet DATABASE JSON_FILE
DATABASE: This is the filename where you want to save the tidied data as a database. As this is an SQLite
database, it is conventional for the filename to end in ".db". Example: my_dataset.db
JSON_FILE: This is the file of tweets you wish to tidy into the database. For more information,
see Input and Output Example: my_search_results.json
Example:
tidy_tweet tree_search_2022-02-22.db tree_search_2022-02-22.json
tidy_tweet can accept more than one JSON file at a time. If you have multiple JSON files, for example resulting
from different search terms or Twitter accounts, you can list them all in a single tidy_tweet
command:
tidy_tweet DATABASE JSON_FILE_1 JSON_FILE_2 JSON_FILE_3
For example:
tidy_tweet tree_searches_2022-02-22.db pine_tree_2022-02-22.json eucalypt_2022-02-22.json jacaranda_2022-02-22.json
At present, there is no metadata to tell what data came from which file, but we plan to fix this soon!
Here is an example using the test data file included with tidy_tweet:
from tidy_tweet import initialise_sqlite, load_twarc_json_to_sqlite
import sqlite3
initialise_sqlite('ObservatoryTeam.db')
load_twarc_json_to_sqlite('tests/data/ObservatoryTeam.jsonl', 'ObservatoryTeam.db')
with sqlite3.connect('ObservatoryTeam.db') as connection:
db = connection.cursor()
db.execute("select count(*) from tweet")
print(f"There are {db.fetchone()[0]} tweets in the database!")
We appreciate all feedback and contributions!
Found an issue with tidy_tweet? Find out how to let us know
Interested in contributing? Find out more in our contributing.md
Some of this documentation is copied from Gab Tidy Data, and much of the structure and functionality is also modelled on gab_tidy_data, which was our initial foray into developing a tool like this.
Tidy_tweet is created and maintained by the QUT Digital Observatory and is open-sourced under an MIT license. We welcome contributions and feedback!
A DOI and citation information will be added in future.