The set of scripts which compounds this tool can be classified into 4 different sections:
usage: get-project-list.py [-h] --input-file INPUT_FILE --output-file
OUTPUT_FILE [--log-file LOG_FILE] [-g]
Adapt projects.csv file from GHTorrent dump with preliminary filter(s)
optional arguments:
-h, --help show this help message and exit
--input-file INPUT_FILE
Input projects CSV file
--output-file OUTPUT_FILE
Output projects CSV file
--log-file LOG_FILE Path to log file
-g, --debug Enables debug mode
usage: github-api.py [-h] --github-token GITHUB_TOKEN --projects-file
PROJECTS_FILE [--log-file LOG_FILE] [-g]
Extracts git trees (list of files) from GitHub repositories
optional arguments:
-h, --help show this help message and exit
--github-token GITHUB_TOKEN
GitHub token
--projects-file PROJECTS_FILE
Projects file
--log-file LOG_FILE Path to log file
-g, --debug Enables debug mode
usage: github-tree.py [-h] --heuristics-file HEURISTICS_FILE --trees-path
TREES_PATH [--log-file LOG_FILE] [-g]
Look for patterns and heuristics into Git-trees and return a list of positive
results
optional arguments:
-h, --help show this help message and exit
--heuristics-file HEURISTICS_FILE
File with patterns and other heuristics
--trees-path TREES_PATH
Path to folder containing trees information
--log-file LOG_FILE Log file
--output-file OUT_FILE
Path to output hits file
-g, --debug Enables debug mode
Here is an example of the heuristics file for this script (YAML format):
---
# 1. Positive extensions for interesting files
level-one_exts:
- jpg
- jpeg
- png
- gif
- py
- js
# 2. Extensions marked as positive only if file-name contains a keyword (See 3.)
level-two_exts:
- svg
- txt
- pdf
- html
- json
- yml
# 3. Keywords
keywords:
- index
- views
- models
- urls
- img
- server
- client
usage: hits2urls.py [-h] --json-path JSON_PATH --projects-file PROJECTS_FILE
--hits-file HITS_FILE [--output-file OUTPUT_FILE]
[--log-file LOG_FILE] [-g]
Converts positive results into URLs pointing to its raw files in GitHub
optional arguments:
-h, --help show this help message and exit
--json-path JSON_PATH
Path where github-api JSONS are stored
--projects-file PROJECTS_FILE
Projects file which was used with github-api
--hits-file HITS_FILE
Path to the output file of github-tree
--output-file OUTPUT_FILE
Path to store the URLs output file
--log-file LOG_FILE Path to log file
-g, --debug Enables debug mode
usage: perceval-handler.py [-h] --github-token GITHUB_TOKEN --urls-file
URLS_FILE --output-path OUTPUT_PATH --perceval-path
PERCEVAL_PATH [--log-file LOG_FILE] [-c] [-g]
Calls GrimoireLab-Perceval to extract git information from the output file of
hits2urls.py script
optional arguments:
-h, --help show this help message and exit
--github-token GITHUB_TOKEN
GitHub token
--urls-file URLS_FILE
Path to URLs file (output from hits2urls.py)
--output-path OUTPUT_PATH
Path where Perceval JSONs will be saved into
--perceval-path PERCEVAL_PATH
Path where Perceval store its cache information
--log-file LOG_FILE Path to log file
-c, --keep-cache Keep Perceval cache
-g, --debug Enables debug mode
usage: projects2sql.py [-h] --db-name DB_NAME --json-path JSON_PATH
--urls-file INPUT_FILE [--output-path OUTPUT_PATH]
[--log-file LOG_FILE] [--avoid-fw] [-g]
Generate SQL files extrating data from Perceval JSON files
optional arguments:
-h, --help show this help message and exit
--db-name DB_NAME Database name
--json-path JSON_PATH
Path where Perceval JSONs are stored into
--urls-file INPUT_FILE
Path to input URLs file produced with hits2urls script
--output-path OUTPUT_PATH
Path where SQL files are stored into
--log-file LOG_FILE Path to log file
--avoid-fw Avoid `Framework`-type projects
-g, --debug Enables debug mode
usage: ghtorrent-users2sql.py [-h] --input-file INPUT_FILE --db-name DB_NAME
[--output-path OUT_PATH] [--log-file LOG_FILE]
[-g]
Converts the USERS table from GHTorrent (csv) into a SQL script
optional arguments:
-h, --help show this help message and exit
--input-file INPUT_FILE
CSV file of USERS table from GHTorrent
--db-name DB_NAME Database name
--output-path OUT_PATH
Path where users.sql script will be stored
--log-file LOG_FILE Path to log file
-g, --debug Enables debug mode
SQL script to create the structure of the MySQL database where the SQL data have to be imported. It is necessary to edit this file in order to set up the database name to match with the parameter --db-name
from last scripts (By default, it is set to my_database
).
Note: pip3
package is needed
pip3 install -r requirements.txt