WikiTeams™ GitHub dataset creator

This software is available to public

Copyright (C) 2013 - WikiTeams contributors

This script allows to get details of "most popular repositories" using GitHub API

Input of program are CSV files with: name, owner, forks, watchers

In data there are around 27.000 repositories parsed from Google BigQuery (as on 15.09.2013)

On output you get CSV files with below dimensions:

  1. Repository size
  2. Commits count
  3. Commit count in particular skill (multiple variables)
  4. Star count (indicator of team quality)
  5. Notifications count (number of people receiving notifications)
  6. Count of issues grouped by issue type
  7. Number of people in particular role ( Issuer -> Wiki(other content) -> Developer (commit) -> Owner)
  8. Median of issue closure team grouped by issue type
  9. Count of Unwatch events
  10. Number of Pull Requests
  11. Number of accepted Pull Requests
  12. Number of Forks
  13. Number of Branches

Usage exaple:

with resume mechanism

There is an option to resume from a given repository, which is helpful due the GitHub API quota limits. Just give in arguments a repo and name (comma seperated) - it must exists in CSV input. Program will take care of job progress.

nohup python --resume=name,owner &

without resume mechanism (start from scratch)

nohup python &

Configuration files

pass.txt (not provided)

File which holds authentication credentials to GitHub API.


login or token




mail_pass.txt (not provided)

Holds authentication to SMTP server. Program reports to email about quota use.





You can use the one provided in our repo. Holds configuration for logging mechanism.


