The goal of this plugin is to mirror GHTorrent.
This stage reduces all commits to a stats object keyed on their commit date (hourly). The stats object looks like this:
case class Stats(date: String, commitStats: CommitStats)
case class CommitStats(totalCommits: Long, filesEdited: List[Extension])
case class Extension(name: String,
additions: Long,
deletions: Long,
added: Long,
removed: Long,
modified: Long)
In summary, every commit is keyed by date (e.g. 2019-01-01 09
) and then its stats are deduced and merged. Finally these stats are saved (and updated) in MongoDB.
Example stats object:
{
"date" : "2013-02-06 11",
"commitStats" : {
"totalCommits" : 1,
"filesEdited" : [
{
"name" : "js",
"additions" : 28,
"deletions" : 23,
"added" : 0,
"removed" : 0,
"modified" : 1
},
{
"name" : "css",
"additions" : 1,
"deletions" : 2,
"added" : 0,
"removed" : 0,
"modified" : 1
}
]
}
}
The idea of this stage is to enrich Commit data with their corresponding PushEvent (if available). I.e. Add the push_id
and push_date
to a Commit.
The commit is enriched in the following case class:
case class Pushed(push_id: Long,
push_date: Date,
pushed_from_github: Boolean = false)
case class EnrichedCommit(push: Option[Pushed], commit: Commit)
All data is pushed into the cf_commit
topic.
The flow diagram of this low-level join can be found below:
The potential outcomes are:
- PushEvent can be found and Commit is enriched with
push_id
andpush_date
. - PushEvent can not be found but it is derived that Commit is directly pushed from GH. The Commit is then enriched with no
push_id
and thepush_date == commit_date
. - No PushEvent can be found and Commit is just forwarded without
Pushed
data.