Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wrong created_at values in commits.csv #97

Open
huichen-cs opened this issue Aug 31, 2020 · 1 comment
Open

Wrong created_at values in commits.csv #97

huichen-cs opened this issue Aug 31, 2020 · 1 comment

Comments

@huichen-cs
Copy link

huichen-cs commented Aug 31, 2020

Looks to me that the most recent dump (2019-06-01) has wrong time stamp. Below is how we may reproduce the error.

In http://ghtorrent-downloads.ewi.tudelft.nl/mysql/mysql-2019-06-01.tar.gz, I found the following,

$ grep "2038-01-18" commits.csv
239581704,"1e954ef83c79ccccb8f570d350ea0ff934650fbb",4498938,4498938,18281400,"2038-01-18 23:00:00"
242299182,"0ebc3ea059dfcc8124135b94dfaa40ba7d49e5b2",1340592,1340592,18533076,"2038-01-18 23:00:00"
606676461,"d4cbfbc31940a514f2974405b579ff8989f9a0b6",33691743,33691743,55773517,"2038-01-18 00:00:00"

I loaded the CSV files to a PostgreSQL database, and then do a query. These projects are,

ghtorrent=# SELECT id,url FROM projects WHERE id=18281400 OR id=18533076 OR id=55773517;
    id    |                              url
----------+---------------------------------------------------------------
 18281400 | https://api.github.com/repos/cirosantilli/test-streak-puppet2
 18533076 | https://api.github.com/repos/phy1729/illacceptanything
 55773517 | https://api.github.com/repos/nhm150030/jfox
(3 rows)

Having compared the dates in the CSV file with the commit log in these projects. The dates are indeed wrong. Similarly, I also found these,

$ grep "2028-12-31 23:00:00" commits.csv
239657882,"182ef308602eabf227265eb2a5034ede567dccfa",4498938,4498938,18281400,"2028-12-31 23:00:00"
242561682,"cd952389edd54aa00cd3a98382da1193a1fc6d36",1340592,1340592,18507496,"2028-12-31 23:00:00"

In fact, if we group by the duration of commits of projects where the duration is the years between the oldest and the newest commits of a project and count the number of projects in a group, I got the following,

 duration | count
----------+-------
        0 |   492
        1 |   632
        2 |   414
        3 |   363
        4 |   325
        5 |   321
        6 |   277
        7 |   214
        8 |   202
        9 |   150
       10 |   102
       11 |   106
       12 |    89
       13 |    82
       14 |    93
       15 |    60
       16 |    63
       17 |    51
       18 |    49
       19 |    38
       20 |    33
       21 |    24
       22 |     8
       23 |    11
       24 |    14
       25 |    13
       26 |     8
       27 |     4
       28 |     1
       29 |     2
       32 |     1
       33 |     1
       34 |     3
       36 |     1
       38 |     2
       40 |     3
       42 |     1
       43 |     1
       44 |     9
       45 |     7
       46 |    15
       47 |    10
       48 |     2
       49 |     7
       58 |     1
       68 |     2
(46 rows)

where count is the number of projects whose duration is given on the left column.

@stuartbates
Copy link

From the FAQ page:

I've seen weird commit timestamps

Git records the commit timestamp on the developer's workstation. If the clock is misconfigured, timestamps will be weird.

We have seen timestamps such as 0000-01-01 00:00 or 2034-12-31 23:59. GitHub and GHTorrent do not process the timestamps in any way.

One of the projects you referenced has future dates set against the authored at timestamps e.g. 2046-09-18T00:00:00Z

This problem affects all data on GH itself - here's a test commit I just made in 2021.

Unfortunately, I don't think there's anything we can do about it. At least instances of this should be pretty rare.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants