diff --git a/LICENSE b/LICENSE index c1304ab..3b9dab4 100644 --- a/LICENSE +++ b/LICENSE @@ -1,6 +1,6 @@ MIT License -Copyright (c) 2021 Sean Breckenridge +Copyright (c) 2021 purarue Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal diff --git a/README.md b/README.md index d759b8b..8368295 100644 --- a/README.md +++ b/README.md @@ -16,7 +16,7 @@ Parses data out of your [Google Takeout](https://takeout.google.com/) (History, - [Contributing](#contributing) - [Testing](#testing) -This doesn't handle all cases, but I have yet to find a parser that does, so here is my attempt at parsing what I see as the most useful info from it. The Google Takeout is pretty particular, and the contents of the directory depend on what you select while exporting. Unhandled files will warn, though feel free to [PR a parser](#contributing) or [create an issue](https://github.com/seanbreckenridge/google_takeout_parser/issues/new?title=add+parser+for) if this doesn't parse some part you want. +This doesn't handle all cases, but I have yet to find a parser that does, so here is my attempt at parsing what I see as the most useful info from it. The Google Takeout is pretty particular, and the contents of the directory depend on what you select while exporting. Unhandled files will warn, though feel free to [PR a parser](#contributing) or [create an issue](https://github.com/purarue/google_takeout_parser/issues/new?title=add+parser+for) if this doesn't parse some part you want. This can take a few minutes to parse depending on what you have in your Takeout (especially while using the old HTML format), so this uses [cachew](https://github.com/karlicoss/cachew) to cache the function result for each Takeout you may have. That means this'll take a few minutes the first time parsing a takeout, but then only a few seconds every subsequent time. @@ -52,7 +52,7 @@ This currently parses: - `Youtube/live chats/live chats.csv` - Likes: `YouTube and YouTube Music/playlists/likes.json` -This was extracted out of [my HPI](https://github.com/seanbreckenridge/HPI/tree/4bb1f174bdbd693ab29e744413424d18b8667b1f/my/google) modules, which was in turn modified from the google files in [karlicoss/HPI](https://github.com/karlicoss/HPI/blob/4a04c09f314e10a4db8f35bf1ecc10e4d0203223/my/google/takeout/html.py) +This was extracted out of [my HPI](https://github.com/purarue/HPI/tree/4bb1f174bdbd693ab29e744413424d18b8667b1f/my/google) modules, which was in turn modified from the google files in [karlicoss/HPI](https://github.com/karlicoss/HPI/blob/4a04c09f314e10a4db8f35bf1ecc10e4d0203223/my/google/takeout/html.py) ## Installation @@ -131,8 +131,8 @@ Also contains a small utility command to help move/extract the google takeout: ```bash $ google_takeout_parser move --from ~/Downloads/takeout*.zip --to-dir ~/data/google_takeout --extract -Extracting /home/sean/Downloads/takeout-20211023T070558Z-001.zip to /tmp/tmp07ua_0id -Moving /tmp/tmp07ua_0id/Takeout to /home/sean/data/google_takeout/Takeout-1634993897 +Extracting /home/username/Downloads/takeout-20211023T070558Z-001.zip to /tmp/tmp07ua_0id +Moving /tmp/tmp07ua_0id/Takeout to /home/username/data/google_takeout/Takeout-1634993897 $ ls -1 ~/data/google_takeout/Takeout-1634993897 archive_browser.html Chrome @@ -252,22 +252,22 @@ On certain machines, the giant HTML files may even take so much memory that the Just to give a brief overview, to add new functionality (parsing some new folder that this doesn't currently support), you'd need to: -- Add a `model` for it in [`models.py`](google_takeout_parser/models.py) subclassing `BaseEvent` and adding it to the Union at the bottom of the file. That should have a [`key` property function](https://github.com/seanbreckenridge/google_takeout_parser/blob/a8aefac76d8e1474ca2275b4a7c78bbb962c7a04/google_takeout_parser/models.py#L185-L187) which describes each event uniquely (this is used to remove duplicate items when merging takeouts) +- Add a `model` for it in [`models.py`](google_takeout_parser/models.py) subclassing `BaseEvent` and adding it to the Union at the bottom of the file. That should have a [`key` property function](https://github.com/purarue/google_takeout_parser/blob/a8aefac76d8e1474ca2275b4a7c78bbb962c7a04/google_takeout_parser/models.py#L185-L187) which describes each event uniquely (this is used to remove duplicate items when merging takeouts) - Write a function which takes the `Path` to the file you're trying to parse and converts it to the model you created (See examples in [`parse_json.py`](google_takeout_parser/parse_json.py)). Ideally extract a single raw item from the takeout file add a test for it so its obvious when/if the format changes. - Add a regex match for the file path to the handler map in [`google_takeout_parser/locales/en.py`](google_takeout_parser/locales/en.py). Dont feel required to add support for all locales, its somewhat annoying to swap languages on google, request a takeout, wait for it to process and then swap back. -Though, if your takeout is in some language this doesn't support, you can [create an issue](https://github.com/seanbreckenridge/google_takeout_parser/issues/new?title=support+new+locale) with the file structure (run `find Takeout` and/or `tree Takeout`), or contribute a locale file by creating a `path -> function mapping` ([see locales](https://github.com/seanbreckenridge/google_takeout_parser/tree/master/google_takeout_parser/locales)), and adding it to the global `LOCALES` variables in `locales/all.py` and `locales/main.py` +Though, if your takeout is in some language this doesn't support, you can [create an issue](https://github.com/purarue/google_takeout_parser/issues/new?title=support+new+locale) with the file structure (run `find Takeout` and/or `tree Takeout`), or contribute a locale file by creating a `path -> function mapping` ([see locales](https://github.com/purarue/google_takeout_parser/tree/master/google_takeout_parser/locales)), and adding it to the global `LOCALES` variables in `locales/all.py` and `locales/main.py` -This is a pretty difficult to maintain, as it requires a lot of manual testing from people who have access to these takeouts, and who actively use the language that the takeout is in. My google accounts main language is English, so I upkeep that locale whenever I notice changes, but its not trivial to port those changes to other locales without swapping my language, making an export, waiting, and then switching back. I keep track of mismatched changes [in this board](https://github.com/users/seanbreckenridge/projects/1/views/1) +This is a pretty difficult to maintain, as it requires a lot of manual testing from people who have access to these takeouts, and who actively use the language that the takeout is in. My google accounts main language is English, so I upkeep that locale whenever I notice changes, but its not trivial to port those changes to other locales without swapping my language, making an export, waiting, and then switching back. I keep track of mismatched changes [in this board](https://github.com/users/purarue/projects/1/views/1) -Ideally, when first creating a locale file, you would select everything when doing a takeout (not just the `My Activity`/`Chrome`/`Location History` like I suggested above), so [paths that are not parsed can be ignored properly](https://github.com/seanbreckenridge/google_takeout_parser/blob/4981c241c04b5b37265710dcc6ca00f19d1eafb4/google_takeout_parser/locales/en.py#L105C1-L113). +Ideally, when first creating a locale file, you would select everything when doing a takeout (not just the `My Activity`/`Chrome`/`Location History` like I suggested above), so [paths that are not parsed can be ignored properly](https://github.com/purarue/google_takeout_parser/blob/4981c241c04b5b37265710dcc6ca00f19d1eafb4/google_takeout_parser/locales/en.py#L105C1-L113). ### Testing ```bash -git clone 'https://github.com/seanbreckenridge/google_takeout_parser' +git clone 'https://github.com/purarue/google_takeout_parser' cd ./google_takeout_parser pip install '.[testing]' mypy ./google_takeout_parser diff --git a/google_takeout_parser/http_allowlist.py b/google_takeout_parser/http_allowlist.py index cf22bb8..dfdc481 100644 --- a/google_takeout_parser/http_allowlist.py +++ b/google_takeout_parser/http_allowlist.py @@ -1,5 +1,5 @@ """ -For context, see: https://github.com/seanbreckenridge/google_takeout_parser/issues/31 +For context, see: https://github.com/purarue/google_takeout_parser/issues/31 This converts HTTP URLs to HTTPS, if they're from certain google domains. In some cases URLs in the takeout are HTTP for no reason, and converting them @@ -223,7 +223,7 @@ def _convert_to_https(url: str, logger: Optional[logging.Logger] = None) -> str: return urlunsplit(("https",) + uu[1:]) if logger: logger.debug( - "HTTP URL did not match allowlist: %s\nIf you think this should be auto-converted to HTTPS, make an issue here: https://github.com/seanbreckenridge/google_takeout_parser/issues/new", + "HTTP URL did not match allowlist: %s\nIf you think this should be auto-converted to HTTPS, make an issue here: https://github.com/purarue/google_takeout_parser/issues/new", url, ) # some other scheme, just return diff --git a/google_takeout_parser/merge.py b/google_takeout_parser/merge.py index e45ed3c..0f47721 100644 --- a/google_takeout_parser/merge.py +++ b/google_takeout_parser/merge.py @@ -84,7 +84,7 @@ def _create_key(e: BaseEvent) -> Key: # This is so that its easier to use this logic in other -# places, e.g. in github.com/seanbreckenridge/HPI +# places, e.g. in github.com/purarue/HPI class GoogleEventSet: """ Class to help manage keys for the models diff --git a/setup.cfg b/setup.cfg index 17bee78..6bb1b07 100644 --- a/setup.cfg +++ b/setup.cfg @@ -4,9 +4,8 @@ version = 0.1.12 description = Parses data out of your Google Takeout (History, Activity, Youtube, Locations, etc...) long_description = file: README.md long_description_content_type = text/markdown -url = https://github.com/seanbreckenridge/google_takeout_parser -author = Sean Breckenridge -author_email = "seanbrecke@gmail.com" +url = https://github.com/purarue/google_takeout_parser +author = purarue license = MIT license_files = LICENSE classifiers = diff --git a/split_html/go.mod b/split_html/go.mod index 2c836b8..b28559f 100644 --- a/split_html/go.mod +++ b/split_html/go.mod @@ -1,4 +1,4 @@ -module github.com/seanbreckenridge/google_takeout_parser/scripts +module github.com/purarue/google_takeout_parser/scripts go 1.18.0 diff --git a/tests/test_json.py b/tests/test_json.py index 9a3f6f0..5eadb6d 100644 --- a/tests/test_json.py +++ b/tests/test_json.py @@ -55,7 +55,7 @@ def test_parse_activity_json(tmp_path_f: Path) -> None: def test_parse_likes_json(tmp_path_f: Path) -> None: - contents = """[{"contentDetails": {"videoId": "J1tF-DKKt7k", "videoPublishedAt": "2015-10-05T17:23:15.000Z"}, "etag": "GbLczUV2gsP6j0YQgTcYropUbdY", "id": "TExBNkR0bmJaMktKY2t5VFlmWE93UU5BLkoxdEYtREtLdDdr", "kind": "youtube#playlistItem", "snippet": {"channelId": "UCA6DtnbZ2KJckyTYfXOwQNA", "channelTitle": "Sean B", "description": "\\u30b7\\u30e5\\u30ac\\u30fc\\u30bd\\u30f3\\u30b0\\u3068\\u30d3\\u30bf\\u30fc\\u30b9\\u30c6\\u30c3\\u30d7 \\nSugar Song and Bitter Step\\n\\u7cd6\\u6b4c\\u548c\\u82e6\\u5473\\u6b65\\u9a5f\\nUNISON SQUARE GARDEN\\n\\u7530\\u6df5\\u667a\\u4e5f\\n\\u8840\\u754c\\u6226\\u7dda\\n\\u5e7b\\u754c\\u6230\\u7dda\\nBlood Blockade Battlefront ED\\nArranged by Maybe\\nScore:https://drive.google.com/open?id=0B9Jb1ks6rtrWSk1hX1U0MXlDSUE\\nThx~~", "playlistId": "LLA6DtnbZ2KJckyTYfXOwQNA", "position": 4, "publishedAt": "2020-07-05T18:27:32.000Z", "resourceId": {"kind": "youtube#video", "videoId": "J1tF-DKKt7k"}, "thumbnails": {"default": {"height": 90, "url": "https://i.ytimg.com/vi/J1tF-DKKt7k/default.jpg", "width": 120}, "high": {"height": 360, "url": "https://i.ytimg.com/vi/J1tF-DKKt7k/hqdefault.jpg", "width": 480}, "medium": {"height": 180, "url": "https://i.ytimg.com/vi/J1tF-DKKt7k/mqdefault.jpg", "width": 320}, "standard": {"height": 480, "url": "https://i.ytimg.com/vi/J1tF-DKKt7k/sddefault.jpg", "width": 640}}, "title": "[Maybe]Blood Blockade Battlefront ED \\u30b7\\u30e5\\u30ac\\u30fc\\u30bd\\u30f3\\u30b0\\u3068\\u30d3\\u30bf\\u30fc\\u30b9\\u30c6\\u30c3\\u30d7 Sugar Song and Bitter Step"}, "status": {"privacyStatus": "public"}}]""" + contents = """[{"contentDetails": {"videoId": "J1tF-DKKt7k", "videoPublishedAt": "2015-10-05T17:23:15.000Z"}, "etag": "GbLczUV2gsP6j0YQgTcYropUbdY", "id": "TExBNkR0bmJaMktKY2t5VFlmWE93UU5BLkoxdEYtREtLdDdr", "kind": "youtube#playlistItem", "snippet": {"channelId": "UCA6DtnbZ2KJckyTYfXOwQNA", "channelTitle": "Title", "description": "\\u30b7\\u30e5\\u30ac\\u30fc\\u30bd\\u30f3\\u30b0\\u3068\\u30d3\\u30bf\\u30fc\\u30b9\\u30c6\\u30c3\\u30d7 \\nSugar Song and Bitter Step\\n\\u7cd6\\u6b4c\\u548c\\u82e6\\u5473\\u6b65\\u9a5f\\nUNISON SQUARE GARDEN\\n\\u7530\\u6df5\\u667a\\u4e5f\\n\\u8840\\u754c\\u6226\\u7dda\\n\\u5e7b\\u754c\\u6230\\u7dda\\nBlood Blockade Battlefront ED\\nArranged by Maybe\\nScore:https://drive.google.com/open?id=0B9Jb1ks6rtrWSk1hX1U0MXlDSUE\\nThx~~", "playlistId": "LLA6DtnbZ2KJckyTYfXOwQNA", "position": 4, "publishedAt": "2020-07-05T18:27:32.000Z", "resourceId": {"kind": "youtube#video", "videoId": "J1tF-DKKt7k"}, "thumbnails": {"default": {"height": 90, "url": "https://i.ytimg.com/vi/J1tF-DKKt7k/default.jpg", "width": 120}, "high": {"height": 360, "url": "https://i.ytimg.com/vi/J1tF-DKKt7k/hqdefault.jpg", "width": 480}, "medium": {"height": 180, "url": "https://i.ytimg.com/vi/J1tF-DKKt7k/mqdefault.jpg", "width": 320}, "standard": {"height": 480, "url": "https://i.ytimg.com/vi/J1tF-DKKt7k/sddefault.jpg", "width": 640}}, "title": "[Maybe]Blood Blockade Battlefront ED \\u30b7\\u30e5\\u30ac\\u30fc\\u30bd\\u30f3\\u30b0\\u3068\\u30d3\\u30bf\\u30fc\\u30b9\\u30c6\\u30c3\\u30d7 Sugar Song and Bitter Step"}, "status": {"privacyStatus": "public"}}]""" fp = tmp_path_f / "file" fp.write_text(contents) res = list(prj._parse_likes(fp)) @@ -160,13 +160,13 @@ def test_location_2024(tmp_path_f: Path) -> None: def test_chrome_history(tmp_path_f: Path) -> None: - contents = '{"Browser History": [{"page_transition": "LINK", "title": "sean", "url": "https://sean.fish", "client_id": "W1vSb98l403jhPeK==", "time_usec": 1617404690134513}]}' + contents = '{"Browser History": [{"page_transition": "LINK", "title": "title", "url": "https://sean.fish", "client_id": "W1vSb98l403jhPeK==", "time_usec": 1617404690134513}]}' fp = tmp_path_f / "file" fp.write_text(contents) res = list(prj._parse_chrome_history(fp)) assert res == [ models.ChromeHistory( - title="sean", + title="title", url="https://sean.fish", dt=datetime.datetime( 2021, 4, 2, 23, 4, 50, 134513, tzinfo=datetime.timezone.utc