Skip to content

Commit

Permalink
chore: update urls
Browse files Browse the repository at this point in the history
  • Loading branch information
purarue committed Oct 25, 2024
1 parent b53a8b2 commit eb89fc9
Show file tree
Hide file tree
Showing 7 changed files with 19 additions and 20 deletions.
2 changes: 1 addition & 1 deletion LICENSE
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
MIT License

Copyright (c) 2021 Sean Breckenridge
Copyright (c) 2021 purarue

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
Expand Down
18 changes: 9 additions & 9 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ Parses data out of your [Google Takeout](https://takeout.google.com/) (History,
- [Contributing](#contributing)
- [Testing](#testing)

This doesn't handle all cases, but I have yet to find a parser that does, so here is my attempt at parsing what I see as the most useful info from it. The Google Takeout is pretty particular, and the contents of the directory depend on what you select while exporting. Unhandled files will warn, though feel free to [PR a parser](#contributing) or [create an issue](https://github.com/seanbreckenridge/google_takeout_parser/issues/new?title=add+parser+for) if this doesn't parse some part you want.
This doesn't handle all cases, but I have yet to find a parser that does, so here is my attempt at parsing what I see as the most useful info from it. The Google Takeout is pretty particular, and the contents of the directory depend on what you select while exporting. Unhandled files will warn, though feel free to [PR a parser](#contributing) or [create an issue](https://github.com/purarue/google_takeout_parser/issues/new?title=add+parser+for) if this doesn't parse some part you want.

This can take a few minutes to parse depending on what you have in your Takeout (especially while using the old HTML format), so this uses [cachew](https://github.com/karlicoss/cachew) to cache the function result for each Takeout you may have. That means this'll take a few minutes the first time parsing a takeout, but then only a few seconds every subsequent time.

Expand Down Expand Up @@ -52,7 +52,7 @@ This currently parses:
- `Youtube/live chats/live chats.csv`
- Likes: `YouTube and YouTube Music/playlists/likes.json`

This was extracted out of [my HPI](https://github.com/seanbreckenridge/HPI/tree/4bb1f174bdbd693ab29e744413424d18b8667b1f/my/google) modules, which was in turn modified from the google files in [karlicoss/HPI](https://github.com/karlicoss/HPI/blob/4a04c09f314e10a4db8f35bf1ecc10e4d0203223/my/google/takeout/html.py)
This was extracted out of [my HPI](https://github.com/purarue/HPI/tree/4bb1f174bdbd693ab29e744413424d18b8667b1f/my/google) modules, which was in turn modified from the google files in [karlicoss/HPI](https://github.com/karlicoss/HPI/blob/4a04c09f314e10a4db8f35bf1ecc10e4d0203223/my/google/takeout/html.py)

## Installation

Expand Down Expand Up @@ -131,8 +131,8 @@ Also contains a small utility command to help move/extract the google takeout:

```bash
$ google_takeout_parser move --from ~/Downloads/takeout*.zip --to-dir ~/data/google_takeout --extract
Extracting /home/sean/Downloads/takeout-20211023T070558Z-001.zip to /tmp/tmp07ua_0id
Moving /tmp/tmp07ua_0id/Takeout to /home/sean/data/google_takeout/Takeout-1634993897
Extracting /home/username/Downloads/takeout-20211023T070558Z-001.zip to /tmp/tmp07ua_0id
Moving /tmp/tmp07ua_0id/Takeout to /home/username/data/google_takeout/Takeout-1634993897
$ ls -1 ~/data/google_takeout/Takeout-1634993897
archive_browser.html
Chrome
Expand Down Expand Up @@ -252,22 +252,22 @@ On certain machines, the giant HTML files may even take so much memory that the

Just to give a brief overview, to add new functionality (parsing some new folder that this doesn't currently support), you'd need to:

- Add a `model` for it in [`models.py`](google_takeout_parser/models.py) subclassing `BaseEvent` and adding it to the Union at the bottom of the file. That should have a [`key` property function](https://github.com/seanbreckenridge/google_takeout_parser/blob/a8aefac76d8e1474ca2275b4a7c78bbb962c7a04/google_takeout_parser/models.py#L185-L187) which describes each event uniquely (this is used to remove duplicate items when merging takeouts)
- Add a `model` for it in [`models.py`](google_takeout_parser/models.py) subclassing `BaseEvent` and adding it to the Union at the bottom of the file. That should have a [`key` property function](https://github.com/purarue/google_takeout_parser/blob/a8aefac76d8e1474ca2275b4a7c78bbb962c7a04/google_takeout_parser/models.py#L185-L187) which describes each event uniquely (this is used to remove duplicate items when merging takeouts)
- Write a function which takes the `Path` to the file you're trying to parse and converts it to the model you created (See examples in [`parse_json.py`](google_takeout_parser/parse_json.py)). Ideally extract a single raw item from the takeout file add a test for it so its obvious when/if the format changes.
- Add a regex match for the file path to the handler map in [`google_takeout_parser/locales/en.py`](google_takeout_parser/locales/en.py).

Dont feel required to add support for all locales, its somewhat annoying to swap languages on google, request a takeout, wait for it to process and then swap back.

Though, if your takeout is in some language this doesn't support, you can [create an issue](https://github.com/seanbreckenridge/google_takeout_parser/issues/new?title=support+new+locale) with the file structure (run `find Takeout` and/or `tree Takeout`), or contribute a locale file by creating a `path -> function mapping` ([see locales](https://github.com/seanbreckenridge/google_takeout_parser/tree/master/google_takeout_parser/locales)), and adding it to the global `LOCALES` variables in `locales/all.py` and `locales/main.py`
Though, if your takeout is in some language this doesn't support, you can [create an issue](https://github.com/purarue/google_takeout_parser/issues/new?title=support+new+locale) with the file structure (run `find Takeout` and/or `tree Takeout`), or contribute a locale file by creating a `path -> function mapping` ([see locales](https://github.com/purarue/google_takeout_parser/tree/master/google_takeout_parser/locales)), and adding it to the global `LOCALES` variables in `locales/all.py` and `locales/main.py`

This is a pretty difficult to maintain, as it requires a lot of manual testing from people who have access to these takeouts, and who actively use the language that the takeout is in. My google accounts main language is English, so I upkeep that locale whenever I notice changes, but its not trivial to port those changes to other locales without swapping my language, making an export, waiting, and then switching back. I keep track of mismatched changes [in this board](https://github.com/users/seanbreckenridge/projects/1/views/1)
This is a pretty difficult to maintain, as it requires a lot of manual testing from people who have access to these takeouts, and who actively use the language that the takeout is in. My google accounts main language is English, so I upkeep that locale whenever I notice changes, but its not trivial to port those changes to other locales without swapping my language, making an export, waiting, and then switching back. I keep track of mismatched changes [in this board](https://github.com/users/purarue/projects/1/views/1)

Ideally, when first creating a locale file, you would select everything when doing a takeout (not just the `My Activity`/`Chrome`/`Location History` like I suggested above), so [paths that are not parsed can be ignored properly](https://github.com/seanbreckenridge/google_takeout_parser/blob/4981c241c04b5b37265710dcc6ca00f19d1eafb4/google_takeout_parser/locales/en.py#L105C1-L113).
Ideally, when first creating a locale file, you would select everything when doing a takeout (not just the `My Activity`/`Chrome`/`Location History` like I suggested above), so [paths that are not parsed can be ignored properly](https://github.com/purarue/google_takeout_parser/blob/4981c241c04b5b37265710dcc6ca00f19d1eafb4/google_takeout_parser/locales/en.py#L105C1-L113).

### Testing

```bash
git clone 'https://github.com/seanbreckenridge/google_takeout_parser'
git clone 'https://github.com/purarue/google_takeout_parser'
cd ./google_takeout_parser
pip install '.[testing]'
mypy ./google_takeout_parser
Expand Down
4 changes: 2 additions & 2 deletions google_takeout_parser/http_allowlist.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
"""
For context, see: https://github.com/seanbreckenridge/google_takeout_parser/issues/31
For context, see: https://github.com/purarue/google_takeout_parser/issues/31
This converts HTTP URLs to HTTPS, if they're from certain google domains.
In some cases URLs in the takeout are HTTP for no reason, and converting them
Expand Down Expand Up @@ -223,7 +223,7 @@ def _convert_to_https(url: str, logger: Optional[logging.Logger] = None) -> str:
return urlunsplit(("https",) + uu[1:])
if logger:
logger.debug(
"HTTP URL did not match allowlist: %s\nIf you think this should be auto-converted to HTTPS, make an issue here: https://github.com/seanbreckenridge/google_takeout_parser/issues/new",
"HTTP URL did not match allowlist: %s\nIf you think this should be auto-converted to HTTPS, make an issue here: https://github.com/purarue/google_takeout_parser/issues/new",
url,
)
# some other scheme, just return
Expand Down
2 changes: 1 addition & 1 deletion google_takeout_parser/merge.py
Original file line number Diff line number Diff line change
Expand Up @@ -84,7 +84,7 @@ def _create_key(e: BaseEvent) -> Key:


# This is so that its easier to use this logic in other
# places, e.g. in github.com/seanbreckenridge/HPI
# places, e.g. in github.com/purarue/HPI
class GoogleEventSet:
"""
Class to help manage keys for the models
Expand Down
5 changes: 2 additions & 3 deletions setup.cfg
Original file line number Diff line number Diff line change
Expand Up @@ -4,9 +4,8 @@ version = 0.1.12
description = Parses data out of your Google Takeout (History, Activity, Youtube, Locations, etc...)
long_description = file: README.md
long_description_content_type = text/markdown
url = https://github.com/seanbreckenridge/google_takeout_parser
author = Sean Breckenridge
author_email = "seanbrecke@gmail.com"
url = https://github.com/purarue/google_takeout_parser
author = purarue
license = MIT
license_files = LICENSE
classifiers =
Expand Down
2 changes: 1 addition & 1 deletion split_html/go.mod
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
module github.com/seanbreckenridge/google_takeout_parser/scripts
module github.com/purarue/google_takeout_parser/scripts

go 1.18.0

Expand Down
6 changes: 3 additions & 3 deletions tests/test_json.py
Original file line number Diff line number Diff line change
Expand Up @@ -55,7 +55,7 @@ def test_parse_activity_json(tmp_path_f: Path) -> None:


def test_parse_likes_json(tmp_path_f: Path) -> None:
contents = """[{"contentDetails": {"videoId": "J1tF-DKKt7k", "videoPublishedAt": "2015-10-05T17:23:15.000Z"}, "etag": "GbLczUV2gsP6j0YQgTcYropUbdY", "id": "TExBNkR0bmJaMktKY2t5VFlmWE93UU5BLkoxdEYtREtLdDdr", "kind": "youtube#playlistItem", "snippet": {"channelId": "UCA6DtnbZ2KJckyTYfXOwQNA", "channelTitle": "Sean B", "description": "\\u30b7\\u30e5\\u30ac\\u30fc\\u30bd\\u30f3\\u30b0\\u3068\\u30d3\\u30bf\\u30fc\\u30b9\\u30c6\\u30c3\\u30d7 \\nSugar Song and Bitter Step\\n\\u7cd6\\u6b4c\\u548c\\u82e6\\u5473\\u6b65\\u9a5f\\nUNISON SQUARE GARDEN\\n\\u7530\\u6df5\\u667a\\u4e5f\\n\\u8840\\u754c\\u6226\\u7dda\\n\\u5e7b\\u754c\\u6230\\u7dda\\nBlood Blockade Battlefront ED\\nArranged by Maybe\\nScore:https://drive.google.com/open?id=0B9Jb1ks6rtrWSk1hX1U0MXlDSUE\\nThx~~", "playlistId": "LLA6DtnbZ2KJckyTYfXOwQNA", "position": 4, "publishedAt": "2020-07-05T18:27:32.000Z", "resourceId": {"kind": "youtube#video", "videoId": "J1tF-DKKt7k"}, "thumbnails": {"default": {"height": 90, "url": "https://i.ytimg.com/vi/J1tF-DKKt7k/default.jpg", "width": 120}, "high": {"height": 360, "url": "https://i.ytimg.com/vi/J1tF-DKKt7k/hqdefault.jpg", "width": 480}, "medium": {"height": 180, "url": "https://i.ytimg.com/vi/J1tF-DKKt7k/mqdefault.jpg", "width": 320}, "standard": {"height": 480, "url": "https://i.ytimg.com/vi/J1tF-DKKt7k/sddefault.jpg", "width": 640}}, "title": "[Maybe]Blood Blockade Battlefront ED \\u30b7\\u30e5\\u30ac\\u30fc\\u30bd\\u30f3\\u30b0\\u3068\\u30d3\\u30bf\\u30fc\\u30b9\\u30c6\\u30c3\\u30d7 Sugar Song and Bitter Step"}, "status": {"privacyStatus": "public"}}]"""
contents = """[{"contentDetails": {"videoId": "J1tF-DKKt7k", "videoPublishedAt": "2015-10-05T17:23:15.000Z"}, "etag": "GbLczUV2gsP6j0YQgTcYropUbdY", "id": "TExBNkR0bmJaMktKY2t5VFlmWE93UU5BLkoxdEYtREtLdDdr", "kind": "youtube#playlistItem", "snippet": {"channelId": "UCA6DtnbZ2KJckyTYfXOwQNA", "channelTitle": "Title", "description": "\\u30b7\\u30e5\\u30ac\\u30fc\\u30bd\\u30f3\\u30b0\\u3068\\u30d3\\u30bf\\u30fc\\u30b9\\u30c6\\u30c3\\u30d7 \\nSugar Song and Bitter Step\\n\\u7cd6\\u6b4c\\u548c\\u82e6\\u5473\\u6b65\\u9a5f\\nUNISON SQUARE GARDEN\\n\\u7530\\u6df5\\u667a\\u4e5f\\n\\u8840\\u754c\\u6226\\u7dda\\n\\u5e7b\\u754c\\u6230\\u7dda\\nBlood Blockade Battlefront ED\\nArranged by Maybe\\nScore:https://drive.google.com/open?id=0B9Jb1ks6rtrWSk1hX1U0MXlDSUE\\nThx~~", "playlistId": "LLA6DtnbZ2KJckyTYfXOwQNA", "position": 4, "publishedAt": "2020-07-05T18:27:32.000Z", "resourceId": {"kind": "youtube#video", "videoId": "J1tF-DKKt7k"}, "thumbnails": {"default": {"height": 90, "url": "https://i.ytimg.com/vi/J1tF-DKKt7k/default.jpg", "width": 120}, "high": {"height": 360, "url": "https://i.ytimg.com/vi/J1tF-DKKt7k/hqdefault.jpg", "width": 480}, "medium": {"height": 180, "url": "https://i.ytimg.com/vi/J1tF-DKKt7k/mqdefault.jpg", "width": 320}, "standard": {"height": 480, "url": "https://i.ytimg.com/vi/J1tF-DKKt7k/sddefault.jpg", "width": 640}}, "title": "[Maybe]Blood Blockade Battlefront ED \\u30b7\\u30e5\\u30ac\\u30fc\\u30bd\\u30f3\\u30b0\\u3068\\u30d3\\u30bf\\u30fc\\u30b9\\u30c6\\u30c3\\u30d7 Sugar Song and Bitter Step"}, "status": {"privacyStatus": "public"}}]"""
fp = tmp_path_f / "file"
fp.write_text(contents)
res = list(prj._parse_likes(fp))
Expand Down Expand Up @@ -160,13 +160,13 @@ def test_location_2024(tmp_path_f: Path) -> None:


def test_chrome_history(tmp_path_f: Path) -> None:
contents = '{"Browser History": [{"page_transition": "LINK", "title": "sean", "url": "https://sean.fish", "client_id": "W1vSb98l403jhPeK==", "time_usec": 1617404690134513}]}'
contents = '{"Browser History": [{"page_transition": "LINK", "title": "title", "url": "https://sean.fish", "client_id": "W1vSb98l403jhPeK==", "time_usec": 1617404690134513}]}'
fp = tmp_path_f / "file"
fp.write_text(contents)
res = list(prj._parse_chrome_history(fp))
assert res == [
models.ChromeHistory(
title="sean",
title="title",
url="https://sean.fish",
dt=datetime.datetime(
2021, 4, 2, 23, 4, 50, 134513, tzinfo=datetime.timezone.utc
Expand Down

0 comments on commit eb89fc9

Please sign in to comment.