Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cache the titles, descriptions and subtitles #200

Open
benoit74 opened this issue May 14, 2024 · 4 comments
Open

Cache the titles, descriptions and subtitles #200

benoit74 opened this issue May 14, 2024 · 4 comments

Comments

@benoit74
Copy link
Collaborator

Videos titles, descriptions and subtitles are not yet cached on S3.

They are however not expected to change much and are rather time-consuming to fetch (especially when the video has been translated into 10s of languages)

Titles and descriptions requires to fetch the HTML page of the video for every language and parse it with Bettersoup to extract this.

Subtitles have to be converted to proper format.

We should cache them and only refresh them when someone complains or one in a while, especially if we continue to want to update the ZIM on a very regular basis to fetch the few new videos that have been published.

@kelson42
Copy link
Contributor

kelson42 commented Jun 28, 2024

Titles and descriptions requires to fetch the HTML page of the video for every language and parse it with Bettersoup to extract this.

How is that a problem? How measurable is that? I'm not in favour with upstream synchronisation based on time delays.... ETAG based solutions should be used.

@benoit74
Copy link
Collaborator Author

As mentioned, first-order problem is that it is time-consuming to fetch (especially when the video has been translated into 10s of languages).

I don't have measure to share yet still we are now reencoding all the videos, so reencoding is the main share of task duration. But once reencoding will be complete, most task will just download videos from the cache. I will share them once available.

ETAGs are indeed available, not sure how well they work but should be ok, see https://www.ted.com/talks/oral_mcguire_how_to_live_with_fire?delay=5s&subtitle=en&trigger=30s

@benoit74
Copy link
Collaborator Author

For instance on https://farm.openzim.org/pipeline/3241d2f3-c4d9-489d-98dc-67820f39e6c0/debug, these are the stats (all images and reencoded videos are already in S3 cache):

Download video infos from TED website: 16 mins
Download images from cache: 1 min
Download videos from cache: 13 mins
Build the ZIM: few secs

So we spend more time downloading info from TED than downloading videos from cache.

@benoit74
Copy link
Collaborator Author

(in mentioned task we finally had 23 videos to ZIM)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants