Attempt to get publication year when auto-titling links #520

rleed · 2023-09-25T20:44:32Z

To resolve #51.

ekzyis

Regarding:

I would put this logic into getMetadata using a custom ruleset. Afaict, looking at the code, you can add new rules (like for date here) and not only extend existing rules. Just add a new key to the ruleset.

I see you are already using getMetadata.

Why are you extracting the date in different functions? Why not have a single call to getMetadata with all date rules? You don't have to reimplement the python code 1:1.

You also mention this:

// try to get date from various sources in order of precedence

But the rule order would already specify the precedence:

The order in which rules are defined indicate their preference, with the first rule being the most preferred.

-- https://www.npmjs.com/package/page-metadata-parser#rules

What's interesting is that there was already a PR for this but the package is unmaintained since February 2022: https://github.com/mozilla/page-metadata-parser/pull/122/files

I haven't found a maintained fork or similar though.

I think we should also use the name publishedDate instead of just date for the rule because it's more clear what kind of date is meant (even though I don't know what other date could be meant, lol).

Also, do you have example websites where different tags are used for the published date? The url in your comment uses script[type="application/ld+json"] but would be nice to see examples for the other tags, too.

edit: also please do a rebase, there are conflicts currently

lib/timedate-scraper.js

api/resolvers/item.js

lib/timedate-scraper.js

rleed · 2023-09-30T11:02:16Z

Just a quick update that I've been working on this. I probably won't have new push till next week though.

ekzyis · 2023-10-17T00:22:44Z

Mhh, I think we missed that this is ready for review again.

I'll try to do a review tomorrow.

ekzyis

Looks much better now!

Most important comments:

I don't understand the purpose of initDateRule
TypeError for links where no publication date was found

api/resolvers/item.js

lib/time.js

api/resolvers/item.js

ekzyis · 2023-10-17T23:03:48Z

By the way, it's a good practice to rebase your branches frequently on master to avoid complex conflicts.

There are currently no conflicts but this means you can rebase for free.

For example, your branch date-scraper is 49 commits behind master while date-ranges and your master branch are 42 commits behind:

$ git rev-list master --not rleed/date-scraper | wc -l
49
$ git rev-list master --not rleed/date-ranges | wc -l
42
$ git rev-list master --not rleed/master | wc -l
42

rleed · 2023-10-18T11:33:32Z

I don't understand the purpose of initDateRule

It just adds our custom rule to the ruleset, since it isn't built-in. Without it, the following line wouldn't know how to find the date in the page. If there's a better way/place to do it, I'm open, but this just seemed like the most logical/reliable place to put it.

rleed · 2023-10-18T12:00:25Z

By the way, it's a good practice to rebase your branches frequently on master to avoid complex conflicts.

There are currently no conflicts but this means you can rebase for free.

For example, your branch date-scraper is 49 commits behind master while date-ranges and your master branch are 42 commits behind:
$ git rev-list master --not rleed/date-scraper | wc -l
49
$ git rev-list master --not rleed/date-ranges | wc -l
42
$ git rev-list master --not rleed/master | wc -l
42

Thanks... I will go through them all now.

ekzyis · 2023-10-18T12:49:34Z

It just adds our custom rule to the ruleset, since it isn't built-in. Without it, the following line wouldn't know how to find the date in the page. If there's a better way/place to do it, I'm open, but this just seemed like the most logical/reliable place to put it.

Oh, I never added why I don't understand it. I had written a lengthy comment about it but somehow I must have lost it in all my tabs, lol. Github is confusing if you have multiple tabs of it open ...

I don't understand it because you're manipulating something you imported at lib/timedate-scraper.js:95:

metadataRuleSets.publicationDate = ruleSet

I am actually surprised this even works - I guess because javascript imports are references? So this changed value is actually reflected where you use it in api/resolvers/item.js:489:

const metadata = getMetadata(doc, url, { title: metadataRuleSets.title, publicationDate: metadataRuleSets.publicationDate })

But since initDateRule takes no arguments ... why don't you just export ruleSet? And then import that in api/resolvers/item.js:489? That's beyond me.

Anyway, you shouldn't manipulate imports imo. That probably relies on very specific ESM import mechanics and I've never seen this before, lol

rleed · 2023-10-18T14:07:05Z

Done, thanks!

rleed force-pushed the date-scraper branch from 126570e to d3efc85 Compare September 25, 2023 20:48

ekzyis reviewed Sep 27, 2023

View reviewed changes

huumn marked this pull request as draft September 30, 2023 20:41

rleed marked this pull request as ready for review October 3, 2023 22:40

ekzyis requested changes Oct 17, 2023

View reviewed changes

api/resolvers/item.js Outdated Show resolved Hide resolved

lib/time.js Outdated Show resolved Hide resolved

api/resolvers/item.js Outdated Show resolved Hide resolved

rleed force-pushed the date-scraper branch 2 times, most recently from 3b5a9f6 to c6a1d93 Compare October 18, 2023 12:24

rleed force-pushed the date-scraper branch from 1b81321 to 8b1edd2 Compare October 20, 2023 13:48

rleed and others added 10 commits October 20, 2023 10:48

port date scraper from python

6347e6c

bug fixes and cleanup

7706fd8

bug fixes and cleanup

5b96928

refactor

d011aa7

address comments

e166054

make it intuitive

7e98ff6

Update timedate-scraper.js - lint

070b8b0

address review comments

76304f3

cleanup

193ec48

simplfy and don't use side effects

8b1edd2

huumn merged commit 72b8b5b into stackernews:master Oct 21, 2023
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Attempt to get publication year when auto-titling links #520

Attempt to get publication year when auto-titling links #520

rleed commented Sep 25, 2023 •

edited

Loading

ekzyis left a comment •

edited

Loading

rleed commented Sep 30, 2023

ekzyis commented Oct 17, 2023

ekzyis left a comment

ekzyis commented Oct 17, 2023

rleed commented Oct 18, 2023 •

edited

Loading

rleed commented Oct 18, 2023

ekzyis commented Oct 18, 2023 •

edited

Loading

rleed commented Oct 18, 2023

Attempt to get publication year when auto-titling links #520

Attempt to get publication year when auto-titling links #520

Conversation

rleed commented Sep 25, 2023 • edited Loading

ekzyis left a comment • edited Loading

Choose a reason for hiding this comment

rleed commented Sep 30, 2023

ekzyis commented Oct 17, 2023

ekzyis left a comment

Choose a reason for hiding this comment

ekzyis commented Oct 17, 2023

rleed commented Oct 18, 2023 • edited Loading

rleed commented Oct 18, 2023

ekzyis commented Oct 18, 2023 • edited Loading

rleed commented Oct 18, 2023

rleed commented Sep 25, 2023 •

edited

Loading

ekzyis left a comment •

edited

Loading

rleed commented Oct 18, 2023 •

edited

Loading

ekzyis commented Oct 18, 2023 •

edited

Loading