Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Attempt to get publication year when auto-titling links #520

Merged
merged 10 commits into from
Oct 21, 2023

Conversation

rleed
Copy link
Contributor

@rleed rleed commented Sep 25, 2023

To resolve #51.

Copy link
Member

@ekzyis ekzyis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding:

I would put this logic into getMetadata using a custom ruleset. Afaict, looking at the code, you can add new rules (like for date here) and not only extend existing rules. Just add a new key to the ruleset.

I see you are already using getMetadata.

Why are you extracting the date in different functions? Why not have a single call to getMetadata with all date rules? You don't have to reimplement the python code 1:1.

You also mention this:

// try to get date from various sources in order of precedence

But the rule order would already specify the precedence:

The order in which rules are defined indicate their preference, with the first rule being the most preferred.

-- https://www.npmjs.com/package/page-metadata-parser#rules

What's interesting is that there was already a PR for this but the package is unmaintained since February 2022: https://github.com/mozilla/page-metadata-parser/pull/122/files

I haven't found a maintained fork or similar though.

I think we should also use the name publishedDate instead of just date for the rule because it's more clear what kind of date is meant (even though I don't know what other date could be meant, lol).

Also, do you have example websites where different tags are used for the published date? The url in your comment uses script[type="application/ld+json"] but would be nice to see examples for the other tags, too.

edit: also please do a rebase, there are conflicts currently

lib/timedate-scraper.js Outdated Show resolved Hide resolved
api/resolvers/item.js Outdated Show resolved Hide resolved
lib/timedate-scraper.js Outdated Show resolved Hide resolved
lib/timedate-scraper.js Outdated Show resolved Hide resolved
lib/timedate-scraper.js Outdated Show resolved Hide resolved
@rleed
Copy link
Contributor Author

rleed commented Sep 30, 2023

Just a quick update that I've been working on this. I probably won't have new push till next week though.

@huumn huumn marked this pull request as draft September 30, 2023 20:41
@rleed rleed marked this pull request as ready for review October 3, 2023 22:40
@ekzyis
Copy link
Member

ekzyis commented Oct 17, 2023

Mhh, I think we missed that this is ready for review again.

I'll try to do a review tomorrow.

Copy link
Member

@ekzyis ekzyis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks much better now!

Most important comments:

  • I don't understand the purpose of initDateRule
  • TypeError for links where no publication date was found

api/resolvers/item.js Outdated Show resolved Hide resolved
lib/time.js Outdated Show resolved Hide resolved
api/resolvers/item.js Outdated Show resolved Hide resolved
@ekzyis
Copy link
Member

ekzyis commented Oct 17, 2023

By the way, it's a good practice to rebase your branches frequently on master to avoid complex conflicts.

There are currently no conflicts but this means you can rebase for free.

For example, your branch date-scraper is 49 commits behind master while date-ranges and your master branch are 42 commits behind:

$ git rev-list master --not rleed/date-scraper | wc -l
49
$ git rev-list master --not rleed/date-ranges | wc -l
42
$ git rev-list master --not rleed/master | wc -l
42

@rleed
Copy link
Contributor Author

rleed commented Oct 18, 2023

  • I don't understand the purpose of initDateRule

It just adds our custom rule to the ruleset, since it isn't built-in. Without it, the following line wouldn't know how to find the date in the page. If there's a better way/place to do it, I'm open, but this just seemed like the most logical/reliable place to put it.

@rleed
Copy link
Contributor Author

rleed commented Oct 18, 2023

By the way, it's a good practice to rebase your branches frequently on master to avoid complex conflicts.

There are currently no conflicts but this means you can rebase for free.

For example, your branch date-scraper is 49 commits behind master while date-ranges and your master branch are 42 commits behind:

$ git rev-list master --not rleed/date-scraper | wc -l
49
$ git rev-list master --not rleed/date-ranges | wc -l
42
$ git rev-list master --not rleed/master | wc -l
42

Thanks... I will go through them all now.

@rleed rleed force-pushed the date-scraper branch 2 times, most recently from 3b5a9f6 to c6a1d93 Compare October 18, 2023 12:24
@ekzyis
Copy link
Member

ekzyis commented Oct 18, 2023

It just adds our custom rule to the ruleset, since it isn't built-in. Without it, the following line wouldn't know how to find the date in the page. If there's a better way/place to do it, I'm open, but this just seemed like the most logical/reliable place to put it.

Oh, I never added why I don't understand it. I had written a lengthy comment about it but somehow I must have lost it in all my tabs, lol. Github is confusing if you have multiple tabs of it open ...

I don't understand it because you're manipulating something you imported at lib/timedate-scraper.js:95:

metadataRuleSets.publicationDate = ruleSet

I am actually surprised this even works - I guess because javascript imports are references? So this changed value is actually reflected where you use it in api/resolvers/item.js:489:

const metadata = getMetadata(doc, url, { title: metadataRuleSets.title, publicationDate: metadataRuleSets.publicationDate })

But since initDateRule takes no arguments ... why don't you just export ruleSet? And then import that in api/resolvers/item.js:489? That's beyond me.

Anyway, you shouldn't manipulate imports imo. That probably relies on very specific ESM import mechanics and I've never seen this before, lol

@rleed
Copy link
Contributor Author

rleed commented Oct 18, 2023

Done, thanks!

@huumn huumn merged commit 72b8b5b into stackernews:master Oct 21, 2023
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Attempt to get publication year when auto-titling links
3 participants