Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Reddit content presentation #532

Merged
merged 6 commits into from
Sep 2, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
131 changes: 66 additions & 65 deletions Gemfile.lock
Original file line number Diff line number Diff line change
Expand Up @@ -3,67 +3,67 @@ GEM
specs:
aasm (5.5.0)
concurrent-ruby (~> 1.0)
actioncable (7.0.6)
actionpack (= 7.0.6)
activesupport (= 7.0.6)
actioncable (7.0.7.2)
actionpack (= 7.0.7.2)
activesupport (= 7.0.7.2)
nio4r (~> 2.0)
websocket-driver (>= 0.6.1)
actionmailbox (7.0.6)
actionpack (= 7.0.6)
activejob (= 7.0.6)
activerecord (= 7.0.6)
activestorage (= 7.0.6)
activesupport (= 7.0.6)
actionmailbox (7.0.7.2)
actionpack (= 7.0.7.2)
activejob (= 7.0.7.2)
activerecord (= 7.0.7.2)
activestorage (= 7.0.7.2)
activesupport (= 7.0.7.2)
mail (>= 2.7.1)
net-imap
net-pop
net-smtp
actionmailer (7.0.6)
actionpack (= 7.0.6)
actionview (= 7.0.6)
activejob (= 7.0.6)
activesupport (= 7.0.6)
actionmailer (7.0.7.2)
actionpack (= 7.0.7.2)
actionview (= 7.0.7.2)
activejob (= 7.0.7.2)
activesupport (= 7.0.7.2)
mail (~> 2.5, >= 2.5.4)
net-imap
net-pop
net-smtp
rails-dom-testing (~> 2.0)
actionpack (7.0.6)
actionview (= 7.0.6)
activesupport (= 7.0.6)
actionpack (7.0.7.2)
actionview (= 7.0.7.2)
activesupport (= 7.0.7.2)
rack (~> 2.0, >= 2.2.4)
rack-test (>= 0.6.3)
rails-dom-testing (~> 2.0)
rails-html-sanitizer (~> 1.0, >= 1.2.0)
actiontext (7.0.6)
actionpack (= 7.0.6)
activerecord (= 7.0.6)
activestorage (= 7.0.6)
activesupport (= 7.0.6)
actiontext (7.0.7.2)
actionpack (= 7.0.7.2)
activerecord (= 7.0.7.2)
activestorage (= 7.0.7.2)
activesupport (= 7.0.7.2)
globalid (>= 0.6.0)
nokogiri (>= 1.8.5)
actionview (7.0.6)
activesupport (= 7.0.6)
actionview (7.0.7.2)
activesupport (= 7.0.7.2)
builder (~> 3.1)
erubi (~> 1.4)
rails-dom-testing (~> 2.0)
rails-html-sanitizer (~> 1.1, >= 1.2.0)
activejob (7.0.6)
activesupport (= 7.0.6)
activejob (7.0.7.2)
activesupport (= 7.0.7.2)
globalid (>= 0.3.6)
activemodel (7.0.6)
activesupport (= 7.0.6)
activerecord (7.0.6)
activemodel (= 7.0.6)
activesupport (= 7.0.6)
activestorage (7.0.6)
actionpack (= 7.0.6)
activejob (= 7.0.6)
activerecord (= 7.0.6)
activesupport (= 7.0.6)
activemodel (7.0.7.2)
activesupport (= 7.0.7.2)
activerecord (7.0.7.2)
activemodel (= 7.0.7.2)
activesupport (= 7.0.7.2)
activestorage (7.0.7.2)
actionpack (= 7.0.7.2)
activejob (= 7.0.7.2)
activerecord (= 7.0.7.2)
activesupport (= 7.0.7.2)
marcel (~> 1.0)
mini_mime (>= 1.1.0)
activesupport (7.0.6)
activesupport (7.0.7.2)
concurrent-ruby (~> 1.0, >= 1.0.2)
i18n (>= 1.6, < 2)
minitest (>= 5.1)
Expand Down Expand Up @@ -144,8 +144,8 @@ GEM
ffi-compiler (1.0.1)
ffi (>= 1.0.0)
rake
globalid (1.1.0)
activesupport (>= 5.0)
globalid (1.2.0)
activesupport (>= 6.1)
hashdiff (1.0.1)
honeybadger (5.2.1)
http (5.1.1)
Expand Down Expand Up @@ -190,11 +190,11 @@ GEM
mimemagic (0.4.3)
nokogiri (~> 1)
rake
mini_mime (1.1.2)
mini_portile2 (2.8.2)
minitest (5.18.1)
mini_mime (1.1.5)
mini_portile2 (2.8.4)
minitest (5.19.0)
msgpack (1.7.1)
net-imap (0.3.6)
net-imap (0.3.7)
date
net-protocol
net-pop (0.1.2)
Expand All @@ -205,7 +205,7 @@ GEM
net-protocol
netrc (0.11.0)
nio4r (2.5.9)
nokogiri (1.15.2)
nokogiri (1.15.4)
mini_portile2 (~> 2.8.2)
racc (~> 1.4)
parallel (1.23.0)
Expand All @@ -225,32 +225,33 @@ GEM
puma (6.3.1)
nio4r (~> 2.0)
racc (1.7.1)
rack (2.2.7)
rack (2.2.8)
rack-test (2.1.0)
rack (>= 1.3)
rails (7.0.6)
actioncable (= 7.0.6)
actionmailbox (= 7.0.6)
actionmailer (= 7.0.6)
actionpack (= 7.0.6)
actiontext (= 7.0.6)
actionview (= 7.0.6)
activejob (= 7.0.6)
activemodel (= 7.0.6)
activerecord (= 7.0.6)
activestorage (= 7.0.6)
activesupport (= 7.0.6)
rails (7.0.7.2)
actioncable (= 7.0.7.2)
actionmailbox (= 7.0.7.2)
actionmailer (= 7.0.7.2)
actionpack (= 7.0.7.2)
actiontext (= 7.0.7.2)
actionview (= 7.0.7.2)
activejob (= 7.0.7.2)
activemodel (= 7.0.7.2)
activerecord (= 7.0.7.2)
activestorage (= 7.0.7.2)
activesupport (= 7.0.7.2)
bundler (>= 1.15.0)
railties (= 7.0.6)
rails-dom-testing (2.0.3)
activesupport (>= 4.2.0)
railties (= 7.0.7.2)
rails-dom-testing (2.2.0)
activesupport (>= 5.0.0)
minitest
nokogiri (>= 1.6)
rails-html-sanitizer (1.6.0)
loofah (~> 2.21)
nokogiri (~> 1.14)
railties (7.0.6)
actionpack (= 7.0.6)
activesupport (= 7.0.6)
railties (7.0.7.2)
actionpack (= 7.0.7.2)
activesupport (= 7.0.7.2)
method_source
rake (>= 12.2)
thor (~> 1.0)
Expand Down Expand Up @@ -351,11 +352,11 @@ GEM
addressable (>= 2.8.0)
crack (>= 0.3.2)
hashdiff (>= 0.4.0, < 2.0.0)
websocket-driver (0.7.5)
websocket-driver (0.7.6)
websocket-extensions (>= 0.1.0)
websocket-extensions (0.1.5)
yaml-lint (0.1.2)
zeitwerk (2.6.8)
zeitwerk (2.6.11)

PLATFORMS
ruby
Expand Down
40 changes: 27 additions & 13 deletions app/normalizers/reddit_normalizer.rb
Original file line number Diff line number Diff line change
@@ -1,29 +1,43 @@
class RedditNormalizer < AtomNormalizer
class RedditNormalizer < BaseNormalizer
def link
discussion_url
xml.xpath("/entry/link").first.attributes["href"].value
end

def text
[super.sub(/\.$/, ""), source_url].join(separator)
def published_at
DateTime.parse(xml.xpath("/entry/published").first.content)
end

def comments
(source_url == discussion_url) ? [] : [discussion_url]
def text
source_url = extract_source_url
source_reference = source_url.present? ? "#{separator}#{source_url}" : ""
"#{title}#{source_reference}\nThread: #{link}"
end

private

def source_url
@source_url ||= Html.link_urls(extract_content)[1]
def thumbnail_url
xml.xpath("/entry/thumbnail").first.attributes["url"].value
rescue StandardError
@source_url ||= discussion_url
nil
end

def extract_source_url
content_urls.reject { URI.parse(_1).host =~ /reddit\.com/ }.first
end

def content_urls
parsed_content_html.css("a").map { _1.attributes["href"].value }
end

def parsed_content_html
Nokogiri::HTML(xml.xpath("/entry/content").first.content)
end

def discussion_url
entity.content.link.href
def title
xml.xpath("/entry/title").first.content
end

def extract_content
content.content.content
def xml
@xml ||= Nokogiri::XML(content).tap { _1.remove_namespaces! }
end
end
31 changes: 21 additions & 10 deletions app/processors/reddit_processor.rb
Original file line number Diff line number Diff line change
@@ -1,32 +1,43 @@
class RedditProcessor < AtomProcessor
class RedditProcessor < BaseProcessor
SCORE_THRESHOLD = 2000
POST_SCORE_CACHE_TTL = 2.hours

def entities
super.select { |item| above_score_threshold?(item.uid) }
atom_feed_entries.select { above_score_threshold?(_1.uid) }
end

private

def above_score_threshold?(link)
cached_score(link) >= SCORE_THRESHOLD
def atom_feed_entries
parsed_xml.xpath("/feed/entry").map do |entry|
uid = entry.xpath("link").first.attributes["href"].value
build_entity(uid, entry.to_xml)
end
end

def cached_score(link)
Rails.cache.fetch(cache_key(link), expires_in: POST_SCORE_CACHE_TTL) { score(link) }
def parsed_xml
Nokogiri::XML(content).tap { _1.remove_namespaces! }
end

def score(link)
RedditPointsFetcher.new(link).points
def above_score_threshold?(url)
cached_score(url) >= SCORE_THRESHOLD
end

def cached_score(url)
Rails.cache.fetch(cache_key(url), expires_in: POST_SCORE_CACHE_TTL) { score(url) }
end

def score(url)
RedditPointsFetcher.new(url).points
rescue StandardError => e
# NOTE: Individual post score fetching should not crash the processor,
# but they are getting reported to monitor Reddit availability
Honeybadger.notify(e)
0
end

def cache_key(link)
[cache_key_prefix, link].join(":")
def cache_key(url)
[cache_key_prefix, url].join(":")
end

def cache_key_prefix
Expand Down
4 changes: 2 additions & 2 deletions config/feeds.yml
Original file line number Diff line number Diff line change
Expand Up @@ -158,8 +158,8 @@
loader: "http"
processor: "reddit"
normalizer: "reddit"
after: "2021-01-06T12:00:00+00:00"
refresh_interval: 1200
after: "2023-09-02T00:00:00+00:00"
refresh_interval: 0
import_limit: 2

- name: "avokado-fm"
Expand Down
28 changes: 0 additions & 28 deletions spec/fixtures/files/feeds/reddit/expected_points.json

This file was deleted.

4 changes: 4 additions & 0 deletions spec/fixtures/files/feeds/reddit/expected_uids.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
[
"https://www.reddit.com/r/worldnews/comments/167sdt8/rworldnews_live_thread_russian_invasion_of/",
"https://www.reddit.com/r/worldnews/comments/167ucb5/meta_and_alphabet_would_owe_at_least_4_of_annual/"
]
42 changes: 41 additions & 1 deletion spec/fixtures/files/feeds/reddit/feed.xml

Large diffs are not rendered by default.

20 changes: 20 additions & 0 deletions spec/fixtures/files/feeds/reddit/normalized.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
[
{
"uid": "https://www.reddit.com/r/worldnews/comments/167sdt8/rworldnews_live_thread_russian_invasion_of/",
"link": "https://www.reddit.com/r/worldnews/comments/167sdt8/rworldnews_live_thread_russian_invasion_of/",
"published_at": "2023-09-02 04:02:37 +0000",
"text": "/r/WorldNews Live Thread\nThread: https://www.reddit.com/r/worldnews/comments/167sdt8/rworldnews_live_thread_russian_invasion_of/",
"attachments": [],
"comments": [],
"validation_errors": []
},
{
"uid": "https://www.reddit.com/r/worldnews/comments/167ucb5/meta_and_alphabet_would_owe_at_least_4_of_annual/",
"link": "https://www.reddit.com/r/worldnews/comments/167ucb5/meta_and_alphabet_would_owe_at_least_4_of_annual/",
"published_at": "2023-09-02 05:49:28 +0000",
"text": "Meta and Alphabet would owe at least 4% of annual revenue in Canada to news outlets under draft regulations pushed by Justin Trudeau - https://fortune.com/2023/09/01/meta-alphabet-canada-news-outlets-draft-regulations-justin-trudeau/\nThread: https://www.reddit.com/r/worldnews/comments/167ucb5/meta_and_alphabet_would_owe_at_least_4_of_annual/",
"attachments": [],
"comments": [],
"validation_errors": []
}
]
Loading
Loading