-
-
Notifications
You must be signed in to change notification settings - Fork 37
/
rdrview.1
158 lines (158 loc) · 4.34 KB
/
rdrview.1
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
.\" rdrview.1 - manpage for rdrview
.\"
.\" Copyright (C) 2020 Ernesto A. Fernández <ernesto.mnd.fernandez@gmail.com>
.\"
.TH rdrview 1 "October 2024" "0.1.3"
.SH NAME
rdrview \- extract readable content from a webpage
.SH SYNOPSIS
.B rdrview
.RB [ -v ]
[\fB-u \fIbase-url\fR]
[\fB-E \fIencoding\fR]
[\fB-A \fIuser-agent\fR]
[\fB-T \fItemplate\fR]
[\fB-P\fR]
[\fB-c\fR|\fB-H\fR|\fB\-M\fR|\fB\-B \fIbrowser\fR]
[\fIpath\fR|\fIurl\fR]
.SH DESCRIPTION
.B rdrview
attempts to extract the meaningful content from a webpage,
as done by the "Reader View" feature of most modern browsers.
It's intended to be used with terminal RSS readers,
to clean up the articles for display on web browsers such as lynx.
.PP
If no
.I url
or
.I path
is provided,
the HTML will be read from standard input.
By default,
.B rdrview
will check mailcap for a way to display the content as text.
If preferred, a browser can be specified with the
.B \-B
option, or with the
.B RDRVIEW_BROWSER
environment variable.
.SH EXAMPLES
.TP 3
If you have a text mode browser, you can extract content with just:
.B rdrview 'https://en.wikipedia.org/wiki/World_wide_web'
.TP 3
To see the same article in a browser:
.B rdrview -B firefox 'https://en.wikipedia.org/wiki/World_wide_web'
.TP 3
To clean up local HTML files:
.B rdrview -H -u 'http://fakehost.com' < source.html > result.html
.TP 3
To mediate between the \
\fBnewsboat\fR(1) feed reader and \fBlynx\fR(1):
.B BROWSER='rdrview -B lynx' newsboat
.SH OPTIONS
.TP
.BR \-c ", " \-\-check
Don't extract content,
just run a quick check to see if the document appears to have any.
Exit status is 0 in that case, or 1 otherwise.
.TP
\fB-u \fIbase-url\fR, \fB--base=\fIbase-url
Specify the base to be used for all relative URLs.
This option is most useful for local files and standard input,
where the document's URL may be unknown.
.TP
.BR \-v ", " \-\-version
Print the version number of
.B rdrview
and exit.
.TP
\fB\-A \fIuser-agent\fR, \fB--agent=\fIuser-agent
Specify the user-agent string.
The default should work fine in most situations.
.TP
\fB\-B \fIbrowser\fR, \fB--browser=\fIbrowser
Specify a browser to display the result.
.TP
\fB\-E \fIencoding\fR, \fB--encoding=\fIencoding
Specify the character encoding of the source.
By default,
the meta tags will be checked.
.TP
.BR \-H ", " \-\-html
Output the raw HTML for the extracted article.
WARNING: the markup may still contain some scripts so,
if you plan to open it with a modern browser at some point,
first check how it implements the same-origin policy for local files.
.TP
.BR \-M ", " \-\-meta
Output only the metadata for the article.
.TP
.BR \-P ", " \-\-preserve-classes
Don't remove html class attributes.
.TP
\fB\-T \fItemplate\fR, \fB--template=\fItemplate
Pick the metadata to include in the extracted article.
The template is a comma-separated list of some of the following:
.IR title ,
.IR body ,
.IR byline ,
.IR excerpt ,
.IR sitename ,
.IR url .
The order matters, and metadata fields can be repeated.
By default, only the body is included.
.TP
.BR \-\-disable-sandbox
Disable the security sandbox.
This option is potentially dangerous,
so don't use it unless you know what you are doing.
.SH EXIT STATUS
The exit status is 0 on success, 1 on failure.
.SH ENVIRONMENT
Any environment understood by
.BR curl (1)
can be used here.
.B TMPDIR
is respected as well.
.TP 3
.B RDRVIEW_BROWSER
Default browser to display the extracted articles. The
.B \-B
option overrides this.
.TP 3
.B RDRVIEW_TEMPLATE
Default template for article content. The
.B \-T
option overrides this; see that option for details.
.TP 3
.B RDRVIEW_USER_AGENT
Default user-agent string, overridden by the
.B \-A
option.
.SH BUGS
The markup produced by the
.B \-H
option is a huge mess.
If you intend to work with it you may want to pipe it to something like
.BR tidy (1)
first.
.PP
If you have a version of the libraries that hasn't been tested,
the security sandbox might not allow the code to run.
Please report this,
but in the meantime,
an option is provided to disable the sandbox.
Don't use it unless you have other security measures in place.
.SH AUTHOR
Ernesto A. Fernández
\%<ernesto.mnd.fernandez@gmail.com>
.PP
Please report bugs via email or, if preferred, file a github issue at
\%https://github.com/eafer/rdrview/issues.
.PP
Credits to Readability.js by Mozilla;
this tool is mostly a transpilation of their code done by hand.
.SH SEE ALSO
.BR lynx (1),
.BR newsboat (1)