Skipping tables #245

Slona · 2023-02-09T21:36:18Z

Hi!

Bot is skipping tables in RSS posts and all their content.
E.g. here is the code of RSS:

&lt;table class="table1"&gt;
	&lt;tbody&gt;
		&lt;tr&gt;
			&lt;td  rowspan="2"&gt;&lt;span&gt;&lt;b&gt;№&lt;/b&gt;&lt;/span&gt;&lt;/td&gt;
			&lt;td  rowspan="2"&gt;CODE&lt;/td&gt;
			&lt;td  rowspan="2"&gt;PAPERа&lt;/td&gt;
			&lt;td colspan="3"&gt;MINIMUM LEVEL, %&lt;/td&gt;
			&lt;td colspan="2" &gt;LIMITшт.&lt;/td&gt;
			&lt;td rowspan="2"&gt;SHORT RESTRICTION&lt;/td&gt;
			&lt;td rowspan="2"&gt;SOME OTHER ITEM&lt;/td&gt;
		&lt;/tr&gt;
		&lt;tr&gt;
			&lt;td&gt;1ST LEVEL, S1_min&lt;/td&gt;
			&lt;td&gt;2ST LEVEL, S2_min&lt;/td&gt;
			&lt;td&gt;3RD LEVEL, S3_min&lt;/td&gt;
			&lt;td&gt;1 LEVEL&lt;/td&gt;
			&lt;td&gt;2 LEVEL&lt;/td&gt;
		&lt;/tr&gt;
		&lt;tr&gt;
			&lt;td &gt;1&lt;/td&gt;
			&lt;td &gt;RU000A105SG2&lt;/td&gt;
			&lt;td &gt;ООО &amp;quot;GAZPROM&amp;quot;&lt;/td&gt;
			&lt;td &gt;52%&lt;/td&gt;
			&lt;td &gt;55%&lt;/td&gt;
			&lt;td &gt;58%&lt;/td&gt;
			&lt;td &gt;40 000&lt;/td&gt;
			&lt;td &gt;200 000&lt;/td&gt;
			&lt;td&gt;NO&lt;/td&gt;
			&lt;td&gt;YES&lt;/td&gt;
		&lt;/tr&gt;
		&lt;tr&gt;
			&lt;td &gt;2&lt;/td&gt;
			&lt;td &gt;RU000A105TK2&lt;/td&gt;
			&lt;td &gt;АО &amp;quot;RHBZ&amp;quot;&lt;/td&gt;
			&lt;td &gt;12%&lt;/td&gt;
			&lt;td &gt;15%&lt;/td&gt;
			&lt;td &gt;18%&lt;/td&gt;
			&lt;td &gt;300 000&lt;/td&gt;
			&lt;td &gt;1 500 000&lt;/td&gt;
			&lt;td&gt;NO&lt;/td&gt;
			&lt;td&gt;NOт&lt;/td&gt;
		&lt;/tr&gt;
	&lt;/tbody&gt;
&lt;/table&gt;

and the bot is just skipping all the contents.

The text was updated successfully, but these errors were encountered:

Slona · 2023-02-09T21:56:52Z

Here is like it looks in web

Okay, telegram cant make tables, but it can at least display the text contents

Slona · 2023-02-09T22:01:15Z

Tested with repost to Telegraph and posts there are also without tables and their contents

Rongronggg9 · 2023-02-10T13:52:24Z

Telegraph lacks support for HTML tables too.

Slona · 2023-02-10T18:04:07Z

Any way to extract text from tables? Or make an image?

Rongronggg9 · 2023-02-13T17:55:27Z

Any way to extract text from tables?

I don't think it will be readable, especially when there are rowspan/columnspan or cells with long text.

Or make an image?

It is possible... technically, and impossible... practically. There does be a table-to-image converter module powered by matplotlib in RSStT, but it can handle neither rowspan/columnspan nor cells with long text. To render an HTML table "perfectly", a browser or a browser-like renderer is required, some projects, e.g. wkhtmltopdf and html2image, can achieve that. The problem is, however, that rendering is not the only consideration: security is always more important than that. Passing untrusted HTML to a browser or a browser-like renderer could result in RCE (Remote Code Execution) or DoS (Denial of Service), two famous kinds of vulnerability.

Slona · 2023-03-02T06:53:06Z

I don't think it will be readable, especially when there are rowspan/columnspan or cells with long text.

I guess in our case, contents data in text is more vital than readability.

Also i have another script in use, which converts emails to telegram messages and there i have found the following part, which converts html to plaintext, ignoring all tags

def html2md(html):
	parser = HTML2Text()
	parser.ignore_images = True
	parser.ignore_anchors = True
	parser.body_width = 0
	md = parser.handle(html)
	return md

def html2plain(html):
	md = html2md(html)
	md = re.sub(r'(^|\n) ? ? ?\\?[•·–-—-*]( \w)', r'\1  *\2', md)
	html_simple = mistletoe.markdown(md)
	soup = BeautifulSoup(html_simple, features="lxml")
	text = soup.getText()
	text = re.sub(r'(^|\n)\|\s*', r'\1', text)
	text = re.sub(r'\*\*', '', text)
	text = re.sub(r' *$', '', text)
	text = re.sub(r'\n\n+', r'\n\n', text)
	text = re.sub(r'^\n+', '', text)
	return text

Rongronggg9 · 2023-03-02T14:08:29Z

I guess in our case, contents data in text is more vital than readability.

Not everyone will be pleased when such an unreadable chuck messes up their messages.

Also i have another script in use, which converts emails to telegram messages and there i have found the following part, which converts html to plaintext, ignoring all tags

I know exactly how to convert a table to plain text or an image, but the problem is not "how to" but "should we". At least for me, converting a table into plain text is never what I want. I will keep looking for the possibility of converting it to an image securely.

Slona · 2023-03-03T09:24:11Z

Ah, i understand now, sorry!
Thanks in advance!

Slona · 2023-03-27T13:20:49Z

I've seen you made tables! Thats great! Thank you very much!

Rongronggg9 added enhancement New feature or request help wanted Extra attention is needed labels Mar 2, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Skipping tables #245

Skipping tables #245

Slona commented Feb 9, 2023

Slona commented Feb 9, 2023

Slona commented Feb 9, 2023 •

edited

Loading

Rongronggg9 commented Feb 10, 2023

Slona commented Feb 10, 2023

Rongronggg9 commented Feb 13, 2023 •

edited

Loading

Slona commented Mar 2, 2023

Rongronggg9 commented Mar 2, 2023

Slona commented Mar 3, 2023

Slona commented Mar 27, 2023

Skipping tables #245

Skipping tables #245

Comments

Slona commented Feb 9, 2023

Slona commented Feb 9, 2023

Slona commented Feb 9, 2023 • edited Loading

Rongronggg9 commented Feb 10, 2023

Slona commented Feb 10, 2023

Rongronggg9 commented Feb 13, 2023 • edited Loading

Slona commented Mar 2, 2023

Rongronggg9 commented Mar 2, 2023

Slona commented Mar 3, 2023

Slona commented Mar 27, 2023

Slona commented Feb 9, 2023 •

edited

Loading

Rongronggg9 commented Feb 13, 2023 •

edited

Loading