Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Skipping tables #245

Open
Slona opened this issue Feb 9, 2023 · 9 comments
Open

Skipping tables #245

Slona opened this issue Feb 9, 2023 · 9 comments
Labels
enhancement New feature or request help wanted Extra attention is needed

Comments

@Slona
Copy link

Slona commented Feb 9, 2023

Hi!

Bot is skipping tables in RSS posts and all their content.
E.g. here is the code of RSS:

<table class="table1">
	<tbody>
		<tr>
			<td  rowspan="2"><span><b>№</b></span></td>
			<td  rowspan="2">CODE</td>
			<td  rowspan="2">PAPERа</td>
			<td colspan="3">MINIMUM LEVEL, %</td>
			<td colspan="2" >LIMITшт.</td>
			<td rowspan="2">SHORT RESTRICTION</td>
			<td rowspan="2">SOME OTHER ITEM</td>
		</tr>
		<tr>
			<td>1ST LEVEL, S1_min</td>
			<td>2ST LEVEL, S2_min</td>
			<td>3RD LEVEL, S3_min</td>
			<td>1 LEVEL</td>
			<td>2 LEVEL</td>
		</tr>
		<tr>
			<td >1</td>
			<td >RU000A105SG2</td>
			<td >ООО "GAZPROM"</td>
			<td >52%</td>
			<td >55%</td>
			<td >58%</td>
			<td >40 000</td>
			<td >200 000</td>
			<td>NO</td>
			<td>YES</td>
		</tr>
		<tr>
			<td >2</td>
			<td >RU000A105TK2</td>
			<td >АО "RHBZ"</td>
			<td >12%</td>
			<td >15%</td>
			<td >18%</td>
			<td >300 000</td>
			<td >1 500 000</td>
			<td>NO</td>
			<td>NOт</td>
		</tr>
	</tbody>
</table>

and the bot is just skipping all the contents.

@Slona
Copy link
Author

Slona commented Feb 9, 2023

Here is like it looks in web
image

Okay, telegram cant make tables, but it can at least display the text contents

@Slona
Copy link
Author

Slona commented Feb 9, 2023

Tested with repost to Telegraph and posts there are also without tables and their contents

image

@Rongronggg9
Copy link
Owner

Telegraph lacks support for HTML tables too.

@Slona
Copy link
Author

Slona commented Feb 10, 2023

Any way to extract text from tables? Or make an image?

@Rongronggg9
Copy link
Owner

Rongronggg9 commented Feb 13, 2023

Any way to extract text from tables?

I don't think it will be readable, especially when there are rowspan/columnspan or cells with long text.

Or make an image?

It is possible... technically, and impossible... practically. There does be a table-to-image converter module powered by matplotlib in RSStT, but it can handle neither rowspan/columnspan nor cells with long text. To render an HTML table "perfectly", a browser or a browser-like renderer is required, some projects, e.g. wkhtmltopdf and html2image, can achieve that. The problem is, however, that rendering is not the only consideration: security is always more important than that. Passing untrusted HTML to a browser or a browser-like renderer could result in RCE (Remote Code Execution) or DoS (Denial of Service), two famous kinds of vulnerability.

@Slona
Copy link
Author

Slona commented Mar 2, 2023

I don't think it will be readable, especially when there are rowspan/columnspan or cells with long text.

I guess in our case, contents data in text is more vital than readability.

Also i have another script in use, which converts emails to telegram messages and there i have found the following part, which converts html to plaintext, ignoring all tags

def html2md(html):
	parser = HTML2Text()
	parser.ignore_images = True
	parser.ignore_anchors = True
	parser.body_width = 0
	md = parser.handle(html)
	return md

def html2plain(html):
	md = html2md(html)
	md = re.sub(r'(^|\n) ? ? ?\\?[•·–-—-*]( \w)', r'\1  *\2', md)
	html_simple = mistletoe.markdown(md)
	soup = BeautifulSoup(html_simple, features="lxml")
	text = soup.getText()
	text = re.sub(r'(^|\n)\|\s*', r'\1', text)
	text = re.sub(r'\*\*', '', text)
	text = re.sub(r' *$', '', text)
	text = re.sub(r'\n\n+', r'\n\n', text)
	text = re.sub(r'^\n+', '', text)
	return text

@Rongronggg9
Copy link
Owner

I guess in our case, contents data in text is more vital than readability.

Not everyone will be pleased when such an unreadable chuck messes up their messages.

Also i have another script in use, which converts emails to telegram messages and there i have found the following part, which converts html to plaintext, ignoring all tags

I know exactly how to convert a table to plain text or an image, but the problem is not "how to" but "should we". At least for me, converting a table into plain text is never what I want. I will keep looking for the possibility of converting it to an image securely.

@Rongronggg9 Rongronggg9 added enhancement New feature or request help wanted Extra attention is needed labels Mar 2, 2023
@Slona
Copy link
Author

Slona commented Mar 3, 2023

Ah, i understand now, sorry!
Thanks in advance!

@Slona
Copy link
Author

Slona commented Mar 27, 2023

I've seen you made tables! Thats great! Thank you very much!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

2 participants