Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds sanitize_html, a whitelist based HTML sanitizer. #171

Merged
merged 7 commits into from
May 30, 2024

Conversation

Kapu1178
Copy link
Contributor

@Kapu1178 Kapu1178 commented Apr 21, 2024

Adds a customizable HTML sanitizer function using the Ammonia crate. Out of the box, it will:

  • Strip <script> and <style> attributes, as well as their contents.
  • Prune all URL schemes, including byond://
  • Prune all HTML attributes and CSS tags, but not their contents.

By providing json encoded lists, you can whitelist given attributes or tags to not be pruned. I have included a curated tag list in the dm source file for this module that will whitelist most safe CSS attributes.

It occured to me that alot of servers run things like old papercode, which does not sanitize on the server side before being viewable by a client. Sanitizing strings with DM would be an absolute performance nuke, assuming you could even make it bulletproof in the first place.
Here is a recommended default tag whitelist

list(
	"b","br",
	"center", "code",
	"dd", "del", "div", "dl", "dt",
	"em",
	"font",
	"h1", "h2", "h3", "h4", "h5", "h6", "hr",
	"i", "ins",
	"li",
	"menu",
	"ol",
	"p", "pre",
	"span", "strong",
	"table",
	"tbody",
	"td",
	"th",
	"thead",
	"tfoot",
	"tr",
	"u",
	"ul",
)

@Kapu1178
Copy link
Contributor Author

Kapu1178 commented Apr 21, 2024

Error: "sanitize = ["ammonia", "maplit", "serde_json"] is not sorted in Cargo.toml default features"

I am unsure how to fix this.

src/sanitize.rs Outdated Show resolved Hide resolved
* * attribute_whitelist_json: a json_encode()'d list of HTML attributes to allow in the final string.
* * tag_whitelist_json: a json_encode()'d list of HTML tags to allow in the final string.
*/
#define rustg_sanitize_html(text, attribute_whitelist_json, tag_whitelist_json) RUSTG_CALL(RUST_G, "sanitize_html")(text, attribute_whitelist_json, tag_whitelist_json)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Semicolon?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand, am I missing something here?

Comment on lines +6 to +7
* * attribute_whitelist_json: a json_encode()'d list of HTML attributes to allow in the final string.
* * tag_whitelist_json: a json_encode()'d list of HTML tags to allow in the final string.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The interface should take a list and json_encode in itself.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't do this so that you can store pre-encoded global lists to save on perf.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't that mean it's encoding on every call? The thing is that this will likely be called many times with only one or a few lists, so this introduces extra overhead.

.link_rel(Some("noopener")) // https://mathiasbynens.github.io/rel-noopener/
.url_schemes(prune_url_schemes)
.generic_attributes(attribute_whitelist)
.tags(tag_whitelist)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wouldn't it make sense to keep this around rather than build it anew on every invocation

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd have to hash the arguments and such and that's out of my skill set presently.

dmsrc/sanitize.dm Outdated Show resolved Hide resolved
@Kapu1178 Kapu1178 requested a review from ZeWaka May 13, 2024 19:34
Copy link
Contributor

@ZeWaka ZeWaka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks about right.

@optimumtact
Copy link
Member

looks about right :+2:

@Kapu1178
Copy link
Contributor Author

mods? mergies? @ZeWaka

@ZeWaka ZeWaka merged commit 6ef3516 into tgstation:master May 30, 2024
2 checks passed
@Kapu1178 Kapu1178 deleted the bbcode branch May 30, 2024 01:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants