Recursive sanitizer/filter to manipulate live WHATWG DOMs rather than HTML, for the browser and Node.js.
Direct DOM manipulation has gotten a bad reputation in the last decade of web development. From Ruby on Rails to React, the DOM was seen as something to gloriously destroy and re-render from the server or even from the browser. Never mind that the browser already exerted a lot of effort parsing HTML and constructing this tree! Mind-numbingly complex HTML string regular expression tests and manipulations had to deal with low-level details of the HTML syntax to insert, delete and change elements, sometimes on every keystroke! Contrasting to that, functions like createElement
, remove
and insertBefore
from the DOM world were largely unknown and unused, except perhaps in jQuery.
Processing of HTML is destructive: The original DOM is destroyed and garbage collected with a certain time delay. Attached event handlers are detached and garbage collected. A completely new DOM is created from parsing new HTML set via .innerHTML =
. Event listeners will have to be re-attached from the user-land (this is no issue when using on*
HTML attributes, but this has disadvantages as well).
It doesn't have to be this way. Do not eliminate, but manipulate!
sanitize-dom
crawls a DOM subtree (beginning from a given node, all the way down to its ancestral leaves) and filters and manipulates it non-destructively. This is very efficient: The browser doesn't have to re-render everything; it only re-renders what has been changed (sound familiar from React?).
The benefits of direct DOM manipulation:
- Nodes stay alive.
- References to nodes (i.e. stored in a
Map
orWeakMap
) stay alive. - Already attached event handlers stay alive.
- The browser doesn't have to re-render entire sections of a page; thus no flickering, no scroll jumping, no big CPU spikes.
- CPU cycles for repeatedly parsing and dumping of HTML are eliminated.
sanitize-dom
s further advantages:
- No dependencies.
- Small footprint (only about 7 kB minimized).
- Faster than other HTML sanitizers because there is no HTML parsing and serialization.
Aside from the browser, sanitize-dom
can also be used in Node.js by supplying WHATWG DOM implementations like jsdom.
The test file describes additional usage patterns and features.
For the usage examples below, I'll use sanitizeHtml
just to be able to illustrate the HTML output.
By default, all tags are 'flattened', i.e. only their inner text is kept:
sanitizeHtml(document, '<div><p>abc <b>def</b></p></div>');
"abc def"
Selective joining of same-tag siblings:
// Joins the two I tags.
sanitizeHtml(document, '<i>Hello</i> <i>world!</i> <em>Goodbye</em> <em>world!</em>', {
allow_tags_deep: { '.*': '.*' },
join_siblings: ['I'],
});
"<i>Hello world!</i> <em>Goodbye</em> <em>world!</em>"
Removal of redundant nested nodes (ubiquitous when using a WYSIWYG contenteditable
editor):
sanitizeHtml(document, '<i><i>H<i></i>ello</i> <i>world! <i>Good<i>bye</i></i> world!</i>', {
allow_tags_deep: { '.*': '.*' },
flatten_tags_deep: { i: 'i' },
});
"<i>Hello world! Goodbye world!</i>"
Remove redundant empty tags:
sanitizeHtml(document, 'H<i></i>ello world!', {
allow_tags_deep: { '.*': '.*' },
remove_empty: true,
});
"Hello world!"
By default, all classes and attributes are removed:
// Keep all nodes, but remove all of their attributes and classes:
sanitizeHtml(document, '<div><p>abc <b class="green" data-type="test">def</b></p></div>', {
allow_tags_deep: { '.*': '.*' },
});
"<div><p>abc <b>def</b></p></div>"
Keep all nodes and all their attributes and classes:
sanitizeHtml(document, '<div><p class="red green">abc <b class="green" data-type="test">def</b></p></div>', {
allow_tags_deep: { '.*': '.*' },
allow_attributes_by_tag: { '.*': '.*' },
allow_classes_by_tag: { '.*': '.*' },
});
'<div><p class="red green">abc <b class="green" data-type="test">def</b></p></div>'
White-listing of classes and attributes:
// Keep only data- attributes and 'green' classes
sanitizeHtml(document, '<div><p class="red green">abc <b class="green" data-type="test">def</b></p></div>', {
allow_tags_deep: { '.*': '.*' },
allow_attributes_by_tag: { '.*': 'data-.*' },
allow_classes_by_tag: { '.*': 'green' },
});
'<div><p class="green">abc <b class="green" data-type="test">def</b></p></div>'
White-listing of node tags to keep:
// Keep only B tags anywhere in the document.
sanitizeHtml(document, '<i>abc</i> <b>def</b> <em>ghi</em>', {
allow_tags_deep: { '.*': '^b$' },
});
"abc <b>def</b> ghi"
// Keep only DIV children of BODY and I children of DIV.
sanitizeHtml(document, '<div> <i>abc</i> <em>def</em></div> <i>ghi</i>', {
allow_tags_direct: {
body: 'div',
div: '^i',
},
});
"<div> <i>abc</i> def</div> ghi"
Selective flattening of nodes:
// Flatten only EM children of DIV.
sanitizeHtml(document, '<div> <i>abc</i> <em>def</em></div> <i>ghi</i>', {
allow_tags_deep: { '.*': '.*' },
flatten_tags_direct: {
div: 'em',
},
});
"<div> <i>abc</i> def</div> <i>ghi</i>"
// Flatten I tags anywhere in the document.
sanitizeHtml(document, '<div> <i>abc</i> <em>def</em></div> <i>ghi</i>', {
allow_tags_deep: { '.*': '.*' },
flatten_tags_deep: {
'.*': '^i',
},
});
"<div> abc <em>def</em></div> ghi"
Selective removal of tags:
// Remove I children of DIVs.
sanitizeHtml(document, '<div> <i>abc</i> <em>def</em></div> <i>ghi</i>', {
allow_tags_deep: { '.*': '.*' },
remove_tags_direct: {
'div': 'i',
},
});
"<div> <em>def</em></div> <i>ghi</i>"
Then, sometimes there are more than one way to accomplish the same, as shown in this advanced example:
// Keep all tags except B, anywhere in the document. Two different solutions:
sanitizeHtml(document, '<div> <i>abc</i> <b>def</b> <em>ghi</em> </div>', {
allow_tags_deep: { '.*': '.*' },
flatten_tags_deep: { '.*': 'B' },
});
"<div> <i>abc</i> def <em>ghi</em> </div>"
sanitizeHtml(document, '<div> <i>abc</i> <b>def</b> <em>ghi</em> </div>', {
allow_tags_deep: { '.*': '^((?!b).)*$' }
});
"<div> <i>abc</i> def <em>ghi</em> </div>"
And finally, filter functions allow ultimate flexibility:
// change B node to EM node with contextual inner text; attach an event listener.
sanitizeHtml(document, '<p>abc <i><b>def</b> <b>ghi</b></i></p>', {
allow_tags_direct: {
'.*': '.*',
},
filters_by_tag: {
B: [
function changesToEm(node, { parentNodes, parentNodenames, siblingIndex }) {
const em = document.createElement('em');
const text = `${parentNodenames.join(', ')} - ${siblingIndex}`;
em.innerHTML = text;
em.addEventListener('click', () => alert(text));
return em;
},
],
},
});
// In a browser, the EM tags would be clickable and an alert box would pop up.
"<p>abc <i><em>I, P, BODY - 0</em> <em>I, P, BODY - 2</em></i></p>"
Run in Node.js:
npm test
For the browser, run:
cd sanitize-dom
npm i -g jspm@2.0.0-beta.7 http-server
jspm install @jspm/core@1.1.0
http-server
Then, in a browser which supports <script type="importmap"></script>
(e.g. Google Chrome
version >= 81), browse to http://127.0.0.1:8080/test
- sanitizeNode(doc, node, [opts], [nodePropertyMap])
Simple wrapper for sanitizeDom. Processes the node and its childNodes recursively.
- sanitizeChildNodes(doc, node, [opts], [nodePropertyMap])
Simple wrapper for sanitizeDom. Processes only the node's childNodes recursively, but not the node itself.
- sanitizeHtml(doc, html, [opts], [isDocument], [nodePropertyMap]) ⇒
String
Simple wrapper for sanitizeDom. Instead of a DomNode, it takes an HTML string.
- sanitizeDom(doc, contextNode, [opts], [childrenOnly], [nodePropertyMap])
This function is not exported: Please use the wrapper functions instead:
sanitizeHtml, sanitizeNode, and sanitizeChildNodes.
Recursively processes a tree with
node
at the root.In all descriptions, the term "flatten" means that a node is replaced with the node's childNodes. For example, if the B node in
<i>abc<b>def<u>ghi</u></b></i>
is flattened, the result is<i>abcdef<u>ghi</u></i>
.Each node is processed in the following sequence:
- Filters matching the
opts.filters_by_tag
spec are called. If the filter returnsnull
, the node is removed and processing stops (see filters). - If the
opts.remove_tags_*
spec matches, the node is removed and processing stops. - If the
opts.flatten_tags_*
spec matches, the node is flattened and processing stops. - If the
opts.allow_tags_*
spec matches:- All attributes not matching
opts.allow_attributes_by_tag
are removed. - All class names not matching
opts.allow_classes_by_tag
are removed. - The node is kept and processing stops.
- All attributes not matching
- The node is flattened.
- Filters matching the
- DomDocument :
Object
Implements the WHATWG DOM Document interface.
In the browser, this is
window.document
. In Node.js, this may for example be new JSDOM().window.document.- DomNode :
Object
Implements the WHATWG DOM Node interface.
Custom properties for each node can be stored in a
WeakMap
passed as optionnodePropertyMap
to one of the sanitize functions.- Tagname :
string
Node tag name.
Even though in the WHATWG DOM text nodes (nodeType 3) have a tag name
#text
, these are referred to by the simpler string 'TEXT' for convenience.- Regex :
string
A string which is compiled to a case-insensitive regular expression
new RegExp(regex, 'i')
. The regular expression is used to match a Tagname.- ParentChildSpec :
Object.<Regex, Array.<Regex>>
Property names are matched against a (direct or ancestral) parent node's Tagname. Associated values are matched against the current nodes Tagname.
- TagAttributeNameSpec :
Object.<Regex, Array.<Regex>>
Property names are matched against the current nodes Tagname. Associated values are used to match its attribute names.
- TagClassNameSpec :
Object.<Regex, Array.<Regex>>
Property names are matched against the current nodes Tagname. Associated values are used to match its class names.
- FilterSpec :
Object.<Regex, Array.<filter>>
Property names are matched against node Tagnames. Associated values are the filters which are run on the node.
- filter ⇒
DomNode
|Array.<DomNode>
|null
Filter functions can either...
- return the same node (the first argument),
- return a single, or an Array of, newly created DomNode(s), in which case
node
is replaced with the new node(s), - return
null
, in which casenode
is removed.
Note that newly generated DomNode(s) are processed by running sanitizeDom on them, as if they had been part of the original tree. This has the following implication:
If a filter returns a newly generated DomNode with the same Tagname as
node
, it would cause the same filter to be called again, which may lead to an infinite loop if the filter is always returning the same result (this would be a badly behaved filter). To protect against infinite loops, the author of the filter must acknowledge this circumstance by setting a boolean property called 'skip_filters' for the DomNode) (in aWeakMap
which the caller must provide to one of the sanitize functions as the argumentnodePropertyMap
). If 'skip_filters' is not set, an error is thrown. With well-behaved filters it is possible to continue subsequent processing of the returned node without causing an infinite loop.
Simple wrapper for sanitizeDom. Processes the node and its childNodes recursively.
Kind: global function
Param | Type | Default | Description |
---|---|---|---|
doc | DomDocument |
||
node | DomNode |
||
[opts] | Object |
{} |
|
[nodePropertyMap] | WeakMap.<DomNode, Object> |
new WeakMap() |
Additional node properties |
Simple wrapper for sanitizeDom. Processes only the node's childNodes recursively, but not the node itself.
Kind: global function
Param | Type | Default | Description |
---|---|---|---|
doc | DomDocument |
||
node | DomNode |
||
[opts] | Object |
{} |
|
[nodePropertyMap] | WeakMap.<DomNode, Object> |
new WeakMap() |
Additional node properties |
Simple wrapper for sanitizeDom. Instead of a DomNode, it takes an HTML string.
Kind: global function
Returns: String
- The processed HTML
Param | Type | Default | Description |
---|---|---|---|
doc | DomDocument |
||
html | string |
||
[opts] | Object |
{} |
|
[isDocument] | Boolean |
false |
Set this to true if you are passing an entire HTML document (beginning with the tag). The context node name will be HTML. If false , then the context node name will be BODY. |
[nodePropertyMap] | WeakMap.<DomNode, Object> |
new WeakMap() |
Additional node properties |
This function is not exported: Please use the wrapper functions instead:
sanitizeHtml, sanitizeNode, and sanitizeChildNodes.
Recursively processes a tree with node
at the root.
In all descriptions, the term "flatten" means that a node is replaced with the node's childNodes.
For example, if the B node in <i>abc<b>def<u>ghi</u></b></i>
is flattened, the result is
<i>abcdef<u>ghi</u></i>
.
Each node is processed in the following sequence:
- Filters matching the
opts.filters_by_tag
spec are called. If the filter returnsnull
, the node is removed and processing stops (see filters). - If the
opts.remove_tags_*
spec matches, the node is removed and processing stops. - If the
opts.flatten_tags_*
spec matches, the node is flattened and processing stops. - If the
opts.allow_tags_*
spec matches:- All attributes not matching
opts.allow_attributes_by_tag
are removed. - All class names not matching
opts.allow_classes_by_tag
are removed. - The node is kept and processing stops.
- All attributes not matching
- The node is flattened.
Kind: global function
Param | Type | Default | Description |
---|---|---|---|
doc | DomDocument |
The document | |
contextNode | DomNode |
The root node | |
[opts] | Object |
{} |
Options for processing. |
[opts.filters_by_tag] | FilterSpec |
{} |
Matching filters are called with the node. |
[opts.remove_tags_direct] | ParentChildSpec |
{} |
Matching nodes which are a direct child of the matching parent node are removed. |
[opts.remove_tags_deep] | ParentChildSpec |
{'.*': ['style','script','textarea','noscript']} |
Matching nodes which are anywhere below the matching parent node are removed. |
[opts.flatten_tags_direct] | ParentChildSpec |
{} |
Matching nodes which are a direct child of the matching parent node are flattened. |
[opts.flatten_tags_deep] | ParentChildSpec |
{} |
Matching nodes which are anywhere below the matching parent node are flattened. |
[opts.allow_tags_direct] | ParentChildSpec |
{} |
Matching nodes which are a direct child of the matching parent node are kept. |
[opts.allow_tags_deep] | ParentChildSpec |
{} |
Matching nodes which are anywhere below the matching parent node are kept. |
[opts.allow_attributes_by_tag] | TagAttributeNameSpec |
{} |
Matching attribute names of a matching node are kept. Other attributes are removed. |
[opts.allow_classes_by_tag] | TagClassNameSpec |
{} |
Matching class names of a matching node are kept. Other class names are removed. If no class names are remaining, the class attribute is removed. |
[opts.remove_empty] | boolean |
false |
Remove nodes which are completely empty |
[opts.join_siblings] | Array.<Tagname> |
[] |
Join same-tag sibling nodes of given tag names, unless they are separated by non-whitespace textNodes. |
[childrenOnly] | Bool |
false |
If false, then the node itself and its descendants are processed recursively. If true, then only the children and its descendants are processed recursively, but not the node itself (use when node is BODY or DocumentFragment ). |
[nodePropertyMap] | WeakMap.<DomNode, Object> |
new WeakMap() |
Additional properties for a DomNode can be stored in an object and will be looked up in this map. The properties of the object and their meaning: skip : If truthy, disables all processing for this node. skip_filters : If truthy, disables all filters for this node. skip_classes : If truthy, disables processing classes of this node. skip_attributes : If truthy, disables processing attributes of this node. See tests for usage details. |
Implements the WHATWG DOM Document interface.
In the browser, this is window.document
. In Node.js, this may for example be
new JSDOM().window.document.
Kind: global typedef
See: https://dom.spec.whatwg.org/#interface-document
Implements the WHATWG DOM Node interface.
Custom properties for each node can be stored in a WeakMap
passed as option nodePropertyMap
to one of the sanitize functions.
Kind: global typedef
See: https://dom.spec.whatwg.org/#interface-node
Node tag name.
Even though in the WHATWG DOM text nodes (nodeType 3) have a tag name #text
,
these are referred to by the simpler string 'TEXT' for convenience.
Kind: global typedef
Example
'DIV'
'H1'
'TEXT'
A string which is compiled to a case-insensitive regular expression new RegExp(regex, 'i')
.
The regular expression is used to match a Tagname.
Kind: global typedef
Example
'.*' // matches any tag
'DIV' // matches DIV
'(DIV|H[1-3])' // matches DIV, H1, H2 and H3
'P' // matches P and SPAN
'^P$' // matches P but not SPAN
'TEXT' // matches text nodes (nodeType 3)
Property names are matched against a (direct or ancestral) parent node's Tagname. Associated values are matched against the current nodes Tagname.
Kind: global typedef
Example
{
'(DIV|SPAN)': ['H[1-3]', 'B'], // matches H1, H2, H3 and B within DIV or SPAN
'STRONG': ['.*'] // matches all tags within STRONG
}
Property names are matched against the current nodes Tagname. Associated values are used to match its attribute names.
Kind: global typedef
Example
{
'H[1-3]': ['id', 'class'], // matches 'id' and 'class' attributes of all H1, H2 and H3 nodes
'STRONG': ['data-.*'] // matches all 'data-.*' attributes of STRONG nodes.
}
Property names are matched against the current nodes Tagname. Associated values are used to match its class names.
Kind: global typedef
Example
{
'DIV|SPAN': ['blue', 'red'] // matches 'blue' and 'red' class names of all DIV and SPAN nodes
}
Property names are matched against node Tagnames. Associated values are the filters which are run on the node.
filter ⇒ DomNode
| Array.<DomNode>
| null
Filter functions can either...
- return the same node (the first argument),
- return a single, or an Array of, newly created DomNode(s), in which case
node
is replaced with the new node(s), - return
null
, in which casenode
is removed.
Note that newly generated DomNode(s) are processed by running sanitizeDom on them, as if they had been part of the original tree. This has the following implication:
If a filter returns a newly generated DomNode with the same Tagname as node
, it
would cause the same filter to be called again, which may lead to an infinite loop if the filter
is always returning the same result (this would be a badly behaved filter). To protect against
infinite loops, the author of the filter must acknowledge this circumstance by setting a boolean
property called 'skip_filters' for the DomNode) (in a WeakMap
which the caller must
provide to one of the sanitize functions as the argument nodePropertyMap
). If 'skip_filters' is
not set, an error is thrown. With well-behaved filters it is possible to continue subsequent
processing of the returned node without causing an infinite loop.
Kind: global typedef
Param | Type | Description |
---|---|---|
node | DomNode |
Currently processed node |
opts | Object |
|
opts.parents | Array.<DomNode> |
The parent nodes of node . |
opts.parentNodenames | Array.<Tagname> |
The tag names of the parent nodes |
opts.siblingIndex | Integer |
The number of the current node amongst its siblings |