tl;dr: In PHP, make sure you remove the id
attribute from a DOM element before you try to use the same id
on another DOM element, even if the you already removed the first element from the document. This is not the case in browser DOM.
I stumbled across something really bizarre in PHP’s document object model, although I shouldn’t be too surprised since the PHP DOM differs in ways both subtle and significant from browser DOM. Alain Schlesser (@schlessera) has done a lot of work on Hacking the DOM Object Hierarchy and he has documented many gotchas for which we’ve implemented workarounds. I have encountered yet another gotcha where PHP DOM behaves differently than browser DOM.
In the scenario I have an element with an id
which I need to replace with another element that has the same id
. For example, let’s say I want to do a performance optimization to replace an animated GIF img
with a video
(which is 93% smaller as MP4 for the nodding guy meme on Giphy). So I want to take this:
<img
id="nod"
src="https://media.giphy.com/media/xSM46ernAUN3y/giphy.gif"
>
Code language: HTML, XML (xml)
And replace it with this:
<video
id="nod"
src="https://media.giphy.com/media/xSM46ernAUN3y/giphy.mp4"
autoplay
loop
muted
playsinline
>
</video>
Code language: HTML, XML (xml)
This seems straightforward enough to do in PHP:
$dom = new DOMDocument( $html );
$xpath = new DOMXPath( $dom );
$img_query = $xpath->query( '
//img[
starts-with(
@src,
"https://media.giphy.com/media/"
)
and
contains( @src, ".gif" )
]
' );
foreach ( $img_query as $img ) {
$video = $dom->createElement( 'video' );
// Copy all attributes from img to video.
foreach ( $img->attributes as $attr ) {
$video->setAttribute( $attr->name, $attr->value );
}
// Add necessary video attributes.
$boolean_attrs = [ 'autoplay', 'loop', 'muted', 'playsinline' ];
foreach ( $boolean_attrs as $boolean_attr ) {
$video->setAttributeNode( $dom->createAttribute( $boolean_attr ) );
}
// Replace gif with mp4 in src.
$video->setAttribute(
'src',
preg_replace(
':/giphy(-.+?)?\.gif:',
'/giphy.mp4',
$img->getAttribute( 'src' )
)
);
// Finally, swap out the img with the video.
$img->parentNode->replaceChild( $video, $img );
}
Code language: PHP (php)
That appears to work just fine.
But let’s say I also have this CSS on the page to give the img
dimensions:
img#nod {
aspect-ratio: 1 / 1;
width: 100%;
}
Code language: CSS (css)
The img
selector is a problem since it is now a video
. To fix this let’s just try inlining any style rule with an ID as an inline style
attribute for the target element. (Or I could replace img
with video
in the selector, but this will better illustrate the problem with ID handling in PHP DOM.) So after this optimization, the video
should then look like this:
<video
id="nod"
src="https://media.giphy.com/media/xSM46ernAUN3y/giphy.mp4"
autoplay
loop
muted
playsinline
style="aspect-ratio: 1/1; width: 100%;"
>
</video>
Code language: HTML, XML (xml)
To do this processing, I’ll use sabberworm/php-css-parser as follows:
use Sabberworm\CSS\Parser;
use Sabberworm\CSS\RuleSet\DeclarationBlock;
// Process each style element in the document.
foreach ( $xpath->query( '//style' ) as $style ) {
$parser = new Parser( $style->textContent );
$parsed = $parser->parse();
// Iterate over all style rules.
foreach ( $parsed->getAllDeclarationBlocks() as $declaration_block ) {
// Extract the IDs from the selectors.
$ids = [];
foreach ( $declaration_block->getSelectors() as $selector ) {
// This ID extractor is admittedly VERY rudimentary.
if ( preg_match( '/#([a-z0-9_-]+)/', $selector, $matches ) ) {
$ids[] = $matches[1];
} else {
// If we couldn't parse an ID out, then skip the whole block.
continue 2;
}
}
// Now inline the declaration blocks as style attributes.
foreach ( $ids as $id ) {
$element = $dom->getElementById( $id );
if ( ! $element ) {
continue;
}
$styles = [];
foreach ( $declaration_block->getRules() as $rule ) {
$styles[] = (string) $rule;
}
if ( $element->hasAttribute( 'style' ) ) {
$styles[] = $element->getAttribute( 'style' );
}
$element->setAttribute( 'style', implode( '', $styles ) );
}
$parsed->remove( $declaration_block );
}
// Update the stylesheet to remove the inlined style rules.
$css_text = $parsed->render();
if ( ! $css_text ) {
$style->parentNode->removeChild( $style );
} else {
$style->textContent = $parsed->render();
}
}
Code language: PHP (php)
When I run this, however, it does not work as expected. The style rule does indeed get removed from the stylesheet, but if I look at the video
element, there is no style
attribute:
<video
id="nod"
src="https://media.giphy.com/media/xSM46ernAUN3y/giphy.mp4"
autoplay
loop
muted
playsinline
>
</video>
Code language: HTML, XML (xml)
Where did it go? An element was returned by $dom->getElementById( $id )
and yet setting the style
attribute didn’t seem to stick. But was was that element the video
? Let’s check:
$dom->saveHTML( $dom->getElementById( 'nod' ) )
Code language: PHP (php)
This returns:
<img
id="nod"
src="https://media.giphy.com/media/xSM46ernAUN3y/giphy.gif"
style="aspect-ratio: 1/1;width: 100%;"
>
Code language: HTML, XML (xml)
What!? It’s the img
! But I replaced the img
with a video
! And yet here it is, and it’s this element that got the inline style
attribute, not the video
element as I intended. Somehow DOMDocument::getElementById()
is holding onto references to ID’ed elements that have been removed from the document. This is not the case, however, when using XPath to query for elements with a given ID:
$dom->saveHTML( $xpath->query( '//*[ @id = "nod" ]' )->item( 0 ) )
Code language: PHP (php)
This returns the video
as expected, and not an img
.
So how can this issue be worked around? It turns out to be simple, albeit annoying: make sure you remove the ID from the element being removed before you assign the same ID to the element being added. See highlighted lines for additions:
$dom = new DOMDocument( $html );
$xpath = new DOMXPath( $dom );
$img_query = $xpath->query( '
//img[
starts-with(
@src,
"https://media.giphy.com/media/"
)
and
contains( @src, ".gif" )
]
' );
foreach ( $img_query as $img ) {
$video = $dom->createElement( 'video' );
// Capture and remove ID so it can be added to replacement.
$id = $img->getAttribute( 'id' );
if ( $id ) {
$img->removeAttribute( 'id' );
$video->setAttribute( 'id', $id );
}
// Copy all attributes from img to video.
foreach ( $img->attributes as $attr ) {
$video->setAttribute( $attr->name, $attr->value );
}
// Add necessary video attributes.
$boolean_attrs = [ 'autoplay', 'loop', 'muted', 'playsinline' ];
foreach ( $boolean_attrs as $boolean_attr ) {
$video->setAttributeNode( $dom->createAttribute( $boolean_attr ) );
}
// Replace gif with mp4 in src.
$video->setAttribute(
'src',
preg_replace(
':/giphy(-.+?)?\.gif:',
'/giphy.mp4',
$img->getAttribute( 'src' )
)
);
// Finally, swap out the img with the video.
$img->parentNode->replaceChild( $video, $img );
}
Code language: PHP (php)
Alternatively, if you want to work with an attribute node instead, make sure you pass that same attribute node to setAttributeNode
on the replacement element, like so:
$id_attr = $img->getAttributeNode( 'id' );
if ( $id_attr instanceof DOMAttr ) {
$img->removeAttributeNode( $id_attr );
$video->setAttributeNode( $id_attr );
}
Code language: PHP (php)
But beware of working with attribute nodes because I’ve also found that if you update an attribute by setting its nodeValue
property then it can fail if the string contains ampersands. For some reason PHP DOM attempts to parse HTML entities for strings supplied when setting DOMAttr::$nodeValue
but it doesn’t when supplying strings via DOMElement::setAttribute()
.
If you obtained an attribute node but then use setAttribute
to set the attribute on a new element, it will fail if you leave a lingering reference to the attribute node. You have to unset the variable referencing the attribute node:
$id_attr = $img->getAttributeNode( 'id' );
if ( $id_attr instanceof DOMAttr ) {
$img->removeAttributeNode( $id_attr );
$video->setAttribute( $id_attr->name, $id_attr->value );
unset( $id_attr ); // This is required for the ID to be transferred!
}
Code language: PHP (php)
This shows what I believe is going on under the hood. PHP DOM is utilizing a reference count to determine whether an ID is unique or not. If it is not unique, then it ignores any new duplicate IDs being added. So make sure that either the id
attribute node is removed from the original element, or unset any reference to the removed node. Either of these will work:
$dom = new DOMDocument();
$dom->loadHTML( '<body><span id="test"></span></body>' );
$body = $dom->getElementsByTagName( 'body' )->item( 0 );
$initial = $dom->getElementById( 'test' );
$body->removeChild( $initial );
switch ( $argc > 1 ? $argv[1] : null ) {
case 'remove':
$initial->removeAttribute( 'id' );
break;
case 'unset':
unset( $initial );
break;
}
$added = $dom->createElement( 'div' );
$added->setAttribute( 'id', 'test' );
$body->appendChild( $added );
if ( 'div' === $dom->getElementById( 'test' )->tagName ) {
echo 'PASS';
} else {
echo 'FAIL';
}
Code language: PHP (php)
This also shows it’s not just an issue with replaceNode
as in my initial scenario.
Update: Alain found this reported as PHP Bug #77686: Removed elements are still returned by getElementById.
Initial Duplicate IDs
There is another issue with IDs in PHP DOM. If you have multiple elements with the same ID, and you remove the first one, the second one will never get returned by getElementById
. The first will be returned even after it is removed from the document. It will only cease to be returned when the id
attribute is removed. After this, getElementById
will then just return null
instead of returning the element with the duplicate ID. In the following example, I’d expect it to print div
and div
but PHP prints span
and null
:
$dom = new DOMDocument();
$dom->loadHTML(
'<span id="test"></span>
<div id="test"></div>'
);
$span = $dom->getElementById( 'test' );
$span->parentNode->removeChild( $span );
$test = $dom->getElementById( 'test' );
printf( "%s\n", $test ? $test->tagName : 'null' );
// => span, but expected div
$test->removeAttribute( 'id' );
$test = $dom->getElementById( 'test' );
printf( "%s\n", $test ? $test->tagName : 'null' );
// => null, but expected div
Code language: PHP (php)
Compare that with browser DOM, where it does output div
and div
as expected:
<span id="test"></span>
<div id="test"></div>
<script>
const span = document.getElementById( 'test' );
span.parentNode.removeChild( span );
test = document.getElementById( 'test' );
console.info( test ? test.tagName : 'null' );
// => div
span.removeAttribute( 'id' );
test = document.getElementById( 'test' );
console.info( test ? test.tagName : 'null' );
// => div
</script>
Code language: HTML, XML (xml)
It’s with good reason that PHP outputs the following warning when passing a document with duplicate IDs:
Warning: DOMDocument::loadHTML(): ID test already defined in Entity
Update: Alain also found this as being reported as PHP Bug #79701: getElementById does not correctly work with duplicate definitions.
Element Queried By ID Before Being Added to Document
One more example to show the difference between PHP DOM and browser DOM. In browser DOM, getElementById
will only return an element that is located inside the DOM tree. In PHP DOM, however, the element can be retrieved by ID even before it has been added to the document. Consider this example:
$dom = new DOMDocument();
$dom->loadHTML('<html></html>');
$div = $dom->createElement( 'div' );
$div->setAttribute( 'id', 'foo' );
print_r( $dom->getElementById( 'foo' ) );
// => DOMElement Object
Code language: PHP (php)
Compare this with browser DOM where getElementById
returns null
in this scenario:
const div = document.createElement('div');
div.setAttribute('id', 'foo');
console.info( document.getElementById('foo') );
// => null
Code language: JavaScript (javascript)
In conclusion, the ubiquity of PHP-based CMSes (in particular WordPress) makes working with PHP DOM essential. Just beware that it has somewhat of an id
entity crisis.
Featured image taken from poster of Robert Redford nod of approval on YouTube.