Categories
Coding

Unexpected Handling of Element IDs in PHP DOM

tl;dr: In PHP, make sure you remove the id attribute from a DOM element before you try to use the same id on another DOM element, even if the you already removed the first element from the document. This is not the case in browser DOM.

I stumbled across something really bizarre in PHP’s document object model, although I shouldn’t be too surprised since the PHP DOM differs in ways both subtle and significant from browser DOM. Alain Schlesser (@schlessera) has done a lot of work on Hacking the DOM Object Hierarchy and he has documented many gotchas for which we’ve implemented workarounds. I have encountered yet another gotcha where PHP DOM behaves differently than browser DOM.

In the scenario I have an element with an id which I need to replace with another element that has the same id. For example, let’s say I want to do a performance optimization to replace an animated GIF img with a video (which is 93% smaller as MP4 for the nodding guy meme on Giphy). So I want to take this:

<img 
	id="nod"
	src="https://media.giphy.com/media/xSM46ernAUN3y/giphy.gif"
>Code language: HTML, XML (xml)

And replace it with this:

<video
	id="nod"
	src="https://media.giphy.com/media/xSM46ernAUN3y/giphy.mp4"
	autoplay
	loop
	muted
	playsinline
>
</video>Code language: HTML, XML (xml)

This seems straightforward enough to do in PHP:

$dom = new DOMDocument( $html );
$xpath = new DOMXPath( $dom );
$img_query = $xpath->query( '
	//img[ 
		starts-with( 
			@src, 
			"https://media.giphy.com/media/" 
		) 
		and 
		contains( @src, ".gif" )
	]
' );
foreach ( $img_query as $img ) {
	$video = $dom->createElement( 'video' );

	// Copy all attributes from img to video.
	foreach ( $img->attributes as $attr ) {
		$video->setAttribute( $attr->name, $attr->value );
	}

	// Add necessary video attributes.
	$boolean_attrs = [ 'autoplay', 'loop', 'muted', 'playsinline' ];
	foreach ( $boolean_attrs as $boolean_attr ) {
		$video->setAttributeNode( $dom->createAttribute( $boolean_attr ) );
	}

	// Replace gif with mp4 in src.
	$video->setAttribute(
		'src',
		preg_replace(
			':/giphy(-.+?)?\.gif:',
			'/giphy.mp4',
			$img->getAttribute( 'src' )
		)
	);

	// Finally, swap out the img with the video.
	$img->parentNode->replaceChild( $video, $img );
}Code language: PHP (php)

That appears to work just fine.

But let’s say I also have this CSS on the page to give the img dimensions:

img#nod {
    aspect-ratio: 1 / 1;
    width: 100%;
}Code language: CSS (css)

The img selector is a problem since it is now a video. To fix this let’s just try inlining any style rule with an ID as an inline style attribute for the target element. (Or I could replace img with video in the selector, but this will better illustrate the problem with ID handling in PHP DOM.) So after this optimization, the video should then look like this:

<video
	id="nod"
	src="https://media.giphy.com/media/xSM46ernAUN3y/giphy.mp4"
	autoplay
	loop
	muted
	playsinline
	style="aspect-ratio: 1/1; width: 100%;"
>
</video>Code language: HTML, XML (xml)

To do this processing, I’ll use sabberworm/php-css-parser as follows:

use Sabberworm\CSS\Parser;
use Sabberworm\CSS\RuleSet\DeclarationBlock;

// Process each style element in the document.
foreach ( $xpath->query( '//style' ) as $style ) {
	$parser = new Parser( $style->textContent );
	$parsed = $parser->parse();

	// Iterate over all style rules.
	foreach ( $parsed->getAllDeclarationBlocks() as $declaration_block ) {

		// Extract the IDs from the selectors.
		$ids = [];
		foreach ( $declaration_block->getSelectors() as $selector ) {
			// This ID extractor is admittedly VERY rudimentary.
			if ( preg_match( '/#([a-z0-9_-]+)/', $selector, $matches ) ) {
				$ids[] = $matches[1];
			} else {
				// If we couldn't parse an ID out, then skip the whole block.
				continue 2;
			}
		}

		// Now inline the declaration blocks as style attributes.
		foreach ( $ids as $id ) {
			$element = $dom->getElementById( $id );
			if ( ! $element ) {
				continue;
			}

			$styles = [];
			foreach ( $declaration_block->getRules() as $rule ) {
				$styles[] = (string) $rule;
			}
			if ( $element->hasAttribute( 'style' ) ) {
				$styles[] = $element->getAttribute( 'style' );
			}

			$element->setAttribute( 'style', implode( '', $styles ) );
		}

		$parsed->remove( $declaration_block );
	}

	// Update the stylesheet to remove the inlined style rules.
	$css_text = $parsed->render();
	if ( ! $css_text ) {
		$style->parentNode->removeChild( $style );
	} else {
		$style->textContent = $parsed->render();
	}
}Code language: PHP (php)

When I run this, however, it does not work as expected. The style rule does indeed get removed from the stylesheet, but if I look at the video element, there is no style attribute:

<video
	id="nod"
	src="https://media.giphy.com/media/xSM46ernAUN3y/giphy.mp4"
	autoplay
	loop
	muted
	playsinline
>
</video>Code language: HTML, XML (xml)

Where did it go? An element was returned by $dom->getElementById( $id ) and yet setting the style attribute didn’t seem to stick. But was was that element the video? Let’s check:

$dom->saveHTML( $dom->getElementById( 'nod' ) )Code language: PHP (php)

This returns:

<img 
	id="nod"
	src="https://media.giphy.com/media/xSM46ernAUN3y/giphy.gif"
	style="aspect-ratio: 1/1;width: 100%;"
>Code language: HTML, XML (xml)

What!? It’s the img! But I replaced the img with a video! And yet here it is, and it’s this element that got the inline style attribute, not the video element as I intended. Somehow DOMDocument::getElementById() is holding onto references to ID’ed elements that have been removed from the document. This is not the case, however, when using XPath to query for elements with a given ID:

$dom->saveHTML( $xpath->query( '//*[ @id = "nod" ]' )->item( 0 ) )Code language: PHP (php)

This returns the video as expected, and not an img.

So how can this issue be worked around? It turns out to be simple, albeit annoying: make sure you remove the ID from the element being removed before you assign the same ID to the element being added. See highlighted lines for additions:

$dom = new DOMDocument( $html );
$xpath = new DOMXPath( $dom );
$img_query = $xpath->query( '
	//img[ 
		starts-with( 
			@src, 
			"https://media.giphy.com/media/" 
		) 
		and 
		contains( @src, ".gif" )
	]
' );
foreach ( $img_query as $img ) {
	$video = $dom->createElement( 'video' );

	// Capture and remove ID so it can be added to replacement.
	$id = $img->getAttribute( 'id' );
	if ( $id ) {
		$img->removeAttribute( 'id' );
		$video->setAttribute( 'id', $id );
	}

	// Copy all attributes from img to video.
	foreach ( $img->attributes as $attr ) {
		$video->setAttribute( $attr->name, $attr->value );
	}

	// Add necessary video attributes.
	$boolean_attrs = [ 'autoplay', 'loop', 'muted', 'playsinline' ];
	foreach ( $boolean_attrs as $boolean_attr ) {
		$video->setAttributeNode( $dom->createAttribute( $boolean_attr ) );
	}

	// Replace gif with mp4 in src.
	$video->setAttribute(
		'src',
		preg_replace(
			':/giphy(-.+?)?\.gif:',
			'/giphy.mp4',
			$img->getAttribute( 'src' )
		)
	);

	// Finally, swap out the img with the video.
	$img->parentNode->replaceChild( $video, $img );
}
Code language: PHP (php)

Alternatively, if you want to work with an attribute node instead, make sure you pass that same attribute node to setAttributeNode on the replacement element, like so:

$id_attr = $img->getAttributeNode( 'id' );
if ( $id_attr instanceof DOMAttr ) {
	$img->removeAttributeNode( $id_attr );
	$video->setAttributeNode( $id_attr );
}Code language: PHP (php)

But beware of working with attribute nodes because I’ve also found that if you update an attribute by setting its nodeValue property then it can fail if the string contains ampersands. For some reason PHP DOM attempts to parse HTML entities for strings supplied when setting DOMAttr::$nodeValue but it doesn’t when supplying strings via DOMElement::setAttribute().

If you obtained an attribute node but then use setAttribute to set the attribute on a new element, it will fail if you leave a lingering reference to the attribute node. You have to unset the variable referencing the attribute node:

$id_attr = $img->getAttributeNode( 'id' );
if ( $id_attr instanceof DOMAttr ) {
	$img->removeAttributeNode( $id_attr );
	$video->setAttribute( $id_attr->name, $id_attr->value );
	unset( $id_attr ); // This is required for the ID to be transferred!
}
Code language: PHP (php)

This shows what I believe is going on under the hood. PHP DOM is utilizing a reference count to determine whether an ID is unique or not. If it is not unique, then it ignores any new duplicate IDs being added. So make sure that either the id attribute node is removed from the original element, or unset any reference to the removed node. Either of these will work:

$dom = new DOMDocument();
$dom->loadHTML( '<body><span id="test"></span></body>' );
$body = $dom->getElementsByTagName( 'body' )->item( 0 );

$initial = $dom->getElementById( 'test' );
$body->removeChild( $initial );

switch ( $argc > 1 ? $argv[1] : null ) {
	case 'remove':
		$initial->removeAttribute( 'id' );
		break;
	case 'unset':
		unset( $initial );
		break;
}

$added = $dom->createElement( 'div' );
$added->setAttribute( 'id', 'test' );
$body->appendChild( $added );

if ( 'div' === $dom->getElementById( 'test' )->tagName ) {
	echo 'PASS';
} else {
	echo 'FAIL';
}
Code language: PHP (php)

This also shows it’s not just an issue with replaceNode as in my initial scenario.

Update: Alain found this reported as PHP Bug #77686: Removed elements are still returned by getElementById.

Initial Duplicate IDs

There is another issue with IDs in PHP DOM. If you have multiple elements with the same ID, and you remove the first one, the second one will never get returned by getElementById. The first will be returned even after it is removed from the document. It will only cease to be returned when the id attribute is removed. After this, getElementById will then just return null instead of returning the element with the duplicate ID. In the following example, I’d expect it to print div and div but PHP prints span and null:

$dom = new DOMDocument();
$dom->loadHTML(
	'<span id="test"></span>
	<div id="test"></div>'
);

$span = $dom->getElementById( 'test' );
$span->parentNode->removeChild( $span );

$test = $dom->getElementById( 'test' );
printf( "%s\n", $test ? $test->tagName : 'null' );
// => span, but expected div

$test->removeAttribute( 'id' );
$test = $dom->getElementById( 'test' );
printf( "%s\n", $test ? $test->tagName : 'null' );
// => null, but expected divCode language: PHP (php)

Compare that with browser DOM, where it does output div and div as expected:

<span id="test"></span>
<div id="test"></div>

<script>
const span = document.getElementById( 'test' );
span.parentNode.removeChild( span );

test = document.getElementById( 'test' );
console.info( test ? test.tagName : 'null' );
// => div

span.removeAttribute( 'id' );
test = document.getElementById( 'test' );
console.info( test ? test.tagName : 'null' );
// => div
</script>Code language: HTML, XML (xml)

It’s with good reason that PHP outputs the following warning when passing a document with duplicate IDs:

Warning: DOMDocument::loadHTML(): ID test already defined in Entity

Update: Alain also found this as being reported as PHP Bug #79701: getElementById does not correctly work with duplicate definitions.

Element Queried By ID Before Being Added to Document

One more example to show the difference between PHP DOM and browser DOM. In browser DOM, getElementById will only return an element that is located inside the DOM tree. In PHP DOM, however, the element can be retrieved by ID even before it has been added to the document. Consider this example:

$dom = new DOMDocument();
$dom->loadHTML('<html></html>');
$div = $dom->createElement( 'div' );
$div->setAttribute( 'id', 'foo' );
print_r( $dom->getElementById( 'foo' ) );
// => DOMElement ObjectCode language: PHP (php)

Compare this with browser DOM where getElementById returns null in this scenario:

const div = document.createElement('div');
div.setAttribute('id', 'foo');
console.info( document.getElementById('foo') );
// => nullCode language: JavaScript (javascript)

In conclusion, the ubiquity of PHP-based CMSes (in particular WordPress) makes working with PHP DOM essential. Just beware that it has somewhat of an identity crisis.

Featured image taken from poster of Robert Redford nod of approval on YouTube.

Leave a Reply

Your email address will not be published. Required fields are marked *