Oliver Nassar

Strip HTML Comments which contain Tags

February 21, 2013

I was running into an issue while performing an HTML diff that was causing quite a headache. The specific problem aside, it became neccessary to strip html comments only which contain an HTML tag within it. For example:

<!-- <a href="#">Oliver</a>-->

There could be quite a few variables of this, so I wanted to be sure the expression I wrote was liberal enough. This is what I came up with:

<!--.*<[a-z]+s.*-->

In PHP, this is run as follows:

return preg_replace('/<!--.*<[a-z]+s.*-->/Ui', '', $markup);

The logic is as follows:

  1. Start and end with the comment delimiters <!-- and -->, respectively
  2. Match 0 or more characters directly after the <!--, which could be numbers, newlines, etc.
  3. Look for a < followed by a string representing the tag type (eg. <a, <strong, <script), requiring at least 1 character
  4. Look for a whitespace character (eg. a space or newline)
  5. Allow for more characters after the newline. While in most real-cases, this will be followed by a number of characters, it is flexible enough to catch both of the following cases:

  6. The U and i flags prevent the expression from being greedy, and enforce case-insensitivity, respectively.

While the latter may not be what you're looking for, it helps for me to be rid of anything that could be interpreted as a tag.

I had to be careful when writing this as often, HTML documents contain <style> blocks whose contents have HTML comments around them. I wanted to be sure not to strip those. Those look like:

<style type="text/css">
<!--
    body {
        background-color: red;
    }
-->
</style>

Test Cases

The following cases would not be caught, and I'll go over why:

<style type="text/css">
<!--
    body {
        background-color: red;
        background-image: url('http://google.com/?img<and&something-here');
    }
-->
</style>

<!--
    Page Documentation
    @author Oliver Nassar <onassar@gmail.com>
-->

In both cases, while the early conditions are met, there is no whiespace character directly after the tag name (which in this case are and and onassar).

Outliers

While I'm sure there will be pieces that get stripped without intention (regular expressions are great for doing what you know you want to do, but challenging when you want to prevent what you don't know that you want to prevent), I'm okay with being a little overly greedy when stripping HTML comments.