Oliver Nassar

I can be reached at onassar@gmail.com.

For my open source work, check out github.com/onassar


RegEx for extracting a <title> tag from a string

View more posts

In writing my page parser, I've needed to extract the title tag from a page. You wouldn't think the regex is too complicated, but there were definitely a few twists and turns. I'll throw up the regex I'm using and walk through it (this is for me so later on when I'm looking at it, I'll understand wtf I was thinking).

preg_match('/<title[^>]*>([^<]+)<\/title>/im', $this->_response, $titles);

The first thing is to obviously match the opening title tag. Inside of the tag I reserve space for an expression that allows a title tag to have an attribute and value (I don't think this is valid W3C markup, but I've found it in various sites).

Then I begin catching the title itself. Why don't I use (.*) instead of ([^<]+)? (.*) will capture everything, but the dot-character by definition does not capture new lines. Many times pages have title tags spit out over three lines: the first containing the opening tag, the second the title copy/string, and the third the title closing tag. The expression I'm using however is a negation, and by definition that does include the newline character :)

Following this I search for the closing title tag. The flag I throw on the end ignores the case of the title tags.

So why didn't I just add the flag s which would set the character-capture-range to include everything including the newlines? Well I tried that, and it failed. I really don't know why, but in some cases it failed to work.

My guess is that it was being applied to the first expression and then skipping the actual copy/string. But all in all, I think this should capture the title nicely.

Note

Ignore the space in the closing title tag. It's my crappy word-wrapper throwing a space in there. I'll be fixing that shortly and will follow up with how insanely difficult word-wrapping really is if you want to get it perfect.