Oliver Nassar

RegEx's are so complicated...

November 14, 2010

Working on a scraper for web pages that basically curls a page, and checks the source for a title, description, favicon and it's images. Sounds simple enough, but spent maybe 30-35 hours so far working on it. I even had a head start as I had some old crap-code. But in rewriting it, found a few good tute's on regexs. Here's how I capture a favicon, and a couple good links.

Grabbing the favicon from some x/html source would seem simple, but here's the finished code first of all:

/**
 * _parseFavicon function.
 *
 * @access private
 * @final
 * @return string
 */
final private function _parseFavicon()
{
    // generate default
    $parsed = parse_url($this->_url);
    $default = ($parsed['scheme']) . '://' . ($parsed['host']) . '/favicon.ico';

    // get the page links (icon attribute value leading)
    preg_match_all('/<link.+[^-]icon.+href=['"]{1}(.+)['"]{1}/imU',
    $this->_response, $links);
    if (empty($links[1])) {
        // get the page links (icon attribute value trailing)
        preg_match_all('/<link.+href=['"]{1}(.+)['"]{1}.+[^-]icon/imU',
        $this->_response, $links);
        if (empty($links[1])) {
            return $default;
        }
    }

    // resolve full path
    $favicon = array_pop($links[1]);
    $favicon = trim($favicon);
    $favicon = $this->_resolveFullPath($favicon, $this->getBase());
    $favicon = str_replace(PHP_EOL, '', $favicon);
    return $favicon;
}

As a quick walkthrough, here's what I'm doing:

  1. Generate and store the default favicon path for a url
  2. Run a regex that checks for a <link> tag with an attribute like rel="icon" leading
  3. If none found, do the same but search for the rel="icon" attribute trailing
  4. If none found, return the default favicon
  5. Grab the last favicon found from one of the previous searches
  6. Trim any whitespace from it
  7. Resolve the full path to the favicon (incase it was referenced in the style: href="../../fav.gif")
  8. Replace any newlines in the path to it
  9. Return it

Two sweet advanced RegEx tutorials that I used elsewhere in my scraper are as follows: