Home > Blogs > web > RegEx's are so complicated...

RegEx's are so complicated...

November 14, 2010

Working on a scraper for web pages that basically curls a page, and checks the source for a title, description, favicon and it's images. Sounds simple enough, but spent maybe 30-35 hours so far working on it. I even had a head start as I had some old crap-code. But in rewriting it, found a few good tute's on regexs. Here's how I capture a favicon, and a couple good links.

Grabbing the favicon from some x/html source would seem simple, but here's the finished code first of all:

/**
 * _parseFavicon function.
 *
 * @access private
 * @final
 * @return string
 */
final private function _parseFavicon()
{
    // generate default
    $parsed = parse_url($this->_url);
    $default = ($parsed['scheme']) . '://' . ($parsed['host']) . '/favicon.ico';

    // get the page links (icon attribute value leading)
    preg_match_all('/<link.+[^-]icon.+href=['"]{1}(.+)['"]{1}/imU',
    $this->_response, $links);
    if (empty($links[1])) {
        // get the page links (icon attribute value trailing)
        preg_match_all('/<link.+href=['"]{1}(.+)['"]{1}.+[^-]icon/imU',
        $this->_response, $links);
        if (empty($links[1])) {
            return $default;
        }
    }

    // resolve full path
    $favicon = array_pop($links[1]);
    $favicon = trim($favicon);
    $favicon = $this->_resolveFullPath($favicon, $this->getBase());
    $favicon = str_replace(PHP_EOL, '', $favicon);
    return $favicon;
}

As a quick walkthrough, here's what I'm doing:

Generate and store the default favicon path for a url
Run a regex that checks for a <link> tag with an attribute like rel="icon" leading
If none found, do the same but search for the rel="icon" attribute trailing
If none found, return the default favicon
Grab the last favicon found from one of the previous searches
Trim any whitespace from it
Resolve the full path to the favicon (incase it was referenced in the style: href="../../fav.gif")
Replace any newlines in the path to it
Return it

Two sweet advanced RegEx tutorials that I used elsewhere in my scraper are as follows: