Oliver Nassar

HTMLDiff Software Projects and Repos (Open Source)

November 21, 2012

In my attempt (and hope) to follow through with my talks idea, I've come to the point in development where I need to analyze differences in text.

Specifically, I need to perform a diff, periodically, on many different blob's of text. I'm looking for what has been removed, what has been added, and what has been changed, ignoring markup changes.

For example, I'm interested in the title of a blog post changing, but not the attribute value of a link on the page.

In looking into all the different software out there, it seems that I'll be able to get 2 our of those 3 requests. Namely, the ability to figure out what's been removed, and what's been added (line by line).

The following is a breakdown of the resources I stumbled on during my research process.

Websites

ChangeDetection.com

Web-based detection service which emails you when there are changes. It's a good proof-of-concept of what I'm looking for.

In actuality, I'm looking for granular control of the diff process, and need it integrated directly into my application, so I'm not able to use this, but it's good if you're looking to be emailed the differences in a web page as they happen.

W3C HTML Diff Service

Another web-based service which attempts to find the diff between two provided html documents. There were a lot of false-positives, but I tested on more complicated examples (eg. http://www.amazon.com/Restful-Web-Services-Leonard-Richardson/dp/0596529260), so maybe it's good on smaller documents.

Software

Text Diff

There are three examples available: Diff, Match and Patch.

These seem to work pretty well, but unfortuantely, don't fit into what I'm looking for. I'm looking for the ability to detect changes within markup (x/html), but disregarding the unimportant stuff (unimportant, for me, being structural changes in the markup).

C# HTML Diff

A full write up can be found here, but in short, it's a C# port of a ruby diff library, which I include below.

Although I'm using PHP, I would consider this as it seems designed to handle what I'm looking for.

htmldiff: Python Command Line Diff

A python library that I haven't been able to test, so I'm not sure if it's capabilities including detecting changes in markup. Seems to be a port of another library, which I couldn't find :(

Diff Colour Coordination

A cool website that quickly colour-coordinates a diff found online (I believe this could be useful for highlighting git/svn changes). Not what I'm looking for at the moment, though.

Ruby Diff

The library I mentioned above. A ruby diff library, which seems to do a solid job detecting insertions and deletions within a text-source.

Lisp HTML Differ (limited resources)

A Lisp HTML Diff tool, which doesn't seem to contain much information or examples, and was last worked on 3+ years ago. Not sure how effective it is.

JS/Node Diff Tool

JavaScript diff tool which can be run on both the server (Node.js) and client end. Tested it, and it works pretty well. May use this to show diff's without a server reload, loading in the entire document as plain text and then comparing it to a previously fetched text-blob.

DaisyDiff

The source library for the Visual Diff tool, seems to be a Java library which ought to work the same as the Visual Diff tool below, which is a PHP port of it.

VisualDiff (fork of PHP DaisyDiff)

A PHP library which was developed by a MediaWiki member, which seems to now be removed from usage by MediaWiki. Was published online at http://gitorious.org/htmldiff.

lxml.html

A Python Diff library which appears to be exceptionally-well documented, along with many more advanced features (eg. parsing email, testing doctype's, etc.)

C Library

I'm adding this one in now (December 3rd) as I stumbled on it, and it may be worth a try. While it doesn't have a website or documentation, it allows you to downlown the C files and comes with an installation guide.

I assume it'll be pretty speedy since it's a C library.

DIY

Originally, before I had discovered the myriad of resources available, I contemplated, perhaps foolishly (we'll never know), building my own engine. These are some resources which could be helpful in that kind of pursuit.

jQuery Compare

A jQuery library which allows you to compare node's in a document, and see if they are equal to others. Additionally, marks whether they are not in that document at all, if they're before, or if they're after, the node you're comparing it to.

Example on how to run a command-line diff tool from PHP

This link is more or less for myself, as it contains an example of how to run a shell script (eg. node, c++, python) from within PHP (since that's the environment I'm working in anyhow).

PHPQuery

A traversing library which allows you to use CSS3 selectors to traverse a document, instead of XPath. Seems pretty powerful. Kind of wish I'd discovered it during my MetaParser library development :)

DOM Searching

A post that provides helpful information on how to search for textNode's in a document in PHP.

DOM Parser

A PHP library which presents jQuery-inspired selectors to search through a document and access node's.

DOM Text Node

A quick walkthrough on how to get the text for a node in PHP.

DOM PHP Documentation

Quick note on how to load an HTML document into PHP's DOMDocument library.

Other

MediaWiki Documentation of Diff Software

A page dedicated to covering some of the different HTML Diff software available online.

Conclusion

No conclusion just yet. It's possible that I'll use a non-PHP library and access it through shell_exec, and while I think the Node/JS one is sweet, I'm not sure it would get me to MVP fastest.

Hopefully this helps someone out there looking for software, and for me, when I get around to writing the code :)

Resources I haven't yet looked into fully

Python Diff Script
Perl Diff Script
Python Diff Script
http://code.google.com/p/html-diff/downloads/detail?name=2010_05_20_v1.0_binary%20for%20unix.zip&can=2&q=
Python Diff Script
Docs regarding lxml diff library
Example lxml Script