Oliver Nassar

I can be reached at onassar@gmail.com.

For my open source work, check out github.com/onassar


Multibyte error with character set encoding

View more posts

I ran into a curious bug yesterday while trying to crawl an Amazon page. The error, specifically:

[Sat Dec 01 01:54:33 2012] [error] [client 10.211.55.2] htmlentities(): Invalid multibyte sequence in argument in ...

Here's the flow I was running that presented the bug:

That's when the bug popped up.
It was confusing, because I'd crawled other pages without incident.

Here's what I figured out. The Amazon page's encoding was set to ISO-8859-1. While this shouldn't have caused an error with encoding, it's possible that during their encoding, they misencoded (is that a word?) some characters, which was then breaking my call to the htmlentities.

A way around this is to convert the string from one character encoding to another, using the iconv PHP function. Specifically, from ISO-8859-1 to UTF-8, and then run the htmlentities function.

Success :)

While googling, I stumbled on the post PHP htmlspecialchars()/htmlentities() invalid multibyte/UTF-8 gotcha with display_errors=true, which found a way to supress the error, which hinted at the idea that it was in fact a rightful error.

Finally, the Stackoverflow post htmlentities, htmlspecialchars, and "invalid multibyte sequence" I found hinted at the conversion I needed to make.

Within my short PHP-Security functions library is where I made the modifications. I think long term, I may need to update my encode function in that library to accomodate multiple different, possible, character encoding sets, but for now, I'm okay with just ISO-8859-1.