I was having trouble crawling (aka. scraping) content from Craigslist.
At first, I was routing my request through PHP Curler, and thought it may have been an issue with the headers I was passing, but alas, it seems to be related to something independent of the formation of the request. And what's left? My IP address.
I tried running the following:
The result of that query, run from my Ubuntu VM on my current OSX, goes through without a problem. The file is saved.
Running it from my AWS EC2 instance? I get 403'd with the response:
Connecting to berlin.en.craigslist.de (berlin.en.craigslist.de)|188.8.131.52|:80... connected. HTTP request sent, awaiting response... 403 Forbidden 2013-03-12 18:12:24 ERROR 403: Forbidden.
I'm guessing a range of IP addresses get blocked from Craigslist (probably with good reason). A heads up to anyone out there hoping to
curl Craigslist from an EC2 box.