Screen scraping with PHP

This is going to be a quick post on some screen scraping techniques in PHP.  I don’t endorse scraping and the content I was scraping I did so with permission because an API was not available.

Some of your options are the following.

There is a good thread over at StackOverflow talking about this.

I first started with simplehtmldom.  It works great for DOM that aren’t very deep.  It absolutely dies on any complex DOMs though, like working in my local environment crashed my sandy bridge MBP.  There have been complaints about memory leaks.

The next thing I discovered is that the all of the above, except RegEx, are not capable of loading any content that is embedded in JS, which a lot of my content was.  So I said screw it and used RegEx, which I have a love/hate relationship with.

My savior in this was gskinner regex utility.  I flew through the different parsers I needed using this.  Remember to test your parser on with negative control content to make sure you don’t get a bunch of junk DOM elements because what you were looking for wasn’t there.

This entry was posted in Code, PHP. Bookmark the permalink. Post a comment or leave a trackback: Trackback URL.

Post a Comment

Your email is never published nor shared. Required fields are marked *

*
*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>