PHP Screen scraping...
- Started
- Last post
- 12 Responses
- ********
I'm doing some screenscraping of a site, while waiting for them to create an API in the spring of next year... however I want to reduce the overhead of the text I'm parsing through...
The html page is fairly standard in that I know that I only want the text after "STARTTEXT" and up to "ENDTEXT" for example...
Is there a simple string command in PHP that does such a thing??
- ********0
If not, I'm guessing the easiest way is to just find where the element exists and then start from there?
- maximillion_0
you'll need to pull all the page in first before you search within it.
Id also run some tests with multiple queries as some sites will only allow say 3 requests from your script with 15 mins etc - or make your script send headers like a browser would - user agent etc etc
- ********0
ta,
yeah I've got about 40 of these to grab so trying to make the download as small as possible as at the moment it is taking about 1.5 seconds each... and don't want people waiting a minute for a site to load, so if i can get it down to about 30 secs will be a bit better...
- maximillion_0
will it scrape for every hit on your site?
- ********0
thats a very good point!!! didn't think of that... guess i need a 'scrape' button in my CMS
- maximillion_0
or use a CRON job if you have access to it
- fugged0
I'm assuming you've tried using regex to get at the content you want? if the site is coded xhtml strict, you could always parse as xml, but that might actually be slower.
maximillion is right, I think. set up a cron job to download and process that content ahead of time. if your site's traffic spikes really high, the scraping might process might be too much load on the server. if it's 1.5 secs with one user, imagine 30+, at one time.
- ********0
yep think the cron is the way forward
- ********0
maybe this can help:
http://blogoscoped.com/archive/2…
- jasonistaken0
preg_match('/STARTTEXT(.*?)ENDTE... $text, $matches);
Although, your best bet, if you are scraping a site that has relatively valid xhtml is to load it into a SimpleXML parser, then you can traverse the DOM, or use Xpath.
- sublocked0
google: ruby / hpricot / mechanize
all the answers await you there.
- ********0
sounds as stable as the one i've written... i'm now doing a CRON, that pumps out the data as a text file, and is as small as possibile rather than a rather bloated xml type file