PHP Screen scraping...

Started
Last post
12 Responses

********
I'm doing some screenscraping of a site, while waiting for them to create an API in the spring of next year... however I want to reduce the overhead of the text I'm parsing through...
The html page is fairly standard in that I know that I only want the text after "STARTTEXT" and up to "ENDTEXT" for example...
Is there a simple string command in PHP that does such a thing??
********
Flag
********
0
If not, I'm guessing the easiest way is to just find where the element exists and then start from there?
********
0Permalink
Upvote Downvote
Dogear
Flag
Show [[ numHiddenNotes ]] more notes Add Note
Save Cancel
maximillion_0
you'll need to pull all the page in first before you search within it.
Id also run some tests with multiple queries as some sites will only allow say 3 requests from your script with 15 mins etc - or make your script send headers like a browser would - user agent etc etc
maximillion_ 0Permalink
Upvote Downvote
Dogear
Flag
Show [[ numHiddenNotes ]] more notes Add Note
Save Cancel
********
0
ta,
yeah I've got about 40 of these to grab so trying to make the download as small as possible as at the moment it is taking about 1.5 seconds each... and don't want people waiting a minute for a site to load, so if i can get it down to about 30 secs will be a bit better...
********
0Permalink
Upvote Downvote
Dogear
Flag
Show [[ numHiddenNotes ]] more notes Add Note
Save Cancel
maximillion_0
will it scrape for every hit on your site?
maximillion_ 0Permalink
Upvote Downvote
Dogear
Flag
Show [[ numHiddenNotes ]] more notes Add Note
Save Cancel
********
0
thats a very good point!!! didn't think of that... guess i need a 'scrape' button in my CMS
********
0Permalink
Upvote Downvote
Dogear
Flag
Show [[ numHiddenNotes ]] more notes Add Note
Save Cancel
maximillion_0
or use a CRON job if you have access to it
maximillion_ 0Permalink
Upvote Downvote
Dogear
Flag
Show [[ numHiddenNotes ]] more notes Add Note
Save Cancel
fugged0
I'm assuming you've tried using regex to get at the content you want? if the site is coded xhtml strict, you could always parse as xml, but that might actually be slower.
maximillion is right, I think. set up a cron job to download and process that content ahead of time. if your site's traffic spikes really high, the scraping might process might be too much load on the server. if it's 1.5 secs with one user, imagine 30+, at one time.
fugged 0Permalink
Upvote Downvote
Dogear
Flag
Show [[ numHiddenNotes ]] more notes Add Note
Save Cancel
********
0
yep think the cron is the way forward
********
0Permalink
Upvote Downvote
Dogear
Flag
Show [[ numHiddenNotes ]] more notes Add Note
Save Cancel
********
0
maybe this can help:
http://blogoscoped.com/archive/2…
********
0Permalink
Upvote Downvote
Dogear
Flag
Show [[ numHiddenNotes ]] more notes Add Note
Save Cancel
jasonistaken0
preg_match('/STARTTEXT(.*?)ENDTE... $text, $matches);
Although, your best bet, if you are scraping a site that has relatively valid xhtml is to load it into a SimpleXML parser, then you can traverse the DOM, or use Xpath.
jasonistaken 0Permalink
Upvote Downvote
Dogear
Flag
Show [[ numHiddenNotes ]] more notes Add Note
Save Cancel
sublocked0
google: ruby / hpricot / mechanize
all the answers await you there.
sublocked 0Permalink
Upvote Downvote
Dogear
Flag
Show [[ numHiddenNotes ]] more notes Add Note
Save Cancel
********
0
sounds as stable as the one i've written... i'm now doing a CRON, that pumps out the data as a text file, and is as small as possibile rather than a rather bloated xml type file
********
0Permalink
Upvote Downvote
Dogear
Flag
Show [[ numHiddenNotes ]] more notes Add Note
Save Cancel