How to create a web Scraper using PHP and jQuery

Webpage extractionImage ExtractorLink ExtractorCSS / Stylesheet ExtractorInline CSS to ClassesKeyword Extractor
SERP specificSERP rank CheckerSERP rank DominationSERP rank Comparison
MiscMultiSearchDuplicate Content FinderConvert / Text Transform
Otherjquery.optionBox pluginA note about browsers
Articles► Scraper with PHP and jqueryHow to parse SERPs in jqueryHow to execute stored PHP

All that is needed is some code for retrieving the content of a webpage, and some code to push the retrieved content to the client. The easiest way to retrieve content from a webpage in PHP is by using file_get_contents, or much, much better - trough the cURL library.


The basics of a PHP web scraper class

define(CURL_ENABLED, true);

class Scraper  {
	private $url;
	private $header = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.8) Gecko/2009032609 Firefox/3.0.8';
	public $result;

	public function __construct($url) {
		$this->url=$url;
	}

	private function exec_CURL() {
		$ch = curl_init();
		curl_setopt($ch, CURLOPT_URL, $this->url);
		curl_setopt($ch, CURLOPT_HEADER, $this->header);
		curl_setopt($ch, CURLOPT_TRANSFERTEXT, 1);
		curl_setopt($ch, CURLOPT_AUTOREFERER,1); 
		curl_setopt($ch, CURLOPT_HTTPPROXYTUNNEL, 1);
		$this->result = curl_exec($ch);
		curl_close($ch);
	}

	private function exec_FGC() {
		$this->result=file_get_contents($this->url);
	}
	
	public function run() {
		switch (CURL_ENABLED) {
			case true : $this->exec_CURL(); break;
			case false : $this->exec_FGC(); break;
		}
	}
}

Place this class in a file called Scraper.php accesible through AJAX. Notice the CURL_ENABLED flag. I assume that the cURL library is enabled on the webserver. Add the following lines to handle argument (the URL to be scraped) and instantiation of the class :

$url = $_GET['url'];
$scraper = new Scraper($url);
$scraper->run();
echo '
';
print_r($scraper->result);
echo '
';

The last part is a clientside script that can push the content of the scraped webpage into a HTML-element. Also, some preprocessing of the HTML content may be desireable. Here is the code used by the scraper demo :

$("#submit").click(function () {
	var url=$("#url").val();
	var div="#result";
	url = 'Scraper.php?url='+encodeURIComponent(url);
	$.ajax({
		url: url,
		cache: false,
		async: true,
		timeout : 6000,
		cleanUp : function(div, tag) {
			var t = div.getElementsByTagName(tag);
			var ii = t.length;
			while (ii--) {
				t[ii].parentNode.removeChild(t[ii]);
			}
			return div;
		},
		dataFilter: function(response) {
			var div = document.createElement('div');
			div.innerHTML = response;
			div = this.cleanUp(div,'script');
			div = this.cleanUp(div,'meta');
			div = this.cleanUp(div,'link');
			div = this.cleanUp(div,'style');
			return div.innerHTML;
		},
		success: function(html) {
			$(div).html(html);
		}
	});
});

About the preprocessing (dataFilter)

The preprocessing of the scraped content is mostly for beautifying purposes. Many webpages makes a lot of styling in scripts, have absolute positioned elements and so on, which inserted without modification in an visible element inside another page can mess the styling of the "parent" page up.


Why use javascript, not jQuery for preprocessing? (cleanUp)

Because I simply not could get it to work properly with jQuery! jQuery strips <head>, <html> and <body> tags since only elements that is allowed inside a <div> are valid and will be added. Using .find(), remove() and so on seems to completely ignore tags like <link> and <meta>, even though they actually is included in the HTML-response and finally will be inserted in the <div> if not removed. I suspect jQuery for actively ignoring tags like <meta> as a "service", but have not investigated the issue.

Anyway - consider <link>, <meta> etc as garbage - you are interested in the content of the page, not meta tags, unless you want to use the scraper as a substitute for an <iframe>.


Now you can see how it works on the scraper demo-tab. Remember, this is the most basic scraper imagineable. In real life you will need to consider issues as redirection and error-handling


blog comments powered by Disqus