If you want to perform websearches in your web application, or wish to parse SERP's for whatever reason, there is actually no need for using the huge API's provided by Google or Yahoo.
By using a scraper
it is quite easy to perform a serverside websearch and deliver the SERP back to the client. But the websearch is not
very useful without some way to parse and index the searchengine result page.
No long explanation is nessecary. The various searchengines has some differences, but basically the SERPs is
wrapped around an easy detectable almost identical HTML-structure. The major searchengines always delivers at least the following items :
- A page title
- A page description / snippet
- An URL to the page
For the code examples below, it is assumed that the SERPs is loaded into an element (div, span whatever) id'ed "result". The javascript code is designed to operate on the first element only, with a method for removing the element. This is the best approach, since the SERP's can be iterated along removing parsed SERP_items in the same loop.
For Google, Bing and Yahoo you can iterate through the SERP's by using the generic pseudo code below :
var element = "#result";
while (<PARSER>.hasMore(element)) {
var raw_SERP_item = <PARSER>.SERP(element);
var title = <PARSER>.SERP_title(element);
var URL = <PARSER>.SERP_url(element);
var description = <PARSER>.SERP_text(element);
<PARSER>.SERP_remove();
}
Parsing Google SERP's
Google store SERP's in a list-structure <li> with the class-name "g", page-title as <h3>'s and the page-description in an "s"-element. This gives the following simple javascript-class :
var GoogleParser = {
hasMore : function(element) {
return ($(element).find('li.g').length>0);
},
SERP : function(element) {
return $(element).find('li.g:first').html();
},
SERP_title : function(element) {
return $(element).find('li.g:first').find('h3.r').text();
},
SERP_url : function(element) {
return $(element).find('li.g:first').find('h3.r').find('a').attr('href');
},
SERP_text : function(element) {
var g=$(element).find('li.g:first').find('.s').clone();
$(g).find('.osl').remove();
$(g).find('.f').remove();
return $(g).text();
},
SERP_remove : function(element) {
$(element).find('li.g:first').remove();
}
};
Parsing Bing SERP's
Microsofts' Bing also store SERP's in a list-structure <li> with the class-name "sa-wr", also page-title as <h3>'s but the page-description in a more logical <p>-tag. This gives the following simple javascript-class :
var BingParser = {
hasMore : function(element) {
return ($(element).find('li.sa-wr').length>0);
},
SERP : function(element) {
return $(element).find('li.sa-wr:first').html();
},
SERP_title : function(element) {
return $(element).find('li.sa-wr:first').find('h3').text();
},
SERP_url : function(element) {
return $(element).find('li.sa-wr:first').find('h3').find('a').attr('href');
},
SERP_text : function(element) {
return $(element).find('li.sa-wr:first').find('p').text();
},
SERP_remove : function(element) {
$(element).find('li.sa-wr:first').remove();
}
};
Parsing Yahoo SERP's
Yahoo is a little bit different. Yahoo are delivering SERP's in an overall structure called "web", and each SERP_item as <li>'s in an ordered list :
var YahooParser = {
hasMore : function(element) {
return ($(element).find('#web').find('ol li').length>0);
},
SERP : function(element) {
return $(element).find('#web').find('ol li:first').find('.res').html();
},
SERP_title : function(element) {
return $(element).find('#web').find('ol li:first').find('h3').text();
},
SERP_url : function(element) {
return $(element).find('#web').
find('ol li:first').find('h3').find('a').attr('href');
},
SERP_text : function(element) {
return $(element).find('#web').find('ol li:first').find('.abstr').text();
},
SERP_remove : function(element) {
$(element).find('#web').find('ol li:first').remove();
}
};
Now you have the basic methods for parsing SERP's. Of course, like me, you are likely to desire methods for counting the SERP's, just looking for certain domains and so on. But this task should be easy implementable for everyone who understand the basic technique.
blog comments powered by Disqus
