Duplicate Content Finder FAQ
► An instructive example ...
The best way to describe the issues about Duplicate Content - and how this tool may be useful - must be by an example showing it all in one. The Wikipedia Bill Gates Biography has been copied many, many times to many webpages - and is misused to attract traffic and fake factuality.
Run the Duplicate Content Finder for the first 3 paragraphs of the Bill Gates biography
► Understanding the findings
SERP %
SERP % is the searchable (eg top-indexed) part of the duplicate content you are looking for. A high SERP % (>60%) indicates a very high duplicate content rate.Real DC %
Real DC% is the amount of actual duplicate content for the page being investigated. A percentage above 70% may be considered as an almost to total duplicate or plagiarism.Page Title / URL
Portion of the <title> tag and a clickable URL for the page in question.

► High SERP but low Real DC - why?
This is typically the case for portals, news sites and similar. A page has been indexed in searchengines by phrases matching the content being searched for, but is now updated with other content. In more seldom cases the server holding the targeted URL is blocking for robotic pageload. An example is Facebook.com, which is very unhappy by any kind of "unusual" sniffing. In those cases, and if you have a high SERP, you can be completely sure that the page contains duplicate content.
► 100% duplicate content, but Real DC is below 100% - why?
Both the source (your input) and the content of duplicate content candidates are processed trough multiple filters. This is nessecary to make a real content comparison possible. "This is a test", "this,is a test", "THIS.is.A.test" seems different, but for a searchengine or in a duplicate content context they are pretty much alike.
► "Scanner" - which to choose?
Scanners is the searchengine used for the preliminary internet-scan. Currently Bing, Yahoo and Google is supported. It really not makes a huge difference choosing one or the other. Google deliver more results, and produces therefore a higher SERP % for more pages, but if there is duplicate content Bing or Yahoo find this as well. The duplicate content issue is not a matter of indexes in a searchengine, but about the good rank the searchengine provides for unique content. Choosing Google as scanner you may discover blank results after heavy use. This is Google suspecting Your computer generating automated requests - which is true - and therefore inserts a 302 redirect. It happends very seldom, but if it does - switch to Bing or Yahoo and Google should be up running after a few hours. I have not yet made the effort to show a "nice" 302-message.
► Tolerance - what is it?
Tolerance express the minimum SERP-percentage which will qualify an URL to be a possible duplicate content candidate. A high tolerance gives few and more certain results, but is likely to oversee webpages with obvious duplicate content. A lower tolerance gives more results, and catches also poorly indexed pages, but may deliver some results completely out of context. However, since "Cleanup" is checked by default this is not a problem. Generally it should not be nessecary to change Tolerance
► Options
Cleanup
If checked analyzed pages with no duplicate content is removed from the final list.Sort
If checked the final list of results will be sorted with the most serious Duplicate Content pages on top.
► Timeout
When the Real DC value and Show actual duplicate content is being calculated, a list of "suspiscious" webpages is continuously being scanned. Timeout is the maximum waiting time in milliseconds for each webpage about to be loaded and scanned. A high timeout gives a more accurate result, but slows down the overall time consumed by the Duplicate Content Finder. The default timeout value of 6000 ms should be enough in most cases. The main difference is a bigger set of matches. Reasons for setting a high timeout may be extremely many "hits" or a lot of slow pages, blocking for each other.
