
Note : The Keyword Extractor & Analysis is purely experimental and far from completed
► The Report tables
Levenshtein / title
The Levenshtein algorithm describes how many replaces you have to do for transforming one word to another.
It is desireable for a webpage having variants of keywords, and words leading to "did you mean .." in Google
► FAQ - Frequently Asked Questions
There appears some words in the report, I'm not seeing on the webpage - why?
The Keyword Extractor extracts all content, also hidden content, content with display : none;
or content hidden in containers with no size. An example is YouTube, which has a lot of such hidden content.
► Definitions - Word, Keyword, Clause and Sentence
Words
A word is any sequence of characters between punctuation marks or stop characters. Stop characters is spaces, periods, commas etc
Keywords
A keyword is a word considered particular important in context
Clauses
A clause is a sequence of words between punctuation marks or two stop characters
Sentences
A sentence is a sequence of clauses between periods, exclamation points or question mark.
► The Keyword extraction and analysis cycle explained
Distillation
The content of the webpage being processed is cleaned for tags, comments, beyond scope garbage, scripts, noscripts and so on.
"Homogenization" (in lack of better word)
The content is stripped for unindexable not searchable e fl
characters and/or character sequences and is then splitted up in regular text fragments. Also, HTML entities and other language specific content is tried to be as unambiguously as possible.
Text processing
The now pure regular text is runned trough some processing filters
A. Some algorithms for identifiyng the content.
B. Custumized filters, eg Number Processing, Experimental processing.
C. Justification, for example the experimental dictionaries.
Extraction
The real Word, Clause and Sentence defragmentation.
Analysis
It is now possible to count words, recognise keywords, calculate density, relevance to page title and more
► Number processing
Exclude Numerals
Strips out numbers. That is integers only such as 0, 117, 3456.
Text with numbers, such as 87a, 100k, 80ties is not stripped.
Exclude Roman Numbers
Strips out any Roman Number between I - MMMCMXCIX (1-3999).
Only correct Roman Number notation is considered. VIII is a Roman Number - IIX is not (theoretically both is "8", but VIII is correct notation)
Exclude spelled numbers
Strips out af broad range of spelled numbers, such as 1st, 17th, billions...
► Experimental processing
Exclude days and abbreviations
Strips out weekdays eg monday, mon ...
Exclude months and abbreviations
Strips out months eg october, oct ...
Exclude Colors
Strips out color names, such as red, purple, magenta ...