A step-by-step walkthrough: raw counts → relative frequencies → z-scores → document similarity
Stylometrists typically analyse the most frequent function words — grammatical words like the, a, of, at, to. Unlike content words, function words are used unconsciously and are hard for an author to deliberately control, making them strong markers of individual style. The table below shows raw counts of five such words across several documents, along with each document’s total word count.
Documents have different lengths. A 10,000-word novel will naturally have more occurrences of the than a 2,000-word essay, even if both authors use the word at the same rate. We cannot compare raw counts directly — we first need to normalize them.