Google has already been using TF*IDF (or TF-IDF, TFIDF, TF.IDF, Artist formerly known as Prince) as a ranking factor for your content for a long time, as the search engine seems to focus more on term frequency rather than on counting keywords. While the visual complexity of the algorithm might turn a lot of people off, it is important to recognize that understanding TF*IDF is not as significant as knowing how it works.
TF*IDF is used by search engines to better understand content which is undervalued. For example, if you’d want to search a term “Coke” on Google, this is how Google can figure out if a page titled “COKE” is about:
a) Coca-Cola.
b) Cocaine.
c) A solid, carbon-rich residue derived from the distillation of crude oil.
d) A county in Texas.
The aim of this article is to guide all content writers and SEO experts through the unknown topic of TF*IDF. By better understanding how Google utilizes this algorithm, content writers can reverse engineer TF*IDF and thus optimize the content of a website to be better for users and search engines. And SEOs can use it as a tool for hunting keywords with a higher search volume and a comparatively lower competition.
What is TF*IDF?
What is TF*IDF? TF*IDF is an information retrieval technique that weighs a term’s frequency (TF) and its inverse document frequency (IDF). Each word or term has its respective TF and IDF score. The product of the TF and IDF scores of a term is called the TF*IDF weight of that term.
Put simply, the higher the TF*IDF score (weight), the rarer the term and vice versa.
The TF*IDF algorithm is used to weigh a keyword in any content and assign the importance to that keyword based on the number of times it appears in the document. More importantly, it checks how relevant the keyword is throughout the web, which is referred to as corpus.
For a term t in a document d, the weight Wt,d of term t in document d is given by:
Wt,d = TFt,d log (N/DFt)
Where:
TFt,d is the number of occurrences of t in document d.
DFt is the number of documents containing the term t.
N is the total number of documents in the corpus.
All right. Don’t panic if you feel a headache coming on.
Let’s define this more concretely.
TF*IDF Defined
How is TF*IDF calculated? The TF (term frequency) of a word is the frequency of a word (i.e. number of times it appears) in a document. When you know it, you’re able to see if you’re using a term too much or too little.
For example, when a 100-word document contains the term “cat” 12 times, the TF for the word ‘cat’ is
TFcat = 12/100 i.e. 0.12
The IDF (inverse document frequency) of a word is the measure of how significant that term is in the whole corpus.
Let’s say the term “cat” appears x amount of times in a 10,000,000 million document-sized corpus (i.e. web). Let’s assume there are 0.3 million documents that contain the term “cat”, then the IDF (i.e. log {DF}) is given by the total number of documents (10,000,000) divided by the number of documents containing the term “cat” (300,000).
IDF (cat) = log (10,000,000/300,000) = 1.52
∴ Wcat = (TF*IDF) cat = 0.12 * 1.52 = 0.182
Now that you have this figured out (right?), let’s look at how this can benefit you.
How you can benefit from using TF*IDF
Gather words. Write your content. Run a TF*IDF report for your words and get their weights. The higher the numerical weight value, the rarer the term. The smaller the weight, the more common the term. Compare all the terms with high TF*IDF weights with respect to their search volumes on the web. Select those with higher search volumes and lower competition. Work smart.