TF-IDF, short for Term Frequency Inverse Document Frequency, is a simple numerical statistic used to determine the relevancy of a text in relation to the terms in a search query. While it does give a basic measure for relevancy, it is not the way modern search engines work today.
What is TF-IDF?
TF-IDF, short for Term Frequency Inverse Document Frequency, is a numerical statistic, which is used to describe one of the ways a search engine can determine if a text is relevant in relation to the terms used in a search query. TF-IDF is a basic mathematical model. Modern search engines use more advanced versions of TF-IDF, as well as neural matching, in addition to simple word counting.
How does TF-IDF work?
As the name suggests, there are two parts to the way this statistic works, in order to provide a relevancy score.
The first part is the “term frequency” score. This part of the algorithm assumes that the more a term is used in a text, the more important it is in determining what the text is about. By applying only this logic to search engines, when a user searches for something like “website analytics” online, the first result should be the page with the highest frequency of the words “website” and “analytics”.
But since the word “website” is fairly common for so many topics on the internet, the second part of the TF-IDF calculation will also take this into account. The “inverse document frequency” diminishes the weight of those terms that are very frequent in several texts being taken into account. So, in our example, the algorithm will give more weight to the term “analytics” when calculating TF-IDF. In general, this part of the algorithm will give more weight to the more specific terms in the query.
Of course, as stated above, search engines use infinitely more complex models to determine relevancy. By today’s standards, with search engines that understand semantics, a page can be highly relevant in relation to a search query, even with 0% frequency of the term. This can happen if a synonym is used, or if the search engines detect a range of words which are commonly used in texts about that term, even if the term itself is absent. This is why simple TF-IDF is not relevant anymore.