Jonathan Law

View Original

Document Class Comparison with TF-IDF & Python

There are many different techniques within the world of natural language processing, ranging from the very simple to the very complex. In this tutorial, we're going to be looking at one of the simpler techniques. Although it is simple, it is powerful. Using a concept called TF-IDF and a bit of linear algebra, we'll be able to compare any two documents or any two classes of documents for similarity. Let's get started!

TF-IDF

TF-IDF stands for Term Frequency-Inverse Document Frequency. The term frequency is the amount of time a word shows up in a particular document, divided by the total number of words in the document.

The inverse document frequency, on the other hand, is the inverse of the amount of documents that contain that term in your corpus. This is calculated using the following equation:

IDF(t) = ln(Total # of documents / # of documents containing the term t)

A term has a high IDF score if it does not appear in very many documents. As you'll see soon, this helps us to filter out common words like "the", "a", or "is".

To get the total TF-IDF score, just multiply the two numbers together.

Okay, but what does it mean?

The TF-IDF score is a representation of how important a particular word is to a particular document, in comparison to how important it is generally. If you took a book about biology, the word mitochondria would probably show up fairly often. However, if you looked at 100 other books, mitochondria probably wouldn't show up in many of them. This tells us that mitochondria is probably an important word in the biology as compared to the other books in the set. The TF-IDF score of mitochondria, therefore, would probably be fairly high for the biology book.

TF-IDF Uses

Although the concept of TF-IDF is very simple and quite easy to calculate, it can be extremely useful. If you wanted to categorize documents or add tags to them according to their topic, you could use TF-IDF to find words that are significant within the text to come up with some options. You can also use TF-IDF (along with other, more complicated algorithms) to do text summarization, finding the most important sentences to include to shorten the length of the article while keeping the information essentially intact.

The use we'll be examining is document comparison using TF-IDF vectorization. This process essentially transforms a given document into a vector containing the TF-IDF score of every word in the vocabulary (all of the unique words within the corpus, or collection of documents). Using these vectors, we can create a correlation matrix to see how similar any two documents are.

To demonstrate, I'll be using the synopses from two movies: Inception and Shutter Island, which I find to be similar in the ways they deal with reality and the complexities of the human mind.

inception = "A young man, exhausted and delirious, washes up on a beach,
looking up momentarily to see two young children..."

shutter_island = "U.S. Marshals Teddy Daniels (Leonardo DiCaprio) and
Chuck Aule (Mark Ruffalo) are on a ferryboat..."

Once we have them loaded in, our first step is to turn them into TF-IDF vectors, using a simple tool from scikit-learn:

from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer()
vecs = tfidf.fit_transform([inception, shutter_island])

Once we have the vectors (which are stored as a sparse matrix), we can multiply it by the inverse of itself (using .T) and rearrange it to make a correlation matrix between the two documents (using .A).

corr_matrix = ((vecs * vecs.T).A)

The result

We can the select the value we want using the index [0,1], giving the two documents a similarity score of 79%. Not bad!

Class Comparison

If we wanted to compare two classes of documents, say, the similarity between Christopher Nolan's movies and Guillermo del Toro's movies, we could compare each of Nolan's movies to each of del Toro's movies individually and take the average. But, using the same tool we used above, we can do it in one fell swoop.

document_dict = {
'Python Developer':
["We're looking for a Python developer with strong programming skills",
"The applicant should have experience with Python, SQL, Numpy,  Pandas, and other data cleaning methods.",
"We're looking for a rockstar developer who's an expert in python, machine learning, data science, and data cleaning."],
'Business Analyst': 
['Candidate must have experience with SQL, PowerBI, and Excel, as well as a strong understanding of finance',
 "Requirements: Data Visualization, SQL, Excel, Bloomberg Terminal familiarity, any scripting language (Python, R)"]}

It'd be a little time consuming to gather the entire works of del Toro and Nolan, so instead we'll use some made up job descriptions for a Python Developer and a Business Analyst. If we feed these into our algorithm above:

document_list = document_dict['Python'] + document_dict['Business Analyst']
vecs = tfidf.fit_transform(document_list)
matrix = ((vecs * vecs.T).A)

We get the following matrix:

The first three columns and rows are the Python jobs, and the last two are the Business analyst jobs. So we can find the intersection in one of these two boxes:

These numbers are all the same and represent the similarity of each Python job to each Business job. All we have to do is select one of the boxes (using the number of documents in the first category, Python) and take the mean:

dim = len(document_dict['Python'])
similarity = matrix[dim:, :dim].mean()
# returns 0.10057696805298395

So in our estimation, these jobs are only about 10% similar.

Conclusion

TF-IDF is a simple but very powerful tool in the realm of natural language processing, and it's uses are nearly endless. Even outside of language, the concept of term frequency and inverse "document" frequency can be extremely useful.

Big thanks to tfidf.com for their clear and concise explanation of TF-IDF, I highly recommend you visit for a more in-depth explanation.

Full Code:

from sklearn.feature_extraction.text import TfidfVectorizer

""" Single Document Comparison """

# I copy-pasted these from IMDb into files
with open('inception_synopsis.txt', 'r') as F:
    inception = F.read()

with open('shutterisland_synopsis.txt', 'r') as F:
    shutter_island = F.read()

tfidf = TfidfVectorizer()
vecs = tfidf.fit_transform([inception, shutter_island])
corr_matrix = ((vecs * vecs.T).A)
similarity = corr_matrix[0,1]

""" Document Class Comparison """
document_dict = {
'Python': ["We're looking for a Python developer with strong programming skills",
"The applicant should have experience with Python, SQL, Numpy, Pandas, and other data cleaning methods.",
"We're looking for a rockstar developer who's an expert in python, machine learning, data science, and data cleaning."],
'Business Analyst': ['Candidate must have experience with SQL, PowerBI, and Excel, as well as a strong understanding of finance',
"Requirements: Data Visualization, SQL, Excel, Bloomberg Terminal familiarity, any scripting language (Python, R)"]}

document_list = document_dict['Python'] + document_dict['Business Analyst']
vecs = tfidf.fit_transform(document_list)
matrix = ((vecs * vecs.T).A)

dim = len(document_dict['Python'])
similarity = matrix[dim:, :dim].mean()
# returns 0.10057696805298395