Jaccard similarity in Python.

Table of Contents

When we want to measure the similarity between two sets, we tend to use cosine similarity or Jaccard similarity in general. In this blog, we will discuss Jaccard similarity and implement it using various libraries in Python.

What is Jaccard Similarity?

Jaccard similarity is also known by names like Jaccard index or interaction over union. The main goal of Jaccard similarity is to quantify the similarity between two sets by comparing the size of their intersection to the size of their union.

Jaccard similarity formula

The formula for Jaccard similarity is given by:

J(A,B)=∣A∩B∣ / ∣A∪B∣

Where:

∣A∩B∣ is the number of elements that are present in both sets i.e. sets A and B
∣A∪B∣ is the number of elements in the union of sets A and B. It means a set of all elements that are in either set A or set B or both.

Interpretation of Jaccard Similarity

When we find Jaccard Similarity between two sets, its value ranges from 0 to 1.

A value of 1 indicates that the sets are identical.
A value of 0 indicates no common elements between the sets.

How to calculate Jaccard similarity

Consider we have two sets:

Set A = {1, 2, 3, 4}
Set B = {3, 4, 5}

The intersection (common elements)

A∩B={3,4}
A∩B={3,4} has a size of 2,

While the union A∪B={1,2,3,4,5} A∪B={1,2,3,4,5} has a size of 5. Thus, the Jaccard similarity is:

J(A,B)=2/5=0.4

0.4 value indicates a moderate level of similarity between the two sets.

How to calculate Jaccard similarity in Python

In Python, we can calculate Jaccard similarity using functions, sets, and from inbuilt functions of libraries like Scikit-learn, Scipy, Textdistance.

Implementing Jaccard similarity in Python using function

Let’s create a function jaccard_similarity that takes two documents as input and computes their Jaccard similarity.

def jaccard_similarity(doc1, doc2):
    # Convert documents to sets of words
    words_doc1 = set(doc1.lower().split())
    words_doc2 = set(doc2.lower().split())
    
    # Calculate intersection and union
    intersection = words_doc1.intersection(words_doc2)
    union = words_doc1.union(words_doc2)
    
    # Calculate Jaccard similarity
    return len(intersection) / len(union)

Now we can test our function with some sample documents.

doc1 = "Data is the new oil of the digital economy"
doc2 = "Data is a new oil"

similarity = jaccard_similarity(doc1, doc2)
print(f"Jaccard Similarity: {similarity:.4f}")

Output

Jaccard Similarity: 0.4444

We can also have a compact approach where we can use the length function with union and intersection functions.

 def jaccard_similarity(set1, set2):
    intersection = len(set1 & set2)
    union = len(set1 | set2)
    return intersection / union

set1 = {'apple', 'banana', 'cherry'}
set2 = {'banana', 'cherry', 'date'}

print(jaccard_similarity(set1, set2))

Output

0.5

Implementing Jaccard similarity in Python using textdistance library

Textdistance is a library in Python that is used for string comparison. It also has Jaccard similarity

pip install textdistance

Use above command to install textdistance using pip

import textdistance

set1 = "I love machine learning"
set2 = "I love deep learning"

jaccard_sim = textdistance.jaccard(set1, set2)
print(jaccard_sim)

Output

0.6538461538461539

Implementing Jaccard similarity in Python using Scikit-learn library

Scikit-learn also provides a function for computing the Jaccard similarity using its metrics module.

In machine learning, we find commonness in predicted class and actual class in binary classification using Jaccard score.

 from sklearn.metrics import jaccard_score

# Example binary arrays
y_true = [1, 1, 0, 0, 1]
y_pred = [1, 0, 0, 0, 1]

score = jaccard_score(y_true, y_pred)
print(score)

output

0.6666666666666666

Implementing Jaccard similarity in Python using Scipy library

Scipy provides a general-purpose distance function that can be used for computing Jaccard similarity.

 from scipy.spatial.distance import jaccard

set1 = [1, 0, 1, 1, 0]
set2 = [1, 1, 0, 1, 0]

similarity = jaccard(set1, set2)
print(similarity)

Output

0.5

Handling Larger Datasets

When we have a large dataset or multiple documents and if we want to calculate Jaccard similarity then it is computationally expensive. To optimize it we have

MinHash: This technique allows us for faster approximation of Jaccard similarities by reducing dimensionally while preserving similarities

Locality Sensitive Hashing (LSH): This method hashes input items so that similar items map to the same “buckets” with high probability. This can significantly speed up the search for similar items.

FAQ on Jaccard similarity

Q1. Jaccard similarity vs cosine similarity

We measure the set similarity by comparing the interaction and union of the sets in Jaccard similarity

Jaccard Similarity=∣A∩B∣ / ∣A∪B|

where A and B are two sets. It ranges from 0 to 1 .

In Cosine Similarity we measure the cosine of the angle between two vectors in a multi-dimensional space. Its formula is

Cosine Similarity=A⋅B / ∥A∥ * ∥B∥

where A and B are vectors. It ranges from -1 to 1. 1 represents identical directions and 0 means no similarity.

Q2. Jaccard distance vs Jaccard similarity

Jaccard Similarity measures how similar two sets are whereas Jaccard distance measures how different two sets are. It is the complement of Jaccard similarity

Jaccard Distance=1−Jaccard Similarity

Jaccard Distance ranges from 0 (identical sets) to 1 (completely disjoint sets).

Q3. Jaccard similarity vs Levenshtein distance

Jaccard Similarity Compares two sets based on the ratio of their intersection to their union whereas Levenshtein distance also known as edit distance measures the minimum number of single-character edits like insertions, deletions or substitutions will be required to change one string to another.

Applications of Jaccard Similarity

Jaccard similarity can be used in

For getting text similarity between documents based on their content.
Jaccard similarity can be used in recommendation systems to show similar content based on what he has been earlier.
It can be used to cluster similar documents together.

Conclusion

In this blog, we learned about Jaccard similarity and how to implement it using Python. We can use sets and other libraries that have built-in functions for calculating Jaccard similarity.

In simple terms, Jaccard similarity quantifies how similar two sets are based on their intersection and union and It ranges from 0 (no similarity) to 1 (identical sets). To optimize Jaccard similarity for larger datasets we can use techniques like MinHash or LSH.

You can also read our other articles on why is the central limit theorem important and how to detect QR code in an image.

What is Jaccard Similarity?

Jaccard similarity formula

Interpretation of Jaccard Similarity

How to calculate Jaccard similarity

How to calculate Jaccard similarity in Python

Implementing Jaccard similarity in Python using function

Implementing Jaccard similarity in Python using textdistance library

Implementing Jaccard similarity in Python using Scikit-learn library

Implementing Jaccard similarity in Python using Scipy library

Handling Larger Datasets

FAQ on Jaccard similarity

Applications of Jaccard Similarity

Related Posts

1 thought on “Jaccard similarity in Python.”