Table of Contents
ToggleWhen we want to measure the similarity between two sets, we tend to use cosine similarity or Jaccard similarity in general. In this blog, we will discuss Jaccard similarity and implement it using various libraries in Python.
What is Jaccard Similarity?
Jaccard similarity is also known by names like Jaccard index or interaction over union. The main goal of Jaccard similarity is to quantify the similarity between two sets by comparing the size of their intersection to the size of their union.
Jaccard similarity formula
The formula for Jaccard similarity is given by:
J(A,B)=∣A∩B∣ / ∣A∪B∣
Where:
- ∣A∩B∣ is the number of elements that are present in both sets i.e. sets A and B
- ∣A∪B∣ is the number of elements in the union of sets A and B. It means a set of all elements that are in either set A or set B or both.
Interpretation of Jaccard Similarity
When we find Jaccard Similarity between two sets, its value ranges from 0 to 1.
- A value of 1 indicates that the sets are identical.
- A value of 0 indicates no common elements between the sets.
How to calculate Jaccard similarity
Consider we have two sets:
Set A = {1, 2, 3, 4} Set B = {3, 4, 5}
The intersection (common elements)
A∩B={3,4} A∩B={3,4} has a size of 2,
While the union A∪B={1,2,3,4,5} A∪B={1,2,3,4,5} has a size of 5. Thus, the Jaccard similarity is:
J(A,B)=2/5=0.4
0.4 value indicates a moderate level of similarity between the two sets.
How to calculate Jaccard similarity in Python
In Python, we can calculate Jaccard similarity using functions, sets, and from inbuilt functions of libraries like Scikit-learn, Scipy, Textdistance.
Implementing Jaccard similarity in Python using function
Let’s create a function jaccard_similarity that takes two documents as input and computes their Jaccard similarity.
def jaccard_similarity(doc1, doc2): # Convert documents to sets of words words_doc1 = set(doc1.lower().split()) words_doc2 = set(doc2.lower().split()) # Calculate intersection and union intersection = words_doc1.intersection(words_doc2) union = words_doc1.union(words_doc2) # Calculate Jaccard similarity return len(intersection) / len(union)
Now we can test our function with some sample documents.
doc1 = "Data is the new oil of the digital economy" doc2 = "Data is a new oil" similarity = jaccard_similarity(doc1, doc2) print(f"Jaccard Similarity: {similarity:.4f}")Output
Jaccard Similarity: 0.4444
We can also have a compact approach where we can use the length function with union and intersection functions.
def jaccard_similarity(set1, set2): intersection = len(set1 & set2) union = len(set1 | set2) return intersection / union set1 = {'apple', 'banana', 'cherry'} set2 = {'banana', 'cherry', 'date'} print(jaccard_similarity(set1, set2))Output
0.5
Implementing Jaccard similarity in Python using textdistance library
Textdistance is a library in Python that is used for string comparison. It also has Jaccard similarity
pip install textdistance
Use above command to install textdistance using pip
import textdistance set1 = "I love machine learning" set2 = "I love deep learning" jaccard_sim = textdistance.jaccard(set1, set2) print(jaccard_sim)
Output
0.6538461538461539
Implementing Jaccard similarity in Python using Scikit-learn library
Scikit-learn also provides a function for computing the Jaccard similarity using its metrics module.
from sklearn.metrics import jaccard_score # Example binary arrays y_true = [1, 1, 0, 0, 1] y_pred = [1, 0, 0, 0, 1] score = jaccard_score(y_true, y_pred) print(score)
output
0.6666666666666666
Implementing Jaccard similarity in Python using Scipy library
Scipy provides a general-purpose distance function that can be used for computing Jaccard similarity.
from scipy.spatial.distance import jaccard set1 = [1, 0, 1, 1, 0] set2 = [1, 1, 0, 1, 0] similarity = jaccard(set1, set2) print(similarity)
Output
0.5
Handling Larger Datasets
When we have a large dataset or multiple documents and if we want to calculate Jaccard similarity then it is computationally expensive. To optimize it we have
MinHash: This technique allows us for faster approximation of Jaccard similarities by reducing dimensionally while preserving similarities
Locality Sensitive Hashing (LSH): This method hashes input items so that similar items map to the same “buckets” with high probability. This can significantly speed up the search for similar items.
FAQ on Jaccard similarity
We measure the set similarity by comparing the interaction and union of the sets in Jaccard similarity
Jaccard Similarity=∣A∩B∣ / ∣A∪B|
where A and B are two sets. It ranges from 0 to 1 .
In Cosine Similarity we measure the cosine of the angle between two vectors in a multi-dimensional space. Its formula is
Cosine Similarity=A⋅B / ∥A∥ * ∥B∥
where A and B are vectors. It ranges from -1 to 1. 1 represents identical directions and 0 means no similarity.
Q2. Jaccard distance vs Jaccard similarity
Jaccard Similarity measures how similar two sets are whereas Jaccard distance measures how different two sets are. It is the complement of Jaccard similarity
Jaccard Distance=1−Jaccard Similarity
Jaccard Distance ranges from 0 (identical sets) to 1 (completely disjoint sets).
Q3. Jaccard similarity vs Levenshtein distance
Jaccard Similarity Compares two sets based on the ratio of their intersection to their union whereas Levenshtein distance also known as edit distance measures the minimum number of single-character edits like insertions, deletions or substitutions will be required to change one string to another.
Applications of Jaccard Similarity
Jaccard similarity can be used in
- For getting text similarity between documents based on their content.
- Jaccard similarity can be used in recommendation systems to show similar content based on what he has been earlier.
- It can be used to cluster similar documents together.
Conclusion
In this blog, we learned about Jaccard similarity and how to implement it using Python. We can use sets and other libraries that have built-in functions for calculating Jaccard similarity.
In simple terms, Jaccard similarity quantifies how similar two sets are based on their intersection and union and It ranges from 0 (no similarity) to 1 (identical sets). To optimize Jaccard similarity for larger datasets we can use techniques like MinHash or LSH.
You can also read our other articles on why is the central limit theorem important and how to detect QR code in an image.
Pingback: QQ Plots in Python