QQ Plots in Python

Hi people, before we jump to implement QQ plots in Python. Let’s quickly understand what a QQ plot is and what QQ plot assumptions are.

What is a QQ plot?

A QQ plot is a graphical method to compare two probability distributions by plotting their quantiles against each other.

How does the QQ plot work?

The QQ plot pairs up the quantiles from two distributions and determines whether the plot lies close to a straight line which means distributions are similar. If the points differ from the straight line then it indicates a difference in the distributions.

Why do we require a QQ plot?

Used to check if the distribution follows a normal distribution which is important for applying statistical tests.

QQ Plot Assumptions

When interpreting a QQ plot, there are a few key assumptions to keep in mind:

  1. Theoretical Distribution: The QQ plot compares the data to theoretical distributions such as normal, uniform, etc. For example, if we compare data to a normal distribution, the QQ plot will show how the data’s quantiles match the expected quantiles of a normal distribution
  2. Sample Size: A large sample size improves the accuracy of the QQ plot compared to small datasets. Small datasets might have misleading interpretations.
  3. Shape of distribution: A QQ plot primarily checks the shape of the distribution like normality, skewness, kurtosis, etc rather than specific parameters such as mean and variance. In simple terms, the plot does not directly test the hypothesis, it only suggests the likelihood of normality or other distributions.

Creating a QQ Plot in Python

We have multiple libraries like Statsmodels, Scipy, etc in Python which would help us to create a QQ plot. Let us explore with an example for each library

Creating a QQ plot in Python Using Scipy

import numpy as np
import scipy.stats as stats
import matplotlib.pyplot as plt

# Generate random data
data = np.random.normal(loc=0, scale=1, size=1000)

# Create QQ plot
stats.probplot(data, dist="norm", plot=plt)
plt.title('QQ Plot with Scipy')
plt.show()
Output
QQ plot using scipy

The function stats.probplot function from Scipy is used to generate a QQ plot. The dist=”norm” parameter specifies that we want to compare the data to normal distribution. The plot is then displayed using matplotlib library.

Creating a QQ plot in Python Using Statsmodels

import numpy as np
import statsmodels.api as sm
import matplotlib.pyplot as plt

# Generate random data
data = np.random.normal(loc=0, scale=1, size=1000)

# Create QQ plot
sm.qqplot(data, line ='45')
plt.title('QQ Plot with Statsmodels')
plt.show()
Output
QQ plot using statsmodels

Statsmodel provides us with ‘qqplot’ function to generate QQ plot. The ‘line=45’ parameter adds a reference line at 45 degrees helping us to visually assess how well the data fits the normal distribution.

We can use seaborn to create aesthetically pleasing plots. Let me show you an example.
import seaborn as sns
import numpy as np
import scipy.stats as stats
import matplotlib.pyplot as plt

# Generate random data
data = np.random.normal(loc=0, scale=1, size=1000)

# Create QQ plot
stats.probplot(data, dist="norm", plot=plt)
sns.set(style="darkgrid")
plt.title('QQ Plot with Seaborn')
plt.show()
Output
QQ plot using seaborn

QQ Plot Between Two Variables

QQ plot is generally used to compare a dataset to a dataset with a theoretical distribution but they can also be used to compare two datasets. In this case, the QQ plot will display how the quantities of the two datasets align with each other.

import numpy as np
import statsmodels.api as sm
import matplotlib.pyplot as plt

# Generate random data for two variables
x = np.random.normal(0, 1, 100)  # Variable 1
y = np.random.normal(0, 1.5, 100)  # Variable 2

# Create a Q-Q plot comparing x and y
plt.figure(figsize=(8, 8))
sm.qqplot_2samples(x, y, xlabel="Quantiles of X", ylabel="Quantiles of Y", line='45')
plt.title("Q-Q Plot Comparing Two Variables")
plt.grid()
plt.show()
Output
QQ plot for two variable

This compares the quantiles of data1 against data2. If the points form a straight line, the two datasets follow similar distributions.

The qqplot_2samples function from statsmodels.api compares the quantiles of the two datasets.

QQ Plot for Multiple Variables

In some cases, you may want to compare multiple variables to see how they compare with a theoretical distribution or with each other. This can be done by plotting multiple QQ plots in a grid or side by side.

import seaborn as sns
import matplotlib.pyplot as plt

# Generate random datasets
data1 = np.random.normal(loc=0, scale=1, size=1000)
data2 = np.random.normal(loc=0, scale=1.5, size=1000)
data3 = np.random.normal(loc=1, scale=1, size=1000)

# Set up the figure
plt.figure(figsize=(8, 6))


# Plot the QQ plot for each data set
stats.probplot(data1, dist="norm", plot=plt)
stats.probplot(data2, dist="norm", plot=plt)
stats.probplot(data3, dist="norm", plot=plt)

# Customize the plot to make it clearer
plt.title('QQ Plots for Multiple Variables')
plt.legend(['Data 1 (Normal)', 'Data 2 (Normal)', 'Data 3 (Uniform)'])
plt.grid(True)
plt.show()

Output
QQ plot for multiple variable

This code generates three QQ plots for data1, data2, and data3, each compared with a normal distribution.

What Does a QQ Plot Show?

A QQ plot provides a visual method for assessing whether a dataset follows a specific distribution. It compares the observed data to the theoretical quantiles. The following patterns can be observed:

  • Straight Line: If the points lie along a straight line, it suggests that the data follows the specified distribution.
  • Curvature: If the points deviate in a curved pattern (e.g., S-shaped), it suggests that the data might be skewed or have heavier tails than the expected distribution.
  • Outliers: Points that deviate significantly from the line indicate potential outliers or extreme values in the data.

QQ Plot Interpretation

  • Normal Distribution: If we have a QQ plot with a straight line then the data is likely normally distributed.
  • Right-Skewed Distribution: If we see points bending upwards on the right side then it indicates a right-skewed distribution where the data has a long tail on the right side.
  • Left-Skewed Distribution: It shows points bending downward on the left side indicating left-skewed distribution.
  • Heavy Tails: If we have points deviate away from the straight line at the extremes then data has heavy tails, meaning it may contain more extreme values than a normal distribution.

QQ Plot of Residuals

When we do regression analysis QQ plots can be used to assess the residuals of a model. If the residuals follow a normal distribution then the model is appropriate. Deviation from normality in the residuals might suggest heteroscedasticity or model misspecification.

import statsmodels.api as sm
import numpy as np

# Generate synthetic data and fit a linear regression model
X = np.random.normal(0, 1, 100)
y = 3*X + np.random.normal(0, 1, 100)

# Fit linear regression model
X = sm.add_constant(X)  # Adds constant (intercept) term
model = sm.OLS(y, X).fit()

# Create QQ plot of residuals
sm.qqplot(model.resid, line ='45')
plt.show()
Output
QQ plot of residuals

How to Tell if a QQ Plot is Normally Distributed

If all the points fall in a straight line in a QQ plot then it is normally distributed. Even if we have a minor deviation from the straight line may be still acceptable. Having significant deviations like curves, bends or outliers in QQ plot suggests a departure from normality

Conclusion

QQ plot in Python can be implemented using libraries like Scipy, Statsmodels, etc. Interpreting the plot helps us to make informed decisions about the distribution of our data.

You can read about our other blogs on XGB feature importance and how to implement Jaccard similarity in Python.