Solve ValueError: Found Input Variables with Inconsistent Numbers of Sample

Table of Contents

The error “ValueError: Found input variables with inconsistent numbers of samples” comes in python libraries like scikit-learn because our feature matrix (X) and target vector (y) do not have the same number of rows. Solution to the error is to verify dataset shapes and remove misaligned rows and have the same number of records for feature matrix (X) and target vector (y) .

Reproducing the Error

import numpy as np
from sklearn.linear_model import LinearRegression

X = np.array([[1], [2], [3], [4]])
y = np.array([10, 20, 30])  # Only 3 samples

model = LinearRegression()
model.fit(X, y)

output

ValueError: Found input variables with inconsistent numbers of samples: [4, 3]

Fixing the Error By correcting the dataset alignment

import numpy as np
from sklearn.linear_model import LinearRegression

X = np.array([[1], [2], [3]])
y = np.array([10, 20, 30])

model = LinearRegression()
model.fit(X, y)

Now, model will train successfully

Do stress-free and focused coding sessions with Noise-Cancelling Headphones which is available on Amazon.

What Does ValueError: Found Input Variables with Inconsistent Numbers of Samples error Mean?

When we look at supervised models in machine learning from linear regression to randomforestclassifier commonly need

X (input features) has shape (n_samples, n_features)
y (target/output) has shape (n_samples,)

When these sizes of input features and target don’t match, we get error ValueError: Found input variables with inconsistent numbers of samples To summarize, the error occurred due to data alignment and it is not a model error.

I recommend you to read this The Hundred Page Machine Learning Book for getting a clear and concise understanding of machine learning concepts.

Why This Error Happens (Top Causes)

Now, let’s see common real world scenarios that produce inconsistent sample counts i.e.

Length mismatch
Shape mismatch
Misaligned rows
Unequal samples
Training label mismatch
Feature target mismatch

While preprocessing we can get above mismatch. Let’s discuss them

Dropped Rows During Pre-processing

Examples:

While removing missing value using dropna()
Filtering dataframe based on a condition in Pandas
Removing outliers

While doing the above processes, if we drop in X and not in y the sample count irregularities.

Incorrect Train-Test Split

Examples

Passing different arrays to train_test_split
Splitting X and y separately instead of together
Improper shuffling or manual datasets

Feature Engineering produces extra rows are created when

Examples

Using rolling windows
Generating lag features
Merging datasets

Loading data incorrectly

Examples

Loading a csv file with different size
Have done a mistake in manual slicing X = df.iloc[:-1], y = df.iloc[:]

You can learn the difference between loc vs iloc here.

How to Fix the ValueError: Found Input Variables with Inconsistent Numbers of Samples

Let me directly jump to code and show you

Verify the shape of X and Y

print(X.shape)
print(y.shape)

Reset Indexes after filtering

X = X.reset_index(drop=True)
y = y.reset_index(drop=True)

Apply same preprocessing step for both X and y

For example if we want to drop rows from the feature dataframe.

mask = X.notnull().all(axis=1)
X = X[mask]
y = y[mask]

Use train_test_split correctly

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

Incorrect usage:

X_train = train_test_split(X)
y_train = train_test_split(y)  # Wrong, don't split separately

Real world example scenarios and how to handle them

Handling missing values in finance or retail forecasting

Data pipelines often drop NA rows which creates misalignment.
Apply check before training the model like:

Check for misalignment in dataset

# Check for misalignment in dataset
# Example code to check for misalignment

if len(X) != len(y):
    print("Warning: Misalignment detected between X and y.")

Do a reset indexes after cleanup

# Reset index after cleanup
X = X.reset_index(drop=True)
y = y.reset_index(drop=True)

Merging Datasets in Healthcare Analytics

If we have two datasets, one for patient data and another for outcome. They need to be merged, then you need:

Always merge using clean keys.

# Example of clean merge using patient_id as key
merged_df = pd.merge(patient_data, outcome_data, on='patient_id', how='inner')

Check for orphaned rows.

# Check for orphaned rows (rows in one dataset but not the other)
orphaned_rows = merged_df[merged_df['patient_id'].isnull()]

Run shape audits after each transformation.

# Run shape audits after each transformation
print("Shape of merged dataset:", merged_df.shape)

Feature engineering for time-series models

Rolling stats cause leading NAs.

# Feature engineering for time-series models

# Drop missing values and reset index after creating lag features
df = df.dropna().reset_index(drop=True)

# Selecting features for model
X = df[['lag1', 'lag2']]
y = df['target']

To summarize, I have provided a table for causes of ValueError: Found Input Variables with Inconsistent Numbers of Samples and possible fixes for it

Cause and Fix for error

Cause	Symptom	Fix
Dropping rows only in X	y has more rows	Apply same mask to y
Incorrect train_test_split	Mismatched splits	Pass X and y together
Rolling / lag features	Extra leading rows in X	Drop NA rows after feature generation
Merging datasets incorrectly	Orphan rows	Use inner join or remove unmatched rows
Misaligned DataFrame indexes	Shape looks equal, but error	Reset index
Separate imports of X and y	Different row counts	Validate using `assert len(X) == len(y)` before training

FAQ

Q. Why do we get “inconsistent numbers of samples” error? The model train_test_split receives input arrays (X, y) in which the rows of X do not match with the rows of y.

Q. Does this error occur only in scikit-learn? No, this error also occurs in other model training frameworks such as TensorFlow and PyTorch, whenever feature and target data are misaligned.

Q. How to quickly debug the error ?


 print(len(X), len(y))
 print(X.index, y.index)
 print(X.shape, y.shape)

Conclusion:

The ValueError: Found input variables with inconsistent numbers of samples is caused by data alignment issues.

The blog is written by Jagdish Kharatmol, he has 3+ years of experience in Python. He is a researcher in field of AI/ML

References

What Does ValueError: Found Input Variables with Inconsistent Numbers of Samples error Mean?

Why This Error Happens (Top Causes)

Dropped Rows During Pre-processing

Incorrect Train-Test Split

Feature Engineering produces extra rows are created when

Loading data incorrectly

How to Fix the ValueError: Found Input Variables with Inconsistent Numbers of Samples

Real world example scenarios and how to handle them

Cause and Fix for error

FAQ

Related Posts