Table of Contents
ToggleThe error “ValueError: Found input variables with inconsistent numbers of samples” comes in python libraries like scikit-learn because our feature matrix (X) and target vector (y) do not have the same number of rows. Solution to the error is to verify dataset shapes and remove misaligned rows and have the same number of records for feature matrix (X) and target vector (y) .
Reproducing the Errorimport numpy as np from sklearn.linear_model import LinearRegression X = np.array([[1], [2], [3], [4]]) y = np.array([10, 20, 30]) # Only 3 samples model = LinearRegression() model.fit(X, y)output
ValueError: Found input variables with inconsistent numbers of samples: [4, 3]Fixing the Error By correcting the dataset alignment
import numpy as np from sklearn.linear_model import LinearRegression X = np.array([[1], [2], [3]]) y = np.array([10, 20, 30]) model = LinearRegression() model.fit(X, y)Now, model will train successfully
What Does ValueError: Found Input Variables with Inconsistent Numbers of Samples error Mean?
When we look at supervised models in machine learning from linear regression to randomforestclassifier commonly need
- X (input features) has shape (n_samples, n_features)
- y (target/output) has shape (n_samples,)
Why This Error Happens (Top Causes)
Now, let’s see common real world scenarios that produce inconsistent sample counts i.e.
- Length mismatch
- Shape mismatch
- Misaligned rows
- Unequal samples
- Training label mismatch
- Feature target mismatch
While preprocessing we can get above mismatch. Let’s discuss them
Dropped Rows During Pre-processing
Examples:
- While removing missing value using dropna()
- Filtering dataframe based on a condition in Pandas
- Removing outliers
While doing the above processes, if we drop in X and not in y the sample count irregularities.
Incorrect Train-Test Split
Examples
- Passing different arrays to train_test_split
- Splitting X and y separately instead of together
- Improper shuffling or manual datasets
Feature Engineering produces extra rows are created when
Examples
- Using rolling windows
- Generating lag features
- Merging datasets
Loading data incorrectly
Examples
- Loading a csv file with different size
- Have done a mistake in manual slicing X = df.iloc[:-1], y = df.iloc[:]
How to Fix the ValueError: Found Input Variables with Inconsistent Numbers of Samples
Verify the shape of X and Y
print(X.shape)
print(y.shape)
Reset Indexes after filtering
X = X.reset_index(drop=True)
y = y.reset_index(drop=True)
Apply same preprocessing step for both X and y
For example if we want to drop rows from the feature dataframe.
mask = X.notnull().all(axis=1)
X = X[mask]
y = y[mask]
Use train_test_split correctly
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
Incorrect usage:
X_train = train_test_split(X)
y_train = train_test_split(y) # Wrong, don't split separately
Real world example scenarios and how to handle them
- Data pipelines often drop NA rows which creates misalignment.
- Apply check before training the model like:
Check for misalignment in dataset
# Check for misalignment in dataset
# Example code to check for misalignment
if len(X) != len(y):
print("Warning: Misalignment detected between X and y.")
Do a reset indexes after cleanup
# Reset index after cleanup
X = X.reset_index(drop=True)
y = y.reset_index(drop=True)
Merging Datasets in Healthcare Analytics
If we have two datasets, one for patient data and another for outcome. They need to be merged, then you need:
Always merge using clean keys.# Example of clean merge using patient_id as key
merged_df = pd.merge(patient_data, outcome_data, on='patient_id', how='inner')
Check for orphaned rows.
# Check for orphaned rows (rows in one dataset but not the other)
orphaned_rows = merged_df[merged_df['patient_id'].isnull()]
Run shape audits after each transformation.
# Run shape audits after each transformation
print("Shape of merged dataset:", merged_df.shape)
Feature engineering for time-series models
Rolling stats cause leading NAs.
# Feature engineering for time-series models
# Drop missing values and reset index after creating lag features
df = df.dropna().reset_index(drop=True)
# Selecting features for model
X = df[['lag1', 'lag2']]
y = df['target']
Cause and Fix for error
| Cause | Symptom | Fix |
|---|---|---|
| Dropping rows only in X | y has more rows | Apply same mask to y |
| Incorrect train_test_split | Mismatched splits | Pass X and y together |
| Rolling / lag features | Extra leading rows in X | Drop NA rows after feature generation |
| Merging datasets incorrectly | Orphan rows | Use inner join or remove unmatched rows |
| Misaligned DataFrame indexes | Shape looks equal, but error | Reset index |
| Separate imports of X and y | Different row counts | Validate using assert len(X) == len(y) before training |
FAQ
print(len(X), len(y))
print(X.index, y.index)
print(X.shape, y.shape) Conclusion:
The ValueError: Found input variables with inconsistent numbers of samples is caused by data alignment issues.
The blog is written by Jagdish Kharatmol, he has 3+ years of experience in Python. He is a researcher in field of AI/ML
References
