Solve ValueError: Found Input Variables with Inconsistent Numbers of Sample

The error “ValueError: Found input variables with inconsistent numbers of samples” comes in python libraries like scikit-learn because our feature matrix (X) and target vector (y) do not have the same number of rows. Solution to the error is to verify dataset shapes and remove misaligned rows and have the same number of records for feature matrix (X) and target vector (y) .

Reproducing the Error
import numpy as np
from sklearn.linear_model import LinearRegression

X = np.array([[1], [2], [3], [4]])
y = np.array([10, 20, 30])  # Only 3 samples

model = LinearRegression()
model.fit(X, y)
output
ValueError: Found input variables with inconsistent numbers of samples: [4, 3]
Fixing the Error By correcting the dataset alignment
import numpy as np
from sklearn.linear_model import LinearRegression

X = np.array([[1], [2], [3]])
y = np.array([10, 20, 30])

model = LinearRegression()
model.fit(X, y)
Now, model will train successfully

What Does ValueError: Found Input Variables with Inconsistent Numbers of Samples error Mean?

When we look at supervised models in machine learning from linear regression to randomforestclassifier commonly need

  • X (input features) has shape (n_samples, n_features)
  • y (target/output) has shape (n_samples,)
When these sizes of input features and target don’t match, we get error ValueError: Found input variables with inconsistent numbers of samples To summarize, the error occurred due to data alignment and it is not a model error.

Why This Error Happens (Top Causes)

Now, let’s see common real world scenarios that produce inconsistent sample counts i.e.

  1. Length mismatch
  2. Shape mismatch
  3. Misaligned rows
  4. Unequal samples
  5. Training label mismatch
  6. Feature target mismatch

While preprocessing we can get above mismatch. Let’s discuss them

Dropped Rows During Pre-processing

Examples:

  • While removing missing value using dropna()
  • Filtering dataframe based on a condition in Pandas
  • Removing outliers

While doing the above processes, if we drop in X and not in y the sample count irregularities.

Incorrect Train-Test Split

Examples

  • Passing different arrays to train_test_split
  • Splitting X and y separately instead of together
  • Improper shuffling or manual datasets

Feature Engineering produces extra rows are created when

Examples

  • Using rolling windows
  • Generating lag features
  • Merging datasets

 Loading data incorrectly

Examples

  • Loading a csv file with different size
  • Have done a mistake in manual slicing X = df.iloc[:-1], y = df.iloc[:]

You can learn the difference between loc vs iloc here.

How to Fix the ValueError: Found Input Variables with Inconsistent Numbers of Samples

Let me directly jump to code and show you

Verify the shape of X and Y
print(X.shape)
print(y.shape)
Reset Indexes after filtering
X = X.reset_index(drop=True)
y = y.reset_index(drop=True)
Apply same preprocessing step for both X and y

For example if we want to drop rows from the feature dataframe.

mask = X.notnull().all(axis=1)
X = X[mask]
y = y[mask]
Use train_test_split correctly
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

Incorrect usage:

X_train = train_test_split(X)
y_train = train_test_split(y)  # Wrong, don't split separately

Real world example scenarios and how to handle them

Handling missing values in finance or retail forecasting
  • Data pipelines often drop NA rows which creates misalignment.
  • Apply check before training the model like:

Check for misalignment in dataset

# Check for misalignment in dataset
# Example code to check for misalignment

if len(X) != len(y):
    print("Warning: Misalignment detected between X and y.")
Do a reset indexes after cleanup
# Reset index after cleanup
X = X.reset_index(drop=True)
y = y.reset_index(drop=True)
Merging Datasets in Healthcare Analytics

If we have two datasets, one for patient data and another for outcome. They need to be merged, then you need:

Always merge using clean keys.
# Example of clean merge using patient_id as key
merged_df = pd.merge(patient_data, outcome_data, on='patient_id', how='inner')

Check for orphaned rows.

# Check for orphaned rows (rows in one dataset but not the other)
orphaned_rows = merged_df[merged_df['patient_id'].isnull()]
Run shape audits after each transformation.

# Run shape audits after each transformation
print("Shape of merged dataset:", merged_df.shape)
Feature engineering for time-series models

Rolling stats cause leading NAs.
# Feature engineering for time-series models

# Drop missing values and reset index after creating lag features
df = df.dropna().reset_index(drop=True)

# Selecting features for model
X = df[['lag1', 'lag2']]
y = df['target']
To summarize, I have provided a table for causes of ValueError: Found Input Variables with Inconsistent Numbers of Samples and possible fixes for it

Cause and Fix for error

Cause Symptom Fix
Dropping rows only in X y has more rows Apply same mask to y
Incorrect train_test_split Mismatched splits Pass X and y together
Rolling / lag features Extra leading rows in X Drop NA rows after feature generation
Merging datasets incorrectly Orphan rows Use inner join or remove unmatched rows
Misaligned DataFrame indexes Shape looks equal, but error Reset index
Separate imports of X and y Different row counts Validate using assert len(X) == len(y) before training

FAQ

Q. Why do we get “inconsistent numbers of samples” error? The model train_test_split receives input arrays (X, y) in which the rows of X do not match with the rows of y.

Q. Does this error occur only in scikit-learn? No, this error also occurs in other model training frameworks such as TensorFlow and PyTorch, whenever feature and target data are misaligned.

Q. How to quickly debug the error ?

 print(len(X), len(y))
 print(X.index, y.index)
 print(X.shape, y.shape)

Conclusion:

The ValueError: Found input variables with inconsistent numbers of samples is caused by data alignment issues.

The blog is written by Jagdish Kharatmol, he has 3+ years of experience in Python. He is a researcher in field of AI/ML

References