Pandas Interview Questions for 2025

Pandas is an important library in data manipulation in Python. Data loading, data cleaning, and data transformation can be done using Python.

In this blog we will go through 55 pandas interview questions.

Pandas interview question on DataFrame Creation and Manipulation

How can you create a pandas DataFrame from a dictionary?

import pandas as pd
# Create a dictionary
data = {
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': [25, 30, 35],
    'City': ['New York', 'Los Angeles', 'Chicago']
}

# Create DataFrame from the dictionary
df = pd.DataFrame(data)

# Display the DataFrame
print(df)

output

 Name  Age         City
0    Alice   25     New York
1      Bob   30  Los Angeles
2  Charlie   35      Chicago

 

What are the methods for adding a new column to a pandas DataFrame?

There are different methods for adding a column:

1. Direct assignment

We can directly assign a new column by specifying the column name and the values in forms like list, series etc

import pandas as pd

# Sample DataFrame
df = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': [25, 30, 35]
})

# Add a new column by direct assignment
df['City'] = ['New York', 'Los Angeles', 'Chicago']

1. Using assign method

df = df.assign(Country=['USA', 'USA', 'USA'])

print(df)

Output

Name  Age Country
0    Alice   25     USA
1      Bob   30     USA
2  Charlie   35     USA

Using insert method

The insert() method allows us to insert a new column at a particular position in the DataFrame

# Insert a new column at the second position (index 1)
df.insert(1, 'Gender', ['Female', 'Male', 'Male'])
print(df)

output

     Name  Gender  Age Country
0    Alice  Female   25     USA
1      Bob    Male   30     USA
2  Charlie    Male   35     USA

2. Using loc[] or iloc[]

Loc[] method in pandas can be handy for adding a column in dataframe

df.loc[:, 'Salary'] = [50000, 60000, 70000]
print(df)

Output

     Name  Gender  Age Country  Salary
0    Alice  Female   25     USA   50000
1      Bob    Male   30     USA   60000
2  Charlie    Male   35     USA   70000

 

How do you set a column as the index of a pandas DataFrame?

We have a function called set_index() which takes two parameters i.e. column name and inplace.

df.set_index('column_name', inplace=False)
  • “column_name” is the name of the column you want to set as the index.
  • “inplace=True” modifies the original DataFrame, while inplace=False (default) returns a new DataFrame with the updated index.
df = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': [25, 30, 35],
    'City': ['New York', 'Los Angeles', 'Chicago']
})

# Set 'Name' column as the index
df_new = df.set_index('Name')

Above code set Name column as index.


 

How do you reset the index of a pandas DataFrame?

The syntax for dropping the index

df.reset_index(drop=False, inplace=False)
  • “drop=False”: This keeps the current index as a column in the DataFrame. If you set it to True, the index will be removed and not added as a column.
  • “inplace=False”: By default, reset_index() returns a new DataFrame. If you want to modify the DataFrame in place, you can set inplace=True.
df = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': [25, 30, 35]
}, index=['a', 'b', 'c'])

# Reset the index without dropping the index column
df_reset = df.reset_index()

 

What are the ways to delete a column from a pandas DataFrame?

We can use drop(), del, loc[], and pop() methods to delete a column in Pandas DataFrame.

# Sample DataFrame
df = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': [25, 30, 35],
    'City': ['New York', 'Los Angeles', 'Chicago']
})

Using drop function

# Drop 'Age' column
df_dropped = df.drop(columns=['Age'])

Using del function

del df['Age']

Using pop function

city_column = df.pop('City')

Using loc function

While loc and iloc are generally used for indexing, you can use them to select specific columns that you want to keep, effectively removing the ones you don’t want.

# Select all columns except 'City' using `loc`
df = df.loc[:, df.columns != 'City']
print(df)

 

How can you transpose a pandas DataFrame?

The .T attribute is a shorthand for transposing the DataFrame.

import pandas as pd

# Sample DataFrame df = pd.DataFrame({ ‘Name’: [‘Alice’, ‘Bob’, ‘Charlie’], ‘Age’: [25, 30, 35], ‘City’: [‘New York’, ‘Los Angeles’, ‘Chicago’] }) #before transpose print(df)

output

    Name  Age          City
0  Alice   25      New York
1    Bob   30  Los Angeles
2 Charlie   35       Chicago
# Transpose the DataFrame
df_transposed = df.T
print(df_transposed)

output

            0            1          2
Name     Alice        Bob    Charlie
Age         25          30          35
City  New York  Los Angeles    Chicago

Pandas interview questions on Data Aggregation and Grouping

How do you group data by a specific column in pandas and calculate the sum?

Use groupby() followed by sum()

import pandas as pd

# Sample DataFrame
data = {'column_name': ['A', 'B', 'A', 'B', 'A'],
        'values': [10, 20, 30, 40, 50]}

df = pd.DataFrame(data)

# Group by 'column_name' and calculate the sum
result = df.groupby('column_name').sum()
print(result)

Output

                           values
column_name        
A                            90
B                            60

 

How can you get the count of unique values in a column using pandas?

Use nunique() to count distinct values

import pandas as pd

# Sample DataFrame
data = {'column_name': ['A', 'B', 'A', 'B', 'C']}

df = pd.DataFrame(data)

# Count unique values in 'column_name'
unique_count = df['column_name'].nunique()
print(f"Unique values count: {unique_count}")

Output

Unique values count: 3

 

What is the agg() function in pandas and how do you use it?

The agg() function in pandas is used to apply one or more aggregation operations on columns of a DataFrame or series.

import pandas as pd

# Sample DataFrame
data = {'category': ['A', 'B', 'A', 'B', 'A'],
        'value1': [10, 20, 30, 40, 50],
        'value2': [100, 200, 300, 400, 500]}

df = pd.DataFrame(data)

# Using agg() with multiple aggregation functions
result = df.groupby('category').agg({
    'value1': ['sum', 'mean'],  # Sum and mean of 'value1'
    'value2': 'max'             # Maximum of 'value2'
})

print(result)

output

          value1            value2
              sum mean   max
category                     
A             90   30       500
B             60   30       400

Pandas interview questions on Filtering and Selection

How do you filter rows in pandas based on a single condition and multiple conditions?

Let’s take an example, Where we want to filter a particular column value greater than 10.

import pandas as pd

data = {'column_name': [5, 15, 20, 8, 30]}
df = pd.DataFrame(data)

# Filter rows where column_name > 10
filtered_df = df[df['column_name'] > 10]
print(filtered_df)

output

   column_name
1            15
2            20
4            30

For multiple conditions, we can filter rows by combining them with ( & ) AND , ( | ) OR operator.

Let’s consider an example where column1>10 and column2 <50.

data = {'column1': [5, 15, 20, 8, 30], 'column2': [25, 40, 30, 45, 50]}
df = pd.DataFrame(data)

# Filter rows where column1 > 10 and column2 < 50 filtered_df = df[(df['column1'] > 10) & (df['column2'] < 50)]
print(filtered_df)

 

How can you select columns from a pandas DataFrame by their names?

To select specific columns from a DataFrame by their names, pass a list of column names to the DataFrame.

data = {'column1': [1, 2, 3], 'column2': [4, 5, 6], 'column3': [7, 8, 9]}
df = pd.DataFrame(data)

# Select specific columns by name
selected_columns = df[['column1', 'column2']]
print(selected_columns)

Output

   column1  column2
0        1        4
1        2        5
2        3        6


 

What does the query() function do in pandas and how do you use it?

The query() function allows us to filter the DataFrame with a string expression. Making our code more readable in complex conditions.

data = {'column1': [5, 15, 20, 8, 30], 'column2': [25, 40, 30, 45, 50]}
df = pd.DataFrame(data)

# Using query() for filtering
filtered_df = df.query('column1 > 10 and column2 < 50')
print(filtered_df)

output

   column1  column2
1       15       40
2       20       30


How can you filter rows where a column value falls within a specific range using pandas?

We can use the between() function to filter rows where column values fall within a specific range.

data = {'column_name': [5, 15, 20, 8, 30]}
df = pd.DataFrame(data)

# Filter rows where column_name is between 10 and 20 (inclusive)
filtered_df = df[df['column_name'].between(10, 20)]
print(filtered_df)

Output

   column_name
1            15
2            20



 

How do you check if values from one column exist in another column using isin() in pandas?

The isin() method in Pandas checks if the element of a column is contained within another column or list.

data = {'column1': ['A', 'B', 'C', 'D'], 'column2': ['B', 'C', 'E', 'F']}
df = pd.DataFrame(data)

# Check if values in column1 exist in column2
filtered_df = df[df['column1'].isin(df['column2'])]
print(filtered_df)

Output

  column1 column2
1       B       C
2       C       E

How can you select rows by index in pandas?

We can use loc[] for the label-based indexing or iloc[] for position-based indexing to select rows.

By index label:

data = {'column_name': [5, 15, 20, 8, 30]}
df = pd.DataFrame(data, index=['a', 'b', 'c', 'd', 'e'])

# Select row by index label
row_by_label = df.loc['c']
print(row_by_label)

Output

column_name    20
Name: c, dtype: int64

By index position

# Select row by position (2nd row)
row_by_position = df.iloc[2]
print(row_by_position)

Output

column_name    20
Name: c, dtype: int64

Pandas interview questions on Merging and Joining DataFrames

What is the difference between merge() and join() in pandas?

Merge() is more flexible and lets you specify columns for merging. You can perform various types of join (like inner, outer, left or right)

Join() : It is mainly used to join DataFrames based on their index or a specific column. It is a simplely used for joining when you want to align data based on the index.


How do you merge two pandas DataFrames on a specific column?

import pandas as pd

# Example DataFrames
df1 = pd.DataFrame({
    'id': [1, 2, 3],
    'name': ['Alice', 'Bob', 'Charlie']
})

df2 = pd.DataFrame({
    'id': [2, 3, 4],
    'age': [25, 30, 35]
})

# Merging on a specific column (e.g., 'id')
merged_df = pd.merge(df1, df2, on='id')
print("Merged DataFrame on 'id':")
print(merged_df)

.
Output

Merged DataFrame on 'id':

    id      name   age
0   2      Bob       25
1   3     Charlie   30

How can you perform an outer join between two pandas DataFrames?

import pandas as pd

# Example DataFrames
df1 = pd.DataFrame({
    'id': [1, 2, 3],
    'name': ['Alice', 'Bob', 'Charlie']
})

df2 = pd.DataFrame({
    'id': [2, 3, 4],
    'age': [25, 30, 35]
})

# Performing an outer join on 'id'
outer_joined_df = pd.merge(df1, df2, on='id', how='outer')
print("\nOuter Join on 'id':")
print(outer_joined_df)

Output

Outer Join on 'id':

    id     name   age
0   1     Alice    NaN
1   2      Bob    25.0
2   3   Charlie  30.0
3   4      NaN   35.0

Pandas interview questions on Data Cleaning and Transformation

How can you change the data type of a column in pandas?

To change the data type of a column in pandas we can use the astype() method.

import pandas as pd
# Sample DataFrame
df = pd.DataFrame({
    'age': ['25', '30', '35']
})

# Changing the 'age' column to integer type
df['age'] = df['age'].astype(int)
print(df)

Output

   age
0   25
1   30
2   35

How do you replace missing values in a pandas DataFrame ?

We can use the fillna() method

import pandas as pd
import numpy as np

# Sample DataFrame with missing values
df = pd.DataFrame({
    'name': ['Alice', 'Bob', np.nan],
    'age': [25, np.nan, 35]
})

# Replacing missing values with a specific value
df_filled = df.fillna({'name': 'Unknown', 'age': 30})
print(df_filled)

Output

      name   age
0    Alice  25.0
1      Bob  30.0
2  Unknown  35.0

How can you normalize a column in pandas?

To normalize data we can use either min-max normalization or z-score normalization depending on our needs

Min Max Normalization

This scales the data to range between 0 to 1.

import pandas as pd
# Sample DataFrame
df = pd.DataFrame({
    'value': [10, 20, 30, 40, 50]
})

# Min-Max Normalization
df['value_normalized'] = (df['value'] - df['value'].min()) / (df['value'].max() - df['value'].min())
print(df)

Output

   value  value_normalized
0     10               0.0
1     20               0.2
2     30               0.4
3     40               0.6
4     50               1.0

In Z-score normalization, each value in the dataset subtracts the mean of the dataset and then divides by the standard deviation

import pandas as pd
# Sample DataFrame
df = pd.DataFrame({
    'value': [10, 20, 30, 40, 50]
})
# Z-Score Normalization
df['value_normalized'] = (df['value'] - df['value'].mean()) / df['value'].std()
print(df)

Output

   value  value_normalized
0     10          -1.414214
1     20          -0.707107
2     30           0.000000
3     40           0.707107
4     50           1.414214

How do you handle missing data with backward fill (bfill) in pandas?

Bfill method in the fillna() function of pandas helps us fill missing values by propagating the next valid value backward.

import pandas as pd
import numpy as np
# Sample DataFrame with missing values (NaN)
df = pd.DataFrame({
    'name': ['Alice', np.nan, 'Charlie', np.nan],
    'age': [25, np.nan, 35, np.nan]
})

# Backward fill to replace missing values with the next valid value
df_bfilled = df.fillna(method='bfill')
print(df_bfilled)

Output

      name   age
0    Alice  25.0
1  Charlie  35.0
2  Charlie  35.0
3      NaN   NaN

How does the qcut() function work in pandas for binning data?

qcut() splits input data into q bins, where q is the number of quantiles you want. Each bin will contain approximately the same number of data points.

Returns a categorical object indicating which bin each data point belongs to.

pandas.qcut(x, q, labels=False, ...)
  • x is the input data like a series which needs to be binned.
  • q is the number of quantiles or bins we want. Here we provide an integer value.
  • Labels: if we want labels for each bin we set it to true.
import pandas as pd

# Sample DataFrame with continuous data
df = pd.DataFrame({
    'score': [15, 20, 35, 40, 50, 60, 70, 85, 90, 100]
})

# Binning the 'score' column into 4 equal-frequency bins using qcut()
df['score_binned'] = pd.qcut(df['score'], q=4, labels=False)

print(df)

Output

score  score_binned
0     15             0
1     20             0
2     35             1
3     40             1
4     50             1
5     60             2
6     70             2
7     85             3
8     90             3
9    100             3

Pandas interview question for Reading and Writing Data

Write code for reading a CSV file into a pandas DataFrame?

import pandas as pd

# Reading a CSV file into a DataFrame
df = pd.read_csv('your_file.csv')
# Displaying the first few rows of the DataFrame
print(df.head())

Output

	Name 	Age
1	JK		25
2	Aakansha	24

Give me code for writing a pandas DataFrame to an Excel file?

import pandas as pd
# Sample DataFrame
df = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Charlie'],
    'age': [25, 30, 35]
})

# Writing the DataFrame to an Excel file
df.to_excel('output.xlsx', index=False)
print("DataFrame has been written to 'output.xlsx'")

Output

DataFrame has been written to 'output.xlsx'

How do you write a pandas DataFrame to a database using to_sql()?

import pandas as pd
import sqlite3
# Sample DataFrame
df = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Charlie'],
    'age': [25, 30, 35]
})

# Create a connection to an SQLite database (or any database you use)
conn = sqlite3.connect('example.db')  # Replace with your database connection

# Writing DataFrame to SQL database
df.to_sql('people', conn, if_exists='replace', index=False)
# Close the connection
conn.close()

print("DataFrame has been written to the database.")

Output

DataFrame has been written to the database

Pandas interview questions on Visualization

How do you create a bar plot using pandas?

We can use the .plot() method with kind=’bar’ argument which will generate a vertical bar plot.

import pandas as pd
import matplotlib.pyplot as plt

# Sample DataFrame
df = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Charlie'],
    'age': [25, 30, 35]
})

# Creating a bar plot for the 'age' column
df.plot(kind='bar', x='name', y='age', legend=False)

# Display the plot
plt.show()


How can you visualize the distribution of data using a boxplot in pandas?

Similar to how we plot a bar chart, just replace kind=’box’.

import pandas as pd
import matplotlib.pyplot as plt

# Sample DataFrame
df = pd.DataFrame({
    'age': [23, 25, 30, 35, 40, 45, 50, 55, 60, 65, 70]
})

# Creating a boxplot for the 'age' column
df.plot(kind='box', y='age')

# Display the plot
plt.show()


What is the function of creating a scatter plot in pandas?

Use df.plot() function with ‘kind’ parameter values as ‘scatter’.



Give code to generate a heatmap using pandas?
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Sample DataFrame with numerical values
data = {
    'A': [1, 2, 3, 4],
    'B': [5, 6, 7, 8],
    'C': [9, 10, 11, 12]
}

df = pd.DataFrame(data)

# Create a heatmap using seaborn
plt.figure(figsize=(8, 6))  # Optional: Set the size of the plot
sns.heatmap(df, annot=True, cmap='coolwarm', linewidths=0.5)

# Show the plot
plt.title('Heatmap Example')
plt.show()

Pandas interview questions on Statistical Functions

How do you calculate correlation between two columns in pandas?

import pandas as pd

# Sample DataFrame
df = pd.DataFrame({
    'col1': [1, 2, 3, 4, 5],
    'col2': [2, 4, 5, 4, 5]
})

# Calculating the correlation between 'col1' and 'col2'
correlation = df['col1'].corr(df['col2'])

print(f"Correlation between col1 and col2: {correlation}")

output

Correlation between col1 and col2: 0.8320502943378437

 

What function in pandas can be used to calculate variance for a column?

We can use the .var() function to calculate the variance of a column.

import pandas as pd

# Sample DataFrame
df = pd.DataFrame({
    'data': [10, 20, 30, 40, 50]
})

# Calculating the variance of the 'data' column
variance = df['data'].var()

print(f"Variance of the 'data' column: {variance}")

output

Variance of the 'data' column: 200.0

 

How can you calculate the quantiles of a DataFrame using pandas?

To calculate quantile of a DataFrame we can use the .quantile() method. Here is an example

import pandas as pd
# Sample DataFrame
df = pd.DataFrame({
    'data1': [1, 2, 3, 4, 5, 6, 7],
    'data2': [10, 20, 30, 40, 50, 60, 70]
})

# Calculate the 25th, 50th, and 75th percentiles (quantiles)
quantiles_data1 = df['data1'].quantile([0.25, 0.5, 0.75])
quantiles_data2 = df['data2'].quantile([0.25, 0.5, 0.75])

print(f"Quantiles for 'data1':\n{quantiles_data1}")
print(f"\nQuantiles for 'data2':\n{quantiles_data2}")

Output

Quantiles for 'data1':
0.25    2.0
0.50    4.0
0.75    6.0
Name: data1, dtype: float64

Quantiles for 'data2':
0.25    20.0
0.50    40.0
0.75    60.0
Name: data2, dtype: float64

1 thought on “Pandas Interview Questions for 2025”

  1. Pingback: Building dashboards in Python

Comments are closed.