Table of Contents
TogglePandas is an important library in data manipulation in Python. Data loading, data cleaning, and data transformation can be done using Python.
In this blog we will go through 55 pandas interview questions.
Pandas interview question on DataFrame Creation and Manipulation
How can you create a pandas DataFrame from a dictionary?
import pandas as pd # Create a dictionary data = { 'Name': ['Alice', 'Bob', 'Charlie'], 'Age': [25, 30, 35], 'City': ['New York', 'Los Angeles', 'Chicago'] } # Create DataFrame from the dictionary df = pd.DataFrame(data) # Display the DataFrame print(df)
output
Name Age City 0 Alice 25 New York 1 Bob 30 Los Angeles 2 Charlie 35 Chicago
What are the methods for adding a new column to a pandas DataFrame?
There are different methods for adding a column:
1. Direct assignment
We can directly assign a new column by specifying the column name and the values in forms like list, series etc
import pandas as pd # Sample DataFrame df = pd.DataFrame({ 'Name': ['Alice', 'Bob', 'Charlie'], 'Age': [25, 30, 35] }) # Add a new column by direct assignment df['City'] = ['New York', 'Los Angeles', 'Chicago']
1. Using assign method
df = df.assign(Country=['USA', 'USA', 'USA']) print(df)
Output
Name Age Country 0 Alice 25 USA 1 Bob 30 USA 2 Charlie 35 USA
Using insert method
The insert() method allows us to insert a new column at a particular position in the DataFrame
# Insert a new column at the second position (index 1) df.insert(1, 'Gender', ['Female', 'Male', 'Male']) print(df)
output
Name Gender Age Country 0 Alice Female 25 USA 1 Bob Male 30 USA 2 Charlie Male 35 USA
2. Using loc[] or iloc[]
Loc[] method in pandas can be handy for adding a column in dataframe
df.loc[:, 'Salary'] = [50000, 60000, 70000] print(df)
Output
Name Gender Age Country Salary 0 Alice Female 25 USA 50000 1 Bob Male 30 USA 60000 2 Charlie Male 35 USA 70000
How do you set a column as the index of a pandas DataFrame?
We have a function called set_index() which takes two parameters i.e. column name and inplace.
df.set_index('column_name', inplace=False)
- “column_name” is the name of the column you want to set as the index.
- “inplace=True” modifies the original DataFrame, while inplace=False (default) returns a new DataFrame with the updated index.
df = pd.DataFrame({ 'Name': ['Alice', 'Bob', 'Charlie'], 'Age': [25, 30, 35], 'City': ['New York', 'Los Angeles', 'Chicago'] }) # Set 'Name' column as the index df_new = df.set_index('Name')
Above code set Name column as index.
How do you reset the index of a pandas DataFrame?
The syntax for dropping the index
df.reset_index(drop=False, inplace=False)
- “drop=False”: This keeps the current index as a column in the DataFrame. If you set it to True, the index will be removed and not added as a column.
- “inplace=False”: By default, reset_index() returns a new DataFrame. If you want to modify the DataFrame in place, you can set inplace=True.
df = pd.DataFrame({ 'Name': ['Alice', 'Bob', 'Charlie'], 'Age': [25, 30, 35] }, index=['a', 'b', 'c']) # Reset the index without dropping the index column df_reset = df.reset_index()
What are the ways to delete a column from a pandas DataFrame?
We can use drop(), del, loc[], and pop() methods to delete a column in Pandas DataFrame.
# Sample DataFrame df = pd.DataFrame({ 'Name': ['Alice', 'Bob', 'Charlie'], 'Age': [25, 30, 35], 'City': ['New York', 'Los Angeles', 'Chicago'] })
Using drop function
# Drop 'Age' column df_dropped = df.drop(columns=['Age'])
Using del function
del df['Age']
Using pop function
city_column = df.pop('City')
Using loc function
While loc and iloc are generally used for indexing, you can use them to select specific columns that you want to keep, effectively removing the ones you don’t want.
# Select all columns except 'City' using `loc` df = df.loc[:, df.columns != 'City'] print(df)
How can you transpose a pandas DataFrame?
The .T attribute is a shorthand for transposing the DataFrame.
import pandas as pd
# Sample DataFrame df = pd.DataFrame({ ‘Name’: [‘Alice’, ‘Bob’, ‘Charlie’], ‘Age’: [25, 30, 35], ‘City’: [‘New York’, ‘Los Angeles’, ‘Chicago’] }) #before transpose print(df)
output
Name Age City 0 Alice 25 New York 1 Bob 30 Los Angeles 2 Charlie 35 Chicago
# Transpose the DataFrame df_transposed = df.T print(df_transposed)
output
0 1 2 Name Alice Bob Charlie Age 25 30 35 City New York Los Angeles Chicago
Pandas interview questions on Data Aggregation and Grouping
How do you group data by a specific column in pandas and calculate the sum?
Use groupby() followed by sum()
import pandas as pd # Sample DataFrame data = {'column_name': ['A', 'B', 'A', 'B', 'A'], 'values': [10, 20, 30, 40, 50]} df = pd.DataFrame(data) # Group by 'column_name' and calculate the sum result = df.groupby('column_name').sum() print(result)
Output
values column_name A 90 B 60
How can you get the count of unique values in a column using pandas?
Use nunique() to count distinct values
import pandas as pd # Sample DataFrame data = {'column_name': ['A', 'B', 'A', 'B', 'C']} df = pd.DataFrame(data) # Count unique values in 'column_name' unique_count = df['column_name'].nunique() print(f"Unique values count: {unique_count}")
Output
Unique values count: 3
What is the agg() function in pandas and how do you use it?
The agg() function in pandas is used to apply one or more aggregation operations on columns of a DataFrame or series.
import pandas as pd # Sample DataFrame data = {'category': ['A', 'B', 'A', 'B', 'A'], 'value1': [10, 20, 30, 40, 50], 'value2': [100, 200, 300, 400, 500]} df = pd.DataFrame(data) # Using agg() with multiple aggregation functions result = df.groupby('category').agg({ 'value1': ['sum', 'mean'], # Sum and mean of 'value1' 'value2': 'max' # Maximum of 'value2' }) print(result)
output
value1 value2 sum mean max category A 90 30 500 B 60 30 400
Pandas interview questions on Filtering and Selection
How do you filter rows in pandas based on a single condition and multiple conditions?
Let’s take an example, Where we want to filter a particular column value greater than 10.
import pandas as pd data = {'column_name': [5, 15, 20, 8, 30]} df = pd.DataFrame(data) # Filter rows where column_name > 10 filtered_df = df[df['column_name'] > 10] print(filtered_df)
output
column_name 1 15 2 20 4 30
For multiple conditions, we can filter rows by combining them with ( & ) AND , ( | ) OR operator.
Let’s consider an example where column1>10 and column2 <50.
data = {'column1': [5, 15, 20, 8, 30], 'column2': [25, 40, 30, 45, 50]} df = pd.DataFrame(data) # Filter rows where column1 > 10 and column2 < 50 filtered_df = df[(df['column1'] > 10) & (df['column2'] < 50)] print(filtered_df)
How can you select columns from a pandas DataFrame by their names?
To select specific columns from a DataFrame by their names, pass a list of column names to the DataFrame.
data = {'column1': [1, 2, 3], 'column2': [4, 5, 6], 'column3': [7, 8, 9]} df = pd.DataFrame(data) # Select specific columns by name selected_columns = df[['column1', 'column2']] print(selected_columns)
Output
column1 column2 0 1 4 1 2 5 2 3 6
What does the query() function do in pandas and how do you use it?
The query() function allows us to filter the DataFrame with a string expression. Making our code more readable in complex conditions.
data = {'column1': [5, 15, 20, 8, 30], 'column2': [25, 40, 30, 45, 50]} df = pd.DataFrame(data) # Using query() for filtering filtered_df = df.query('column1 > 10 and column2 < 50') print(filtered_df)
output
column1 column2 1 15 40 2 20 30
How can you filter rows where a column value falls within a specific range using pandas?
We can use the between() function to filter rows where column values fall within a specific range.
data = {'column_name': [5, 15, 20, 8, 30]} df = pd.DataFrame(data) # Filter rows where column_name is between 10 and 20 (inclusive) filtered_df = df[df['column_name'].between(10, 20)] print(filtered_df)
Output
column_name 1 15 2 20
How do you check if values from one column exist in another column using isin() in pandas?
The isin() method in Pandas checks if the element of a column is contained within another column or list.
data = {'column1': ['A', 'B', 'C', 'D'], 'column2': ['B', 'C', 'E', 'F']} df = pd.DataFrame(data) # Check if values in column1 exist in column2 filtered_df = df[df['column1'].isin(df['column2'])] print(filtered_df)
Output
column1 column2 1 B C 2 C E
How can you select rows by index in pandas?
We can use loc[] for the label-based indexing or iloc[] for position-based indexing to select rows.
By index label:
data = {'column_name': [5, 15, 20, 8, 30]} df = pd.DataFrame(data, index=['a', 'b', 'c', 'd', 'e']) # Select row by index label row_by_label = df.loc['c'] print(row_by_label)
Output
column_name 20 Name: c, dtype: int64
By index position
# Select row by position (2nd row) row_by_position = df.iloc[2] print(row_by_position)
Output
column_name 20 Name: c, dtype: int64
Pandas interview questions on Merging and Joining DataFrames
What is the difference between merge() and join() in pandas?
Merge() is more flexible and lets you specify columns for merging. You can perform various types of join (like inner, outer, left or right)
Join() : It is mainly used to join DataFrames based on their index or a specific column. It is a simplely used for joining when you want to align data based on the index.
How do you merge two pandas DataFrames on a specific column?
import pandas as pd # Example DataFrames df1 = pd.DataFrame({ 'id': [1, 2, 3], 'name': ['Alice', 'Bob', 'Charlie'] }) df2 = pd.DataFrame({ 'id': [2, 3, 4], 'age': [25, 30, 35] }) # Merging on a specific column (e.g., 'id') merged_df = pd.merge(df1, df2, on='id') print("Merged DataFrame on 'id':") print(merged_df)
.
Output
Merged DataFrame on 'id': id name age 0 2 Bob 25 1 3 Charlie 30
How can you perform an outer join between two pandas DataFrames?
import pandas as pd # Example DataFrames df1 = pd.DataFrame({ 'id': [1, 2, 3], 'name': ['Alice', 'Bob', 'Charlie'] }) df2 = pd.DataFrame({ 'id': [2, 3, 4], 'age': [25, 30, 35] }) # Performing an outer join on 'id' outer_joined_df = pd.merge(df1, df2, on='id', how='outer') print("\nOuter Join on 'id':") print(outer_joined_df)
Output
Outer Join on 'id': id name age 0 1 Alice NaN 1 2 Bob 25.0 2 3 Charlie 30.0 3 4 NaN 35.0
Pandas interview questions on Data Cleaning and Transformation
How can you change the data type of a column in pandas?
To change the data type of a column in pandas we can use the astype() method.
import pandas as pd # Sample DataFrame df = pd.DataFrame({ 'age': ['25', '30', '35'] }) # Changing the 'age' column to integer type df['age'] = df['age'].astype(int) print(df)
Output
age 0 25 1 30 2 35
How do you replace missing values in a pandas DataFrame ?
We can use the fillna() method
import pandas as pd import numpy as np # Sample DataFrame with missing values df = pd.DataFrame({ 'name': ['Alice', 'Bob', np.nan], 'age': [25, np.nan, 35] }) # Replacing missing values with a specific value df_filled = df.fillna({'name': 'Unknown', 'age': 30}) print(df_filled)
Output
name age 0 Alice 25.0 1 Bob 30.0 2 Unknown 35.0
How can you normalize a column in pandas?
To normalize data we can use either min-max normalization or z-score normalization depending on our needs
Min Max Normalization
This scales the data to range between 0 to 1.
import pandas as pd # Sample DataFrame df = pd.DataFrame({ 'value': [10, 20, 30, 40, 50] }) # Min-Max Normalization df['value_normalized'] = (df['value'] - df['value'].min()) / (df['value'].max() - df['value'].min()) print(df)
Output
value value_normalized 0 10 0.0 1 20 0.2 2 30 0.4 3 40 0.6 4 50 1.0
In Z-score normalization, each value in the dataset subtracts the mean of the dataset and then divides by the standard deviation
import pandas as pd # Sample DataFrame df = pd.DataFrame({ 'value': [10, 20, 30, 40, 50] }) # Z-Score Normalization df['value_normalized'] = (df['value'] - df['value'].mean()) / df['value'].std() print(df)
Output
value value_normalized 0 10 -1.414214 1 20 -0.707107 2 30 0.000000 3 40 0.707107 4 50 1.414214
How do you handle missing data with backward fill (bfill) in pandas?
Bfill method in the fillna() function of pandas helps us fill missing values by propagating the next valid value backward.
import pandas as pd import numpy as np # Sample DataFrame with missing values (NaN) df = pd.DataFrame({ 'name': ['Alice', np.nan, 'Charlie', np.nan], 'age': [25, np.nan, 35, np.nan] }) # Backward fill to replace missing values with the next valid value df_bfilled = df.fillna(method='bfill') print(df_bfilled)
Output
name age 0 Alice 25.0 1 Charlie 35.0 2 Charlie 35.0 3 NaN NaN
How does the qcut() function work in pandas for binning data?
qcut() splits input data into q bins, where q is the number of quantiles you want. Each bin will contain approximately the same number of data points.
Returns a categorical object indicating which bin each data point belongs to.
pandas.qcut(x, q, labels=False, ...)
- x is the input data like a series which needs to be binned.
- q is the number of quantiles or bins we want. Here we provide an integer value.
- Labels: if we want labels for each bin we set it to true.
import pandas as pd # Sample DataFrame with continuous data df = pd.DataFrame({ 'score': [15, 20, 35, 40, 50, 60, 70, 85, 90, 100] }) # Binning the 'score' column into 4 equal-frequency bins using qcut() df['score_binned'] = pd.qcut(df['score'], q=4, labels=False) print(df)
Output
score score_binned 0 15 0 1 20 0 2 35 1 3 40 1 4 50 1 5 60 2 6 70 2 7 85 3 8 90 3 9 100 3
Pandas interview question for Reading and Writing Data
Write code for reading a CSV file into a pandas DataFrame?
import pandas as pd # Reading a CSV file into a DataFrame df = pd.read_csv('your_file.csv') # Displaying the first few rows of the DataFrame print(df.head())
Output
Name Age 1 JK 25 2 Aakansha 24
Give me code for writing a pandas DataFrame to an Excel file?
import pandas as pd # Sample DataFrame df = pd.DataFrame({ 'name': ['Alice', 'Bob', 'Charlie'], 'age': [25, 30, 35] }) # Writing the DataFrame to an Excel file df.to_excel('output.xlsx', index=False) print("DataFrame has been written to 'output.xlsx'")
Output
DataFrame has been written to 'output.xlsx'
How do you write a pandas DataFrame to a database using to_sql()?
import pandas as pd import sqlite3 # Sample DataFrame df = pd.DataFrame({ 'name': ['Alice', 'Bob', 'Charlie'], 'age': [25, 30, 35] }) # Create a connection to an SQLite database (or any database you use) conn = sqlite3.connect('example.db') # Replace with your database connection # Writing DataFrame to SQL database df.to_sql('people', conn, if_exists='replace', index=False) # Close the connection conn.close() print("DataFrame has been written to the database.")
Output
DataFrame has been written to the database
Pandas interview questions on Visualization
We can use the .plot() method with kind=’bar’ argument which will generate a vertical bar plot.
import pandas as pd import matplotlib.pyplot as plt # Sample DataFrame df = pd.DataFrame({ 'name': ['Alice', 'Bob', 'Charlie'], 'age': [25, 30, 35] }) # Creating a bar plot for the 'age' column df.plot(kind='bar', x='name', y='age', legend=False) # Display the plot plt.show()
How can you visualize the distribution of data using a boxplot in pandas?
Similar to how we plot a bar chart, just replace kind=’box’.
import pandas as pd import matplotlib.pyplot as plt # Sample DataFrame df = pd.DataFrame({ 'age': [23, 25, 30, 35, 40, 45, 50, 55, 60, 65, 70] }) # Creating a boxplot for the 'age' column df.plot(kind='box', y='age') # Display the plot plt.show()
What is the function of creating a scatter plot in pandas?
Use df.plot() function with ‘kind’ parameter values as ‘scatter’.
Give code to generate a heatmap using pandas?
import pandas as pd import seaborn as sns import matplotlib.pyplot as plt # Sample DataFrame with numerical values data = { 'A': [1, 2, 3, 4], 'B': [5, 6, 7, 8], 'C': [9, 10, 11, 12] } df = pd.DataFrame(data) # Create a heatmap using seaborn plt.figure(figsize=(8, 6)) # Optional: Set the size of the plot sns.heatmap(df, annot=True, cmap='coolwarm', linewidths=0.5) # Show the plot plt.title('Heatmap Example') plt.show()
Pandas interview questions on Statistical Functions
How do you calculate correlation between two columns in pandas?
import pandas as pd # Sample DataFrame df = pd.DataFrame({ 'col1': [1, 2, 3, 4, 5], 'col2': [2, 4, 5, 4, 5] }) # Calculating the correlation between 'col1' and 'col2' correlation = df['col1'].corr(df['col2']) print(f"Correlation between col1 and col2: {correlation}")
output
Correlation between col1 and col2: 0.8320502943378437
What function in pandas can be used to calculate variance for a column?
We can use the .var() function to calculate the variance of a column.
import pandas as pd # Sample DataFrame df = pd.DataFrame({ 'data': [10, 20, 30, 40, 50] }) # Calculating the variance of the 'data' column variance = df['data'].var() print(f"Variance of the 'data' column: {variance}")
output
Variance of the 'data' column: 200.0
How can you calculate the quantiles of a DataFrame using pandas?
To calculate quantile of a DataFrame we can use the .quantile() method. Here is an example
import pandas as pd # Sample DataFrame df = pd.DataFrame({ 'data1': [1, 2, 3, 4, 5, 6, 7], 'data2': [10, 20, 30, 40, 50, 60, 70] }) # Calculate the 25th, 50th, and 75th percentiles (quantiles) quantiles_data1 = df['data1'].quantile([0.25, 0.5, 0.75]) quantiles_data2 = df['data2'].quantile([0.25, 0.5, 0.75]) print(f"Quantiles for 'data1':\n{quantiles_data1}") print(f"\nQuantiles for 'data2':\n{quantiles_data2}")
Output
Quantiles for 'data1': 0.25 2.0 0.50 4.0 0.75 6.0 Name: data1, dtype: float64 Quantiles for 'data2': 0.25 20.0 0.50 40.0 0.75 60.0 Name: data2, dtype: float64
Pingback: Building dashboards in Python