Google News
logo
Pandas Interview Questions
Pandas is a software library written for Python that is mainly used to analyze and manipulate data. It is an open-source, cross-platform library written by Wes Mckinney and released in 2008. This library offers data structures and operations for manipulating numerical and time-series data.

You can install Pandas using pip or with the Anaconda distribution. With this package, you can easily and quickly perform machine learning operations on the table data.

We can analyze data in pandas using :
 
* Series
* DataFrames

Pandas is free software released under the three-clause BSD license.
Pandas Series is a one-dimensional labeled array capable of holding data of any type (integer, string, float, python objects, etc.)
 
Axis labels are collectively called index. Pandas Series is nothing but a column in an excel sheet.
 
Creating a series from Array

import pandas as pd
import numpy as np
# pandas as an array
data = np.array(['p','a','n','d','a', 's'])
myseries = pd.Series(data)
print(myseries)​

Output :
0 p
1 a
2 n
3 d
4 s
5 s
dtype : object
DataFrame is a two-dimensional labeled data structure with columns of potentially different types. You can think of it like a spreadsheet or a dict of Series objects.
 
Creating DataFrame from a dictionary :

Syntax : 

import pandas as pd
d = {'col1': [1, 2], 'col2': [3, 4]}
df = pd.DataFrame(data=d)

print(df)

 

Output :

   col1  col2

0     1     3

1     2     4

Data Frame in Pandas can be created either directly from a dictionary or by combining various series.
 
import pandas as pd
country_population = {'India': 1600000000, 'China': 1730000000, 'USA': 390000000, 'UK': 450000000}
population = pd.Series(country_population)
#print(population)

country_land = {'India': '2547869 hectares', 'China': '9543578 hectares', 'USA': '5874658 hectares',  'UK': '6354652 hectares'}
area = pd.Series(country_land)
#print(area)

df = pd.DataFrame({'Population': population, 'SpaceOccupied': area})
print(df)
 
Output :
 
             Population      SpaceOccupied
India     1600000000    2547869 hectares
China    1730000000    9543578 hectares
USA        390000000    5874658 hectares
UK          450000000    6354652 hectares
 
Indexes can be created using Pandas Index function. Indexes support intersection and union.
 
import pandas as pd
index_A = pd.Index([1, 3, 5, 7, 9])
index_B = pd.Index([2, 3, 5, 7, 11])
Indexing in pandas means simply selecting particular rows and columns of data from a DataFrame. Indexing could mean selecting all the rows and some of the columns, some of the rows and all of the columns, or some of each of the rows and columns. Indexing can also be known as Subset Selection.

The axis labeling information in pandas objects serves many purposes :
 
* Identifies data (i.e. provides metadata) using known indicators, important for analysis, visualization, and interactive console display.
* Enables automatic and explicit data alignment.
* Allows intuitive getting and setting of subsets of the data set.
 
In this section, we will focus on the final point: namely, how to slice, dice, and generally get and set subsets of pandas objects. The primary focus will be on Series and DataFrame as they have received more development attention in this area.
Reindexing changes the row labels and column labels of a DataFrame. To reindex means to conform the data to match a given set of labels along a particular axis.
 
Multiple operations can be accomplished through indexing like :
 
* Reorder the existing data to match a new set of labels.
* Insert missing value (NA) markers in label locations where no data for the label existed.
 
Example :

import pandas as pd
import numpy as np

N=20

df = pd.DataFrame({
   'A': pd.date_range(start='2021-01-18',periods=N,freq='D'),
   'x': np.linspace(0,stop=N-1,num=N),
   'y': np.random.rand(N),
   'C': np.random.choice(['Low','Medium','High'],N).tolist(),
   'D': np.random.normal(100, 10, size=(N)).tolist()
})

#reindex the DataFrame
df_reindexed = df.reindex(index=[0,2,5], columns=['A', 'C', 'B'])

print (df_reindexed)​

 

Output :

           A              C              B

0 2021-01-18     Medium     NaN

2 2021-01-20     High          NaN

5 2021-01-23     Low           NaN

 

While reindexing NaN can be introduced .bfill and .ffill are used to handle NaN
 
bfill : Fills the value from ahead value into the previous NaN value
ffill  :  Fills the value from behind value into the missing NaN value

Reindexing without using any method(bfill or ffill)
 
import pandas as pd
import numpy as np

df1 = pd.DataFrame(np.random.randn(4, 3), columns=['col1', 'col2', 'col3'])
df2 = pd.DataFrame(np.random.randn(2, 3), columns=['col1', 'col2', 'col3'])

print(df2.reindex_like(df1))

 

Output :

            col1             col2               col3

0    -0.641715     1.031070     -0.208415

1    -1.560385    -0.584403      0.291666

2          NaN           NaN               NaN

3          NaN           NaN               NaN

 

Reindexing with using methods(bfill or ffill)

import pandas as pd
import numpy as np

df1 = pd.DataFrame(np.random.randn(4, 3), columns=['col1', 'col2', 'col3'])
df2 = pd.DataFrame(np.random.randn(2, 3), columns=['col1', 'col2', 'col3'])

print(df2.reindex_like(df1, method='ffill'))


Output :

                  col1               col2               col3

0        1.332612      -0.479218       -1.016999
1       -1.091319     -0.844934       -0.492755
2       -1.091319     -0.844934       -0.492755
3       -1.091319     -0.844934       -0.492755

 

import pandas as pd
import numpy as np

df1 = pd.DataFrame(np.random.randn(4, 3), columns=['col1', 'col2', 'col3'])
df2 = pd.DataFrame(np.random.randn(2, 3), columns=['col1', 'col2', 'col3'])

print(df2.reindex_like(df1, method='bfill'))

 

Output :

                     col1               col2              col3

0           0.526663      -0.450748        0.791112
1          -1.805287       0.641050        1.864871
2                 NaN                 NaN                NaN
3                 NaN                 NaN                NaN

 
 
Some of the major features of Python Pandas are,
 
* Fast and efficient in handling the data with its DataFrame object.
* It provides tools for loading data into in-memory data objects from various file formats.
* It has high-performance in merging and joining data.
* It has Time Series functionality.
* It provides functions for Data set merging and joining.
* It has functionalities for label-based slicing, fancy indexing, and subsetting of large data sets.
* It provides functionalities for reshaping and pivoting of data sets.
Different types of data structures available in Pandas are,
 
Series : It is immutable in size and homogeneous one-dimensional array data structure.

DataFrame : It is a tabular data structure which comprises of rows and columns. Here, data and size are mutable.

Panel : It is a three-dimensional data structure to store the data heterogeneously.
GroupBy is used to split the data into groups. It groups the data based on some criteria. Grouping also provides a mapping of labels to the group names. It has a lot of variations that can be defined with the parameters and makes the task of splitting the data quick and easy.
Pylab is a module in the Matplotlib library that acts as a procedural interface to the Matplotlib. Matplotlib is an object-oriented plotting library. It combines the Matplotlib with the NumPy module for graphical plotting. This is not a separate module but is embedded inside Matplotlib to provide matplotlib like experience for the user.
Matplotlib is the most popular data visualization library that is used to plot the data. This comprehensive library is used for creating a static, animated, and interactive visualization with the data. It Developed by John D. Hunter, this open-source library was first released in 2003. Matplotlib also provides various toolkits that extend the functionalities of it. Such toolkits are Basemap, Cartopy, Excel tool, GTK tools, and more.
Some of the statistical functions in Python Pandas are :
 
sum() : it returns the sum of the values.
 
mean() : returns the mean that is the average of the values.
 
std() : returns the standard deviation of the numerical columns.
 
min() : returns the minimum value.
 
max() : returns the maximum value.
 
abs() : returns the absolute value.
 
prod() : returns the product of the values.
loc() : Slicing DataFrame based upon Label

iloc() : Slicing DataFrame based on Interger

ix() : Slicing DataFrame based on both Label and Integer
NaN values in a Pandas DataFrame can be handled in the following three ways :
 
dropna : Removing all the rows in DataFrame for which values in column are NaN

pad : Replacing NaN values with previous non NaN values meaning replacing NaN with value just above it in same column

backfill : Replacing NaN values with ahead non NaN values meaning replacing NaN with value just below it in same column
pandas.concat() function does all the heavy lifting of performing concatenation operations along with an axis od Pandas objects while performing optional set logic (union or intersection) of the indexes (if any) on the other axes.

Syntax : concat(objs, axis, join, ignore_index, keys, levels, names, verify_integrity, sort, copy)

Parameters :

objs : Series or DataFrame objects

axis : axis to concatenate along; default = 0

join : way to handle indexes on other axis; default = ‘outer’

ignore_index : if True, do not use the index values along the concatenation axis; default = False

keys : sequence to add an identifier to the result indexes; default = None

levels : specific levels (unique values) to use for constructing a MultiIndex; default = None

names : names for the levels in the resulting hierarchical index; default = None

verify_integrity : check whether the new concatenated axis contains duplicates; default = False

sort : sort non-concatenation axis if it is not already aligned when join is ‘outer’; default = False

copy : if False, do not copy data unnecessarily; default = True

Returns : type of objs (Series of DataFrame)


Example  : Concatenating 2 Series with default parameters.
 
import numpy as np
import pandas as pd

s1 = pd.Series(['a', 'b'])
s2 = pd.Series(['c', 'd'])
print(pd.concat([s1, s2]))


Output :

0    a
1    b
0    c
1    d
dtype: object

pandas.Series.copy
 
Series.copy(deep=True)
 
pandas.Series.copy. Make a deep copy, including a copy of the data and the indices. With deep=False neither the indices or the data are copied. Note that when deep=True data is copied, actual python objects will not be copied recursively, only the reference to the object.
Time series : The Time series data is defined as an essential source for information that provides a strategy that is used in various businesses. From a conventional finance industry to the education industry, it consists of a lot of details about the time.
 
Time series forecasting is the machine learning modeling that deals with the Time Series data for predicting future values through Time Series modeling.
 
Time Offset : The offset specifies a set of dates that conform to the DateOffset. We can create the DateOffsets to move the dates forward to valid dates.
 
Time Periods : The Time Periods represent the time span, e.g., days, years, quarter or month, etc. It is defined as a class that allows us to convert the frequency to the periods.
In Pandas, there are different useful data operations for DataFrame, which are as follows:
 
Row and column selection :  We can select any row and column of the DataFrame by passing the name of the rows and columns. When you select it from the DataFrame, it becomes one-dimensional and considered as Series.
 
Filter Data : We can filter the data by providing some of the boolean expressions in DataFrame.
 
Null values : A Null value occurs when no data is provided to the items. The various columns may contain no values, which are usually represented as NaN.
Categorical data refers to real-time data that can be repetitive; for instance, data values under categories such as country, gender, codes will always be repetitive. Categorical values in pandas can also take only a limited and fixed number of possible values. 
 
Numerical operations cannot be performed on such data. All values of categorical data in pandas are either in categories or np.nan.
 
This data type can be useful in the following cases :
 
* If a string variable contains only a few different values, converting it into a categorical variable can save some memory.
* It is useful as a signal to other Python libraries because this column must be treated as a categorical variable.
* A lexical order can be converted to a categorical order to be sorted correctly, like a logical order.
To add rows to a DataFrame, we can use .loc (), .iloc () and .ix(). The .loc () is label based, .iloc() is integer based and .ix() is booth label and integer based.

To add columns to the DataFrame, we can again use .loc () or .iloc ().
The dataframe we used as data structure of pandas and work with two-dimensional arrays with rows and columns.
It will store data and has two different rows and column index.

Example : 
import pandas as pd
info=pd.DataFrame ()
print(info)​


Output :
Empty DataFrame
Columns: []
Index: []

We can identify if a dataframe has missing values by using the isnull() and isna() methods.

missing_data_count=df.isnull().sum()


We can handle missing values by either replacing the values in the column with 0 as follows:

df[‘column_name’].fillna(0)


Or by replacing it with the mean value of the column

df[‘column_name’] = df[‘column_name’].fillna((df[‘column_name’].mean()))

Multi-index allows you to select more than one row and column in your index. It is a multi-level or hierarchical object for pandas object. Now there are various methods of multi-index that are used such as MultiIndex.from_arrays, MultiIndex.from_tuples, MultiIndex.from_product, MultiIndex.from_frame, etc which helps us to create multiple indexes from arrays, tuples, dataframes, etc.
 
Syntax : pandas.MultiIndex(levels=None, codes=None, sortorder=None, names=None, dtype=None, copy=False, name=None, verify_integrity=True)
 
levels : It is a sequence of arrays which shows the unique labels for each level.
codes : It is also a sequence of arrays where integers at each level helps us to designate the labels in that location.
sortorder : optional int. It helps us to sort the levels lexographically.
dtype : data-type(size of the data which can be of 32 bits or 64 bits)
copy : It is a boolean type parameter with default value as False. It helps us to copy the metadata.
verify_integrity : It is a boolean type parameter with default value as True. It checks the integrity of the levels and codes i.t if they are valid.
When data from a very large table needs to be summarised in a very sophisticated manner so that they can be easily understood then pivot tables is a prompt choice. The summarization can be upon a variety of statistical concepts like sums, averages, etc. for designing these pivot tables from a pandas perspective the pivot_table() method in pandas library can be used. This is an effective method for drafting these pivot tables in pandas.
 
Syntax : pandas.pivot_table(data, values=None, index=None, columns=None, aggfunc='mean', fill_value=None, margins=False, dropna=True, margins_name='All', observed=False)