Google News
logo
Pandas Interview Questions
Pandas is a software library written for Python that is mainly used to analyze and manipulate data. It is an open-source, cross-platform library written by Wes Mckinney and released in 2008. This library offers data structures and operations for manipulating numerical and time-series data.

You can install Pandas using pip or with the Anaconda distribution. With this package, you can easily and quickly perform machine learning operations on the table data.

We can analyze data in pandas using :
 
* Series
* DataFrames

Pandas is free software released under the three-clause BSD license.
Pandas Series is a one-dimensional labeled array capable of holding data of any type (integer, string, float, python objects, etc.)
 
Axis labels are collectively called index. Pandas Series is nothing but a column in an excel sheet.
 
Creating a series from Array

import pandas as pd
import numpy as np
# pandas as an array
data = np.array(['p','a','n','d','a', 's'])
myseries = pd.Series(data)
print(myseries)​

Output :
0 p
1 a
2 n
3 d
4 s
5 s
dtype : object
DataFrame is a two-dimensional labeled data structure with columns of potentially different types. You can think of it like a spreadsheet or a dict of Series objects.
 
Creating DataFrame from a dictionary :

Syntax : 

import pandas as pd
d = {'col1': [1, 2], 'col2': [3, 4]}
df = pd.DataFrame(data=d)

print(df)

 

Output :

   col1  col2

0     1     3

1     2     4

Data Frame in Pandas can be created either directly from a dictionary or by combining various series.
 
import pandas as pd
country_population = {'India': 1600000000, 'China': 1730000000, 'USA': 390000000, 'UK': 450000000}
population = pd.Series(country_population)
#print(population)

country_land = {'India': '2547869 hectares', 'China': '9543578 hectares', 'USA': '5874658 hectares',  'UK': '6354652 hectares'}
area = pd.Series(country_land)
#print(area)

df = pd.DataFrame({'Population': population, 'SpaceOccupied': area})
print(df)
 
Output :
 
             Population      SpaceOccupied
India     1600000000    2547869 hectares
China    1730000000    9543578 hectares
USA        390000000    5874658 hectares
UK          450000000    6354652 hectares
 
Indexes can be created using Pandas Index function. Indexes support intersection and union.
 
import pandas as pd
index_A = pd.Index([1, 3, 5, 7, 9])
index_B = pd.Index([2, 3, 5, 7, 11])
Indexing in pandas means simply selecting particular rows and columns of data from a DataFrame. Indexing could mean selecting all the rows and some of the columns, some of the rows and all of the columns, or some of each of the rows and columns. Indexing can also be known as Subset Selection.

The axis labeling information in pandas objects serves many purposes :
 
* Identifies data (i.e. provides metadata) using known indicators, important for analysis, visualization, and interactive console display.
* Enables automatic and explicit data alignment.
* Allows intuitive getting and setting of subsets of the data set.
 
In this section, we will focus on the final point: namely, how to slice, dice, and generally get and set subsets of pandas objects. The primary focus will be on Series and DataFrame as they have received more development attention in this area.
Reindexing changes the row labels and column labels of a DataFrame. To reindex means to conform the data to match a given set of labels along a particular axis.
 
Multiple operations can be accomplished through indexing like :
 
* Reorder the existing data to match a new set of labels.
* Insert missing value (NA) markers in label locations where no data for the label existed.
 
Example :

import pandas as pd
import numpy as np

N=20

df = pd.DataFrame({
   'A': pd.date_range(start='2021-01-18',periods=N,freq='D'),
   'x': np.linspace(0,stop=N-1,num=N),
   'y': np.random.rand(N),
   'C': np.random.choice(['Low','Medium','High'],N).tolist(),
   'D': np.random.normal(100, 10, size=(N)).tolist()
})

#reindex the DataFrame
df_reindexed = df.reindex(index=[0,2,5], columns=['A', 'C', 'B'])

print (df_reindexed)​

 

Output :

           A              C              B

0 2021-01-18     Medium     NaN

2 2021-01-20     High          NaN

5 2021-01-23     Low           NaN

 

While reindexing NaN can be introduced .bfill and .ffill are used to handle NaN
 
bfill : Fills the value from ahead value into the previous NaN value
ffill  :  Fills the value from behind value into the missing NaN value

Reindexing without using any method(bfill or ffill)
 
import pandas as pd
import numpy as np

df1 = pd.DataFrame(np.random.randn(4, 3), columns=['col1', 'col2', 'col3'])
df2 = pd.DataFrame(np.random.randn(2, 3), columns=['col1', 'col2', 'col3'])

print(df2.reindex_like(df1))

 

Output :

            col1             col2               col3

0    -0.641715     1.031070     -0.208415

1    -1.560385    -0.584403      0.291666

2          NaN           NaN               NaN

3          NaN           NaN               NaN

 

Reindexing with using methods(bfill or ffill)

import pandas as pd
import numpy as np

df1 = pd.DataFrame(np.random.randn(4, 3), columns=['col1', 'col2', 'col3'])
df2 = pd.DataFrame(np.random.randn(2, 3), columns=['col1', 'col2', 'col3'])

print(df2.reindex_like(df1, method='ffill'))


Output :

                  col1               col2               col3

0        1.332612      -0.479218       -1.016999
1       -1.091319     -0.844934       -0.492755
2       -1.091319     -0.844934       -0.492755
3       -1.091319     -0.844934       -0.492755

 

import pandas as pd
import numpy as np

df1 = pd.DataFrame(np.random.randn(4, 3), columns=['col1', 'col2', 'col3'])
df2 = pd.DataFrame(np.random.randn(2, 3), columns=['col1', 'col2', 'col3'])

print(df2.reindex_like(df1, method='bfill'))

 

Output :

                     col1               col2              col3

0           0.526663      -0.450748        0.791112
1          -1.805287       0.641050        1.864871
2                 NaN                 NaN                NaN
3                 NaN                 NaN                NaN

 
 
Some of the major features of Python Pandas are,
 
* Fast and efficient in handling the data with its DataFrame object.
* It provides tools for loading data into in-memory data objects from various file formats.
* It has high-performance in merging and joining data.
* It has Time Series functionality.
* It provides functions for Data set merging and joining.
* It has functionalities for label-based slicing, fancy indexing, and subsetting of large data sets.
* It provides functionalities for reshaping and pivoting of data sets.
Different types of data structures available in Pandas are,
 
Series : It is immutable in size and homogeneous one-dimensional array data structure.

DataFrame : It is a tabular data structure which comprises of rows and columns. Here, data and size are mutable.

Panel : It is a three-dimensional data structure to store the data heterogeneously.