Data Science Interview Questions
Data Science is a combination of algorithms, tools, and machine learning techniques.

Data science is a multi-disciplinary approach to extracting actionable insights from the large and ever-increasing volumes of data collected and created by today’s organizations. Data science encompasses preparing data for analysis and processing, performing advanced data analysis, and presenting the results to reveal patterns and enable stakeholders to draw informed conclusions.
Data preparation can involve cleansing, aggregating, and manipulating it to be ready for specific types of processing. Analysis requires the development and use of algorithms, analytics and AI models.
There are several ways to handle missing values in the given data :
* Dropping the values
* Deleting the observation (not always recommended).
* Replacing value with the mean, median and mode of the observation.
* Predicting value with regression
* Finding appropriate value with clustering
Data Science : 
Definition : Data Science is not exactly a subset of machine learning but it uses machine learning to analyze and make future predictions.
Role : It can take on a business role.
Scope : Data Science is a broad term for diverse disciplines and is not merely about developing and training models.
AI : Loosely integrated
Machine Learning :
Definition : A subset of AI that focuses on a narrow range of activities.
Role : It is a purely technical role.
Scope : Machine learning fits within the data science spectrum.
AI : Machine learning is a subfield of AI and is tightly integrated.
Artificial Intelligence : 
Definition : A wide term that focuses on applications ranging from Robotics to Text Analysis.
Role : It is a combination of both business and technical aspects.
Scope : AI is a sub-field of computer science.
AI : A sub-field of computer science consisting of various tasks like planning, moving around in the world, recognizing objects and sounds, speaking, translating, performing social or business transactions, creative work.
The common differences between data science and big data are : 
Big Data : 
* Large collection of data sets that cannot be stored in a traditional system
* Popular in the field of communication, purchase and sale of goods, financial services, and educational sector
* Big Data solves problems related to data management and handling, and analyze insights resulting in informed decision making
* Popular tools are Hadoop, Spark, Flink, NoSQL, Hive, etc.

Data Science : 
* An interdisciplinary field that includes analytical aspects, statistics, data mining, machine learning, etc.
* Common applications are digital advertising, web research, recommendation systems (Netflix, Amazon, Facebook), speech and handwriting recognition applications
* Data Science uses machine learning algorithms and statistical methods to obtain accurate predictions from raw data
* Popular tools are Python, R, SAS, SQL, etc.
Cluster analysis or clustering, is an unsupervised machine learning task.
It involves automatically discovering natural grouping in data. Unlike supervised learning (like predictive modeling), clustering algorithms only interpret the input data and find natural groups or clusters in feature space.
Clustering is used in various fields like image recognition, pattern analysis, medical informatics, genomics, data compression etc. It is part of the unsupervised learning algorithm in machine learning.
According to the formal definition of K-means clustering – K-means clustering is an iterative algorithm that partitions a group of data containing n values into k subgroups. Each of the n value belongs to the k cluster with the nearest mean.
K-means clustering is the most popular form of an unsupervised learning algorithm. It is easy to understand and implement.
* Firstly, KNN is a supervised learning algorithm. In order to train this algorithm, we require labeled data.
* K-means is an unsupervised learning algorithm that looks for patterns that are intrinsic to the data.
* The K in KNN is the number of nearest data points. On the contrary, the K in K-means specify the number of centroids.
A p-value is the measure of the probability of having results equal to or more than the results achieved under a specific hypothesis assuming that the null hypothesis is correct. This represents the probability that the observed difference occurred randomly by chance.
Low p-value which means values ≤ 0.05 means that the null hypothesis can be rejected and the data is unlikely with true null.
High p-value, i.e values ≥ 0.05 indicates the strength in favor of the null hypothesis. It means that the data is like with true null.
p-value = 0.05 means that the hypothesis can go either way.
KPI : KPI stands for Key Performance Indicator that measures how well the business achieves its objectives.
Lift : This is a performance measure of the target model measured against a random choice model. Lift indicates how good the model is at prediction versus if there was no model.
Model fitting : This indicates how well the model under consideration fits given observations.
Robustness : This represents the system’s capability to handle differences and variances effectively.
DOE : stands for the design of experiments, which represents the task design aiming to describe and explain information variation under hypothesized conditions to reflect variables.
True Positive rate(TPR) is the ratio of True Positives to True Positives and False Negatives. It is the probability that an actual positive will test as positive.


The False Positive Rate(FPR) is the ratio of the False Positives to all the positives(True positives and false positives). It is the probability of a false alarm, i.e., a positive result will be given when it is actually negative.