Google News
logo
Data Science Interview Questions
Data Science is a combination of algorithms, tools, and machine learning techniques.

Data science is a multi-disciplinary approach to extracting actionable insights from the large and ever-increasing volumes of data collected and created by today’s organizations. Data science encompasses preparing data for analysis and processing, performing advanced data analysis, and presenting the results to reveal patterns and enable stakeholders to draw informed conclusions.
 
Data preparation can involve cleansing, aggregating, and manipulating it to be ready for specific types of processing. Analysis requires the development and use of algorithms, analytics and AI models.
There are several ways to handle missing values in the given data :
 
* Dropping the values
* Deleting the observation (not always recommended).
* Replacing value with the mean, median and mode of the observation.
* Predicting value with regression
* Finding appropriate value with clustering
Data Science : 
Definition : Data Science is not exactly a subset of machine learning but it uses machine learning to analyze and make future predictions.
Role : It can take on a business role.
Scope : Data Science is a broad term for diverse disciplines and is not merely about developing and training models.
AI : Loosely integrated
 
 
Machine Learning :
Definition : A subset of AI that focuses on a narrow range of activities.
Role : It is a purely technical role.
Scope : Machine learning fits within the data science spectrum.
AI : Machine learning is a subfield of AI and is tightly integrated.
 
 
Artificial Intelligence : 
Definition : A wide term that focuses on applications ranging from Robotics to Text Analysis.
Role : It is a combination of both business and technical aspects.
Scope : AI is a sub-field of computer science.
AI : A sub-field of computer science consisting of various tasks like planning, moving around in the world, recognizing objects and sounds, speaking, translating, performing social or business transactions, creative work.
The common differences between data science and big data are : 
 
Big Data : 
* Large collection of data sets that cannot be stored in a traditional system
* Popular in the field of communication, purchase and sale of goods, financial services, and educational sector
* Big Data solves problems related to data management and handling, and analyze insights resulting in informed decision making
* Popular tools are Hadoop, Spark, Flink, NoSQL, Hive, etc.


Data Science : 
* An interdisciplinary field that includes analytical aspects, statistics, data mining, machine learning, etc.
* Common applications are digital advertising, web research, recommendation systems (Netflix, Amazon, Facebook), speech and handwriting recognition applications
* Data Science uses machine learning algorithms and statistical methods to obtain accurate predictions from raw data
* Popular tools are Python, R, SAS, SQL, etc.
Cluster analysis or clustering, is an unsupervised machine learning task.
 
It involves automatically discovering natural grouping in data. Unlike supervised learning (like predictive modeling), clustering algorithms only interpret the input data and find natural groups or clusters in feature space.
 
Clustering is used in various fields like image recognition, pattern analysis, medical informatics, genomics, data compression etc. It is part of the unsupervised learning algorithm in machine learning.
According to the formal definition of K-means clustering – K-means clustering is an iterative algorithm that partitions a group of data containing n values into k subgroups. Each of the n value belongs to the k cluster with the nearest mean.
 
K-means clustering is the most popular form of an unsupervised learning algorithm. It is easy to understand and implement.
* Firstly, KNN is a supervised learning algorithm. In order to train this algorithm, we require labeled data.
* K-means is an unsupervised learning algorithm that looks for patterns that are intrinsic to the data.
* The K in KNN is the number of nearest data points. On the contrary, the K in K-means specify the number of centroids.
A p-value is the measure of the probability of having results equal to or more than the results achieved under a specific hypothesis assuming that the null hypothesis is correct. This represents the probability that the observed difference occurred randomly by chance.
 
Low p-value which means values ≤ 0.05 means that the null hypothesis can be rejected and the data is unlikely with true null.
High p-value, i.e values ≥ 0.05 indicates the strength in favor of the null hypothesis. It means that the data is like with true null.
p-value = 0.05 means that the hypothesis can go either way.
KPI : KPI stands for Key Performance Indicator that measures how well the business achieves its objectives.
Lift : This is a performance measure of the target model measured against a random choice model. Lift indicates how good the model is at prediction versus if there was no model.
Model fitting : This indicates how well the model under consideration fits given observations.
Robustness : This represents the system’s capability to handle differences and variances effectively.
DOE : stands for the design of experiments, which represents the task design aiming to describe and explain information variation under hypothesized conditions to reflect variables.
True Positive rate(TPR) is the ratio of True Positives to True Positives and False Negatives. It is the probability that an actual positive will test as positive.

TPR=TP/TP+FN


The False Positive Rate(FPR) is the ratio of the False Positives to all the positives(True positives and false positives). It is the probability of a false alarm, i.e., a positive result will be given when it is actually negative.

FPR=FP/TP+FP
AUC curve is a measurement of precision against the recall. Precision = TP/(TP + FP) and TP/(TP + FN). This is in contrast with ROC(Receiver Operating Characteristic) that measures and plots True Positive against False positive rate.
As the name suggests, data cleansing is a process of removing or updating the information that is incorrect, incomplete, duplicated, irrelevant, or formatted improperly. It is very important to improve the quality of data and hence the accuracy and productivity of the processes and organisation as a whole.
 
Real-world data is often captured in formats which have hygiene issues. There are sometimes errors due to various reasons which make the data inconsistent and sometimes only some features of the data. Hence data cleansing is done to filter the usable data from the raw data, otherwise many systems consuming the data will produce erroneous results.
Different types of data require different types of cleaning, the most important steps of Data Cleaning are :
 
* Data Quality
* Removing Duplicate Data (also irrelevant data)
* Structural errors
* Outliers
* Treatment for Missing Data

Data Cleaning is an important step before analysing data, it helps to increase the accuracy of the model. This helps organisations to make an informed decision.
 
Data Scientists usually spends 80% of their time cleaning data.
Statistics provides tools and methods to identify patterns and structures in data to provide a deeper insight into it. Serves a great role in data acquisition, exploration, analysis, and validation. It plays a really powerful role in Data Science.
 
Data Science is a derived field which is formed from the overlap of statistics probability and computer science. Whenever one needs to do estimations, statistics is involved. Many algorithms in data science are built on top of statistical formulae and processes. Hence statistics is an important part of data science.
Supervised learning is a type of machine learning where a function is inferred from labeled training data. The training data contains a set of training examples.
 
Unsupervised learning, on the other hand, is a type of machine learning where inferences are drawn from datasets containing input data without labeled responses. Following are the various other differences between the two types of machine learning:
 
Algorithms Used : Supervised learning makes use of Decision Trees, K-nearest Neighbor algorithm, Neural Networks, Regression, and Support Vector Machines. Unsupervised learning uses Anomaly Detection, Clustering, Latent Variable Models, and Neural Networks.

Enables : Supervised learning enables classification and regression, whereas unsupervised learning enables classification, dimension reduction, and density estimation

Use : While supervised learning is used for prediction, unsupervised learning finds use in analysis
Selection bias is typically associated with research that doesn’t have a random selection of participants. It is a type of error that occurs when a researcher decides who is going to be studied. On some occasions, selection bias is also referred to as the selection effect.
 
In other words, selection bias is a distortion of statistical analysis that results from the sample collecting method. When selection bias is not taken into account, some conclusions made by a research study might not be accurate. Following are the various types of selection bias:
 
Sampling Bias : A systematic error resulting due to a non-random sample of a populace causing certain members of the same to be less likely included than others that results in a biased sample.
Time Interval : A trial might be ended at an extreme value, usually due to ethical reasons, but the extreme value is most likely to be reached by the variable with the most variance, even though all variables have a similar mean.
Data : Results when specific data subsets are selected for supporting a conclusion or rejection of bad data arbitrarily.
Attrition : Caused due to attrition, i.e. loss of participants, discounting trial subjects or tests that didn’t run to completion.
When studying the target population spread throughout a wide area becomes difficult and applying simple random sampling becomes ineffective, the technique of cluster sampling is used. A cluster sample is a probability sample, in which each of the sampling units is a collection or cluster of elements.
 
Following the technique of systematic sampling, elements are chosen from an ordered sampling frame. The list is advanced in a circular fashion. This is done in such a way so that once the end of the list is reached, the same is progressed from the start, or top, again.
The linear regression equation is a one-degree equation with the most basic form being Y = mX + C where m is the slope of the line and C is the standard error. It is used when the response variable is continuous in nature for example height, weight, and the number of hours. It can be a simple linear regression if it involves continuous dependent variable with one independent variable and a multiple linear regression if it has multiple independent variables. 
 
Linear regression is a standard statistical practice to calculate the best fit line passing through the data points when plotted. The best fit line is chosen in such a way so that the distance of each data point is minimum from the line which reduces the overall error of the system. Linear regression assumes that the various features in the data are linearly related to the target. It is often used in predictive analytics for calculating estimates in the foreseeable future.
These are descriptive statistical analysis techniques that can be differentiated based on the number of variables involved at a given point in time. For example, the pie charts of sales based on territory involve only one variable and can be referred to as univariate analysis.
 
If the analysis attempts to understand the difference between 2 variables at the time as in a scatterplot, then it is referred to as bivariate analysis. For example, analyzing the volume of sales and spending can be considered as an example of bivariate analysis.
 
Analysis that deals with the study of more than two variables to understand the effect of variables on the responses is referred to as multivariate analysis.
The response variable for a regression analysis might not satisfy one or more assumptions of an ordinary least squares regression. The residuals could either curve as the prediction increases or follow the skewed distribution. In such scenarios, it is necessary to transform the response variable so that the data meets the required assumptions. A Box cox transformation is a statistical technique to transform non-normal dependent variables into a normal shape. If the given data is not normal then most of the statistical techniques assume normality. Applying a box cox transformation means that you can run a broader number of tests.
SVM stands for support vector machine. They are used for classification and prediction tasks. SVM consists of a separating plane that discriminates between the two classes of variables. This separating plane is known as hyperplane. Some of the kernels used in SVM are :
 
* Polynomial Kernel
* Gaussian Kernel
* Laplace RBF Kernel
* Sigmoid Kernel
* Hyperbolic Kernel
Cross-validation is essentially a technique used to assess how well a model performs on a new independent dataset. The simplest example of cross-validation is when you split your data into two groups: training data and testing data, where you use the training data to build the model and the testing data to test the model.
Reducing the number of features for a given dataset is known as dimensionality reduction. There are many techniques used to reduce dimensionality such as :
 
* Feature Selection Methods
* Matrix Factorization
* Manifold Learning
* Autoencoder Methods
* Linear Discriminant Analysis (LDA)
* Principal component analysis (PCA)

One of the main reasons for dimensionality reduction is the curse of dimensionality. When the number of features increases, the model becomes more complex. But if the number of datapoints is less, the model will start learning or overfitting the data. The model will not generalize the data. This is known as the curse of dimensionality.
 
Other benefits of dimensionality reduction include :
 
* The time and storage space is reduced.
* It becomes easier to visualize and visually represent the data in 2D or 3D.
* Space complexity is reduced.
* Time series data can be thought of as an extension to linear regression which uses terms like autocorrelation, movement of averages for summarizing historical data of y-axis variables for predicting a better future.

* Forecasting and prediction is the main goal of time series problems where accurate predictions can be made but sometimes the underlying reasons might not be known.

* Having Time in the problem does not necessarily mean it becomes a time series problem. There should be a relationship between target and time for a problem to become a time series problem.

* The observations close to one another in time are expected to be similar to the ones far away which provide accountability for seasonality.

* For instance, today’s weather would be similar to tomorrow’s weather but not similar to weather from 4 months from today. Hence, weather prediction based on past data becomes a time series problem.
Although these two terms are used for establishing a relationship and dependency between any two random variables, the following are the differences between them :
 
Correlation : This technique is used to measure and estimate the quantitative relationship between two variables and is measured in terms of how strong are the variables related.

Covariance : It represents the extent to which the variables change together in a cycle. This explains the systematic relationship between pair of variables where changes in one affect changes in another variable.

Mathematically, consider 2 random variables, X and Y where the means are represented as  and  respectively and standard deviations are represented by  and  respectively and E represents the expected value operator, then:
 
covarianceXY = E[(X-),(Y-)]
correlationXY = E[(X-),(Y-)]/()

correlation(X,Y) = covariance(X,Y)/(covariance(X) covariance(Y))

Based on the above formula, we can deduce that the correlation is dimensionless whereas covariance is represented in units that are obtained from the multiplication of units of two variables.
The impact of missing values can be known after identifying what kind of variables have the missing values.
 
* If the data analyst finds any pattern in these missing values, then there are chances of finding meaningful insights.
* In case of patterns are not found, then these missing values can either be ignored or can be replaced with default values such as mean, minimum, maximum, or median values.
* If the missing values belong to categorical variables, then they are assigned with default values. If the data has a normal distribution, then mean values are assigned to missing values.
* If 80% values are missing, then it depends on the analyst to either replace them with default values or drop the variables.
Statistical analyses are classified based on the number of variables processed at a given time.
 
Univariate analysis :  This analysis deals with solving only one variable at a time.
Example : Sales pie charts based on territory.
 
Bivariate analysis : This analysis deals with the statistical study of two variables at a given time.
Example : Scatterplot of Sales and spend volume analysis study.
 
Multivariate analysis : This analysis deals with statistical analysis of more than two variables and studies the responses.
Example : Study of the relationship between human’s social media habits and their self-esteem which depends on multiple factors like age, number of hours spent, employment status, relationship status, etc.
* The method of regularization entails the addition of penalties to different parameters in the machine learning model for reducing the freedom of the model to avoid the issue of overfitting.

* There are various regularization methods available such as linear model regularization, Lasso/L1 regularization, etc. The linear model regularization applies penalty over coefficients that multiplies the predictors. The Lasso/L1 regularization has the feature of shrinking some coefficients to zero, thereby making it eligible to be removed from the model.
While doing binary classification, if the data set is imbalanced, the accuracy of the model can’t be predicted correctly using only the R2 score. For example, if the data belonging to one of the two classes is very less in quantity as compared to the other class, the traditional accuracy will take a very small percentage of the smaller class. If only 5% of the examples are belonging to the smaller class, and the model classifies all outputs belonging to the other class, the accuracy would still be around 95%. But this will be wrong. To deal with this, we can do the following :
 
* Use other methods for calculating the model performance like precision/recall, F1 score, etc.
* Resample the data with techniques like undersampling(reducing the sample size of the larger class), oversampling(increasing the sample size of smaller class using repetition, SMOTE, and other such techniques.
* Using K-fold cross-validation
* Using ensemble learning such that each decision tree considers the entire sample of the smaller class and only a subset of the larger class.
It can be considered as the first step towards the design of a database. Data modeling creates a conceptual model based on the relationship between various data models. The process involves moving from the conceptual stage to the logical model to the physical schema. It involves the systematic method of applying data modeling techniques. Database Design: This is the process of designing the database. The database design creates an output which is a detailed data model of the database. Strictly speaking, database design includes the detailed logical model of a database but it can also include physical design choices and storage parameters.
An error occurs in values while the prediction gives us the difference between the observed values and the true values of a dataset. Whereas, the residual error is the difference between the observed values and the predicted values. The reason we use the residual error to evaluate the performance of an algorithm is that the true values are never known. Hence, we use the observed values to measure the error using residuals. It helps us get an accurate estimate of the error.
Recommender Systems is a subclass of information filtering systems, meant for predicting the preferences or ratings awarded by a user to some product.
 
An application of a recommender system is the product recommendations section in Amazon. This section contains items based on the user’s search history and past orders.
Data cleaning can be a daunting task due to the fact that with the increase in the number of data sources, the time required for cleaning the data increases at an exponential rate.
 
This is due to the vast volume of data generated by additional sources. Also, data cleaning can solely take up to 80% of the total time required for carrying out a data analysis task.
 
Nevertheless, there are several reasons for using data cleaning in data analysis. Two of the most important ones are:
 
Cleaning data from different sources helps in transforming the data into a format that is easy to work with
Data cleaning increases the accuracy of a machine learning model
Logistic regression is a technique in predictive analytics which is used when we are doing predictions on a variable which is dichotomous(binary) in nature. For example, yes/no or true/false etc. The equation for this method is of the form Y = eX + e – X. It is used for classification based tasks. It finds out probabilities for a data point to belong to a particular class for classification.
Normal Distribution is also called the Gaussian Distribution. It is a type of probability distribution such that most of the values lie near the mean. There are following characteristics :

* The total area under the curve is 1
* The distribution has a bell-shaped curve
* The mean, median, and mode of the distribution coincide
* Exactly half of the values are to the right of the centre, and the other half to the left of the centre
Recall is the fraction of instances that have been classified as true. On the contrary, precision is a measure of weighing instances that are actually true. While recall is an approximation, precision is a true value that represents factual knowledge.
Correlation is defined as the measure of the relationship between two variables.
* If two variables are directly proportional to each other, then its positive correlation.
* If the variables are indirectly proportional to each other, it is known as a negative correlation.

Covariance is the measure of how much two random variables vary together.
The native data structures of python are :
 
* Lists
* Tuples
* Sets
* Dictionary

Tuples are immutable. Others are mutable.
The error introduced in your model because of over-simplification of the algorithm is known as Bias. On the other hand, Variance is the error introduced to your model because of the complex nature of machine learning algorithm. In this case, the model also learns noise and perform poorly on the test dataset.
 
The bias-variance tradeoff is the optimum balance between bias and variance in a machine learning model. If you try to decrease bias, the variance will increase and vice-versa.
 
Total Error= Square of bias+variance+irreducible error. Bias variance tradeoff is the process of finding the exact number of features while model creation such that the error is kept minimum, but also taking effective care such that the model does not overfit or underfit.
A confusion matrix is a 2X2 table that consists of four outputs provided by the binary classifier.
 
A binary classifier predicts all data instances of a test dataset as either positive or negative. This produces four outcomes-
 
True positive(TP) : Correct positive prediction
False-positive(FP) : Incorrect positive prediction
True negative(TN) : Correct negative prediction
False-negative(FN) : Incorrect negative prediction

It helps in calculating various measures including error rate (FP+FN)/(P+N), specificity(TN/N), accuracy(TP+TN)/(P+N), sensitivity (TP/P), and precision( TP/(TP+FP) ).
 
A confusion matrix is essentially used to evaluate the performance of a machine learning model when the truth values of the experiments are already known and the target class has more than two categories of data. It helps in visualisation and evaluation of the results of the statistical process.
A validation set can be considered part of the training set as it is used for parameter selection and to avoid overfitting the model being built. On the other hand, a test set is used for testing or evaluating the performance of a trained machine learning model.

In simple terms, the differences can be summarised as :

* Training Set is to fit the parameters, i.e. weights.
* Test Set is to assess the performance of the model, i.e. evaluating the predictive power and generalisation.
* The validation set is to tune the parameters.
Outlier values can be identified by using univariate or any other graphical analysis method. If the number of outlier values is few then they can be assessed individually but for a large number of outliers, the values can be substituted with either the 99th or the 1st percentile values.

All extreme values are not outlier values. The most common ways to treat outlier values  :

1) To change the value and bring in within a range
2) To just remove the value.
There are various methods to assess the results of logistic regression analysis
* Using Classification Matrix to look at the true negatives and false positives.
* Concordance that helps identify the ability of the logistic model to differentiate between the event happening and not happening.
* Lift helps assess the logistic model by comparing it with random selection.
The weights are initialized in neural networks to surpass the above problems. The gradient can vanish or explode rapidly if weights are not initialised during the forward pass through the deep neural network. That can cause the slow convergence of the network, or they may not even converge in some cases. It also ensures that we will not oscillate near the minima.
Dropout is a regularisation method used for deep neural networks to train different neural networks architectures on a given dataset. When the neural network is trained on a dataset, a few layers of the architecture are randomly dropped out of the network. This method introduces noise in the network by compelling nodes within a layer to probabilistically take on more or less authority for the input values. Thus, dropout makes the neural network model more robust by fixing the units of other layers with the help of prior layers.
Mean square error is the squared sum of (actual value-predicted value) for all data points. It gives an estimate of the total square sum of errors. Root mean square is the square root of the squared sum of errors.
Wide-format is where we have a single row for every data point with multiple columns to hold the values of various attributes. The long format is where for each data point we have as many rows as the number of attributes and each row contains the value of a particular attribute for a given data point.
SVM is an ML algorithm which is used for classification and regression. For classification, it finds out a muti dimensional hyperplane to distinguish between classes. SVM uses kernels which are namely linear, polynomial, and rbf. There are few parameters which need to be passed to SVM in order to specify the points to consider while the calculation of the hyperplane.
It completely depends on the accuracy and precision being required at the point of delivery and also on how much new data we have to train on. For a model trained on 10 million rows its important to have new data with the same volume or close to the same volume. Training on 1 million new data points every alternate week, or fortnight won’t add much value in terms of increasing the efficiency of the model.
Most statistics and ML projects need to fit a model on training data to be able to create predictions. There can be two problems while fitting a model- overfitting and underfitting.
 
* Overfitting is when a model has random error/noise and not the expected relationship. If a model has a large number of parameters or is too complex, there can be overfitting. This leads to bad performance because minor changes to training data highly changes the model’s result.

* Underfitting is when a model is not able to understand the trends in the data. This can happen if you try to fit a linear model to non-linear data. This also results in bad performance.
Three disadvantages of the linear model are :
 
* The assumption of linearity of the errors.
* You can’t use this model for binary or count outcomes
* There are plenty of overfitting problems that it can’t solve
AB testing used to conduct random experiments with two variables, A and B. The goal of this testing method is to find out changes to a web page to maximize or increase the outcome of a strategy.
The ensemble is a method of combining a diverse set of learners together to improvise on the stability and predictive power of the model. Two types of Ensemble learning methods are :
 
* Bagging : Bagging method helps you to implement similar learners on small sample populations. It helps you to make nearer predictions.
 
* Boosting : Boosting is an iterative method which allows you to adjust the weight of an observation depends upon the last classification. Boosting decreases the bias error and helps you to build strong predictive models.
Artificial Neural networks (ANN) are a special set of algorithms that have revolutionized machine learning. It helps you to adapt according to changing input. So the network generates the best possible result without redesigning the output criteria.
Reinforcement Learning is a learning mechanism about how to map situations to actions. The end result should help you to increase the binary reward signal. In this method, a learner is not told which action to take but instead must discover which action offers a maximum reward. As this method based on the reward/penalty mechanism.
Data Science and Machine Learning are two terms that are closely related but are often misunderstood. Both of them deal with data. However, there are some fundamental distinctions that show us how they are different from each other.
 
Data Science is a broad field that deals with large volumes of data and allows us to draw insights out of this voluminous data. The entire process of Data Science takes care of multiple steps that are involved in drawing insights out of the available data. This process includes crucial steps such as data gathering, data analysis, data manipulation, data visualization, etc.
 
Machine Learning, on the other hand, can be thought of as a sub-field of Data Science. It also deals with data, but here, we are solely focused on learning how to convert the processed data into a functional model, which can be used to map inputs to outputs, e.g., a model that can expect an image as an input and tell us if that image contains a flower as an output.
 
In short, Data Science deals with gathering data, processing it, and finally, drawing insights from it. The field of Data Science that deals with building models using algorithms is called Machine Learning. Therefore, Machine Learning is an integral part of Data Science.
RMSE stands for the Root Mean Square Error. It is a measure of accuracy in regression. RMSE allows us to calculate the magnitude of error produced by a regression model.

The way RMSE is calculated is as follows :
 
First, we calculate the errors in the predictions made by the regression model. For this, we calculate the differences between the actual and the predicted values. Then, we square the errors. After this step, we calculate the mean of the squared errors, and finally, we take the square root of the mean of these squared errors. This number is the RMSE, and a model with a lower value of RMSE is considered to produce lower errors, i.e., the model will be more accurate.
In statistics, a confounder is a variable that influences both the dependent variable and independent variable.
 
For example, if you are researching whether a lack of exercise leads to weight gain,
 
lack of exercise = independent variable
weight gain = dependent variable.
 
A confounding variable here would be any other variable that affects both of these variables, such as the age of the subject.
TF–IDF is short for term frequency-inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus. It is often used as a weighting factor in information retrieval and text mining.
 
The TF–IDF value increases proportionally to the number of times a word appears in the document but is offset by the frequency of the word in the corpus, which helps to adjust for the fact that some words appear more frequently in general.
Eigenvectors are used for understanding linear transformations. In data analysis, we usually calculate the eigenvectors for a correlation or covariance matrix. Eigenvectors are the directions along which a particular linear transformation acts by flipping, compressing or stretching.
 
Eigenvalue can be referred to as the strength of the transformation in the direction of eigenvector or the factor by which the compression occurs.
Combinatorics is defined as a branch of mathematics that is concerned with sets of objects that meet certain conditions. In computer science, combinatorics is used to study algorithms, sets of steps or rules devised to address a specific problem. Combinators optimisation is a subfield of combinators related to algorithm theory machine learning, image analysis and ANNs. Machine learning is related to computational statistics, which focuses on prediction making through the use of computers. Combinators are nothing but the study of countable sets. Probability use combinators to assign probability value between 0 to 1 to events and compare them with probability models. Real-world machine learning tasks frequently involve combinatorial structure.

How model, infer or predict with graphs, matchings, hierarchies, informative subsets or other discrete structures are underlying the data In Artificial neural networks, feature selection and parameter optimisation in feed-forward artificial neural networks. In feature selection, you’re trying to find an optimal combination of features to use in your dataset from a finite possible selection. Greedy algorithms, meta-heuristics and information gain filtering are all common approaches. Back-propagation is an algorithm used in artificial neural networks to find a near-optimal set of weights/parameters.
The Box-Cox transformation is a method of normalising data, named after two statisticians who introduced it, George Box and David Cox. Each data point, X, is transformed using the formula Xa, where a represents the power to which each data point is raised. The box-cox transformation fits the data for values -5 to +5 until the optimal value of ‘a’ that can best normalize the data is identified. 
When the observations in a dataset are spread equally across the range of distribution, then it is referred to as uniform distribution. There are no clear perks in a uniform distribution. Distributions that have more observations on one side of the graph than the other are referred to as skewed distribution. Distributions with fewer observations on the left ( towards lower values) are skewed left, and distributions with fewer observations on the right ( towards higher values) are skewed right.
In bayesian estimate, we have some knowledge about the data/problem (prior). There may be several values of the parameters which explain data and hence we can look for multiple parameters like 5 gammas and 5 lambdas that do this. As a result of the Bayesian Estimate, we get multiple models for making multiple predictions i.e. one for each pair of parameters but with the same prior. So, if a new example needs to be predicted, computing the weighted sum of these predictions serves the purpose.
 
Maximum likelihood does not take prior into consideration (ignores the prior) so it is like being a Bayesian while using some kind of a flat prior.
ADAM ALGORITHM : Adaptive Moment Estimation, shortly called ADAM, is a combination of Momentum and RMSProp. In the AdaGrad algorithm, the sum of gradients is squared, which only grows, and it is prolonged. RMSProp is nothing, but root mean square propagation which fixes the issue by considering a decay factor. In the Adam algorithm, when mathematically explained, two decay rates are used namely beta1 and beta2 where beta1 denotes the first momentum in which the sum of the gradient is considered and beta2 denotes the second momentum in which the sum of gradient squared is considered. Since the Momentum algorithm gives us a faster way and RMSProp provides the ability to gradient to restyle in different directions, the combination of the two works well. Thus, the Adam algorithm is considered the go-to choice of deep learning algorithms.

MOMENTUM ALGORITHM : Vanilla gradient descent with momentum is a method of accelerating the gradient descent to move faster towards the global minimum. Mathematically, a decay rate is multiplied to the previous sum of gradients and added with the present gradient to get a new sum of gradients. When the decay rate is assigned zero, it denotes a normal gradient descent. When the decay rate is set to 1, it oscillates like a ball in a frictionless bowl without any end. Hence decay rate is typically chosen around 0.8 to 0.9 to arrive at an end. The momentum algorithm gives us the advantage of escaping the local minima and getting into global minima.
Pruning is the process of reducing the size of a decision tree. The reason for pruning is that the trees prepared by the base algorithm can be prone to overfitting as they become incredibly large and complex.
Handling missing data is a common challenge in data science, as missing values can have a significant impact on the results of data analysis and modeling. There are several techniques to handle missing data, including:

Deletion : This involves removing observations or variables with missing data. This method is simple but can result in loss of information and reduced sample size, which can impact the validity of the results.

Mean/Median/Mode Imputation : This involves replacing missing values with the mean, median or mode of the available data for the variable. This is simple but can be biased if the missing data is not missing at random.

Predictive imputation : This involves using a predictive model to estimate missing values based on the values of other variables in the dataset. This can be a more sophisticated approach but requires careful consideration of the choice of model and the potential for introducing bias.

Multiple Imputation : This involves creating multiple imputed datasets, each with different imputed values for the missing data, and combining the results from these datasets to account for the uncertainty in the imputed values.

Data Augmentation : This involves generating new synthetic observations based on the existing data to increase the sample size and reduce the impact of missing data.

The choice of method will depend on the specific situation and the objectives of the analysis, and some methods may be more appropriate in certain circumstances than others.
The steps involved in a typical data science project can be summarized as follows:

Problem Definition : Identify the problem to be solved and clearly define the objectives and goals of the project.

Data Collection : Gather the necessary data from various sources, such as databases, APIs, or web scraping.

Data Cleaning and Preprocessing : Clean and preprocess the data to handle missing values, outliers, and other issues that may affect the results.

Exploratory Data Analysis (EDA) : Perform an exploratory analysis of the data to gain insights into the underlying structure and relationships, and identify potential challenges and biases.

Feature Engineering : Create new features or transform existing features to improve the performance of the models.

Model Selection : Choose the appropriate machine learning model based on the problem definition and the results of the EDA.

Model Training : Train the model on the cleaned and preprocessed data.

Model Evaluation : Evaluate the performance of the model using appropriate metrics, such as accuracy, precision, recall, or AUC.

Hyperparameter Tuning : Optimize the model's performance by adjusting its hyperparameters.

Deployment : Deploy the model in a production environment and monitor its performance.

Model Maintenance : Regularly update and maintain the model to ensure that it continues to perform well and reflect changes in the underlying data.

These steps are not always performed in a strict sequence and may involve iteration and refinement throughout the project. Additionally, some steps may be omitted or added depending on the specific requirements of the project.
Overfitting is a common problem in machine learning where a model is too complex and fits the training data too closely, capturing the noise or random fluctuations in the data instead of the underlying pattern. This can result in a model that performs well on the training data but poorly on new, unseen data, leading to poor generalization performance.

To prevent overfitting, there are several techniques that data scientists can use :

Regularization : Regularization is a technique that adds a penalty term to the loss function of the model to reduce its complexity and prevent overfitting. Common regularization methods include L1 (Lasso) and L2 (Ridge) regularization.

Cross-validation : Cross-validation is a technique that involves dividing the dataset into multiple parts and using one part for testing and the other parts for training. This allows the data scientist to assess the model's performance on multiple partitions of the data, helping to prevent overfitting.

Early stopping : Early stopping is a technique used in deep learning to prevent overfitting by monitoring the performance on a validation set during training and stopping the training process when the performance on the validation set stops improving.

Ensemble methods : Ensemble methods, such as bagging and boosting, can be used to prevent overfitting by combining the predictions of multiple models.

Simplifying the model : Reducing the complexity of the model, for example, by using a simpler model architecture or fewer features, can help prevent overfitting.

Adding more data : Increasing the size of the training dataset can help prevent overfitting by providing the model with more information to learn from and reducing the impact of any noise in the data.
The bias-variance tradeoff is a fundamental concept in machine learning that refers to the balance between two types of errors that a model can make. These errors are bias and variance.

Bias refers to the error introduced by assuming that the relationship between the features and the target variable is too simple. A model with high bias tends to make the same errors consistently and underfits the data, meaning it doesn't capture the complexity of the underlying relationship between the features and target variable.

Variance, on the other hand, refers to the error introduced by the model's sensitivity to small fluctuations in the training data. A model with high variance overfits the data, meaning it captures the noise in the training data rather than the underlying relationship between the features and target variable.

The bias-variance tradeoff refers to the balance between these two types of errors. A model with low bias and high variance is likely to overfit the data, while a model with high bias and low variance is likely to underfit the data. The goal is to find a balance between these two errors to produce a model that generalizes well to new, unseen data.

To balance the bias-variance tradeoff, data scientists can use techniques such as regularization, cross-validation, and ensemble methods. They can also adjust the complexity of the model, such as by increasing or decreasing the number of features or changing the model architecture, to find the optimal tradeoff between bias and variance. Ultimately, the best balance will depend on the specific problem and the characteristics of the dataset.
Evaluating a machine learning model is an important step in the development process to assess its performance and determine its suitability for the task at hand. There are several metrics that can be used to evaluate a machine learning model, including:

Accuracy : This metric measures the proportion of correct predictions made by the model. It is often used for classification problems where the goal is to assign a class label to each instance.

Precision and recall : These metrics measure the ability of the model to identify positive instances while avoiding false positives and false negatives, respectively. They are often used in classification problems where there is a class imbalance or when it is important to minimize false positives or false negatives.

F1 Score : The F1 score is the harmonic mean of precision and recall and is a good metric to use when there is an imbalanced class distribution.

Area under the receiver operating characteristic (ROC) curve (AUC-ROC) : This metric measures the ability of the model to distinguish between positive and negative instances. The ROC curve plots the true positive rate against the false positive rate for a range of threshold values, and the AUC-ROC is the area under this curve.
Mean squared error (MSE) : This metric measures the average squared difference between the predicted values and the true values for a regression problem. The goal is to minimize the MSE to produce accurate predictions.

Mean absolute error (MAE) : This metric measures the average absolute difference between the predicted values and the true values for a regression problem. Like the MSE, the goal is to minimize the MAE to produce accurate predictions.

In addition to these metrics, it is also important to perform a visual analysis of the model's predictions, such as plotting the predicted vs. actual values, to gain a deeper understanding of its performance and identify any patterns or trends in the errors.

It is important to use appropriate metrics that are relevant to the specific problem and to use cross-validation techniques to obtain a more robust estimate of the model's performance. The best way to evaluate a model will depend on the specific problem and the type of data being used.
A decision tree and a random forest are both machine learning algorithms used for classification and regression problems. However, there are several key differences between the two.

A decision tree is a type of model that makes predictions by recursively partitioning the input space into smaller and smaller regions, known as branches or leaves. At each node in the tree, a decision is made based on the value of a feature that maximizes the separation of the target variable. The final prediction is made by following the path from the root of the tree to a leaf node. Decision trees are simple to understand and interpret, but they are prone to overfitting and can easily capture noise in the data.

A random forest, on the other hand, is an ensemble method that builds multiple decision trees and aggregates their predictions to make a final prediction. In a random forest, each tree is built using a random subset of the features, and the final prediction is made by averaging the predictions of all the trees. This randomization helps to reduce the variance of the model and prevent overfitting. The resulting model is more robust and can produce better predictions on unseen data.
The curse of dimensionality refers to the difficulties that arise when working with high-dimensional data. High-dimensional data is data with a large number of features or dimensions, and the curse of dimensionality refers to the fact that many common techniques and algorithms that work well with low-dimensional data become ineffective or even break down entirely when applied to high-dimensional data.

The curse of dimensionality arises due to the following reasons :

Sparsity : With increasing number of dimensions, the amount of data that can be stored in any given region of space decreases rapidly. This means that the data becomes sparse and widely dispersed in high-dimensional space, making it difficult to detect patterns or relationships in the data.

Distance metrics : In high-dimensional space, the distance between two points can become extremely large, even if they are close together in the original space. This makes it difficult to use traditional distance metrics such as Euclidean distance to measure similarity between data points.
Overfitting : In high-dimensional space, the number of features or dimensions can become very large compared to the number of data points. This makes it easy for models to overfit the data, that is, to fit the noise in the data instead of the underlying patterns.

These issues make it difficult to apply traditional machine learning algorithms to high-dimensional data. Dimensionality reduction techniques, such as those I mentioned in my previous answer, can help alleviate the curse of dimensionality by reducing the number of dimensions in the data, making it more manageable and allowing traditional algorithms to be applied.
Data visualization is an important step in the data analysis process, as it allows us to gain insights into the data, identify patterns, and communicate findings effectively. Here are some common data visualization techniques:

Line charts : Line charts are used to visualize trends over time, such as stock prices, sales data, or temperature readings.

Bar charts : Bar charts are used to compare the magnitude of different categories, such as the number of products sold by different companies or the popularity of different music genres.

Histograms : Histograms are used to visualize the distribution of a single variable, such as the height or weight of a group of people.

Scatter plots : Scatter plots are used to visualize the relationship between two variables, such as the relationship between height and weight, or the relationship between years of experience and salary.
Box plots : Box plots are used to visualize the distribution of a single variable, by showing the median, quartiles, and outliers.

Heat maps : Heat maps are used to visualize the relationship between two variables, where the color scale represents the magnitude of the relationship.

Pie charts : Pie charts are used to visualize the proportion of different categories, such as the proportion of different expenses in a budget.

Area charts : Area charts are used to visualize trends over time, similar to line charts, but the area under the line is filled in to represent the magnitude of the data.

These are just a few examples of the many data visualization techniques that are commonly used in Data Science. The choice of visualization technique depends on the type of data and the question being asked. It's important to choose a visualization that effectively communicates the insights from the data and is easy to interpret.
Regularization is a technique used to prevent overfitting in machine learning models by adding a penalty term to the loss function. The goal of regularization is to keep the model parameters from becoming too large, which can cause the model to fit the noise in the data instead of the underlying pattern.

L1 and L2 regularization are two commonly used types of regularization.

L1 regularization, also known as Lasso regularization, adds a penalty term to the loss function that is proportional to the absolute value of the coefficients. This has the effect of shrinking the coefficients towards zero, which can lead to sparse solutions, where some of the coefficients are exactly zero. In other words, L1 regularization encourages the model to use only a subset of the features.
L2 regularization, also known as Ridge regularization, adds a penalty term to the loss function that is proportional to the square of the coefficients. This has the effect of shrinking the coefficients towards zero, but unlike L1 regularization, it does not encourage sparse solutions. L2 regularization tends to produce models with small, non-zero coefficients.

Simple Answer :  The main difference between L1 and L2 regularization is the way in which they penalize large coefficients. L1 regularization encourages sparse solutions, while L2 regularization discourages large coefficients, but does not encourage sparse solutions. The choice between L1 and L2 regularization depends on the problem at hand and the desired properties of the solution.
Feature selection is an important step in the data analysis process, as it can improve the performance of machine learning models by reducing overfitting, improving interpretability, and reducing the computational cost of training.

Here are some popular feature selection methods :

Filter methods : Filter methods evaluate each feature independently and rank them based on a criterion, such as information gain or chi-squared test statistics. Features with the highest ranking are selected for use in the model.

Wrapper methods : Wrapper methods evaluate feature subsets by training a machine learning model and evaluating its performance. The goal is to find the subset of features that results in the best model performance. Wrapper methods can be computationally expensive, as they require training a model multiple times with different feature subsets.
Embedded methods : Embedded methods use the learning algorithm itself to perform feature selection. Regularization methods, such as L1 regularization, are examples of embedded methods, as they shrink the coefficients of less important features towards zero.

Hybrid methods : Hybrid methods combine elements of filter, wrapper, and embedded methods to produce more effective feature selection results. For example, a hybrid method might use a filter method to pre-select a set of promising features, and then use a wrapper method to further refine the selection.

These are some of the most popular feature selection methods in Data Science. The choice of feature selection method depends on the problem at hand, the available computational resources, and the desired trade-off between computational cost and accuracy. It's also common to use multiple feature selection methods and compare their results, as different methods may produce different results for the same problem.
Advertisement