Machine Learning Interview Questions

Machine learning is the form of Artificial Intelligence that deals with system programming and automates data analysis to enable computers to learn and act through experiences without being explicitly programmed.

Machine Learning algorithms can be primarily classified depending on the presence/absence of target variables.

Machine Learning Are Used Two Types of Datas.

***** Structured Data

***** Unstructured Data.

Some of the popular Machine Learning algorithms are :

Here, it’s important to remember that once in a while, the model needs to be checked to make sure it’s working correctly. It should be modified to make sure that it is up-to-date.

There are various means to select important variables from a data set that include the following :

Causality applies to situations where one action, say X, causes an outcome, say Y, whereas Correlation is just relating one action (X) to another action(Y) but X does not necessarily cause Y.

Supervised learning is a machine learning algorithm of inferring a function from labeled training data. The training data consists of a set of training examples.

Knowing the height and weight identifying the gender of the person. Below are the popular supervised learning algorithms.

If you build a T-shirt classifier, the labels will be “this is an S, this is an M and this is L”, based on showing the classifier examples of S, M, and L.

Unsupervised learning is also a type of machine learning algorithm used to find patterns on the set of data given. In this, we don’t have any dependent variable or label to predict. Unsupervised Learning Algorithms :

In the same example, a T-shirt clustering will categorize as “collar style and V neck style”, “crew neck style” and “sleeve types”.

At times when the model begins to underfit or overfit, regularization becomes necessary. It is a regression that diverts or regularizes the coefficient estimates towards zero. It reduces flexibility and discourages learning in a model to avoid the risk of overfitting. The model complexity is reduced and it becomes better at predicting.

Standard deviation refers to the spread of your data from the mean. Variance is the average degree to which each point differs from the mean i.e. the average of all data points. We can relate Standard deviation and Variance because it is the square root of Variance.

Overfitting can be seen in machine learning when a statistical model describes random error or noise instead of the underlying relationship. Overfitting is usually observed when a model is excessively complex. It happens because of having too many parameters concerning the number of training data types. The model displays poor performance, which has been overfitted.

Overfitting occurs when we have a small dataset, and a model is trying to learn from it. By using a large amount of data, overfitting can be avoided. But if we have a small database and are forced to build a model based on that, then we can use a technique known as cross-validation. In this method, a model is usually given a dataset of a known data on which training data set is run and dataset of unknown data against which the model is tested. The primary aim of cross-validation is to define a dataset to "test" the model in the training phase. If there is sufficient data, 'Isotonic Regression' is used to prevent overfitting.

Both bias and variance are errors. Bias is an error due to erroneous or overly simplistic assumptions in the learning algorithm. It can lead to the model under-fitting the data, making it hard to have high predictive accuracy and generalize the knowledge from the training set to the test set.

Variance is an error due to too much complexity in the learning algorithm. It leads to the algorithm being highly sensitive to high degrees of variation in the training data, which can lead the model to overfit the data.

To optimally reduce the number of errors, we will need to tradeoff bias and variance.

In various areas of information science like machine learning, a set of data is used to discover the potentially predictive relationship known as ‘Training Set’. Training set is an examples given to the learner, while Test set is used to test the accuracy of the hypotheses generated by the learner, and it is the set of example held back from the learner. Training set are distinct from Test set.

Genetic programming is one of the two techniques used in machine learning. The model is based on the testing and selecting the best choice among a set of results.

Let’s consider a scenario of a fire emergency :

Applications of supervised machine learning include :

Supervised learning uses data that is completely labeled, whereas unsupervised learning uses no training data.

In the case of semi-supervised learning, the training data contains a small amount of labeled data and a large amount of unlabeled data.

**Precision = TP/TP+FP**

**Recall = TP/TP+FN**

**F1-Score = 2*(Recall * Precision) / (Recall + Precision)**

**Accuracy = TP+TN/TP+TN+FP+FN**

Yes, it can be used but it depends on the applications. The predictive models based on machine learning have wide applicability across time series projects. These models help in facilitating the predictive distribution of time and resources. The most widely applied machine learning methods for time series forecasting projects are :

These are important feature extraction techniques, which are majorly used for dimensionality reduction.

28 .

It is a Receiver Operating Characteristic curve, a fundamental tool for diagnostic test evaluation. ROC curve is a plot of Sensitivity against Specificity for probable cut-off points of a diagnostic test. It is the graphical representation of the contrast between true positive rates and the false positive rate at different thresholds.

Principal Component Analysis (PCA) is a statistical procedure that uses an orthogonal transformation that converts a set of correlated variables to a set of uncorrelated variables. PCA is the most widely used tool in exploratory data analysis and in machine learning for predictive models. Moreover, PCA is an unsupervised statistical technique used to examine the interrelations among a set of variables. It is also known as a general factor analysis where regression determines a line of best fit.

It is the process of reducing random variables under consideration. Dimensionality reduction can be classified as feature selection and feature extraction.

Feature selection tries to find the subset of input variables, while feature extraction begins from an initial set of measured data and builds derived values.

Boltzmann Machines have a simple learning algorithm that helps to discover exciting features in training data. These were among the first neural networks to learn internal representations and are capable of solving severe combinatory problems.

Different types of Genetic Programming are :

A regression model that uses L1 Regularization is called Lasso Regression, and the Model which uses L2 Regularization is called Ridge Regression.

***** L1 regularization helps in eliminating the features that are not important.

It is a technique for increasing the model performance by feeding multiple sample data from the dataset. The sampling process is done by breaking the data into smaller parts that have the same number of rows. Out of all the parts, one is randomly selected for the test and another one for train sets.

It consists of the following techniques :

It consists of the following techniques :

From the perspective of **inductive learning**, we are given input samples (x) and output samples (f(x)) and the problem is to estimate the function (f). Specifically, the problem is to generalize from the samples and the mapping to be useful to estimate the output for new samples in the future.

In practice it is almost always too hard to estimate the function, so we are looking for very good approximations of the function.

Some practical examples of induction are :

There are problems where inductive learning is not a good idea. It is important when to use and when not to use supervised machine learning.

Deductive learning is a subclass of machine learning that studies algorithms for learning provably correct knowledge. Typically such methods are used to speedup problem solvers by adding knowledge to them that is deductively entailed by existing knowledge, but that may result in faster solutions.

“KickStart your Artificial Intelligence Journey with Great Learning which offers high-rated Artificial Intelligence courses with world-class training by industry leaders. Whether you’re interested in machine learning, data mining, or data analysis, Great Learning has a course for you!”

Random forests are a significant number of decision trees pooled using averages or majority rules at the end. Gradient boosting machines also combine decision trees but at the beginning of the process unlike Random forests. Random forest creates each tree independent of the others while gradient boosting develops one tree at a time. Gradient boosting yields better outcomes than random forests if parameters are carefully tuned but it’s not a good option if the data set contains a lot of outliers/anomalies/noise as it can result in overfitting of the model.Random forests perform well for multiclass object detection. Gradient Boosting performs well when there is data which is not balanced such as in real time risk assessment.

Confusion matrix (also called the error matrix) is a table that is frequently used to illustrate the performance of a classification model i.e. classifier on a set of test data for which the true values are well-known.

It allows us to visualize the performance of an algorithm/model. It allows us to easily identify the confusion between different classes. It is used as a performance measure of a model/algorithm.

A confusion matrix is known as a summary of predictions on a classification model. The number of right and wrong predictions were summarized with count values and broken down by each class label. It gives us information about the errors made through the classifier and also the types of errors made by a classifier.

ILP stands for Inductive Logic Programming. It is a part of machine learning which uses logic programming. It aims at searching patterns in data which can be used to build predictive models. In this process, the logic programs are assumed as a hypothesis.

The classifier is called "naive" because it makes assumptions that may or may not turn out to be correct.

The algorithm assumes that the presence of one feature of a class is not related to the presence of any other feature (absolute independence of features), given the class variable.

For instance, a fruit may be considered to be a cherry if it is red in color and round in shape, regardless of other features. This assumption may or may not be right (as an apple also matches the description).

Normalisation adjusts the data; regularisation adjusts the prediction function. If your data is on very different scales (especially low to high), you would want to normalise the data. Alter each column to have compatible basic statistics. This can be helpful to make sure there is no loss of accuracy. One of the goals of model training is to identify the signal and ignore the noise if the model is given free rein to minimize error, there is a possibility of suffering from overfitting. Regularization imposes some control on this by providing simpler fitting functions over complex ones.

Before starting linear regression, the assumptions to be met are as follow :

SVM stands for** Support Vector Machine**. SVM are supervised learning models with an associated learning algorithm which analyze the data used for classification and regression analysis.

The classification methods that SVM can handle are :

There are six types of kernels in SVM :

Entropy in Machine Learning measures the randomness in the data that needs to be processed. The more entropy in the given data, the more difficult it becomes to draw any useful conclusion from the data. For example, let’s take the incident of flipping a coin. The result of this is random as it does not favor heads or tails. Here, the result for any number of tosses cannot be predicted easily as there is no definite relationship between the action of flipping and the possible outcomes.

Epoch in Machine Learning is used to indicate the count of passes in a given training dataset where the Machine Learning algorithm has done its job. Generally, when there is a huge chunk of data, it is grouped into several batches. Here, each of these batches goes through the given model, and this process is referred to as iteration. Now, if the batch size comprises the complete training dataset, then the count of iterations is the same as that of epochs.

In case there is more than one batch, ** ‘d’ is the dataset**,** ‘e’ is the number of epochs**,** ‘i’ is the number of iterations**, and** ‘b’ is the batch size**.

**d*e=i*b**

is the formula used, whereinLogistic regression is the proper regression analysis used when the dependent variable is categorical or binary. Like all regression analyses, logistic regression is a technique for predictive analysis. Logistic regression is used to explain data and the relationship between one dependent binary variable and one or more independent variables. Also, it is employed to predict the probability of a categorical dependent variable.

We can use logistic regression in the following scenarios :

There are three types of logistic regression:

Temporal Difference Learning Method is a mix of Monte Carlo method and Dynamic programming method. Some of the advantages of this method include :

Sampling Techniques can help with an imbalanced dataset. There are two ways to perform sampling, Under Sample or Over Sampling.

In Under Sampling, we reduce the size of the majority class to match minority class thus help by improving performance w.r.t storage and run-time execution, but it potentially discards useful information.

For Over Sampling, we upsample the Minority class and thus solve the problem of information loss, however, we get into the trouble of having Overfitting.

KNN Machine Learning algorithm is called a lazy learner. K-NN is defined as a lazy learner because it will not learn any machine-learned values or variables from the given training data, but dynamically it calculates the distance every time it wants to classify. Hence it memorizes the training dataset instead.

The random forest can be defined as a supervised learning algorithm that is used for classifications and regression. Similarly, the random forest algorithm creates decision trees on the data samples, and then it gets the prediction from each of the samples and finally selects the best one by means of voting.

In **Machine Learning,** we encounter the Vanishing Gradient Problem while training the Neural Networks with gradient-based methods like Back Propagation. This problem makes it hard to tune and learn the parameters of the earlier layers in the given network.

The vanishing gradients problem can be taken as one example of the unstable behavior that we may encounter when training the deep neural network.

It describes a situation where the deep multilayer feed-forward network or the recurrent neural network is not able to propagate the useful gradient information from the given output end of the model back to the layers close to the input end of the model.

A classifier in machine learning is defined as an algorithm that automatically categorizes the data into one or more of a group of “**classes**”. One of the common examples is an email classifier that can scan the emails to filter them by the given class labels: Spam or Not Spam.

We have five types of classification algorithms, namely :

Sequence prediction aims to predict elements of the sequence on the basis of the preceding elements.

A prediction model is trained with the set of training sequences. On training, the model is used to perform sequence predictions. A prediction comprises predicting the next items of a sequence. This task has a number of applications like web page prefetching, weather forecasting, consumer product recommendation, and stock market prediction.

Examples of sequence prediction problems include :

Data augmentation is a machine learning strategy that enables the users to increase the data diversity for training models remarkably from internal and external sources within an enterprise. This does not require any new data collection.

Modification in images is one of the most helpful examples of data augmentation. We can easily perform the following activities on an image and modify it :

In machine learning projects, both R and Python come with their own advantages. However, Python is more useful in data manipulation and repetitive tasks, making it the right choice if you plan to build a digital product based on machine learning. Moreover, to develop a tool for ad-hoc analysis at an early stage of the project, R is more suitable.

Rotation is a significant step in PCA as it maximizes the separation within the variance obtained by components. Due to this, the interpretation of components becomes easier.

The motive behind doing PCA is to choose fewer components that can explain the greatest variance in a dataset. When rotation is performed, the original coordinates of the points get changed. However, there is no change in the relative position of the components.

If the components are not rotated, then we need more extended components to describe the variance.

In real-world scenarios, the attributes present in data will be in a varying pattern. So, rescaling of the characteristics to a common scale gives benefit to algorithms to process the data efficiently.

We can rescale the data using Scikit-learn. The code for rescaling the data using MinMaxScaler is as follows:

```
#Rescaling data
import pandas
import scipy
import numpy
from sklearn.preprocessing import MinMaxScaler
names = ['Ramu', 'Ramana', 'Mounika', 'Sathya', 'raj', 'mani', 'samu', 'venu', 'sam']
Dataframe = pandas.read_csv(url, names=names)
Array = dataframe.values
# Splitting the array into input and output
X = array[:,0:8]
Y = array[:,8]
Scaler = MinMaxScaler(feature_range=(0, 1))
rescaledX = scaler.fit_transform(X)
# Summarizing the modified data
numpy.set_printoptions(precision=3)
print(rescaledX[0:5,:])
```

Most Machine learning algorithms require number as input. That is why we convert categorical values into factors to get numerical values. We also don't have to deal with dummy variables.

The functions are used to convert variables into factors.

**factor()**

and `as.factor()`

A rule of thumb for interpreting the variance inflation factor :

68 .

A pipeline is a sophisticated way of writing software such that each intended action while building a model can be serialized and the process calls the individual functions for the individual tasks. The tasks are carried out in sequence for a given sequence of data points and the entire process can be run onto n threads by use of composite estimators in scikit learn.

SVM algorithms have basically advantages in terms of complexity. First I would like to clear that both Logistic regression as well as SVM can form non linear decision surfaces and can be coupled with the kernel trick. If Logistic regression can be coupled with kernel then why use SVM?

â— SVM is found to have better performance practically in most cases.

â— SVM is computationally cheaper O(N^2*K) where K is no of support vectors (support vectors are those points that lie on the class margin) where as logistic regression is O(N^3)

â— Classifier in SVM depends only on a subset of points . Since we need to maximize distance between closest points of two classes (aka margin) we need to care about only a subset of points unlike logistic regression.

First reason is that** XGBoos** is an ensemble method that uses many trees to make a decision so it gains power by repeating itself.

We will implement and demonstrate the FIND-S algorithm for finding the most specific hypothesis based on a given set of training data samples.

In finds algorithm , we initialize hypothesis as an array of phi, thein in the first step we replace it with the first positive row of our dataset which is most specific hypothesis.

In next step, we will traverse the dataset and check if the target value of dataset is positive or not, we will only consider positive value. if the value is positive we will traverse that row from start to end and check if any element matches with our respective hypothesis. if the element does not matches with the hypothesis, we will generalize the hypothesis and we will replace element in hypothesis with the dataset element .

Let’s have a look at this table before directly jumping into the F1 score.

Prediction | Predicted Yes | Predicted No |
---|---|---|

Actual Yes | True Positive (TP) | False Negative (FN) |

Actual No | False Positive (FP) | True Negative (TN) |

In binary classification we consider the

**F1**

score to be a measure of the model’s accuracy. The **F1**

score is a weighted average of precision and recall scores.**F1 = 2TP/2TP + FP + FN**

We see scores for F1 between 0 and 1, where 0 is the worst score and 1 is the best score.

The F1 score is typically used in information retrieval to see how well a model retrieves relevant results and our model is performing.

Overfitting means the model fitted to training data too well, in this case, we need to resample the data and estimate the model accuracy using techniques like k-fold cross-validation.

Whereas for the Underfitting case we are not able to understand or capture the patterns from the data, in this case, we need to change the algorithms, or we need to feed more data points to the model.

There are two kinds of methods that include direct methods and statistical testing methods :

The silhouette is the most frequently used while determining the optimal value of k.

Visually, we can use plots. A few of the normality checks are as follows :

The process of choosing models among diverse mathematical models, which are used to define the same data is known as **Model Selection**. Model learning is applied to the fields of **statistics**, **data mining**, and **machine learning**.

There are several essential steps we must follow to achieve a good working model while doing a Machine Learning Project.

Those steps may include**parameter tuning**, **data preparation**, **data collection**, **training the model**, **model evaluation**, and **prediction**, etc.

Those steps may include

In **machine learning**, **lazy learning** can be described as a method where induction and generalization processes are delayed until classification is performed. Because of the same property, an instance-based learning algorithm is sometimes called lazy learning algorithm.

A **regularization** is a form of regression, which constrains/ **regularizes** or **shrinks** the coefficient estimates towards zero. In other words, it discourages learning a more complex or flexible model to avoid the risk of overfitting. It reduces the variance of the model, without a substantial increase in its bias.

Regularization is used to address overfitting problems as it penalizes the loss function by adding a multiple of an **L1 (LASSO)** or an **L2 (Ridge)** norm of weights vector w.

Dimensionality reduction is a technique for reducing the number of features in a dataset while still retaining the important information. This can be useful in many areas, such as image and speech recognition, natural language processing, and even in stock market analysis.

The basic idea behind dimensionality reduction is to project the high-dimensional data onto a lower-dimensional space while preserving the structure and relationships between the data points. There are many techniques for doing this, including:

**Principal Component Analysis (PCA) :** This is a linear dimensionality reduction technique that transforms the data to a lower-dimensional space by finding the directions of maximum variance in the data. The transformed data is then projected onto these directions, effectively removing the redundancy in the data.

**Singular Value Decomposition (SVD) :** This is a linear dimensionality reduction technique that uses matrix decomposition to reduce the dimensionality of the data.
**t-SNE (t-distributed Stochastic Neighbor Embedding) :** This is a non-linear dimensionality reduction technique that maps high-dimensional data to a lower-dimensional space while preserving the local structure of the data.

**Auto-Encoder :** This is a deep learning-based approach to dimensionality reduction that involves training a neural network to reconstruct the original data from a lower-dimensional representation.

In each of these techniques, the goal is to find a lower-dimensional representation of the data that retains the important information. The reduced representation can then be used for further analysis or as input to a machine learning algorithm.

The basic idea behind dimensionality reduction is to project the high-dimensional data onto a lower-dimensional space while preserving the structure and relationships between the data points. There are many techniques for doing this, including:

In each of these techniques, the goal is to find a lower-dimensional representation of the data that retains the important information. The reduced representation can then be used for further analysis or as input to a machine learning algorithm.

Feature engineering is the process of creating new features or transforming existing ones in order to improve the performance of machine learning models. Some popular techniques for feature engineering include:

**Binning :** Binning is the process of converting a continuous feature into a categorical feature by dividing the range of the feature into bins. This can be useful for dealing with non-linear relationships in the data.

**One-hot encoding :** One-hot encoding is a technique for converting categorical features into a numerical representation that can be used by machine learning algorithms. The idea is to convert each unique category into a new binary feature, with a value of 1 indicating the presence of that category and a value of 0 indicating its absence.

**Polynomial features :** Polynomial features are new features generated by raising the original features to a power and combining them in a polynomial expression. This can be useful for capturing non-linear relationships in the data.
**Interaction features :** Interaction features are new features generated by combining two or more existing features. This can be useful for capturing the combined effect of multiple features on the target variable.

**Logarithmic transformation :** Logarithmic transformation is a technique for transforming a feature by taking the logarithm of its values. This can be useful for reducing the skew in the distribution of the feature and improving the linearity of the relationship between the feature and the target variable.

**Scaling and normalization :** Scaling and normalization are techniques for transforming features so that they have similar ranges and distributions. This can be important for ensuring that the features have similar importance in the machine learning algorithms.

**Aggregation :** Aggregation is the process of summarizing data by aggregating it into a smaller number of features. This can be useful for reducing the noise in the data and improving the interpretability of the features.

These are just a few of the many techniques that can be used for feature engineering. The choice of technique will depend on the specific problem and the characteristics of the data. The goal of feature engineering is to create new features that capture the relationships and patterns in the data in a way that will improve the performance of the machine learning models.

These are just a few of the many techniques that can be used for feature engineering. The choice of technique will depend on the specific problem and the characteristics of the data. The goal of feature engineering is to create new features that capture the relationships and patterns in the data in a way that will improve the performance of the machine learning models.

In summary, supervised learning is best suited for problems where the desired output is known for each input, and the goal is to learn the mapping between inputs and outputs. Reinforcement learning is best suited for problems where the goal is to learn a policy that maximizes a reward signal in an environment through trial and error.