Google News
logo
Machine Learning Interview Questions
Epoch in Machine Learning is used to indicate the count of passes in a given training dataset where the Machine Learning algorithm has done its job. Generally, when there is a huge chunk of data, it is grouped into several batches. Here, each of these batches goes through the given model, and this process is referred to as iteration. Now, if the batch size comprises the complete training dataset, then the count of iterations is the same as that of epochs.
 
In case there is more than one batch, d*e=i*b is the formula used, wherein ‘d’ is the dataset, ‘e’ is the number of epochs, ‘i’ is the number of iterations, and ‘b’ is the batch size.
Logistic regression is the proper regression analysis used when the dependent variable is categorical or binary. Like all regression analyses, logistic regression is a technique for predictive analysis. Logistic regression is used to explain data and the relationship between one dependent binary variable and one or more independent variables. Also, it is employed to predict the probability of a categorical dependent variable.
 
We can use logistic regression in the following scenarios :
 
* To predict whether a citizen is a Senior Citizen (1) or not (0)
* To check whether a person is having a disease (Yes) or not (No)


There are three types of logistic regression:
 
* Binary Logistic Regression : In this, there are only two outcomes possible.
   Ex : To predict whether it will rain (1) or not (0)
 
* Multinomial Logistic Regression: In this, the output consists of three or more unordered categories.
   Ex : Prediction on the regional languages (Kannada, Telugu, Marathi, etc.)
 
* Ordinal Logistic Regression: In ordinal logistic regression, the output consists of three or more ordered categories.
   Ex : Rating an Android application from 1 to 5 stars.
Temporal Difference Learning Method is a mix of Monte Carlo method and Dynamic programming method. Some of the advantages of this method include :
 
* It can learn in every step online or offline.
* It can learn from a sequence which is not complete as well.
* It can work in continuous environments.
* It has lower variance compared to MC method and is more efficient than MC method.


Limitations of TD method are :
 
* It is a biased estimation.
* It is more sensitive to initialization.
Sampling Techniques can help with an imbalanced dataset. There are two ways to perform sampling, Under Sample or Over Sampling.
 
In Under Sampling, we reduce the size of the majority class to match minority class thus help by improving performance w.r.t storage and run-time execution, but it potentially discards useful information.
 
For Over Sampling, we upsample the Minority class and thus solve the problem of information loss, however, we get into the trouble of having Overfitting.
 
There are other techniques as well  :

Cluster-Based Over Sampling :  In this case, the K-means clustering algorithm is independently applied to minority and majority class instances. This is to identify clusters in the dataset. Subsequently, each cluster is oversampled such that all clusters of the same class have an equal number of instances and all classes have the same size
 
Synthetic Minority Over-sampling Technique (SMOTE) : A subset of data is taken from the minority class as an example and then new synthetic similar instances are created which are then added to the original dataset. This technique is good for Numerical data points.
Exploratory Data Analysis (EDA) helps analysts to understand the data better and forms the foundation of better models. 
 
Visualization :
* Univariate visualization
* Bivariate visualization
* Multivariate visualization


Missing Value Treatment : Replace missing values with Either Mean/Median
 
Outlier Detection : Use Boxplot to identify the distribution of Outliers, then Apply IQR to set the boundary for IQR
 
Transformation : Based on the distribution, apply a transformation on the features
 
Scaling the Dataset : Apply MinMax, Standard Scaler or Z Score Scaling mechanism to scale the data.
 
Feature Engineering : Need of the domain, and SME knowledge helps Analyst find derivative fields which can fetch more information about the nature of the data
 
Dimensionality reduction : Helps in reducing the volume of data without losing much information
Association rule mining (ARM) aims to find out the association rules that will satisfy the predefined minimum support and confidence from a database. AMO is mainly used to reduce the number of association rules with the new fitness functions that can incorporate frequent rules.
KNN Machine Learning algorithm is called a lazy learner. K-NN is defined as a lazy learner because it will not learn any machine-learned values or variables from the given training data, but dynamically it calculates the distance every time it wants to classify. Hence it memorizes the training dataset instead.
The random forest can be defined as a supervised learning algorithm that is used for classifications and regression. Similarly, the random forest algorithm creates decision trees on the data samples, and then it gets the prediction from each of the samples and finally selects the best one by means of voting.
In Machine Learning, we encounter the Vanishing Gradient Problem while training the Neural Networks with gradient-based methods like Back Propagation. This problem makes it hard to tune and learn the parameters of the earlier layers in the given network.
 
The vanishing gradients problem can be taken as one example of the unstable behavior that we may encounter when training the deep neural network.
 
It describes a situation where the deep multilayer feed-forward network or the recurrent neural network is not able to propagate the useful gradient information from the given output end of the model back to the layers close to the input end of the model.
A classifier in machine learning is defined as an algorithm that automatically categorizes the data into one or more of a group of “classes”. One of the common examples is an email classifier that can scan the emails to filter them by the given class labels: Spam or Not Spam.
 
We have five types of classification algorithms, namely :
 
* Decision Tree
* Naive Bayes Classifier
* K-Nearest Neighbors
* Support Vector Machines
* Artificial Neural Networks