Google News
logo
Data Science Interview Questions
A validation set can be considered part of the training set as it is used for parameter selection and to avoid overfitting the model being built. On the other hand, a test set is used for testing or evaluating the performance of a trained machine learning model.

In simple terms, the differences can be summarised as :

* Training Set is to fit the parameters, i.e. weights.
* Test Set is to assess the performance of the model, i.e. evaluating the predictive power and generalisation.
* The validation set is to tune the parameters.
Outlier values can be identified by using univariate or any other graphical analysis method. If the number of outlier values is few then they can be assessed individually but for a large number of outliers, the values can be substituted with either the 99th or the 1st percentile values.

All extreme values are not outlier values. The most common ways to treat outlier values  :

1) To change the value and bring in within a range
2) To just remove the value.
There are various methods to assess the results of logistic regression analysis
* Using Classification Matrix to look at the true negatives and false positives.
* Concordance that helps identify the ability of the logistic model to differentiate between the event happening and not happening.
* Lift helps assess the logistic model by comparing it with random selection.
The weights are initialized in neural networks to surpass the above problems. The gradient can vanish or explode rapidly if weights are not initialised during the forward pass through the deep neural network. That can cause the slow convergence of the network, or they may not even converge in some cases. It also ensures that we will not oscillate near the minima.
Dropout is a regularisation method used for deep neural networks to train different neural networks architectures on a given dataset. When the neural network is trained on a dataset, a few layers of the architecture are randomly dropped out of the network. This method introduces noise in the network by compelling nodes within a layer to probabilistically take on more or less authority for the input values. Thus, dropout makes the neural network model more robust by fixing the units of other layers with the help of prior layers.
Mean square error is the squared sum of (actual value-predicted value) for all data points. It gives an estimate of the total square sum of errors. Root mean square is the square root of the squared sum of errors.
Wide-format is where we have a single row for every data point with multiple columns to hold the values of various attributes. The long format is where for each data point we have as many rows as the number of attributes and each row contains the value of a particular attribute for a given data point.
SVM is an ML algorithm which is used for classification and regression. For classification, it finds out a muti dimensional hyperplane to distinguish between classes. SVM uses kernels which are namely linear, polynomial, and rbf. There are few parameters which need to be passed to SVM in order to specify the points to consider while the calculation of the hyperplane.
It completely depends on the accuracy and precision being required at the point of delivery and also on how much new data we have to train on. For a model trained on 10 million rows its important to have new data with the same volume or close to the same volume. Training on 1 million new data points every alternate week, or fortnight won’t add much value in terms of increasing the efficiency of the model.
Most statistics and ML projects need to fit a model on training data to be able to create predictions. There can be two problems while fitting a model- overfitting and underfitting.
 
* Overfitting is when a model has random error/noise and not the expected relationship. If a model has a large number of parameters or is too complex, there can be overfitting. This leads to bad performance because minor changes to training data highly changes the model’s result.

* Underfitting is when a model is not able to understand the trends in the data. This can happen if you try to fit a linear model to non-linear data. This also results in bad performance.