Google News
Deep Learning Interview Questions
There are five main steps that are used to initialize and use the gradient descent algorithm :
* Initialize biases and weights for the network
* Send input data through the network (the input layer)
* Calculate the difference (the error) between expected and predicted values
* Change values in neurons to minimize the loss function
* Multiple iterations to determine the best weights for efficient working
Hyperparameters are variables used to determine the structure of a neural network. They are also used to understand parameters, such as the learning rate and the number of hidden layers, and more, present in the neural network.
Hyperparameters can be trained using four components as shown below :
Batch size : This is used to denote the size of the input chunk. Batch sizes can be varied and cut into sub-batches based on the requirement.

Epochs : An epoch denotes the number of times the training data is visible to the neural network so that it can train. Since the process is iterative, the number of epochs will vary based on the data.

Momentum : Momentum is used to understand the next consecutive steps that occur with the current data being executed at hand. It is used to avoid oscillations when training.

Learning rate : Learning rate is used as a parameter to denote the time required for the network to update the parameters and learn.
Transfer learning is a learning technique that allows data scientists to use what they've learned from a previous machine learning model that was used for a similar task. The ability of humans to transfer their knowledge is used as an example in this learning. You can learn to operate other two-wheeled vehicles more simply if you learn to ride a bicycle. A model trained for autonomous automobile driving can also be used for autonomous truck driving. The features and weights can be used to train the new model, allowing it to be reused. When there is limited data, transfer learning works effectively for quickly training a model.
transfer learning
In the above image, the first diagram represents training a model from scratch while the second diagram represents using a model already trained on cats and dogs to classify the different class of vehicles, thereby representing transfer learning.
Following are the advantages of transfer learning :
Better initial model : In other methods of learning, you must create a model from scratch. Transfer learning is a better starting point because it allows us to perform tasks at a higher level without having to know the details of the starting model.

Higher learning rate : Because the problem has already been taught for a similar task, transfer learning allows for a faster learning rate during training.

Higher accuracy after training : Transfer learning allows a deep learning model to converge at a higher performance level, resulting in more accurate output, thanks to a better starting point and higher learning rate.
A tensor is a multidimensional array that represents a generalization of vectors and matrices. It is one of the key data structures used in deep learning. Tensors are represented as n-dimensional arrays of base data types. The data type of each element in the Tensor is the same, and the data type is always known. It's possible that only a portion of the shape (that is, the number of dimensions and the size of each dimension) is known. Most operations yield fully-known tensors if their inputs are likewise fully known, however, in other circumstances, the shape of a tensor can only be determined at graph execution time.

Tensor in Deep Learning
The LSTM model is considered a special case of RNNs. The problems of vanishing gradients and exploding gradients we saw earlier are a disadvantage while using the plain RNN model.
In LSTMs, we add a forget gate, which is basically a memory unit that retains information that is retained across timesteps and discards the other information that is not needed. This also necessitates the need for input and output gates to include the results of the forget gate as well.
As you can see, the LSTM model can become quite complex. In order to still retain the functionality of retaining information across time and yet not make a too complex model, we need GRUs.
Basically, in GRUs, instead of having an additional Forget gate, we combine the input and Forget gates into a single Update Gate :
It is this reduction in the number of gates that makes GRU less complex and faster than LSTM.
An optimization algorithm that is used to minimize some function by repeatedly moving in the direction of steepest descent as specified by the negative of the gradient is known as gradient descent. It's an iteration algorithm, in every iteration algorithm, we compute the gradient of a cost function, concerning each parameter and update the parameter of the function via the following formula:
Gradient Descent
Θ - is the parameter vector,
α - learning rate,
J(Θ) - is a cost function
In machine learning, it is used to update the parameters of our model. Parameters represent the coefficients in linear regression and weights in neural networks.
Stochastic Gradient Descent : Stochastic gradient descent is used to calculate the gradient and update the parameters by using only a single training example.

Batch Gradient Descent :
Batch gradient descent is used to calculate the gradients for the whole dataset and perform just one update at each iteration.

Mini-batch Gradient Descent :
Mini-batch gradient descent is a variation of stochastic gradient descent. Instead of a single training example, mini-batch of samples is used. Mini-batch gradient descent is one of the most popular optimization algorithms.