Prepare better with the best interview questions and answers, and walk away with top interview tips. These interview questions and answers will boost your core interview skills and help you perform better. Be smarter with every interview.
Deep learning is large neural networks. It is a machine learning technique that teaches the computers to do what comes naturally to humans, learn by example. It is a key technology behind driverless cars, also deployed in medical research to detect the cancer cells which may as well achieve state of the art accuracy sometimes exceeding the human level performance.
The term “deep” usually refers to the number of hidden layers in the neural network. These models are trained by using large sets of labeled data and neural network architectures that learn features directly from the data without the need for manual feature extraction. Some of the deep neural networks are MLP, CNNs
A perceptron (a single neuron model) is always feedforward, that is, all the arrows are going in the direction of the output. In addition, it is assumed that in a perceptron, all the arrows are going from layer ‘i’ to layer ‘i+1’, and it is also usual (to start with having) that all the arcs from layer ‘i’ to ‘i+1’ are present.
Finally, having multiple layers means more than two layers, that is, you have hidden layers. It consists of 3 layers of nodes: a. Input, b. Hidden c. and output.
A perceptron is a network with two layers, one input and one output whereas a multilayered network means that you have at least one hidden layer (we call all the layers between the input and output layers hidden).
(where y_hat is the final class label that we return as the prediction based on the input x, ‘a’ is the activated neurons and ‘w’ are the weight coefficients.)
Input layers are the training observations that are fed through the neurons.
It computes a linear function (z = Wx+b) followed by an activation function.
These are the intermediate layers between input and output which help the neural networks to learn the complicated relationships involved in data.
The sigmoid function is a logistic function bounded by 0 and. This is an activation function that is continuous and differentiable. It is nonlinear in nature that helps to increase the performance making sure that small changes in the weights and bias causes small changes in the output of the neuron.
where Alpha is the slope parameter of the above function.
Maxout function has been found by Ian Goodfellow, a research scientist at Google brain in 2013. It facilitates optimization by dropout and improves the accuracy of dropout’s fast approximate model averaging technique. It learns not just the relationship between the hidden units, but also the activation function of each hidden unit.
As a result, the weights are then pushed towards becoming smaller (closer to 0)
Normalizing the input x makes the count function faster to optimize.
We use epsilon during normalization to avoid the division by zero.
We must try random values, rather than carrying out systematic research because we don’t know which hyperparameter is more important than other. At the same time, the choice of alpha (learning rate) also matters a lot.
We use it to pass the variables computed during the forward propagation to the corresponding backward propagation step. It contains useful values for backward propagation to compute the derivatives.
Data normalization is very important preprocessing step, used to rescale values to fit in a specific range to assure better convergence during backpropagation. In general, it boils down to subtracting the mean of each data point, dividing by its standard deviation.
Neural network converts data in such a form that it would be better to solve the desired problem and hence, called representation learning.
Logistic regression has a non-linear activation function that squashes the linear input meaning that it returns the conditional probability, but the weight coefficients of this model are essentially a linear combination. Hence, a generalized linear model.
Neural networks make use of ReLU activation functions, having the power to approximate the non-linear function.
The goal of an activation function is to introduce nonlinearity/a non-linear decision boundary via non-linear combinations of the weighted inputs into the neural network so that it can learn more complex function i.e. converts the processed input into an output called the activation value. Without it, the neural network would be only able to learn function which is a linear combination of its input data.
Some of the examples are – Step Function, ReLU, Tanh, Softmax.
C = ½ (y – y_hat)^2
Where c = cost function, y = Original output & y_hat = Predicted output.
Gradient descent is an optimization algorithm to minimize the cost function in order to maximize the performance of the model. It aims to find the local or the global minima of the function. So, if your cost is a function of K variables, then the gradient is the length-K vector that defines the direction in which the cost is increasing most rapidly and you follow the negative of the gradient to the point where the cost is a minimum.
Backpropagation is a method to update the weights in the neural network by taking into account the actual output and the desired output. It is a training algorithm used for a multilayer neural network. It moves the error information from the end of the network to all the weights inside the network and thus allows for efficient computation of the gradient. (The technique used to minimize the cost function is called “gradient descent”.)
In feed forward neural networks –
The signal travels in one direction, from input to the output.
There are No feedbacks or loops
It considers only the current input
It cannot memorize the previous inputs. (eg. CNNs)
In Recurrent neural networks –
The signal travels in both the directions i.e. they can pass the information onto themselves.
It considers the current input along with the previously received inputs to generate the output of a layer.
This is the simple network with one input and multiple outputs. Example: It helps you caption an image, where the picture goes through the CNN model and then fed to the RNN.
Example: It can be used in sentiment analysis and text mining, where you have a lot of text such as a customer’s comment and you need to gauge that what’s the chance that this comment is positive or negative.
Translations such as Google translator and generating subtitles for the movies is an example of many to many types of network.
Softmax function calculates the probabilities distribution of the event over ‘n’ different events.
In general way of saying, this function will calculate the probabilities of each target class over all possible target classes. Later the calculated probabilities will be helpful for determining the target class for the given inputs.
So, softmax function squashes the outputs of each unit to be between 0 and 1, just like a sigmoid function. But it also divides each output such that the total sum of the outputs is equal to 1 (check it on the figure above) where an output is equivalent to a categorical probability distribution.
Mathematically the softmax function is shown to the right, where z is a vector of the inputs to the output layer (if you have 10 output units, then there are 10 elements in z). And again, j indexes the output units, so j = 1, 2, ..., K.
Also, known as Rectified linear units. It reproduces output 0 if the input is less than 0, and raw output otherwise. That is, if the input is greater than 0, the output is equal to the input.
It results in much faster training for larger networks.
Hyperparams can’t learn from the data, they are set before the training phase such as –
If the learning rate too high:
If the learning rate too low:
Dropout is a regularization technique for reducing overfitting in neural networks by dropping out the neurons with probability p>0. It forces the model to avoid relying too much on particular sets of features.
It is a step of hyperparams, which normalizes the batch. To facilitate learning, we typically normalize the initial values of our parameters by initializing them with zero mean and unit variance. As training progresses and we update parameters to different extents, we lose this normalization, which slows down training and amplifies changes as the network becomes deeper.
Batch normalization re-establishes these normalizations for every mini-batch and changes are back-propagated through the operation as well. By making normalization part of the model architecture, we are able to use higher learning rates and pay less attention to the initialization parameters. Batch normalization additionally acts as a regularizer, reducing/ even eliminating the need for Dropout.
It is usually done after a fully connected/convolutional layer and before a non-linearity layer which aims at allowing higher learning rates and reducing the strong dependence on initialization.
It has been shown that when we slowly decrease the learning rate, SGD shows the same convergence behavior as batch gradient descent, almost certainly converging to a local or the global minimum for non-convex and convex optimization respectively.
Overfitting happens when a model learns the details as well as the noise in the training data to the degree that it adversely impacts the execution to the model on new information.
It is more likely to occur with non-linear models that have more flexibility when learning a target function.
Underfitting, it is when a model is neither trained well on the training dataset nor it can generalize the new information properly. It usually happens when there is less and improper data to train a model.
And results in poor performance and accuracy.
Combating overfitting and underfitting:
When you set all the weights to 0, the derivative w.r.t loss function is the same for every ‘w’ in W^l, thus all the weights have the same values in a subsequent iteration. And thus setting It to zero, makes the model no better than the linear model. (equivalent to linear model)
W = np.zeros(layer_size[l]. layer_size[l-1])
Here, weights are assigned randomly by initializing them very close to zero.
It gives better accuracy to the model since every neuron performs different computations. But this can potentially lead to 2 issues – Vanishing gradient and Exploding gradient.
The problem of the vanishing gradient was first discovered by Sepp (Joseph) Hochreiterback in 1991.
Since the weight update is minor and results in slower convergence. This makes the optimization of the loss function slow. In the worst case, this may completely stop the neural network from training further or may as well return half of the network not trained.
As a result of setting weights in the network to zero, all the neurons at each layer are producing the same output and the same gradients during backpropagation.
The network can’t learn at all because there is no source of asymmetry between neurons. That is why we need to add randomness to the weight initialization process.
Convolutional Neural Networks are very similar to ordinary Neural Networks. They are made up of neurons that have learnable weights and biases. Each neuron receives some inputs, performs a dot product and optionally follows it with a non-linearity. The whole network still expresses a single differentiable score function: from the raw image pixels on one end to class scores at the other. And they also have a loss function (e.g. SVM/Softmax) on the last (fully-connected) layer.
The change is that ConvNet architectures make the explicit assumption that the inputs are images, which allows us to encode certain properties into the architecture. These then make the forward function more efficient to implement and vastly reduce the number of parameters in the network.
As mentioned above, it is a down-sampling operation that is typically applied after a convolutional layer, which does some sort of spatial invariance, to reduce the spatial dimensions of the CNN.
It creates a pooled feature map sliding a filter matrix over the input matrix. In particular, max and average pooling are special kinds of pooling where max and average values are taken, respectively.
The fully connected layer operates on the flattened input where each input is connected to all the neurons. If present, the fully connected layers are usually found towards the end of the CNN architecture and can be used to optimize the objectives such as class scores.
It is used to define a loss function in machine learning and optimization. Also called the log loss, measures the performance of the classification model whose output is a probability value between 0 and 1.
It is a regularization technique that stops the training process as soon as the validation loss reaches a plateau or starts to increase.
In the case of the exploding gradient, you can:
In case of vanishing gradient:
LSTMs, are special kind of Recurrent Neural Networks that are capable of learning long-term dependencies i.e. remembering the information for a longer period of time is their default behavior.
There are 3 steps in the process:
TF-IDF, which stands for term frequency-inverse document frequency, is a scoring measure widely used in text summarization. TF-IDF is intended to reflect how relevant a term is in a given document.
For example: Consider a document containing 100 words wherein the word Rajat appears 3 times.
Higher dropout rate says that more neurons are active. So there would be less regularization.
It is the ability to approximate any given function. The higher model capacity is the larger amount of information that can be stored in the network.
VGG is a convolutional neural network architecture named after the Visual Geometry Group from Oxford, who developed it.
For code snippets to extract features with VGG: https://keras.io/applications/
This line is used to convey the fact that we wish to tune the value of the keep probability of Dropout and find the best fit among the range of real numbers between 0 and 1.
Data compression is a big topic that’s used in computer vision, computer networks, computer architecture, and many other fields. The point of data compression is to convert our input into a smaller representation that we recreate, to a degree of quality. This smaller representation is what would be passed around, and, when anyone needed the original, they would reconstruct it from the smaller representation.
Autoencoders are unsupervised neural networks that use machine learning to do this compression for us. The aim of an autoencoder is to learn a compressed, distributed representation for the given data, typically for the purpose of dimensionality reduction.
Steps include –
It works by compressing the input to a latent-space representation and then reconstructing the output from the representation.
* operator indicates the element-wise multiplication. Element-wise multiplication requires the same dimension between two matrices. It's going to be an error.
C = a + b.T
for(i in range(1, len(layer_dims))): parameter[‘W’ + str(i)] = np.random.randn(layers[i], layers[i - 1])) * 0.01 parameter[‘b’ + str(i)] = np.random.randn(layers[i], 1) * 0.01
These are based on the idea that the output at a time may not only depend on the previous elements in the sequence. But also future elements. For example:
To predict a missing word in a sequence you want to look at both the left and the right context. Bidirectional RNNs are quite simple. They are just two RNNs stacked on top of each other. The output computes based on the hidden state of both RNNs.
A regularization technique (such as L2 regularization) that results in gradient descent shrinking the weights on every iteration.
LSTMs have a chain like structure similar to standard RNNs. But instead of having a single neural network layer, there are four, interacting in a very special way.
In the below figure,
Thus, every block has three inputs (xt, ht-1, and ct-1) and two outputs (ht and ct).
denotes vector transfer
Line merging here denotes concatenation
shows that the information is copied and further transferred in 2 different directions
points to the pointwise operations like vector addition
this yellow box is termed as a learned neural network
Understanding Step-by-Step: Idea behind the LSTMs
The LSTM does have the ability to remove or add information to the cell state, carefully regulated by structures called gates. Gates are a way to optionally let information through. They are composed out of a sigmoid neural net layer and a pointwise multiplication operation.
An LSTM has three of these gates, to protect and control the cell state.
Where 1 represents- completely keep this while,0 represents- completely get rid of this.
For further clarification, please refer: http://colah.github.io/posts/2015-08-Understanding-LSTMs/
Here are few ideas to keep in mind when manually optimizing the hyperparameters:
A Boltzmann machine is a network of symmetrically coupled stochastic binary units. In other words, they are shallow 2 layer neural nets that make stochastic decisions whether a neuron should be on or off where 1st layers is the visible layer and the 2nd layer is the hidden layer.
In a Boltzmann machine, nodes are connected to each other across the layers but no two nodes of the same layer are connected. Hence, also known as Restricted Boltzmann Machine.
Fig: Boltzmann machine