- Home
- Data Science
- Deep Learning

- 4.6 Rating
- 61 Question(s)
- 70 Mins of Read
- 6354 Reader(s)

Prepare better with the best interview questions and answers, and walk away with top interview tips. These interview questions and answers will boost your core interview skills and help you perform better. Be smarter with every interview.

- 4.6 Rating
- 61 Question(s)
- 70 Mins of Read
- 6354 Reader(s)

Deep learning is large neural networks. It is a machine learning technique that teaches the computers to do what comes naturally to humans, learn by example. It is a key technology behind driverless cars, also deployed in medical research to detect the cancer cells which may as well achieve state of the art accuracy sometimes exceeding the human level performance.

The term “deep” usually refers to the number of hidden layers in the neural network. These models are trained by using large sets of labeled data and neural network architectures that learn features directly from the data without the need for manual feature extraction. Some of the deep neural networks are MLP, CNNs

- Google and Facebook are translating text into hundreds of languages at a time. This is being done through some deep learning models being applied to NLP tasks and is a major success story.
- Conversational agents like Siri, Alexa, Cortana basically work on simplifying the speech recognition techniques through LSTMs and RNNs.
- Deep learning is being used in impactful computer vision applications such as OCR (Optical Character Recognition) and real-time language translation
- Multimedia sharing apps like Snapchat and Instagram apply facial feature detection which is another application of deep learning.
- Deep Learning is being used in the Healthcare domain to locate malignant cells and other foreign bodies in order to detect complex diseases.

- sales forecasting
- industrial process control
- customer research
- data validation
- risk management
- target marketing

A perceptron (a single neuron model) is always feedforward, that is, all the arrows are going in the direction of the output. In addition, it is assumed that in a perceptron, all the arrows are going from layer ‘i’ to layer ‘i+1’, and it is also usual (to start with having) that all the arcs from layer ‘i’ to ‘i+1’ are present.

Finally, having multiple layers means more than two layers, that is, you have hidden layers. It consists of 3 layers of nodes: a. Input, b. Hidden c. and output.

A perceptron is a network with two layers, one input and one output whereas a multilayered network means that you have at least one hidden layer (we call all the layers between the input and output layers hidden).

(where y_hat is the final class label that we return as the prediction based on the input x, ‘a’ is the activated neurons and ‘w’ are the weight coefficients.)

Input layers are the training observations that are fed through the neurons.

It computes a linear function (z = Wx+b) followed by an activation function.

The sigmoid function is a logistic function bounded by 0 and. This is an activation function that is continuous and differentiable. It is nonlinear in nature that helps to increase the performance making sure that small changes in the weights and bias causes small changes in the output of the neuron.

where Alpha is the slope parameter of the above function.

- It can be trained as a supervised learning problem
- It is applicable when the input/output is a sequence. (i.e. a sequence of words)

As a result, the weights are then pushed towards becoming smaller (closer to 0)

Normalizing the input x makes the count function faster to optimize.

- We should try using adam optimizer
- Try for better random initialization for the weights
- Try tuning the learning rate ‘alpha’
- Must as well try to initialize the weight to 0.

We use epsilon during normalization to avoid the division by zero.

The goal of an activation function is to introduce nonlinearity/a non-linear decision boundary via non-linear combinations of the weighted inputs into the neural network so that it can learn more complex function i.e. converts the processed input into an output called the activation value. Without it, the neural network would be only able to learn function which is a linear combination of its input data.

Some of the examples are – Step Function, ReLU, Tanh, Softmax.

- It is a measure to evaluate how good your model's performance is. Also referred to as ‘ Loss/Error’
- It is used to compute the error of the output layer during back propagation.
- Mean squared error or Sum of squared errors is an example of the popular cost function.

C = ½ (y – y_hat)^2

Where c = cost function, y = Original output & y_hat = Predicted output.

- Forward propagation of the training data through the network in order to generate the output.
- Compute the error derivative using the target value and the output value with respect to output activations.
- Then, we backpropagate to compute the derivative of the error with respect to the output activations in the previous layer.
- We continue the same for all the hidden layers.
- Use the previously calculated derivatives for output and all the hidden layers, to calculate the derivative of the error w.r.t weights.
- Update the weights.

**In feed forward neural networks – **

The signal travels in one direction, from input to the output.

There are No feedbacks or loops

It considers only the current input

It cannot memorize the previous inputs. (eg. CNNs)

**In Recurrent neural networks – **

The signal travels in both the directions i.e. they can pass the information onto themselves.

It considers the current input along with the previously received inputs to generate the output of a layer.

**One to Many**

This is the simple network with one input and multiple outputs. Example: It helps you caption an image, where the picture goes through the CNN model and then fed to the RNN.

**Many to One**

Example: It can be used in sentiment analysis and text mining, where you have a lot of text such as a customer’s comment and you need to gauge that what’s the chance that this comment is positive or negative.

**Many to Many**

Translations such as Google translator and generating subtitles for the movies is an example of many to many types of network.

**Softmax Function-**

Softmax function calculates the probabilities distribution of the event over ‘n’ different events.

In general way of saying, this function will calculate the probabilities of each target class over all possible target classes. Later the calculated probabilities will be helpful for determining the target class for the given inputs.

So, softmax function squashes the outputs of each unit to be between 0 and 1, just like a sigmoid function. But it also divides each output such that the total sum of the outputs is equal to 1 (check it on the figure above) where an output is equivalent to a categorical probability distribution.

Mathematically the softmax function is shown to the right, where z is a vector of the inputs to the output layer (if you have 10 output units, then there are 10 elements in z). And again, j indexes the output units, so j = 1, 2, ..., K.

**ReLU –**

Also, known as Rectified linear units. It reproduces output 0 if the input is less than 0, and raw output otherwise. That is, if the input is greater than 0, the output is equal to the input.

It results in much faster training for larger networks.

Hyperparams can’t learn from the data, they are set before the training phase such as –

**A number of epochs –**In the context of training the model, epoch is a term used to refer to one iteration where model sees the whole training set to update its-weights.

**Batch Size –**refers to the number of training examples in one forward/backward pass.

**Learning rate –**The learning rate, often noted as ‘alpha’, indicates at which pace the weight gets updated. It can be fixed or adaptively changed. The current most popular method is called ‘Adam’, a method that adapts the learning rate.

**If the learning rate too high:**

- This may cause an undesirable divergent behavior to the loss function due to the drastic updates in weights.
- It may as well fail to converge or even diverge

**If the learning rate too low:**

- This will lead to very slow progress in training of the model as we make tiny updates in weights.
- It’ll take many updates before reaching a minimum point.

It is a step of hyperparams, which normalizes the batch. To facilitate learning, we typically normalize the initial values of our parameters by initializing them with zero mean and unit variance. As training progresses and we update parameters to different extents, we lose this normalization, which slows down training and amplifies changes as the network becomes deeper.

Batch normalization re-establishes these normalizations for every mini-batch and changes are back-propagated through the operation as well. By making normalization part of the model architecture, we are able to use higher learning rates and pay less attention to the initialization parameters. Batch normalization additionally acts as a regularizer, reducing/ even eliminating the need for Dropout.

*It is usually done after a fully connected/convolutional layer and before a non-linearity layer which aims at allowing higher learning rates and reducing the strong dependence on initialization.*

**Batch Gradient descent –**Vanilla gradient descent, aka batch gradient descent, calculates the gradient of the whole dataset and perform just one update at each iteration. Batch gradient descent is guaranteed to converge to the global minimum for convex error surfaces and to a local minimum for non-convex surfaces.

**Stochastic gradient descent –**It uses only a single training example ( each training example ) to calculate the gradient and update the parameters. It is fast and performs frequent updates with a high variance.

It has been shown that when we slowly decrease the learning rate, SGD shows the same convergence behavior as batch gradient descent, almost certainly converging to a local or the global minimum for non-convex and convex optimization respectively.

*Overfitting* happens when a model learns the details as well as the noise in the training data to the degree that it adversely impacts the execution to the model on new information.

It is more likely to occur with non-linear models that have more flexibility when learning a target function.

*Underfitting, *it is when a model is neither trained well on the training dataset nor it can generalize the new information properly. It usually happens when there is less and improper data to train a model.

And results in poor performance and accuracy.

Combating overfitting and underfitting:

- K-fold cross-validation - Resample the data to estimate the accuracy.
- Having a validation dataset to evaluate the model.
- Regularization techniques such as Weight regularization ( Lasso, ridge), Dropout regularization, early stopping (It is when you stop the training process as soon as the validation loss reaches a plateau or starts to increase)

- Initializing all the weights to zero –

When you set all the weights to 0, the derivative w.r.t loss function is the same for every ‘w’ in W^l, thus all the weights have the same values in a subsequent iteration. And thus setting It to zero, makes the model no better than the linear model. (equivalent to linear model)

W = np.zeros(layer_size[l]. layer_size[l-1])

- Initializing weights Randomly –

Here, weights are assigned randomly by initializing them very close to zero.

np.random.randn(size_l, size_l-1)

It gives better accuracy to the model since every neuron performs different computations. But this can potentially lead to 2 issues – Vanishing gradient and Exploding gradient.

The problem of the vanishing gradient was first discovered by Sepp (Joseph) Hochreiterback in 1991.

- Now that we know, information travels through time in RNNs, which means that information from previous time points is used as input for the next time points. Secondly, you can calculate the cost function, or your error, at each time point.
- Basically, during the training, your cost function compares your outcomes to your desired output.
- You’ve calculated the cost function, and now you want to propagate your cost function back through the network because you need to update the weights, meaning that you’ve to propagate all the way back through the time to these neurons.
- This is where the problem lies, for any activation function, abs(dW) will get smaller and smaller as we go back with every layer during back propagation.

*Since the weight update is minor and results in slower convergence. This makes the optimization of the loss function slow. In the worst case, this may completely stop the neural network from training further or may as well return half of the network not trained.*

The network can’t learn at all because there is no source of asymmetry between neurons. That is why we need to add randomness to the weight initialization process.

Convolutional Neural Networks are very similar to ordinary Neural Networks. They are made up of neurons that have learnable weights and biases. Each neuron receives some inputs, performs a dot product and optionally follows it with a non-linearity. The whole network still expresses a single differentiable score function: from the raw image pixels on one end to class scores at the other. And they also have a loss function (e.g. SVM/Softmax) on the last (fully-connected) layer.

The change is that ConvNet architectures make the explicit assumption that the inputs are images, which allows us to encode certain properties into the architecture. These then make the forward function more efficient to implement and vastly reduce the number of parameters in the network.

**Convolutional Layer –**A layer that performs the convolutional operation on 2 functions to produce a third function that expresses how the shape of the one is modified by the another.**ReLU Layer –**It brings non-linearity to the network and converts all the negative pixels to the zero, where your output is a rectified feature map.**Pooling layer –**A down sampling operation that reduces the dimensionality of the feature map.

As mentioned above, it is a down-sampling operation that is typically applied after a convolutional layer, which does some sort of spatial invariance, to reduce the spatial dimensions of the CNN.

It creates a pooled feature map sliding a filter matrix over the input matrix. In particular, max and average pooling are special kinds of pooling where max and average values are taken, respectively.

**In the case of the exploding gradient, you can:**

- Truncated Backpropagation - Stop back-propagating after a certain point, which is usually not optimal because not all the weights get updated.
- Penalties - Penalize or artificially reduce the gradient
- Gradient clipping - Put a maximum limit on a gradient

**In case of vanishing gradient:**

- Weight initialization - Initialize the weight so that the potential of the vanishing gradient is minimized
- Have Echo State Networks
- Have Long Short-Term Memory Networks (LSTMs)

LSTMs, are special kind of Recurrent Neural Networks that are capable of learning long-term dependencies i.e. remembering the information for a longer period of time is their default behavior.

There are 3 steps in the process:

- Decides what to forget and what to remember
- Selectively updates the cell state values
- Decides what part of the current state make it to the output

- A gated recurrent unit (GRU) is basically an LSTM without an output gate and has 2 gates – reset and update gates, which therefore fully writes the contents from its memory cell to the larger net at each time step.
- The GRU unit controls the flow of information like the LSTM unit, but without having to use a memory unit. It just exposes the full hidden content without any control.
- GRU is relatively new, and from my perspective, the performance is on par with LSTM, but computationally more efficient (less complex structure as pointed out). So we are seeing it being used more and more.

TF-IDF, which stands for term frequency-inverse document frequency, is a scoring measure widely used in text summarization. TF-IDF is intended to reflect how relevant a term is in a given document.

- The intuition behind it is that if a word occurs multiple times in a document, we should boost its relevance as it should be more meaningful than other words that appear fewer times (TF).
- At the same time, if a word occurs many times in a document but also along with many other documents, maybe it is because this word is just a frequent word; not because it was relevant or meaningful (IDF).

For example: Consider a document containing 100 words wherein the word Rajat appears 3 times.

- The term frequency (tf) for ‘Rajat’ is then TF = (3 / 100) = 0.03.
- Now, assume we have 10 million documents and the word ‘Rajat’ appears in 1000 of these. Then, the inverse document frequency (idf) is calculated as IDF = log(10,000,000 / 1,000) = 4.
- Thus, the Tf-idf weight is the product of these quantities TF-IDF = 0.03 * 4 = 0.12.

Higher dropout rate says that more neurons are active. So there would be less regularization.

VGG is a convolutional neural network architecture named after the Visual Geometry Group from Oxford, who developed it.

**Built using:**

- Convolutions layers (used only 3*3 size )
- Max pooling layers (used only 2*2 size)
- Fully connected layers at the end

**Application:**

- Given an image, it finds object name in the image
- It can detect any one of 1000 images
- It takes input image of size 224 * 224 * 3 (RGB image)

For code snippets to extract features with VGG: https://keras.io/applications/

Data compression is a big topic that’s used in computer vision, computer networks, computer architecture, and many other fields. The point of data compression is to convert our input into a smaller representation that we recreate, to a degree of quality. This smaller representation is what would be passed around, and, when anyone needed the original, they would reconstruct it from the smaller representation.

Autoencoders are unsupervised neural networks that use machine learning to do this compression for us. The aim of an autoencoder is to learn a compressed, distributed representation for the given data, typically for the purpose of dimensionality reduction.

Steps include –

- Let’s take a neural network that has 3 layers. (refer to the image attached)
- The network is then trained to reconstruct its inputs. Here, input neurons are equal to the output neurons.
- And, the network’s target output is same as the input. It uses the dimensionality reduction to restructure the input.

It works by compressing the input to a latent-space representation and then reconstructing the output from the representation.

*** **operator indicates the element-wise multiplication. Element-wise multiplication requires the same dimension between two matrices. It's going to be an error.

C = a + b.T

These are based on the idea that the output at a time may not only depend on the previous elements in the sequence. But also future elements. For example:

To predict a missing word in a sequence you want to look at both the left and the right context. Bidirectional RNNs are quite simple. They are just two RNNs stacked on top of each other. The output computes based on the hidden state of both RNNs.

LSTMs have a chain like structure similar to standard RNNs. But instead of having a single neural network layer, there are four, interacting in a very special way.

In the below figure,

**Ct-1:**is the input from memory cell in time point t;**Xt:**input in time point t;**ht:**output in time point t that goes to both the layers and hidden layer in the next time point.

Thus, every block has three inputs (xt, ht-1, and ct-1) and two outputs (ht and ct).

denotes vector transfer

Line merging here denotes concatenation

shows that the information is copied and further transferred in 2 different directions

points to the pointwise operations like vector addition

this yellow box is termed as a learned neural network

Understanding Step-by-Step: Idea behind the LSTMs

- We’ve new value xt and value from the previous node ht-1.
- In the first step, these values are combined together to go through the sigmoid activation function, where it is decided if the forget valve should be open, closed or open to some extent.
- In the next step, the same values, or actual vectors of values, go in parallel through layer operation tanh, where we decide what value we’re going to pass to the memory pipeline, and another layer of operation ‘sigmoid’, where it is decided, if that value is going to be passed to the memory pipeline and to what extent.
- Then, we have a memory flowing through the top pipeline. If we have forget the valve open and memory valve closed then the memory will not change. Otherwise, if we have forget valve closed and memory valve open, the memory will be updated completely.
- Finally, we’ve got xt and ht-1 combined to decide what part of the memory pipeline is going to become the output of this module.

The LSTM does have the ability to remove or add information to the cell state, carefully regulated by structures called gates. Gates are a way to optionally let information through. They are composed out of a sigmoid neural net layer and a pointwise multiplication operation.

An LSTM has three of these gates, to protect and control the cell state.

- The first step in our LSTM is to decide what information we’re going to throw away from the cell state. This decision is made by a sigmoid layer called the forget gate layer. It looks at ht−1and xt, and outputs a number between 0 and 1 for each number in the cell state Ct−1.

Where 1 represents- completely keep this while,0 represents- completely get rid of this.

- The next step is to decide what new information we’re going to store in the cell state. This has two parts. First, a sigmoid layer called the input gate layer decides which values we’ll update. Next, a tanh layer creates a vector of new candidate values, ~Ct, that could be added to the state.
- Finally, we need to decide what we’re going to output. This output layer will be based on our cell state, but will be a filtered version. First, we run a sigmoid layer which decides what parts of the cell state we’re going to output. Then, we put the cell state through tanh (to push the values to be between −1 and 1) and multiply it by the output of the sigmoid gate, so that we only output the parts we decided to.

For further clarification, please refer: http://colah.github.io/posts/2015-08-Understanding-LSTMs/

Here are few ideas to keep in mind when manually optimizing the hyperparameters:

- Using regularization methods that include l1, l2, dropout, and others help avoid overfitting.
- More data is always better
- Train over multiple epochs
- Early stopping - evaluate the test set performance at each epoch to know when to stop
- Learning rate is the single most important parameter
- For LSTMs use soft sign over tanh as it is faster and less prone to saturation (~0 gradients)
- Optimizers such as RMSprop, AdaGrad or momentum are usually good choices
- Finally remember data normalization, MSE Loss function + identity activation function for regression, Xavier weight initialization

A Boltzmann machine is a network of symmetrically coupled stochastic binary units. In other words, they are shallow 2 layer neural nets that make stochastic decisions whether a neuron should be on or off where 1st layers is the visible layer and the 2nd layer is the hidden layer.

In a Boltzmann machine, nodes are connected to each other across the layers but no two nodes of the same layer are connected. Hence, also known as Restricted Boltzmann Machine.

Fig: Boltzmann machine

Deep learning is large neural networks. It is a machine learning technique that teaches the computers to do what comes naturally to humans, learn by example. It is a key technology behind driverless cars, also deployed in medical research to detect the cancer cells which may as well achieve state of the art accuracy sometimes exceeding the human level performance.

The term “deep” usually refers to the number of hidden layers in the neural network. These models are trained by using large sets of labeled data and neural network architectures that learn features directly from the data without the need for manual feature extraction. Some of the deep neural networks are MLP, CNNs

- Google and Facebook are translating text into hundreds of languages at a time. This is being done through some deep learning models being applied to NLP tasks and is a major success story.
- Conversational agents like Siri, Alexa, Cortana basically work on simplifying the speech recognition techniques through LSTMs and RNNs.
- Deep learning is being used in impactful computer vision applications such as OCR (Optical Character Recognition) and real-time language translation
- Multimedia sharing apps like Snapchat and Instagram apply facial feature detection which is another application of deep learning.
- Deep Learning is being used in the Healthcare domain to locate malignant cells and other foreign bodies in order to detect complex diseases.

- sales forecasting
- industrial process control
- customer research
- data validation
- risk management
- target marketing

A perceptron (a single neuron model) is always feedforward, that is, all the arrows are going in the direction of the output. In addition, it is assumed that in a perceptron, all the arrows are going from layer ‘i’ to layer ‘i+1’, and it is also usual (to start with having) that all the arcs from layer ‘i’ to ‘i+1’ are present.

Finally, having multiple layers means more than two layers, that is, you have hidden layers. It consists of 3 layers of nodes: a. Input, b. Hidden c. and output.

A perceptron is a network with two layers, one input and one output whereas a multilayered network means that you have at least one hidden layer (we call all the layers between the input and output layers hidden).

(where y_hat is the final class label that we return as the prediction based on the input x, ‘a’ is the activated neurons and ‘w’ are the weight coefficients.)

Input layers are the training observations that are fed through the neurons.

It computes a linear function (z = Wx+b) followed by an activation function.

The sigmoid function is a logistic function bounded by 0 and. This is an activation function that is continuous and differentiable. It is nonlinear in nature that helps to increase the performance making sure that small changes in the weights and bias causes small changes in the output of the neuron.

where Alpha is the slope parameter of the above function.

- It can be trained as a supervised learning problem
- It is applicable when the input/output is a sequence. (i.e. a sequence of words)

As a result, the weights are then pushed towards becoming smaller (closer to 0)

Normalizing the input x makes the count function faster to optimize.

- We should try using adam optimizer
- Try for better random initialization for the weights
- Try tuning the learning rate ‘alpha’
- Must as well try to initialize the weight to 0.

We use epsilon during normalization to avoid the division by zero.

The goal of an activation function is to introduce nonlinearity/a non-linear decision boundary via non-linear combinations of the weighted inputs into the neural network so that it can learn more complex function i.e. converts the processed input into an output called the activation value. Without it, the neural network would be only able to learn function which is a linear combination of its input data.

Some of the examples are – Step Function, ReLU, Tanh, Softmax.

- It is a measure to evaluate how good your model's performance is. Also referred to as ‘ Loss/Error’
- It is used to compute the error of the output layer during back propagation.
- Mean squared error or Sum of squared errors is an example of the popular cost function.

C = ½ (y – y_hat)^2

Where c = cost function, y = Original output & y_hat = Predicted output.

- Forward propagation of the training data through the network in order to generate the output.
- Compute the error derivative using the target value and the output value with respect to output activations.
- Then, we backpropagate to compute the derivative of the error with respect to the output activations in the previous layer.
- We continue the same for all the hidden layers.
- Use the previously calculated derivatives for output and all the hidden layers, to calculate the derivative of the error w.r.t weights.
- Update the weights.

**In feed forward neural networks – **

The signal travels in one direction, from input to the output.

There are No feedbacks or loops

It considers only the current input

It cannot memorize the previous inputs. (eg. CNNs)

**In Recurrent neural networks – **

The signal travels in both the directions i.e. they can pass the information onto themselves.

It considers the current input along with the previously received inputs to generate the output of a layer.

**One to Many**

This is the simple network with one input and multiple outputs. Example: It helps you caption an image, where the picture goes through the CNN model and then fed to the RNN.

**Many to One**

Example: It can be used in sentiment analysis and text mining, where you have a lot of text such as a customer’s comment and you need to gauge that what’s the chance that this comment is positive or negative.

**Many to Many**

Translations such as Google translator and generating subtitles for the movies is an example of many to many types of network.

**Softmax Function-**

Softmax function calculates the probabilities distribution of the event over ‘n’ different events.

In general way of saying, this function will calculate the probabilities of each target class over all possible target classes. Later the calculated probabilities will be helpful for determining the target class for the given inputs.

So, softmax function squashes the outputs of each unit to be between 0 and 1, just like a sigmoid function. But it also divides each output such that the total sum of the outputs is equal to 1 (check it on the figure above) where an output is equivalent to a categorical probability distribution.

Mathematically the softmax function is shown to the right, where z is a vector of the inputs to the output layer (if you have 10 output units, then there are 10 elements in z). And again, j indexes the output units, so j = 1, 2, ..., K.

**ReLU –**

Also, known as Rectified linear units. It reproduces output 0 if the input is less than 0, and raw output otherwise. That is, if the input is greater than 0, the output is equal to the input.

It results in much faster training for larger networks.

Hyperparams can’t learn from the data, they are set before the training phase such as –

**A number of epochs –**In the context of training the model, epoch is a term used to refer to one iteration where model sees the whole training set to update its-weights.

**Batch Size –**refers to the number of training examples in one forward/backward pass.

**Learning rate –**The learning rate, often noted as ‘alpha’, indicates at which pace the weight gets updated. It can be fixed or adaptively changed. The current most popular method is called ‘Adam’, a method that adapts the learning rate.

**If the learning rate too high:**

- This may cause an undesirable divergent behavior to the loss function due to the drastic updates in weights.
- It may as well fail to converge or even diverge

**If the learning rate too low:**

- This will lead to very slow progress in training of the model as we make tiny updates in weights.
- It’ll take many updates before reaching a minimum point.

It is a step of hyperparams, which normalizes the batch. To facilitate learning, we typically normalize the initial values of our parameters by initializing them with zero mean and unit variance. As training progresses and we update parameters to different extents, we lose this normalization, which slows down training and amplifies changes as the network becomes deeper.

Batch normalization re-establishes these normalizations for every mini-batch and changes are back-propagated through the operation as well. By making normalization part of the model architecture, we are able to use higher learning rates and pay less attention to the initialization parameters. Batch normalization additionally acts as a regularizer, reducing/ even eliminating the need for Dropout.

*It is usually done after a fully connected/convolutional layer and before a non-linearity layer which aims at allowing higher learning rates and reducing the strong dependence on initialization.*

**Batch Gradient descent –**Vanilla gradient descent, aka batch gradient descent, calculates the gradient of the whole dataset and perform just one update at each iteration. Batch gradient descent is guaranteed to converge to the global minimum for convex error surfaces and to a local minimum for non-convex surfaces.

**Stochastic gradient descent –**It uses only a single training example ( each training example ) to calculate the gradient and update the parameters. It is fast and performs frequent updates with a high variance.

It has been shown that when we slowly decrease the learning rate, SGD shows the same convergence behavior as batch gradient descent, almost certainly converging to a local or the global minimum for non-convex and convex optimization respectively.

*Overfitting* happens when a model learns the details as well as the noise in the training data to the degree that it adversely impacts the execution to the model on new information.

It is more likely to occur with non-linear models that have more flexibility when learning a target function.

*Underfitting, *it is when a model is neither trained well on the training dataset nor it can generalize the new information properly. It usually happens when there is less and improper data to train a model.

And results in poor performance and accuracy.

Combating overfitting and underfitting:

- K-fold cross-validation - Resample the data to estimate the accuracy.
- Having a validation dataset to evaluate the model.
- Regularization techniques such as Weight regularization ( Lasso, ridge), Dropout regularization, early stopping (It is when you stop the training process as soon as the validation loss reaches a plateau or starts to increase)

- Initializing all the weights to zero –

When you set all the weights to 0, the derivative w.r.t loss function is the same for every ‘w’ in W^l, thus all the weights have the same values in a subsequent iteration. And thus setting It to zero, makes the model no better than the linear model. (equivalent to linear model)

W = np.zeros(layer_size[l]. layer_size[l-1])

- Initializing weights Randomly –

Here, weights are assigned randomly by initializing them very close to zero.

np.random.randn(size_l, size_l-1)

It gives better accuracy to the model since every neuron performs different computations. But this can potentially lead to 2 issues – Vanishing gradient and Exploding gradient.

The problem of the vanishing gradient was first discovered by Sepp (Joseph) Hochreiterback in 1991.

- Now that we know, information travels through time in RNNs, which means that information from previous time points is used as input for the next time points. Secondly, you can calculate the cost function, or your error, at each time point.
- Basically, during the training, your cost function compares your outcomes to your desired output.
- You’ve calculated the cost function, and now you want to propagate your cost function back through the network because you need to update the weights, meaning that you’ve to propagate all the way back through the time to these neurons.
- This is where the problem lies, for any activation function, abs(dW) will get smaller and smaller as we go back with every layer during back propagation.

*Since the weight update is minor and results in slower convergence. This makes the optimization of the loss function slow. In the worst case, this may completely stop the neural network from training further or may as well return half of the network not trained.*

The network can’t learn at all because there is no source of asymmetry between neurons. That is why we need to add randomness to the weight initialization process.

Convolutional Neural Networks are very similar to ordinary Neural Networks. They are made up of neurons that have learnable weights and biases. Each neuron receives some inputs, performs a dot product and optionally follows it with a non-linearity. The whole network still expresses a single differentiable score function: from the raw image pixels on one end to class scores at the other. And they also have a loss function (e.g. SVM/Softmax) on the last (fully-connected) layer.

The change is that ConvNet architectures make the explicit assumption that the inputs are images, which allows us to encode certain properties into the architecture. These then make the forward function more efficient to implement and vastly reduce the number of parameters in the network.

**Convolutional Layer –**A layer that performs the convolutional operation on 2 functions to produce a third function that expresses how the shape of the one is modified by the another.**ReLU Layer –**It brings non-linearity to the network and converts all the negative pixels to the zero, where your output is a rectified feature map.**Pooling layer –**A down sampling operation that reduces the dimensionality of the feature map.

As mentioned above, it is a down-sampling operation that is typically applied after a convolutional layer, which does some sort of spatial invariance, to reduce the spatial dimensions of the CNN.

It creates a pooled feature map sliding a filter matrix over the input matrix. In particular, max and average pooling are special kinds of pooling where max and average values are taken, respectively.

**In the case of the exploding gradient, you can:**

- Truncated Backpropagation - Stop back-propagating after a certain point, which is usually not optimal because not all the weights get updated.
- Penalties - Penalize or artificially reduce the gradient
- Gradient clipping - Put a maximum limit on a gradient

**In case of vanishing gradient:**

- Weight initialization - Initialize the weight so that the potential of the vanishing gradient is minimized
- Have Echo State Networks
- Have Long Short-Term Memory Networks (LSTMs)

LSTMs, are special kind of Recurrent Neural Networks that are capable of learning long-term dependencies i.e. remembering the information for a longer period of time is their default behavior.

There are 3 steps in the process:

- Decides what to forget and what to remember
- Selectively updates the cell state values
- Decides what part of the current state make it to the output

- A gated recurrent unit (GRU) is basically an LSTM without an output gate and has 2 gates – reset and update gates, which therefore fully writes the contents from its memory cell to the larger net at each time step.
- The GRU unit controls the flow of information like the LSTM unit, but without having to use a memory unit. It just exposes the full hidden content without any control.
- GRU is relatively new, and from my perspective, the performance is on par with LSTM, but computationally more efficient (less complex structure as pointed out). So we are seeing it being used more and more.

TF-IDF, which stands for term frequency-inverse document frequency, is a scoring measure widely used in text summarization. TF-IDF is intended to reflect how relevant a term is in a given document.

- The intuition behind it is that if a word occurs multiple times in a document, we should boost its relevance as it should be more meaningful than other words that appear fewer times (TF).
- At the same time, if a word occurs many times in a document but also along with many other documents, maybe it is because this word is just a frequent word; not because it was relevant or meaningful (IDF).

For example: Consider a document containing 100 words wherein the word Rajat appears 3 times.

- The term frequency (tf) for ‘Rajat’ is then TF = (3 / 100) = 0.03.
- Now, assume we have 10 million documents and the word ‘Rajat’ appears in 1000 of these. Then, the inverse document frequency (idf) is calculated as IDF = log(10,000,000 / 1,000) = 4.
- Thus, the Tf-idf weight is the product of these quantities TF-IDF = 0.03 * 4 = 0.12.

Higher dropout rate says that more neurons are active. So there would be less regularization.

VGG is a convolutional neural network architecture named after the Visual Geometry Group from Oxford, who developed it.

**Built using:**

- Convolutions layers (used only 3*3 size )
- Max pooling layers (used only 2*2 size)
- Fully connected layers at the end

**Application:**

- Given an image, it finds object name in the image
- It can detect any one of 1000 images
- It takes input image of size 224 * 224 * 3 (RGB image)

For code snippets to extract features with VGG: https://keras.io/applications/

Data compression is a big topic that’s used in computer vision, computer networks, computer architecture, and many other fields. The point of data compression is to convert our input into a smaller representation that we recreate, to a degree of quality. This smaller representation is what would be passed around, and, when anyone needed the original, they would reconstruct it from the smaller representation.

Autoencoders are unsupervised neural networks that use machine learning to do this compression for us. The aim of an autoencoder is to learn a compressed, distributed representation for the given data, typically for the purpose of dimensionality reduction.

Steps include –

- Let’s take a neural network that has 3 layers. (refer to the image attached)
- The network is then trained to reconstruct its inputs. Here, input neurons are equal to the output neurons.
- And, the network’s target output is same as the input. It uses the dimensionality reduction to restructure the input.

It works by compressing the input to a latent-space representation and then reconstructing the output from the representation.

*** **operator indicates the element-wise multiplication. Element-wise multiplication requires the same dimension between two matrices. It's going to be an error.

C = a + b.T

These are based on the idea that the output at a time may not only depend on the previous elements in the sequence. But also future elements. For example:

To predict a missing word in a sequence you want to look at both the left and the right context. Bidirectional RNNs are quite simple. They are just two RNNs stacked on top of each other. The output computes based on the hidden state of both RNNs.

LSTMs have a chain like structure similar to standard RNNs. But instead of having a single neural network layer, there are four, interacting in a very special way.

In the below figure,

**Ct-1:**is the input from memory cell in time point t;**Xt:**input in time point t;**ht:**output in time point t that goes to both the layers and hidden layer in the next time point.

Thus, every block has three inputs (xt, ht-1, and ct-1) and two outputs (ht and ct).

denotes vector transfer

Line merging here denotes concatenation

shows that the information is copied and further transferred in 2 different directions

points to the pointwise operations like vector addition

this yellow box is termed as a learned neural network

Understanding Step-by-Step: Idea behind the LSTMs

- We’ve new value xt and value from the previous node ht-1.
- In the first step, these values are combined together to go through the sigmoid activation function, where it is decided if the forget valve should be open, closed or open to some extent.
- In the next step, the same values, or actual vectors of values, go in parallel through layer operation tanh, where we decide what value we’re going to pass to the memory pipeline, and another layer of operation ‘sigmoid’, where it is decided, if that value is going to be passed to the memory pipeline and to what extent.
- Then, we have a memory flowing through the top pipeline. If we have forget the valve open and memory valve closed then the memory will not change. Otherwise, if we have forget valve closed and memory valve open, the memory will be updated completely.
- Finally, we’ve got xt and ht-1 combined to decide what part of the memory pipeline is going to become the output of this module.

The LSTM does have the ability to remove or add information to the cell state, carefully regulated by structures called gates. Gates are a way to optionally let information through. They are composed out of a sigmoid neural net layer and a pointwise multiplication operation.

An LSTM has three of these gates, to protect and control the cell state.

- The first step in our LSTM is to decide what information we’re going to throw away from the cell state. This decision is made by a sigmoid layer called the forget gate layer. It looks at ht−1and xt, and outputs a number between 0 and 1 for each number in the cell state Ct−1.

Where 1 represents- completely keep this while,0 represents- completely get rid of this.

- The next step is to decide what new information we’re going to store in the cell state. This has two parts. First, a sigmoid layer called the input gate layer decides which values we’ll update. Next, a tanh layer creates a vector of new candidate values, ~Ct, that could be added to the state.
- Finally, we need to decide what we’re going to output. This output layer will be based on our cell state, but will be a filtered version. First, we run a sigmoid layer which decides what parts of the cell state we’re going to output. Then, we put the cell state through tanh (to push the values to be between −1 and 1) and multiply it by the output of the sigmoid gate, so that we only output the parts we decided to.

For further clarification, please refer: http://colah.github.io/posts/2015-08-Understanding-LSTMs/

Here are few ideas to keep in mind when manually optimizing the hyperparameters:

- Using regularization methods that include l1, l2, dropout, and others help avoid overfitting.
- More data is always better
- Train over multiple epochs
- Early stopping - evaluate the test set performance at each epoch to know when to stop
- Learning rate is the single most important parameter
- For LSTMs use soft sign over tanh as it is faster and less prone to saturation (~0 gradients)
- Optimizers such as RMSprop, AdaGrad or momentum are usually good choices
- Finally remember data normalization, MSE Loss function + identity activation function for regression, Xavier weight initialization

A Boltzmann machine is a network of symmetrically coupled stochastic binary units. In other words, they are shallow 2 layer neural nets that make stochastic decisions whether a neuron should be on or off where 1st layers is the visible layer and the 2nd layer is the hidden layer.

In a Boltzmann machine, nodes are connected to each other across the layers but no two nodes of the same layer are connected. Hence, also known as Restricted Boltzmann Machine.

Fig: Boltzmann machine

Prepare better with the best interview questions and answers, and walk away with top interview tips. These interview questions and answers will boost your core interview skills and help you perform better. Be smarter with every interview.

- Winrunner online training in Gurgaon
- Ruby Deep Dive online training in Singapore
- Zend course in Mumbai
- How To Do Test Driven Development
- Android Development course online in Cardiff
- Ui Ux Design certification in San Diego
- Elasticsearch classes in Delhi
- Absolut Data Rolls Out Ai Platform
- Nokia Introduces Mika For Telecom Operations
- Ruby Deep Dive training in Bangalore

Submitted questions and answers are subjecct to review and editing,and may or may not be selected for posting, at the sole discretion of Knowledgehut.

Log In With Facebook Log In With Google+ Login with linkedin We do not post without your permission.

Log In With Facebook Log In With Google+ Login with linkedin
We do not post without your permission.

Cancel / Close