TensorFlow Interview Questions

TensorFlow is a free and open-source software library for dataflow and differentiable programming across a range of tasks. It is a symbolic math library, and is also used for machine learning applications such as neural networks.

  • 4.6 Rating
  • 50 Question(s)
  • 30 Mins of Read
  • 7654 Reader(s)


for epoch in range(training_epochs):
for (x, y) in zip(train_X, train_Y):
sess.run(optimizer, feed_dict = {X:x, Y:y})

Here, the initializer is run and all the training data is fit by running a loop for all the epochs.

x = tf.constant(35, name = 'x')
y = tf.Variable(x+5, name = 'y')

Output  is 

<tf.Variable 'y:0' shape=() dtype=int32_ref>

Here, y is effectively an equation that means “when this variable is computed, it takes the value of x (as it is then) and add 5 to it. But these two types of values x and y are namely ‘placeholders’ which are values that are unassigned and that will be initialized by the session when you run it.

In short, the result of the lines of code is an abstract tensor in the computation graph, that defines the model but no process ran to calculate the result.

import tensorflow as tf
# Initialize two constants
x1 = tf.constant([1,2,3,4])
x2 = tf.constant([5,6,7,8])
# Multiply
result = tf.multiply(x1, x2)
# Intialize the Session
sess = tf.Session()
# Print the result
# Close the session

Output : 

[ 5 12 21 32]
tf.convert_to_tensor(tensor1d, dtype = tf.float64)
  1. Import data, generate data, or setup a data-pipeline through placeholders.
  2. Feed data through computational graph.
  3. Evaluate output on loss function.
  4. Use backpropagation to modify the variables.
  5. Repeat until stopping condition.

The data is usually not in the correct dimension or type that our Tensorflow algorithms expect. Since, most of the algorithms expect normalized data, therefore we transform our data before we can use it. 

Tensorflow has built in functions that can normalize the data for you.

data = tf.nn.batch_norm_with_global_normalization()

Tensorflow depends on us telling it what it can and cannot modify. Tensorflow will modify the variables during optimization to minimize a loss function. To accomplish this, we feed in data through placeholders. We need to initialize both of these, variables and placeholders with size and type, so that Tensorflow knows what to expect.

Example: Below is an implementation of declaring a placeholder

We declare a placeholder by using TensorFlow's function, tf.placeholder(), which accepts a data-type argument (tf.float32) and a shape argument. Note that the shape can be a tuple or a list.

a_var = tf.constant(42) 
x_input = tf.placeholder(tf.float32, [None, input_size]) 
y_input = tf.placeholder(tf.float32, [None, num_classes])
  • Model Structure - We define the model after we have the data and have initialized the variables and placeholders. This is done by building ‘computational graph’. Here, we tell the tensorflow what operations must be done on the variables and the placeholders to arrive at our model predictions. 
y_pred =  tf.add(tf.mul(x_input, weight_matrix), b_matrix)
  • Loss functions - After defining the model, we must be able to evaluate the output. This is where we declare the loss function. The loss function is very important as it tells us how far off our predictions are from the actual values. 
loss = tf.reduce_mean(tf.square(y_actual – y_pred))

A. Numerical Loss Functions – 

# x_vals constitute of predicted x-values
  1. L2 Loss
# L2 loss
# L2 = (pred - actual)^2
l2_loss = tf.square(target - x_vals)
  1. L1 Loss

This is very similar to L2 except that we take the absolute value of the difference instead of squaring it.

# L1 loss
# L1 = abs(pred - actual)
l1_loss = tf.abs(target - x_vals)
  1. Psuedo- Huber Loss

The psuedo-huber loss function is a smooth approximation to the L1 loss as the (predicted - target) values get larger. When the predicted values are close to the target, the pseudo-huber loss behaves similar to the L2 loss.

# L = delta^2 * (sqrt(1 + ((pred - actual)/delta)^2) - 1)
# Pseudo-Huber with delta = 0.25
delta1 = tf.constant(0.25)
phuber = tf.multiply(tf.square(delta1), tf.sqrt(1. + tf.square((target - x_vals)/delta1)) - 1.)

B. Categorical Loss Functions

  1. Hinge Loss

# Hinge loss
# Use for predicting binary (-1, 1) classes
# L = max(0, 1 - (pred * actual))
hinge = tf.maximum(0., 1. - tf.multiply(target, x_vals))
  1. Cross Entropy Loss

The cross entropy loss is a way to measure the loss between categorical targets and output model logits.
# Cross entropy loss
# L = -actual * (log(pred)) - (1-actual)(log(1-pred))
C_entropy = - tf.multiply(target, tf.log(x_vals)) - tf.multiply((1. - target), tf.log(1. - x_vals))
  1. Sigmoid Entropy Loss

# L = -actual * (log(sigmoid(pred))) - (1-actual)(log(1-sigmoid(pred)))
# or
# L = max(actual, 0) - actual * pred + log(1 + exp(-abs(actual)))
x_val_input = tf.expand_dims(x_vals, 1)
target_input = tf.expand_dims(targets, 1)
entropy_sigmoid = tf.nn.sigmoid_cross_entropy_with_logits(logits=x_val_input, labels=target_input)
  1. Weighted ( Softmax ) Cross Entropy Loss

Tensorflow also has a similar function to the sigmoid cross entropy loss function above, but we take the softmax of the actuals and weight the predicted output instead.

# Weighted (softmax) cross entropy loss
# L = -actual * (log(pred)) * weights - (1-actual)(log(1-pred))
# or
# L = (1 - pred) * actual + (1 + (weights - 1) * pred) * log(1 + exp(-actual))
weight = tf.constant(0.5)
entropy weighted = tf.nn.weighted_cross_entropy_with_logits(logits=x_vals, targets=targets, pos_weight=weight)

For performing linear regression, we will do the following – 

1. Create the linear regression computational graph output. This means we will accept an input, x, and generate the output, Ax + b.

2. We create a loss function, the L2 loss, and use that output with the learning rate to compute the gradients of the model variables, A and B to minimize the loss.

Import tensorflow as tf
# Creating variable for parameter slope (W) with initial value as 0.4
W = tf.Variable([.4], tf.float32)
#Creating variable for parameter bias (b) with initial value as -0.4
b = tf.Variable([-0.4], tf.float32)
# Creating placeholders for providing input or independent variable, denoted by x
x = tf.placeholder(tf.float32)
# Equation of Linear Regression
linear_model = W * x + b
# Initializing all the variables
sess = tf.Session()
init = tf.global_variables_initializer()
# Running regression model to calculate the output w.r.t. to provided x values
print(sess.run(linear_model {x: [1, 2, 3, 4]})) 
z = -x  # z = tf.negative(x)
z = x + y  # z = tf.add(x, y)
z = x - y  # z = tf.subtract(x, y)
z = x * y  # z = tf.mul(x, y)
z = x / y  # z = tf.div(x, y)
z = x // y  # z = tf.floordiv(x, y)
z = x % y  # z = tf.mod(x, y)
z = x ** y  # z = tf.pow(x, y)
z = x @ y  # z = tf.matmul(x, y)
z = x > y  # z = tf.greater(x, y)
z = x >= y  # z = tf.greater_equal(x, y)
z = x < y  # z = tf.less(x, y)
z = x <= y  # z = tf.less_equal(x, y)
z = abs(x)  # z = tf.abs(x)
z = x & y  # z = tf.logical_and(x, y)
z = x | y  # z = tf.logical_or(x, y)
z = x ^ y  # z = tf.logical_xor(x, y)
z = ~x  # z = tf.logical_not(x)

A loss function measures how far apart the current output of the model is from that of the desired or target output. Here, we’ll use a most commonly used loss function for linear regression model called as Sum of Squared Error or SSE. SSE calculated w.r.t. model output (represent by linear_model) and desired or target output (y).

# Creating variable for parameter slope (W) with initial value as 0.4
W = tf.Variable([.4], tf.float32)
#Creating variable for parameter bias (b) with initial value as -0.4
b = tf.Variable([-0.4], tf.float32)
import tensorflow as tf
# Creating placeholders for providing input or independent variable, denoted by x
x = tf.placeholder(tf.float32)
# Equation of Linear Regression
linear_model = W * x + b
# Initializing all the variables
sess = tf.Session()
init = tf.global_variables_initializer()
y = tf.placeholder(tf.float32)
error = linear_model - y
squared_errors = tf.square(error)
loss = tf.reduce_sum(squared_errors)
print(sess.run(loss, {x:[1,2,3,4], y:[2, 4, 6, 8]})

The API’s inside TensorFlow are-  tf.manual or tf.nnrelu which are used to build neural network architecture.

APIs outside Tensorflow are -

  • TFLearn:

This API shouldn’t be seen as TF Learn, which is TensorFlow’s  tf.contrib.learn. It is a separate Python package.

  • TensorLayer: 

It comes as a separate package and is different from what TensorFlow’s layers API has in its bag.

  • Pretty Tensor: 

It is actually a Google project which offers a fluent interface with chaining.

  • Sonnet

It is a project of Google’s DeepMind which features a modular approach.

Error due to Bias: The error due to bias is taken as the difference between the expected (or average) prediction of our model and the correct value which we are trying to predict. However, imagine you could repeat the whole model building process more than once: each time you gather new data and run a new analysis creating a new model. Due to randomness in the underlying data sets, the resulting models will have a range of predictions. Bias measures how far off in general these models' predictions are from the correct value.

Error due to Variance: The error due to variance is taken as the variability of a model prediction for a given data point. Again, imagine you can repeat the entire model building process multiple times. The variance is how much the predictions for a given point vary between different realizations of the model.

Essentially, bias is how removed a model's predictions are from correctness, while variance is the degree to which these predictions vary between model iterations.

It is useful because it normalizes (adjusts) all the inputs before sending it to the subsequent layer.

Word embeddings are used in Natural Language Processing as a representation of words and they can be used in TensorFlow where it is also Known as Word2vec

The two models used are – The continuous bag of words model and the skip gram model.

model = tf.keras.Sequential ([
  tf.keras.layers.Dense(10, activation=tf.nn.relu, input_shape=(4,)),  # input shape required
  tf.keras.layers.Dense(10, activation=tf.nn.relu),

The TensorFlow tf.keras API is used to create models and layers which makes it easier to build models and experiment while Keras handles the complexity of connecting everything together.

The tf.keras.Sequential model is a linear stack of layers. Its constructor takes a list of layer instances, in this case, two Dense layers with 10 nodes each, and an output layer with 3 nodes representing our label predictions. The first layer's input_shape parameter corresponds to the number of features from the dataset, and is required.

The activation function determines the output shape of each node in the layer. These non-linearities are important—without them the model would be equivalent to a single layer. 

There are many available activations such as sigmoid, hyperbolic tangent but RELU is common for hidden layers.

Estimators is a high-level API that reduces much of the code you previously needed to write when training a TensorFlow model. Estimators are very flexible, allowing you to override the default behavior if you have specific requirements for your model.

There are two possible ways you can build your model using Estimators:

  • Pre-made Estimator - These are predefined estimators, created to generate a specific type of model. Example, DNNClassifier is pre-made estimator.

  • Estimator (base class) - Gives you complete control of how your model should be created by using a model_fn function.

  1. Total steps to train
  2. Number of samples per batch
  3. Number of Inputs
  4. Number of features
  5. Number of Trees
  6. Max nodes

Example code:

# Parameters
num_steps = 500
batch_size = 1024
num_classes = 10
num_features = 784
num_trees = 10
max_nodes = 1000
# Input and Target data
X = tf.placeholder(tf.float32, shape=[None, num_features])
# For random forest, labels must be integers (the class id)
Y = tf.placeholder(tf.int32, shape=[None])
# Random Forest Parameters
hparams = tensor_forest.ForestHParams(num_classes=num_classes,

Below is an implementation of the logistic regression algorithm using Tensorflow library. Here we’ll make use of the famous MNIST dataset.

import tensorflow as tf
# Import MINST data
from tensorflow.examples.tutorials.mnist import input_data
mnist = input_data.read_data_sets("/tmp/data/", one_hot=True)
# Parameters
learning_rate = 0.01
training_epochs = 25
batch_size = 100
display_step = 1
# tf Graph Input
x = tf.placeholder(tf.float32, [None, 784]) # mnist data image of shape 28*28=784
y = tf.placeholder(tf.float32, [None, 10]) # 0-9 digits recognition => 10 classes
# Set model weights
W = tf.Variable(tf.zeros([784, 10]))
b = tf.Variable(tf.zeros([10]))
# Construct model
pred = tf.nn.softmax(tf.matmul(x, W) + b) # Softmax
# Minimize error using cross entropy
cost = tf.reduce_mean(-tf.reduce_sum(y*tf.log(pred), reduction_indices=1))
# Gradient Descent
optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(cost)
# Initialize the variables (i.e. assign their default value)
init = tf.global_variables_initializer()
# Start training
with tf.Session() as sess:
   # Training cycle
   for epoch in range(training_epochs):
       avg_cost = 0.
       total_batch = int(mnist.train.num_examples/batch_size)
       # Loop over all batches
       for i in range(total_batch):
           batch_xs, batch_ys = mnist.train.next_batch(batch_size)
           # Fit training using batch data
           _, c = sess.run([optimizer, cost], feed_dict={x: batch_xs,
                                                         y: batch_ys})
           # Compute average loss
           avg_cost += c / total_batch
       # Display logs per epoch step
       if (epoch+1) % display_step == 0:
           print "Epoch:", '%04d' % (epoch+1), "cost=", "{:.9f}".format(avg_cost)
   # Test model
   correct_prediction = tf.equal(tf.argmax(pred, 1), tf.argmax(y, 1))
   # Calculate accuracy for 3000 examples
   accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))
   print "Accuracy:", accuracy.eval({x: mnist.test.images[:3000], y: mnist.test.labels[:3000]})

Source: aymericdamien/Tensorflow

A wide variety of statistical distributions functions are provided by TensorFlow located inside :


including but not limited to distributions like Bernoulli, Beta, Chi2, Dirichlet, Gamma, Uniform, etc. They are important building blocks when it comes to build machine learning algorithms, especially for probabilistic approaches like Bayesian models.

Word2Vec algorithm is used to compute the vector representations of the words. 

Parameters to use:

embedding_size # Dimension of the embedding vector
max_vocabulary_size # Total number of different words in the vocabulary
min_occurrence # Remove all words that does not appears at least n times
skip_window # How many words to consider left and right
num_skips # How many times to reuse an input to generate a label
num_sampled # Number of negative examples to sample

Below is the implementation for KNN algorithm, the tensorflow way.

import numpy as np
import tensorflow as tf
# Import MINST data
from tensorflow.examples.tutorials.mnist import input_data
mnist = input_data.read_data_sets("/tmp/data/", one_hot=True)
# In this example, we limit mnist data
Xtrain, Ytrain = mnist.train.next_batch(5000) #5000 for training (nn candidates)
Xtest, Ytest = mnist.test.next_batch(200) #200 for testing
# tf Graph Input
xtrain = tf.placeholder("float", [None, 784])
xtest = tf.placeholder("float", [784])
# Nearest Neighbor calculation using L1 Distance
# Calculate L1 Distance
distance = tf.reduce_sum(tf.abs(tf.add(xtrain, tf.negative(xtest))), reduction_indices=1)
# Prediction: Get min distance index (Nearest neighbor)
pred = tf.argmin(distance, 0)
accuracy = 0.
# Initialize the variables (i.e. assign their default value)
init = tf.global_variables_initializer()
# Start training
with tf.Session() as sess:
   # loop over test data
   for i in range(len(Xtest)):
       # Get nearest neighbor
       nn_index = sess.run(pred, feed_dict={xtrain: Xtrain, xtest: Xtest[i, :]})
     # Get nearest neighbor class label and compare it to its true label
       print "Test", i, "Prediction:", np.argmax(Ytrain[nn_index]), \
           "True Class:", np.argmax(Ytest[i])
       # Calculate accuracy
       if np.argmax(Ytrain[nn_index]) == np.argmax(Ytest[i]):
           accuracy += 1./len(Xtest)
   print "Accuracy:", accuracy

In tensorflow you create graphs and pass values to that graph. Graph does all the hardwork and generate the output based on the configuration that you have made in the graph. Now When you pass values to the graph then first you need to create a tensorflow session.


Once session is initialized then you are supposed to use that session because all the variables and settings are now part of the session.

So, there are two ways to pass external values to the graph so that graph accepts them. One is to call the .run() while you are using the session being executed. Other way which is basically a shortcut to this is to use .eval(). I said shortcut because the full form of .eval() is


At the place of values.eval() run tf.get_default_session().run(values). You must get the same behavior, here what eval is doing, is using the default session and then executing run().

Weighted standard error is a base metric used to compute the coefficient of determination. 

It is used to evaluate the linear regression.

# To be used with TFLearn estimators
weighted_r2 = WeightedR2()
regression = regression(net, metric=weighted_r2)

#Sol: ROC AUC Score measures the overall performance for a full range of threshold levels.

tflearn.objectives.roc_auc_score (y_pred, y_true)

1. Stochastic Gradient Descent

SGD Optimizer accepts learning rate decay. When training a model, it is often recommended to lower the learning rate as the training progresses.

# With TFLearn estimators.
sgd = SGD(learning_rate=0.01, lr_decay=0.96, decay_step=100)
regression = regression(net, optimizer=sgd)
# Without TFLearn estimators (returns tf.Optimizer).
sgd = SGD(learning_rate=0.01).get_tensor()

2. RMSprop

Maintain a moving (discounted) average of the square of gradients. Divide gradient by the root of this average.
# With TFLearn estimators.
rmsprop = RMSProp(learning_rate=0.1, decay=0.999)
regression = regression(net, optimizer=rmsprop)
# Without TFLearn estimators (returns tf.Optimizer).
rmsprop = RMSProp(learning_rate=0.01, decay=0.999).get_tensor()
# or
rmsprop = RMSProp(learning_rate=0.01, decay=0.999)()

3. Adam

A method of Stochastic Optimization where the default value of 1e-8 for epsilon might not be a good default in general. For example, when training an Inception network on ImageNet a current good choice is 1.0 or 0.1.

# With TFLearn estimators
adam = Adam(learning_rate=0.001, beta1=0.99)
regression = regression(net, optimizer=adam)
# Without TFLearn estimators (returns tf.Optimizer)
adam = Adam(learning_rate=0.01).get_tensor()

4. Momentum

Momentum Optimizer accepts learning rate decay. When training a model, it is often recommended to lower the learning rate as the training progresses. The function returns the decayed learning rate. 

# With TFLearn estimators
momentum = Momentum(learning_rate=0.01, lr_decay=0.96, decay_step=100)
regression = regression(net, optimizer=momentum)
# Without TFLearn estimators (returns tf.Optimizer)
mm = Momentum(learning_rate=0.01, lr_decay=0.96).get_tensor()

5. AdaGrad

Adaptive Subgradient Methods for Online Learning and Stochastic Optimization.

# With TFLearn estimators
adagrad = AdaGrad(learning_rate=0.01, initial_accumulator_value=0.01)
regression = regression(net, optimizer=adagrad)
# Without TFLearn estimators (returns tf.Optimizer)
adagrad = AdaGrad(learning_rate=0.01).get_tensor()

6. AdaDelta

An Adaptive Learning Rate Method

tflearn.optimizers.AdaDelta (learning_rate=0.001, rho=0.1, epsilon=1e-08, use_locking=False, name='AdaDelta')

a) FeedDictFlow - Generate a stream of batches from a dataset. It uses two queues, one for generating batch of data ids, and the other one to load data and apply pre processing. 

If continuous is True, data flow will never ends until stop is invoked, or coord interrupt threads.


Takes following arguments –

 1. feed_dict: A TensorFlow formatted feed dict (with placeholders as keys and data as values).

2. coord: A Tensorflow coordinator.

3. num_threads: Total number of simultaneous threads to process data.

4. max_queue: Maximum number of data stored in a queue.

5. shuffle: If True, data will be shuffle.

6. continuous: If True, when an epoch is over, same data will be feeded again.

and few others.

b) ArrayFlow - Convert array samples to tensors and store them in a queue.

tflearn.data_flow.ArrayFlow ()

Arguments -

1. X: The features data array.

2. Y: The targets data array.

3. multi_inputs: Set to True if X has multiple input sources (i.e. X is a list of arrays).

4. Batch_size: The batch size.

5. shuffle: If True, data will be shuffled.

The argument ‘num_words = 10000’ , it keeps the 10,000 most frequently occurring words in the training dataset while it discards the words occurring rarely.


train_step = tf.train.GradientDescentOptimizer(0.5).minimize(cross_entropy) 

Building the neural network requires you to configure the layers of the model and then after the model is ready, you compile the model.

Here when compiling the model we’ve added a loss function which measures how accurate the model is during the training. Next, we’ve made use of the ‘adam optimizer’ which will update the model based on the data it sees and its loss function. Last, accuracy metric is used to monitor the training and the testing steps. It tells you the fraction of images that are correctly classified.

To answer in a nutshell, both of them are the loss functions used in classification tasks.

  • If your targets are one-hot encoded ( when categorical variables are represented as binary vectors), we  use Categorical cross entropy

  • If the targets are in form of an integer, we use sparse categorical cross entropy.

It can be done in the following ways :

  • Convert the arrays into vectors of 0s and 1s indicating word occurrence, similar to a one-hot encoding. For example, the sequence [3, 5] would become a 10,000-dimensional vector that is all zeros except for indices 3 and 5, which are ones. Then, make this the first layer in our network, a dense layer that can handle floating point vector data.
  • Or we can pad the arrays so that they all have the same length, and can create an integer tensor of shape ‘a*b’


tf.pad() – this operation pads a tensor according to the paddings you specify, where paddings are a tensor of type int32.

The idea behind hashing trick is that instead of maintaining a one-to-one mapping of categorical feature values to locations in the feature vector, we use a hash function to determine the feature's location in a vector of lower dimension.

Let's say our text is: "the quick brown fox" and we would like to represent this as a vector. The first thing we need to do is, fix the length of the vector and let's say we would like to use 5 dimensions.

Once we fix the number of dimensions we need a hash function that will take a string and return a number between 0 and n-1, in this case, it’ll be between 0 and 4. 

We can make use of any good hash function and use h(string) mod n to make it return a number between 0 and n-1.

For example:

h(the) mod 5 = 0
h(quick) mod 5 = 1
h(brown) mod 5 = 1
h(fox) mod 5 = 3

Once we have this we can simply construct our vector as: (1,2,0,1,0)

What we’ve done above is we’ve simply added 1 to the nth dimension of the vector each time our hash function returns that dimension for a word in the text. This is called feature hashing or "the hashing trick".

The above implementation is for building a text classification model. Here, the layers are stacked sequentially to build a classifier.

  • The first layer is the embedding layer which takes the integer-encoded words and looks up the embedding vector for each word-index. These vectors learns as the model is trained and adds a dimension to the output array.

  • Next, global average pooling1D, returns an output vector of fixed-length. It allows the model to handle the input of the variable length, in the simplest way possible.

  • The above output vector is then moved through a fully-connected Dense layer where each unit or neuron is connected to each neuron in the next layer with 16 hidden units.

  • The last layer is again densely connected, but with a single output node where we’ve made use of  the sigmoid activation function which gives a float value between 0 and 1, representing a probability.

Dense is the only actual network layer in the model, in fact the most basic layer in the network. It feeds all the outputs from the previous layer to all of its neurons, each neuron providing one output to the next layer.

It does the element wise non-linear transformation like hyperbolic tangent (‘tanh’) resulting in a vector of size in a number of neurons. 

For example - According to keras doc, dense implements the operation – 

output = activation(dot(input, kernel) + bias)

where, activation – element wise activation function, kernel – weights matrix created by the layer and bias is a bias vector created by layer.

Linear Classifier:


Linear classifiers take lesser time to train and give you only linear decision boundaries.

Kernel Classifier:


According to the tensorflow doc, it is a pre-packaged tf.contrib.learn Estimator that combines the power of explicit kernel mappings with the linear models. It needs more complicated training algorithms (often involving convex quadratic programming) and provide non-linear classification boundaries.

So, when the data classes are linearly separable, we can use linear classifiers. Otherwise, kernel classifiers are a better option.

Word Embeddings can be considered as the building blocks for using Neural networks to do NLP. It let us represent words in the form of vectors. But these are not random vectors, where the aim is to represent the words via vectors such that similar words or words used in similar context are closer to each other.

So a natural language modelling technique like Word Embedding is used to map words or phrases from a vocabulary to a corresponding vector of real numbers. 

It has 2 important and advantageous properties:

  • Dimensionality reduction

  • Contextual similarity 

In other words, it helps in building a low-dimensional vector representation from corpus of text, which preserves the contextual similarity of words.

For example:

Words like cats and dogs, both of them are similar in a lot of ways should be paired together whereas Audi and BMW, both automobile companies must be mapped apart from the cat and dog.

It also learns the relations such as – 

King -> Man , then Woman -> Queen

So the purpose of the word embeddings is to turn words into numbers, which algorithms like deep learning can then ingest and process, to formulate an understanding of Natural language.

Also known as Kullback-Leibler Divergence is a measure of relative entropy which is used for evaluating the difference/distance between two probability distributions.

For example:

Given 2 probability distributions, A and B, the KL divergence from the distribution A to distribution B is defined as:

Where A – a1, B – b1, and i – 1..to n ( that is total of n events)

Thereby, to get distance between two functions, KL-divergence is one of the perspective. But since KL-divergence dose not really satisfy properties of metric it is called a divergence.


This layer creates a convolutional kernel that is convolved with the layer input to produce the tensor of output. 

Some of the argument it takes are as follows:

  • Tensor input

  • No of filters in the convolution

  • Kernel size, which is an integer or list of 2 numbers.

  • Strides

  • Padding: One of ‘valid’ or ‘same’ and few other.

# Below is a code snippet for use of one single convolutional layer to modify an image,

input = image,
filters = 32,
kernel_size = [5,5],
strides = [1,3,3,1],
padding = ‘SAME’,
activation = tf.nn.relu,

# In continuous bag of words:

The aim is to fill in the missing word given its neighboring context. 

For example: given “When”, “in”, “____”, “speak”, “French”. The algorithm learns for “France” to be the obvious choice.

# In skip gram model

Given a word, the algorithm predict its context. 

So from the above example: given France (Input layer) predict ‘When’, ‘in’, ‘speak’, ‘French’ (Output) as its neighboring words.

The function exponential decay takes following arguments as parameters:

  • learning rate – is the initial learning rate 

  • global step - is used for decay computation 

  • decay_steps 

  • decay_rate 

  • staircase = False,  if true decay the learning rate at discrete intervals

# Example: Below implementation returns the decayed learning rate

learning_rate: {
        exponential_decay_learning_rate {
          initial_learning_rate: 0.004
          decay_steps: 800720
          decay_factor: 0.95

Neural style transfer is an optimization technique used to take three images,

  • a content image,

  •  a style reference image (an artwork by a famous artist), 

  • and the input image that we want to style 

and then blend them together in a way such that the input image is transformed to look like the content image, but painted in the style of the style image.

In neural style transfer, we define two distance functions,

  1. It describes how different the content of two images

  2. The other describes the difference between two images in terms of their style. 

Then, given three images, a desired style image, a desired content image, and the input image, we try to transform the input image to minimize the content distance with the content image and its style distance with the style image. In other words, we take the base input image, a content image that we want to match, and the style image that we want to match and transforms the base input image by minimizing the content and style distances (losses) with backpropagation, creating an image that matches the content of the content image and the style of the style image.

Computing style and content losses:

c) Content loss: It is a function that describes the distance of content from our output image and our content image

def get_content_loss(base_content, target):
  return tf.reduce_mean(tf.square(base_content - target))

b) Style Loss: Here instead of comparing the raw outputs of the base input image and the style image, we compare the Gram matrices of the two outputs.

def gram_matrix(input_tensor):
  channels = int(input_tensor.shape[-1])
  a = tf.reshape(input_tensor, [-1, channels])
  n = tf.shape(a)[0]
  gram = tf.matmul(a, a, transpose_a=True)
  return gram / tf.cast(n, tf.float32)
def get_style_loss(base_style, gram_target):
  height, width, channels = base_style.get_shape().as_list()
  gram_style = gram_matrix(base_style)
  return tf.reduce_mean(tf.square(gram_style - gram_target))

According to the tensorflow doc, it return an operation that initializes global variables i.e. it works as a tf.assign op. 

Using tf.global_variables_initializer() in a session helps your variables to hold the values you told them to when you declare them. 

# Now all variables are initialized.

TensorFlow uses a dataflow graph to represent your computation in terms of the dependencies between individual operations. In a dataflow graph, the nodes represent units of computation, and the edges represent the data consumed or produced by a computation.

The constants and operation that we create are automagically added to the graph in TensorFlow. The graph default is instantiated when the library is imported. Creating a graph object instead of using the default graph is useful when creating multiple models in one file that do not depend on each other.

new_graph = tf.Graph()
with new_graph.as_default ():
new_g_const = tf.constant ([1. ,2.])

Scoping is a mechanism for tensorflow to share the variables.  It is used to control the complexity of the model and makes it easier for us to break them down into individual pieces. 

# Below is an example of implementation of scope nested inside of the other scopes

With tf.name_scope (“Scope1”):
with tf.name_scope (“Scope_nested”):
nested_var = tf.mul (5 , 5)

Attention mechanism in Neural networks, also known as neural attentions are equipped with the ability to focus on a subset of its inputs.

Attention can be implemented as ,

Where ‘x’ is an input vector,

‘z’ feature vector,

‘a’ is an attention vector, 

‘g’ is an attention glimpse performing element wise multiplication between a and z,

and  be the attention network with the parameter  

Stacking LSTM hidden layers makes the model deeper, more accurately earning the description as a deep learning technique. Increasing the depth of the network provides an alternate solution that requires fewer neurons and trains faster. Ultimately, adding depth it is a type of representational optimization.

Also, given that LSTMs operate on sequence data, it means that the addition of layers adds levels of abstraction of input observations over time i.e. in stacked LSTMs, each LSTM layer outputs a sequence of vectors which will be used as an input to a subsequent LSTM layer. This hierarchy of hidden layers enables more complex representation of our time-series data, capturing information at different scales.

A Stacked LSTM architecture can be defined as an LSTM model comprised of multiple LSTM layers.

#Below is an implementation of stacking multiple LSTMs 

a. Using Keras:

b. using MultiRNNCell (Tensorflow) :

#Keras implementation
from keras.models import Sequential
from keras.layers import LSTM
from numpy import array
# define model where LSTM is also output layer
model = Sequential()
model.add(LSTM(1, return_sequences=True, input_shape=(3,1)))
model.compile(optimizer='adam', loss='mse')
# input time steps
data = array([0.1, 0.2, 0.3]).reshape((1,3,1))
# make and show prediction

# Using Tensorflow

def lstm_cell():
  return tf.contrib.rnn.BasicLSTMCell(lstm_size)
stacked_lstm = tf.contrib.rnn.MultiRNNCell(
    [lstm_cell() for _ in range(number_of_layers)])

initial_state = state = stacked_lstm.zero_state(batch_size, tf.float32)
for i in range(num_steps):
    # The value of state is updated after processing each batch of words.
    output, state = stacked_lstm(words[:, i], state)

#rest of the code
final_state = state


TensorFlow is a free and open-source software library for dataflow and differentiable programming across a range of tasks. It is a symbolic math library, and is also used for machine learning applications such as neural networks.