Artificial Intelligence Interview Questions

Prepare better with the best interview questions and answers, and walk away with top interview tips. These interview questions and answers will boost your core interview skills and help you perform better. Be smarter with every interview.

  • 4.6 Rating
  • 60 Question(s)
  • 55 Mins of Read
  • 8524 Reader(s)


It is a science of making computers understand what is happening within the image. For example, the objects within the image i.e. let’s say in the case of driverless cars, a pedestrian gets detected, lane detection, traffic signs and so on. 

It is a science that allows computers to understand the images and videos and determine what the computer sees or recognizes. 

It is divided into 3 basic categories:

  • Low-level vision: includes processing of images for the feature extraction
  • Intermediate level vision: involves object recognition and 3D scene interpretation
  • High-level vision: includes a conceptual description of the scene like activity etc.

It is applicable everywhere such as – 

  • Face recognition - Facebook
  • Object detection, surveillance detection - Security
  • Handwriting detection
  • Autonomous vehicles, Self-driving cars – driver vigilance monitoring
  • License plate number detection
  • Snapchat filters
  • Industrial automation 
  • Taking images from different viewpoints
  • Camera limitation – resolution of the image
  • Lighting – In daylight & night
  • Scaling 
  • Object variation – varying images of the object/different material/different textures

Computer vision allows the computers to emulate human vision which relates to image understanding. Example – object recognition, defect detection or automatic driving.

Image processing itself is a part of computer vision. It is related to enhancing the image and play with the features like colors. Example, perform smoothing, sharpening, contrasting and stretching.

  • Greyscale image is stored in a system of 256 tones with values ranging from 0-255
  • 0- Black & 255- White
  • Numbers in-between represents grey
  • Binary systems use digits 0 and 1 where 00000000 for black, and 11111111 for a white image(8-bit image)

Note: Binary value of 11111111 is equal to the decimal value of 255.

  • For example, when detecting the lanes for the driverless cars, what if there are no lanes on road.
  • The lighting conditions may as well be dark outside.

Color spaces:

  • RGB – Red, green & blue image
  • HSV – Hue saturation value
  • HED
  • HSL
  • CMY’K
  • Y’UV – where (Y’) is one luma & (UV) is 2 chrominance components
  • Y’CbCr – (Y’) is luma component, Cb and Cr are the blue difference and red difference chroma components.
  • YIQ
  • LAB – (L) is luminance, and AB represents the color and the Euclidean distance

#OpenCV stores color in the BGR format.

Image features are important areas of an image that are unique to a specific image. A feature specifically is one piece of information in an image such as edges, objects that is unique.

They are important because they form the critical part in machine learning to analyze, describe and match the images. They are used to train different classifiers to detect objects such as pedestrians, cars in case of autonomous vehicles.

Our input image has a lot of extra information that is not required when performing image classification. Thereby, we extract the important information from the image, leaving out the rest. For example, running an edge detector on an image to simply it, retaining the essential info and throwing away the non- essential info.  This step is called Feature extraction.

It converts an image of a fixed size to a feature vector of fixed size.  

  • Haar like features introduced by Viola and Jones
  • Histogram of Oriented Gradients (HOG)
  • Scale- Invariant feature transform (SIFT)
  • Speeded up Robust feature (SURF)

In HOG feature descriptor, the distribution(histogram) of the direction of gradients(oriented gradients) are used as features. It is a manually designed feature which debuted in 2005, which converts the pixel-based representations into a gradient based one, and are often used in linear classification techniques. Basically  is based on the idea that the local object appearance can be effectively described by the distribution(Histogram) of edge directions (oriented gradients)

  • Signed Variables, are signed integers that allow you to represent numbers both in the positive and negative range.
  • Unsigned Variables, are unsigned integers that only allows you to represent the numbers in positive.

ROI stands for a region of interest, is the portion of the image that you want to filter or perform operations on to improve the accuracy and the performance. Example, in case of eye detection, instead of searching for the whole image, we obtain the face region alone and search for eyes.

  • Image addition using cv.add( )
  • Image blending using  cv.addWeighted( )
  • Bitwise operations such as AND, OR, NOT & XOR.

The purpose of the image subtraction is to find absolute changes between 2 different images. 

Gradients are 2D principle derivatives that indicate the change in the intensity values across the image. While edges, on the other hand, are considered to be the binary indicator of whether an edge is present, also indicates where the change is high.

Denoising means removing the noise explicitly. Image denoising can be achieved by applying a Gaussian filtering technique or wave thresholding.

Image filtering, on the other hand, is used for image enhancement, edge detection etc.

Usually, there are 3 steps in the edge detection process:

  • Noise reduction

Suppress as much noise as possible without removing the edges

  • Edge enhancement

Highlight edges and weaken elsewhere.

  • Edge localization

Look at the maxima of the output and eliminate the spurious edges.

The operator sometimes called the Sobel-fedlman operator used within the edge detection algorithms to create image emphasizing edges. It works by calculating the gradient of the image intensity at each pixel and finds the direction of the largest increase from light to dark and rate of change in that direction.

The edge detection method can be grouped into 2 categories:

  • Gradient

It detects the edges by looking at the minima and maxima in the first derivative of the image.

  • Laplacian

This method searches for the zero crossings in the second derivative of the image.

  • Smooth the image
  • Subtract the smoothed image from the original
  • Add the subtracted result back to the original image

Suppose there’s a wine shop that purchases wine from the dealers which they will resell later. But, there are some dealers who sell fake wine as well. In this case, the shop owner should be able to distinguish between the fake and the authentic news. Where, the forger will try different techniques to sell the fake wine and make sure certain techniques go past the shop owner’s check and on the other hand, shop owner received feedback from the wine experts that some of his (dealer’s) wine is not original and would have to improve how he determines whether a wine is fake or authentic.

In a similar manner, there are 2 components of GAN: 

  1. Generator 
  2. Discriminator

A generator is a convolutional neural net that keeps producing the images that are closer in appearance to the real images while the discriminator tries to determine the difference between the real and fake images. 

It is a popular edge detection algorithm developed by John F kanny. It includes 4 steps:

  1. Noise reduction, where you remove the noise in the image using a Gaussian filter
  2. Finding the intensity gradient of the image
  3. Next, a full scan of an image is done to remove unwanted pixels which may not constitute the edges.
  4. Hysteresis thresholding stage decides if the edges are really edges or not.

Below is an implementation of canny edge algorithm for  edge detection using OpenCV:

import cv2
import numpy as np
import matplotlib.pyplot as plt
img = cv2.imread('abc.jpg',0)
edges = cv2.Canny(img,100,200)
plt.subplot(121),plt.imshow(img,cmap = 'gray')
plt.title('Original Image'), plt.xticks([]), plt.yticks([])
plt.subplot(122),plt.imshow(edges,cmap = 'gray')
plt.title('Edge Image'), plt.xticks([]), plt.yticks([])
  • Flipping Images
  • Random cropping – Here, we randomly sample a section from the original image and then, resize this section to the original image size.
  • Random Scaling - The image can be scaled outward or inward.
  • Color jittering - One of the color channels of the image is modified adding or subtracting a random and bounded value
  • Random translation - Translation just involves moving the image along the X or Y direction (or both)
  • Random shearing - Shearing, is bounding box transformation, that can be done with the help of the transformation matrix.

Fourier transform (FT), decomposes an image into its sine and cosine components, starting at the min and the max points respectively. It is used extensively in image processing and computer vision. For example, convolution, a fundamental image processing operation, can be done much faster by using the Fast FT. When applying the FT to an image, we transform it from its spatial domain into a "frequency domain", which in essence is the image represented in terms of its variation in color and brightness over time.

In simple words, it tells you what is happening in the image in terms of the frequencies of sine and cosine components. Therefore, the output of the transformation represents the image in the frequency(Fourier) domain.

Numpy has an FFT package, providing us the frequency transform:

np.fft.fft2( ) 

A kernel, convolutional matrix or mask is a small matrix that is used for blurring, sharpening, embossing, edge detection and more operations which are usually accomplished by doing a convolution (Integral of the product of 2 functions) between a kernel and an image.  

Template matching is essentially required for object detection. It is a technique, where you recognize the small parts of the image matching the template image. Let say you have a football and you create a template of it. Now perform a pixel by pixel match of the template with the image to be scanned, placing template at every possible pixel. Using a similarity metric, find the pixels giving the max match, which will give you the pattern most similar to your object. 

OpenCV comes with the function cv2.matchTemplate( ) for this purpose. 

Hough transform is an efficient method where spatially extended patterns are transformed to produce the compact features in parameter space. It is a technique used in image processing for detecting a line in the binary images, finding the straight lines (functions) in OpenCV, where line plotted as x and y, is modeled as –  

And each of the lines is represented as a single point with (m,b) coordinates or (rho, theta) parameters. 

In short, this theory converts the detection problem in the image space into an easier local peak detection problem in the parameter space.

#To apply the transform, first apply the canny edge detection pre-processing

Cv2.HoughLines( ) # to detect straight lines

The idea of mathematical morphology is fixing up the picture. where we find the shape and size or the structure of the object. Here, we use the concept of structuring element. 

Now, the structuring element is the mask or the window that we place on the original image to find the desired output. There are 2 main characteristics of the structuring elements:

Shape: Circular, square, rectangle, triangle

Size: varies from 3x3 to 21x21

Fundamentally, there are two basic operations that we referred to are:

  • Dilation

It adds/expands the pixels to the boundaries of the object in an image using vector addition or subtraction. It can be used for:

  • Growing features – grows or thicken the objects in a binary image
  • Filling holes and spaces
  • Erosion

It is the complete opposite to the dilation. It shrinks/removes the pixels on the object boundaries, decreases the brightness. It is used for:

  • Shrinking features
  • Removing the bridges, branches, protrusions if any. 

Some other operations that are performed :

  • Opening 

An operation that involves erosion followed by dilation

  • Closing 

It involves dilation followed by an erosion.

  • Thinning and thickening

Intuitively, the watershed is an area of high ground from where the water flows down to the river. In the case of image processing, it is simply a technique used to segment the images typically when 2 ROI (region of interest) are close to each other i.e. their edges touch. It is an image enhancement method, can think of like a possible pre-processing result to improve the results of the algorithm.

  • Compute the Gaussian derivatives at each pixel for each and every image
  • Compute the second-moment matrix M, in a Gaussian window around each pixel. In other words, find a pixel, and while going through the whole image, find a small region and compute the second-moment matrix. Compute this within a window
  • Compute the corner response function R from M, by looking at its trace and determinant
  • Thresholding on R ( Only a little, we don’t want to live off the details)
  • Find the local maxima of the response function by using a non – max suppression

Which basically start to give us features, which can be used to match features between two images.  

SIFT is termed as scale-invariant feature transform, which is a feature detector developed in 2004, by Lowe that solves the image rotation, affine transformations, intensity and viewpoint change in matching features.

It has 4 basic steps:

  • Estimate the scale space extrema using the difference of Gaussian.
  • Next, is the key point localization where the key point candidates are localized and refined by eliminating the low contrast points.
  • Next, is the orientation assignment based on the local image gradient 
  • Last is the descriptive generator to compute the local image descriptor for each key point based on the image gradient magnitude and orientation.

As the name suggests, speeded-up robust features, an algorithm which is a speeded-up version of SIFT.

It approximates the Difference of Gaussian with box filters. Instead of averaging(Gaussian) the image, squares are used for approximation since the convolution with the square is much faster if the integral image is used. It relies on the determinant of a Hessian matrix for both scale and location. For orientation assignment, it uses wavelet responses in both horizontal and vertical directions by applying adequate Gaussian weights. For feature description also SURF uses the wavelet responses

ORB is oriented FAST and Rotated BRIEF, where BRIEF is referred to as Binary Robust Independent elementary features presented as an alternative to SIFT and requires less complexity with almost similar matching performance.

The feature point detector has 2 parts: FAST & BRIEF

FAST: It finds the x,y coordinates of the points that are stable under the transformations like translation, increase and decrease in size. 

BRIEF: It works as a descriptor which encodes the appearance of the point so that we can tell one feature point from other. 

  • It is an important pre-processing step for image segmentation and classification techniques.
  •  It is the process of aligning two or more images of the same scene, the same document into a single integrated image.
  • It helps overcome issues such as image rotation, scale, and skew
  • Often used in medical and satellite imagery to align images from different camera sources.

It is a transformation ( 3x3 matrix) that maps the points in one image to the corresponding points in the other image(warp one image on to another) or in short, it relates two images with the same camera center. For example, creating panoramas.

Let say, we have a 3x3 matrix.

And let x1, y1 be the coordinates of the first image and x2, y2 be he coordinates of the second. Then, homography relates them in the following way:

# calculate homography

H, status = cv2.findHomography(points1, points2)

Where point 1 and 2 are array of corresponding points and h being the homography matrix.

The Viola-Jones detector is a strong, binary classifier build of several weak detectors where each weak detector is an extremely simple binary classifier.

Three major contributions/phases of the algorithm are:

  • Introduction of the integral image – allows the features used by the detector to be computed quickly.
  • Learning algorithm based on AdaBoost – selects the small number of critical features from a larger set and yields extremely efficient classifiers
  • Last core component is a method of combining increasingly more complex classifiers in a cascade which allows background regions of the image to be quickly discarded while spending more computations on the promising regions. 
  • Must be robust to illumination
  • It should avoid detecting the non-stationary background objects such as rain, snow etc cast by moving objects

It is a technique widely used for tracking where you are in the world and where other things are. The objective of the Kalman filter is to minimize the mean squared error between the actual and the estimated data. It is also known as the Recursive Least Square filter which works as a Max. Likelihood function, to fit the set of model parameters to a model.  


Here, each pixel coordinate (x,y) of the image contains 3 values ranging for the intensities of 0-255 (8-bit). The image is split into 3 matrices corresponding to red, green and blue (RGB). We can also come up with any other color, created by mixing intensities of RGB and so on. 

  • Each RGB pixel has 3 set of 8 binary numbers which in turn translates into  24 bits of computer information in total; ’24 bit color’
  • Assuming the same number of pixels, RGB image is 3 times bigger in size than a greyscale image.

Yellow - (255, 255, 0)

Orange – (255,128,0)

Pink – (255, 153,255)

import matplotlib.image as mpimg
import matplotlib.pyplot as plt
image = mpimg.imread(“image.jpg”)
import matplotlib.image as mpimg
import matplotlib.pyplot as plt
import cv2 #OpenCV lib
Gray_image = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
plt.imshow(Gray_image, cmap = ‘gray’)
  • We’ve to rely on more advanced computer vision techniques such as edges detection
  • Can extract  more features from the image
  • In case of driverless cars, may as well use LIDAR systems (Light detection and ranging)

Image search engines that quantify the content of an image are called CBIR systems (Content-based image retrieval systems ). It is where the image is analyzed, quantified, and stored so that similar images are returned by the system during a search. (does the search by example for you)

  • Subtracting the mean of the image intensities and divide by the standard deviation
  • Gamma correction – Power law equalization
  • A color space transformation when dealing with a colored image
  • You may as well have to crop and resize the input image

Let’s say we have a 64x128 image,

  • Calculate the gradient
  • Divide the image into 8x8 cells
  • Calculate the distribution of the edge directions in these cells
  • Normalize the histogram
  • Calculate the final feature vector
import numpy as np
import cv2
# Python gradient calculation 
# Read image
im = cv2.imread(‘abc.jpg’)
im = np.float32(im) / 255.0
# Calculate gradient using sobel operator with kernel size 1
gx = cv2.Sobel(img, cv2.CV_32F, 1, 0, ksize=1) #Horizontal Gradient
gy = cv2.Sobel(img, cv2.CV_32F, 0, 1, ksize=1) #Vertical Gradient
mag, angle = cv2.cartToPolar( gx, gy, angleInDegrees = True)
  • winSize
  • blockSize
  • blockStride – determines the overlap b/w the neighboring block and controls a degree of contrast normalization.
  • cellSize – chosen based on the scale of the feature
  • nbins
  • derivAperture
  • winSigma
  • histogramNormType
  • L2HysThreshold
  • gammaCorrection
  • nlevels
  • signedGradients

It simply means to convert an image into binary format. Thresholding is done to trim the high-frequency values to be able to separate the darker and the lighter regions. The values trimmed contribute less to the overall picture, hence, retains the essential information that is required.

Three broad types are:

  • Simple or global thresholding

Where one provide the threshold value as an input constant. This threshold is applied for all pixels of the image.

  • Adaptive thresholding

It is where a threshold is not a constant scalar, rather a distribution that is applied over a small window of pixels.

  • Otsu’s binarization

 It automatically calculates a threshold value from image histogram for a bimodal image.

Images are not smooth because adjacent pixels are different. And we apply Smoothing to make adjacent pixels look more similar using an average of its neighbors. Smoothing also known as blurring is an operation performed in image processing to remove the high-frequency content from the image which is done by convolving an image through a low pass filter.

Different techniques are:

  • Averaging using cv.blur( )
  • Gaussian blurring using cv.getGaussiankernel( )
  • Median blurring using cv.medianBlur( )
  • Bilateral filtering using cv.bilateralFilter( )

Geometric transformations:

  • Scaling

Resizing the image

  • Translation

Shifting of the image

  • Rotation

A transformational operation that converts one coordinate space onto another.

  • Affine transformation

Is done to correct the geometric distortions/deformations that occur.

  • Perspective transformation

Conversion of 3D image into 2D image.

# Below is the implementation for the image translation for a shift of (220,50)

As described in the OpenCV doc,

import numpy as np
import cv2 as cv
img = cv.imread('messi5.jpg',0)
rows,cols = img.shape
M = np.float32([[1,0,220],[0,1,50]])
dst = cv.warpAffine(img,M,(cols,rows))

Background subtraction is an important step in video analysis where you separate out the foreground objects from the background in a sequence of video frames.

Frame differencing is the simplest form of background subtraction where the current frame is simply subtracted from the previous frame, and if the difference in the pixel values for a given pixel is greater than the threshold Th, then that pixel is considered the part of the foreground.  

Where users can manually choose the threshold,  or use automatic thresholding technique

CNNs are the most powerful algorithms for image classification and analysis. They process visual info in a feed-forward manner, passing an image through the image filters which extracts certain features from the input image. These feature level representations are useful for image construction as well and form the basis for style transfer which composes images based on CNN layer activations and extracted features.

When a CNN is trained to classify an image, the convolutional layers learn to extract more and more complex features from a given image. And max-pooling layers discards the detailed spatial information alternatively (info that is irrelevant for classification task). The effect of this is, that the input image is transformed into feature maps that increasingly care about the content of the image rather than any detail about the texture or color of pixels. These later layers are sometimes called, content representation of an image. 

Style can be termed as something that can be found in the brush strokes of the painting, its textures, colors, curvature and so on. To perform style transfer, we need to combine the content of one image with the style of another. 

To represent the style of an image, a feature space designed to capture the texture and color information is used. This space essentially looks at spatial correlations within layers of a network. For example, is a certain color detected in one map similar to color in another map or detected edges and corners? So the similarities and the differences between the features in a layer give us some info about the texture and the color info in the image and at the same time leaves info about the actual arrangement and the identity of different objects in that image. 

This is how to separate the style and the content. Now, let's see how style transfer works:

It will look at 2 different images which we call it as, content image and style image. Using a trained CNN, style transfer finds the style of one image and content of the other. And finally, it tries to merge the two to create a new third image. In this newly created image, the objects and their arrangement are taken from the content image, and color and texture from the style image. 

Example of style transfer:

Features from the accelerated segmented test (FAST) algorithm was proposed by Edward Rosten and Tom Drummond in their paper ‘machine learning in high-speed corner detection’ in 2006. This algorithm is used to extract the feature points and later used to track and map the objects when performing computer vision tasks. 


  1. Select the pixel p in the image which is to be identified as an interesting point
  2.  - get its intensity
  3. Select the appropriate threshold value t
  4. Consider the pixels in the circle (16 pixels) under the test
  5. Pixel p is a corner if there exist a set of adjacent pixels in a circle which are brighter than +t, or all darker than  - t.
  6. A high-speed test is proposed to exclude a large number of non-corners. This test examines only the four pixels at 1, 9, 5 and 13. If p is a corner, then at least three of these must all be brighter than  +tor darker than  −t. If neither of these is the case, then p cannot be a corner.


Prepare better with the best interview questions and answers, and walk away with top interview tips. These interview questions and answers will boost your core interview skills and help you perform better. Be smarter with every interview.