Big Data Analytics Interview Questions

Prepare better with the best interview questions and answers, and walk away with top interview tips. These interview questions and answers will boost your core interview skills and help you perform better. Be smarter with every interview.

  • 4.5 Rating
  • 29 Question(s)
  • 35 Mins of Read
  • 9964 Reader(s)


The term Big data analytics refers to the strategy of analyzing large volumes of data, or big data. The large amount of data which gathered from a wide variety of sources, including social networks, videos, digital images, sensors, and sales transaction records is called Big Data. The main purpose in analyzing all this data is to uncover patterns and connections that might otherwise be invisible, and that might provide valuable insights about the users who created it. Through this insight, businesses may be able to gain an edge over their rivals and make superior business decisions.

Most important advantage of Big Data analysis is, it helps organizations harness their data and use it to identify new opportunities. With the help of this, companies lead to smarter business moves, more efficient operations, higher profits, and happier customers.

The five V’s of Big data is as follows:

  • Volume – It indicates the amount of data that is growing at a high rate i.e. data volume in Petabytes
  • Velocity – Velocity of data means the rate at which data grows. Social media contributes a major role in the velocity of growing data
  • Variety –  Term Variety in Big Data refers to the different data types i.e. various data formats like text, audios, videos, etc.
  • Veracity – It indicates the uncertainty of available data. The main reason for arising uncertainty is the high volume of data that brings incompleteness and inconsistency
  • Value –It refers to turning data into value. By turning accessed big data into values, businesses may generate revenue

There are various tools in Big Data technology which are deployed for importing, sorting, and analyzing data. List of some tools are as follows:

  • Apache Hive
  • Apache Spark
  • MongoDB
  • MapReduce
  • Apache Sqoop
  • Cassandra
  • Apache Flume
  • Apache Pig
  • Apache Splunk
  • Apache Hadoop
  • Tableau
  • RapidMiner
  • OpenRefine
  • Google Search Operators
  • Solver
  • NodeXL
  • Wolfram Alpha’s
  • Google Fusion Tables

Data cleansing it is also known as Data scrubbing, it is a process of removing data which incorrect, duplicated or corrupted. This process is used for enhancing the data quality by eliminating errors and irregularities.

The sources of Unstructured data are as follows:

  • Textfiles and documents
  • Server website and application log
  • Sensor data
  • Images, Videos and audio files
  • Emails
  • Social media Data

Very often, there exist data objects that do not comply with the general behavior or model of the data. Such data objects, which are grossly different from or inconsistent with the remaining set of data, are called outliers.

It is a term which is commonly used by data analysts while referring to a value that appears to be far removed and divergent from a set pattern in a sample.

There are two kinds of outliers – Univariate and Multivariate.

K-mean is a partitioning technique in which objects are categorized into K groups. In this algorithm, the clusters are spherical with the data points aligned around that cluster, and the variance of the clusters is similar to one another.

The process of clustering involves the grouping of similar objects into a set known as a cluster. In Clustering objects in one cluster are likely to be different when compared to objects grouped under another cluster. It is one of the main tasks in data mining and is also a technique used in statistical data analysis. Hierarchical, partitioning, density-based, and model-based. These are some of the popular clustering methods.

Some statistical methods are as follows:

  •  Markov process
  •  Mathematical optimization
  •  Imputation techniques
  • Simplex Algorithm 
  • Bayesian Method
  • Rank statistics spatial and cluster processes

Data mining:

  • A hypothesis is not required in Data Mining
  • Data mining demands clean and well-documented data
  • Results of Data mining are not easy to interpret
  • Data mining algorithms automatically develop an equation

Data Analysis:

  • Data analysis begins with a hypothesis
  • Data analysis involves data cleaning, therefore, it does not require clean and well-documented data.
  • Data analysts interpret results and present it to the stakeholders
  • In Data analysis we have to develop own equations

 It is a simple algorithm to create a recommendation system based on user behavioral data. The most important components of collaborative filtering are users- items- interest.

A good example of collaborative filtering is when you see a statement like “recommended for you” on online shopping sites that pop out based on your browsing history.

Linear Regression:

  • It requires independent variables to be continuous
  • It is based on least squares estimation 
  • It requires 5 cases per independent variable
  • It is aimed at finding the best fitting straight line where the distance between the points and the regression lines are the error

Logistic Regression :

  • It can have dependent variables with more than two categories
  • It is based on maximum likelihood estimation
  • It required at least 10 events per independent variable
  • It is used to predict a binary outcome, the resultant graph is an S-curved one

Most of the widely used analytical techniques falls into one of the following categories:

  •  Statistical methods
  •  Forecasting
  •  Regression analysis 
  •  Database querying
  •  Data warehouse  
  •  Machine learning and data mining

The main task of P-value is to determine the significance of results after a hypothesis test in statistics.

Readers can draw with conclusions with the help of P-value and it is always between 0 and 1.

  •         P- Value > 0.05 denotes weak evidence against the null hypothesis, It means the null hypothesis cannot be rejected
  •         P-value <= 0.05 denotes strong evidence against the null hypothesis which means the null hypothesis can be rejected
  •         P-value=0.05 is the marginal value indicating it is possible to go either way

Machine learning is a category of an algorithm that helps software applications to become more accurate in predicting outcomes without being explicitly programmed. The basic concept of machine learning is to build algorithms that can receive input data and use statistical analysis to predict an output while updating outputs as new data becomes available.

It enables the computers or the machines to make data-driven decisions rather than being explicitly programmed for carrying out a certain task.

The main difference between data mining and data profiling is as follows:

  • Data profiling: It targets the instant analysis of individual attributes like price vary, distinct price and their frequency, an incidence of null values, data type, length, etc.
  • Data mining: It focuses on dependencies, sequence discovery, relation holding between several attributes, cluster analysis, detection of unusual records etc.

These both the values are used for understanding linear transformations. In data analysis, we usually calculate the eigenvectors for a correlation or covariance matrix. Eigenvectors are nothing but the directions along which a particular linear transformation acts by flipping, compressing or stretching. Eigenvalue can be referred to as the strength of the transformation in the direction of eigenvector or the factor by which the compression occurs.


Data analysis mostly deals with collecting, inspecting, cleaning, transforming and modeling data to gain some valuable insights and support better decision making in an organization. The various steps involved in the data analysis process include: 

Data Exploration:

For identifying the business problem, a data analyst has to go through the data provided by the client to analyze the root cause of the problem.

Data Preparation:

This Data preparation step is one of the important steps for data analysis process wherein any data anomalies (like missing values or detecting outliers) with the data have to be modeled in the right direction. 

Data Modelling:

This step begins once the data has been prepared. In this process, the model runs repeatedly for improvements. Data modeling ensures that the best possible result is found for a given business problem.


In this step, the model provided by the client and the model developed by the data analyst are validated against each other to find out if the developed model will meet the business requirements.

Implementation of the Model and Tracking:

This step is the final step of the data analysis process. In this process, the model is implemented in production and is tested for accuracy and efficiency.

Suppose, you find any suspicious or missing data in that case :

  • The first step will be to make a validation report to provide information on the suspected data
  • Get it checked by experienced personnel so that its acceptability can be determined
  • If there is any Invalid data, it should be updated with a validation code
  • For this kind of scenario, use the best analysis strategy to work on the missing data like simple imputation, deletion method, or case wise imputation

In the banking industry, where giving loans is the main source of making money but at the same time if your repayment rate is not good you will not make any profit, rather you will risk huge losses.

In Banks, they don’t want to lose good customers and at the same point of time, they don’t want to acquire bad customers. In this scenario, both the false positives and false negatives become very important to measure. 

In recent days we hear many cases of players using steroids during sports competitions Every player has to go through a steroid test before the game starts. A false positive can ruin the career of a Great sportsman and a false negative can make the game unfair.

In Bayesian estimate, we have some knowledge about the data/problem. There may be several values of the parameters which explain data and hence we can look for multiple parameters like 5 gammas and 5 lambdas that do this. As a result of Bayesian Estimate, we get multiple models for making multiple predictions i.e. one for each pair of parameters but with the same prior. So, if a new example needs to be predicted then computing the weighted sum of these predictions serves the purpose.

Maximum likelihood does not take consider the prior (ignores the prior) so it is like being a Bayesian while using some kind of a flat prior.

These are descriptive statistical analysis techniques which can be differentiated based on the number of variables involved at a given point of time. For example, the pie charts of sales based on territory involve only one variable and can be referred to as univariate analysis.

If the analysis attempts to understand the difference between 2 variables at a time as in a scatterplot, then it is referred to as bivariate analysis. For example, analyzing the volume of sale and spending can be considered as an example of bivariate analysis.

The analysis that deals with the study of more than two variables to understand the effect of variables on the responses is referred to as multivariate analysis.

For comparing means between two groups:

We must use Independent T-test when a continuous variable and a categorical variable having two independent categories.

We can also use Paired T-test when a continuous variable and a categorical variable having two dependent or paired categories.

Use one way ANOVA when a continuous variable and a categorical variable having more than two independent categories.

Use GLM Repeated Measures when a continuous variable and a categorical variable more than two dependent categories.

Data cleansing process can be done in the following ways:

  • Shorting the data by various attributes
  • In large and big data sheets, the cleaning should be done stepwise in order to achieve a result for the given data
  • For big projects, break down the data sheets into parts and work on it in a sequential manner which will help you to come with the perfect data faster as compared to working on the whole lot at once
  • For the cleansing process make a set of utility tools which will help you to maximize the speed of the process and reduce the duration for completion of the process
  • Arrange the data by estimated frequency and start by clearing the most common problems first
  • For faster cleaning, analyze the summary of the data
  • By keeping a check over daily data cleansing, you can improvise the set of utility tools as per requirements

 The various types of data validation methods used are:

  • Field Level Validation – validation is done in each field as the user enters the data to avoid errors caused by human interaction
  • Form Level Validation – In this method, validation is done once the user completes the form before a save of the information is needed
  • Data Saving Validation – This type of validation is performed during the saving process of the actual file or database record. This is usually done when there are multiple data entry forms
  • Search Criteria Validation – This type of validation is relevant to the user to match what the user is looking for to a certain degree. It is to ensure that the results are actually returned

R Programming Language: It is an open source programming language with a focus on statistical analysis. It is competitive with commercial tools such as SAS, SPSS in terms of statistical capabilities. In R another advantage is a large number of open source libraries that are available. In terms of performance.

Python for data analysis: Python is a general-purpose programming language and it contains a significant number of libraries devoted to data analysis such as pandas, sci-kit-learn, theano, numpy and scipy.

Most of the things available in R can also be done in Python but R is simpler to use compared to it. In case if you are working with large datasets, normally Python is a better choice than R. Python can be used quite effectively to clean and process data line by line.  

Julia: It is a high-level language, mostly used for technical computing. Its syntax is similar to R or Python, if you are already working with R or Python it should be quite simple to write the same code in Julia. The language is quite new and has grown significantly in the last years, so it is definitely an option at the moment.

SAS: It is mostly a commercial language that is still being used for business intelligence. It has a base language that allows the user to program a wide variety of applications. It contains few commercial products that give non-expert users the ability to use complex tools such as a neural network library without the need of programming.

SPSS: SPSS, is currently a product of IBM for statistical analysis.It is widely used to analyze survey data and is a decent alternative for users who are not able to program.It is probably as simple to use as SAS, but in terms of implementing a model, it is simpler as it provides a SQL code to score a model. This code is normally not efficient, but it’s a start whereas SAS sells the product that scores models for each database separately. For small data and an inexperienced team, SPSS is an option as good as SAS is.

The software is however rather limited, and experienced users will be orders of magnitude more productive using R or Python.

Matla, Octave: There are other tools available such as Matlab or its open source version (Octave). These tools are mostly used for research. In terms of capabilities, R or Python can do all that’s available in Matlab or Octave. It only makes sense to buy a license of the product if you are interested in the support they provide

The primary responsibilities of a data analyst are as follows:

  1.  A data analyst is always responsible for all data related information and the analysis is needed for the staff and the customers
  1.  A data analyst is very useful at the time of an audit
  1.  The data analyst is capable of using statistical techniques and also provides suggestions based on the data
  1.  Analyst must always focus on improving the business process and always strive for process optimization
  1.  The main responsibility is to work with the raw data and provide meaningful reports for the managers
  1.  They are responsible for acquiring data from different primary and secondary sources so that they can harvest one common database


Prepare better with the best interview questions and answers, and walk away with top interview tips. These interview questions and answers will boost your core interview skills and help you perform better. Be smarter with every interview.