Prepare better with the best interview questions and answers, and walk away with top interview tips. These interview questions and answers will boost your core interview skills and help you perform better. Be smarter with every interview.
The term Big data analytics refers to the strategy of analyzing large volumes of data, or big data. The large amount of data which gathered from a wide variety of sources, including social networks, videos, digital images, sensors, and sales transaction records is called Big Data. The main purpose in analyzing all this data is to uncover patterns and connections that might otherwise be invisible, and that might provide valuable insights about the users who created it. Through this insight, businesses may be able to gain an edge over their rivals and make superior business decisions.
Most important advantage of Big Data analysis is, it helps organizations harness their data and use it to identify new opportunities. With the help of this, companies lead to smarter business moves, more efficient operations, higher profits, and happier customers.
The five V’s of Big data is as follows:
There are various tools in Big Data technology which are deployed for importing, sorting, and analyzing data. List of some tools are as follows:
Data cleansing it is also known as Data scrubbing, it is a process of removing data which incorrect, duplicated or corrupted. This process is used for enhancing the data quality by eliminating errors and irregularities.
The sources of Unstructured data are as follows:
Very often, there exist data objects that do not comply with the general behavior or model of the data. Such data objects, which are grossly different from or inconsistent with the remaining set of data, are called outliers.
It is a term which is commonly used by data analysts while referring to a value that appears to be far removed and divergent from a set pattern in a sample.
There are two kinds of outliers – Univariate and Multivariate.
K-mean is a partitioning technique in which objects are categorized into K groups. In this algorithm, the clusters are spherical with the data points aligned around that cluster, and the variance of the clusters is similar to one another.
The process of clustering involves the grouping of similar objects into a set known as a cluster. In Clustering objects in one cluster are likely to be different when compared to objects grouped under another cluster. It is one of the main tasks in data mining and is also a technique used in statistical data analysis. Hierarchical, partitioning, density-based, and model-based. These are some of the popular clustering methods.
Some statistical methods are as follows:
Data mining:
Data Analysis:
It is a simple algorithm to create a recommendation system based on user behavioral data. The most important components of collaborative filtering are users- items- interest.
A good example of collaborative filtering is when you see a statement like “recommended for you” on online shopping sites that pop out based on your browsing history.
Linear Regression:
Logistic Regression :
Most of the widely used analytical techniques falls into one of the following categories:
The main task of P-value is to determine the significance of results after a hypothesis test in statistics.
Readers can draw with conclusions with the help of P-value and it is always between 0 and 1.
Machine learning is a category of an algorithm that helps software applications to become more accurate in predicting outcomes without being explicitly programmed. The basic concept of machine learning is to build algorithms that can receive input data and use statistical analysis to predict an output while updating outputs as new data becomes available.
It enables the computers or the machines to make data-driven decisions rather than being explicitly programmed for carrying out a certain task.
The main difference between data mining and data profiling is as follows:
These both the values are used for understanding linear transformations. In data analysis, we usually calculate the eigenvectors for a correlation or covariance matrix. Eigenvectors are nothing but the directions along which a particular linear transformation acts by flipping, compressing or stretching. Eigenvalue can be referred to as the strength of the transformation in the direction of eigenvector or the factor by which the compression occurs.
Data analysis mostly deals with collecting, inspecting, cleaning, transforming and modeling data to gain some valuable insights and support better decision making in an organization. The various steps involved in the data analysis process include:
Data Exploration:
For identifying the business problem, a data analyst has to go through the data provided by the client to analyze the root cause of the problem.
Data Preparation:
This Data preparation step is one of the important steps for data analysis process wherein any data anomalies (like missing values or detecting outliers) with the data have to be modeled in the right direction.
Data Modelling:
This step begins once the data has been prepared. In this process, the model runs repeatedly for improvements. Data modeling ensures that the best possible result is found for a given business problem.
Validation:
In this step, the model provided by the client and the model developed by the data analyst are validated against each other to find out if the developed model will meet the business requirements.
Implementation of the Model and Tracking:
This step is the final step of the data analysis process. In this process, the model is implemented in production and is tested for accuracy and efficiency.
Suppose, you find any suspicious or missing data in that case :
In the banking industry, where giving loans is the main source of making money but at the same time if your repayment rate is not good you will not make any profit, rather you will risk huge losses.
In Banks, they don’t want to lose good customers and at the same point of time, they don’t want to acquire bad customers. In this scenario, both the false positives and false negatives become very important to measure.
In recent days we hear many cases of players using steroids during sports competitions Every player has to go through a steroid test before the game starts. A false positive can ruin the career of a Great sportsman and a false negative can make the game unfair.
In Bayesian estimate, we have some knowledge about the data/problem. There may be several values of the parameters which explain data and hence we can look for multiple parameters like 5 gammas and 5 lambdas that do this. As a result of Bayesian Estimate, we get multiple models for making multiple predictions i.e. one for each pair of parameters but with the same prior. So, if a new example needs to be predicted then computing the weighted sum of these predictions serves the purpose.
Maximum likelihood does not take consider the prior (ignores the prior) so it is like being a Bayesian while using some kind of a flat prior.
These are descriptive statistical analysis techniques which can be differentiated based on the number of variables involved at a given point of time. For example, the pie charts of sales based on territory involve only one variable and can be referred to as univariate analysis.
If the analysis attempts to understand the difference between 2 variables at a time as in a scatterplot, then it is referred to as bivariate analysis. For example, analyzing the volume of sale and spending can be considered as an example of bivariate analysis.
The analysis that deals with the study of more than two variables to understand the effect of variables on the responses is referred to as multivariate analysis.
For comparing means between two groups:
We must use Independent T-test when a continuous variable and a categorical variable having two independent categories.
We can also use Paired T-test when a continuous variable and a categorical variable having two dependent or paired categories.
Use one way ANOVA when a continuous variable and a categorical variable having more than two independent categories.
Use GLM Repeated Measures when a continuous variable and a categorical variable more than two dependent categories.
Data cleansing process can be done in the following ways:
The various types of data validation methods used are:
R Programming Language: It is an open source programming language with a focus on statistical analysis. It is competitive with commercial tools such as SAS, SPSS in terms of statistical capabilities. In R another advantage is a large number of open source libraries that are available. In terms of performance.
Python for data analysis: Python is a general-purpose programming language and it contains a significant number of libraries devoted to data analysis such as pandas, sci-kit-learn, theano, numpy and scipy.
Most of the things available in R can also be done in Python but R is simpler to use compared to it. In case if you are working with large datasets, normally Python is a better choice than R. Python can be used quite effectively to clean and process data line by line.
Julia: It is a high-level language, mostly used for technical computing. Its syntax is similar to R or Python, if you are already working with R or Python it should be quite simple to write the same code in Julia. The language is quite new and has grown significantly in the last years, so it is definitely an option at the moment.
SAS: It is mostly a commercial language that is still being used for business intelligence. It has a base language that allows the user to program a wide variety of applications. It contains few commercial products that give non-expert users the ability to use complex tools such as a neural network library without the need of programming.
SPSS: SPSS, is currently a product of IBM for statistical analysis.It is widely used to analyze survey data and is a decent alternative for users who are not able to program.It is probably as simple to use as SAS, but in terms of implementing a model, it is simpler as it provides a SQL code to score a model. This code is normally not efficient, but it’s a start whereas SAS sells the product that scores models for each database separately. For small data and an inexperienced team, SPSS is an option as good as SAS is.
The software is however rather limited, and experienced users will be orders of magnitude more productive using R or Python.
Matla, Octave: There are other tools available such as Matlab or its open source version (Octave). These tools are mostly used for research. In terms of capabilities, R or Python can do all that’s available in Matlab or Octave. It only makes sense to buy a license of the product if you are interested in the support they provide
The primary responsibilities of a data analyst are as follows:
The term Big data analytics refers to the strategy of analyzing large volumes of data, or big data. The large amount of data which gathered from a wide variety of sources, including social networks, videos, digital images, sensors, and sales transaction records is called Big Data. The main purpose in analyzing all this data is to uncover patterns and connections that might otherwise be invisible, and that might provide valuable insights about the users who created it. Through this insight, businesses may be able to gain an edge over their rivals and make superior business decisions.
Most important advantage of Big Data analysis is, it helps organizations harness their data and use it to identify new opportunities. With the help of this, companies lead to smarter business moves, more efficient operations, higher profits, and happier customers.
The five V’s of Big data is as follows:
There are various tools in Big Data technology which are deployed for importing, sorting, and analyzing data. List of some tools are as follows:
Data cleansing it is also known as Data scrubbing, it is a process of removing data which incorrect, duplicated or corrupted. This process is used for enhancing the data quality by eliminating errors and irregularities.
The sources of Unstructured data are as follows:
Very often, there exist data objects that do not comply with the general behavior or model of the data. Such data objects, which are grossly different from or inconsistent with the remaining set of data, are called outliers.
It is a term which is commonly used by data analysts while referring to a value that appears to be far removed and divergent from a set pattern in a sample.
There are two kinds of outliers – Univariate and Multivariate.
K-mean is a partitioning technique in which objects are categorized into K groups. In this algorithm, the clusters are spherical with the data points aligned around that cluster, and the variance of the clusters is similar to one another.
The process of clustering involves the grouping of similar objects into a set known as a cluster. In Clustering objects in one cluster are likely to be different when compared to objects grouped under another cluster. It is one of the main tasks in data mining and is also a technique used in statistical data analysis. Hierarchical, partitioning, density-based, and model-based. These are some of the popular clustering methods.
Some statistical methods are as follows:
Data mining:
Data Analysis:
It is a simple algorithm to create a recommendation system based on user behavioral data. The most important components of collaborative filtering are users- items- interest.
A good example of collaborative filtering is when you see a statement like “recommended for you” on online shopping sites that pop out based on your browsing history.
Linear Regression:
Logistic Regression :
Most of the widely used analytical techniques falls into one of the following categories:
The main task of P-value is to determine the significance of results after a hypothesis test in statistics.
Readers can draw with conclusions with the help of P-value and it is always between 0 and 1.
Machine learning is a category of an algorithm that helps software applications to become more accurate in predicting outcomes without being explicitly programmed. The basic concept of machine learning is to build algorithms that can receive input data and use statistical analysis to predict an output while updating outputs as new data becomes available.
It enables the computers or the machines to make data-driven decisions rather than being explicitly programmed for carrying out a certain task.
The main difference between data mining and data profiling is as follows:
These both the values are used for understanding linear transformations. In data analysis, we usually calculate the eigenvectors for a correlation or covariance matrix. Eigenvectors are nothing but the directions along which a particular linear transformation acts by flipping, compressing or stretching. Eigenvalue can be referred to as the strength of the transformation in the direction of eigenvector or the factor by which the compression occurs.
Data analysis mostly deals with collecting, inspecting, cleaning, transforming and modeling data to gain some valuable insights and support better decision making in an organization. The various steps involved in the data analysis process include:
Data Exploration:
For identifying the business problem, a data analyst has to go through the data provided by the client to analyze the root cause of the problem.
Data Preparation:
This Data preparation step is one of the important steps for data analysis process wherein any data anomalies (like missing values or detecting outliers) with the data have to be modeled in the right direction.
Data Modelling:
This step begins once the data has been prepared. In this process, the model runs repeatedly for improvements. Data modeling ensures that the best possible result is found for a given business problem.
Validation:
In this step, the model provided by the client and the model developed by the data analyst are validated against each other to find out if the developed model will meet the business requirements.
Implementation of the Model and Tracking:
This step is the final step of the data analysis process. In this process, the model is implemented in production and is tested for accuracy and efficiency.
Suppose, you find any suspicious or missing data in that case :
In the banking industry, where giving loans is the main source of making money but at the same time if your repayment rate is not good you will not make any profit, rather you will risk huge losses.
In Banks, they don’t want to lose good customers and at the same point of time, they don’t want to acquire bad customers. In this scenario, both the false positives and false negatives become very important to measure.
In recent days we hear many cases of players using steroids during sports competitions Every player has to go through a steroid test before the game starts. A false positive can ruin the career of a Great sportsman and a false negative can make the game unfair.
In Bayesian estimate, we have some knowledge about the data/problem. There may be several values of the parameters which explain data and hence we can look for multiple parameters like 5 gammas and 5 lambdas that do this. As a result of Bayesian Estimate, we get multiple models for making multiple predictions i.e. one for each pair of parameters but with the same prior. So, if a new example needs to be predicted then computing the weighted sum of these predictions serves the purpose.
Maximum likelihood does not take consider the prior (ignores the prior) so it is like being a Bayesian while using some kind of a flat prior.
These are descriptive statistical analysis techniques which can be differentiated based on the number of variables involved at a given point of time. For example, the pie charts of sales based on territory involve only one variable and can be referred to as univariate analysis.
If the analysis attempts to understand the difference between 2 variables at a time as in a scatterplot, then it is referred to as bivariate analysis. For example, analyzing the volume of sale and spending can be considered as an example of bivariate analysis.
The analysis that deals with the study of more than two variables to understand the effect of variables on the responses is referred to as multivariate analysis.
For comparing means between two groups:
We must use Independent T-test when a continuous variable and a categorical variable having two independent categories.
We can also use Paired T-test when a continuous variable and a categorical variable having two dependent or paired categories.
Use one way ANOVA when a continuous variable and a categorical variable having more than two independent categories.
Use GLM Repeated Measures when a continuous variable and a categorical variable more than two dependent categories.
Data cleansing process can be done in the following ways:
The various types of data validation methods used are:
R Programming Language: It is an open source programming language with a focus on statistical analysis. It is competitive with commercial tools such as SAS, SPSS in terms of statistical capabilities. In R another advantage is a large number of open source libraries that are available. In terms of performance.
Python for data analysis: Python is a general-purpose programming language and it contains a significant number of libraries devoted to data analysis such as pandas, sci-kit-learn, theano, numpy and scipy.
Most of the things available in R can also be done in Python but R is simpler to use compared to it. In case if you are working with large datasets, normally Python is a better choice than R. Python can be used quite effectively to clean and process data line by line.
Julia: It is a high-level language, mostly used for technical computing. Its syntax is similar to R or Python, if you are already working with R or Python it should be quite simple to write the same code in Julia. The language is quite new and has grown significantly in the last years, so it is definitely an option at the moment.
SAS: It is mostly a commercial language that is still being used for business intelligence. It has a base language that allows the user to program a wide variety of applications. It contains few commercial products that give non-expert users the ability to use complex tools such as a neural network library without the need of programming.
SPSS: SPSS, is currently a product of IBM for statistical analysis.It is widely used to analyze survey data and is a decent alternative for users who are not able to program.It is probably as simple to use as SAS, but in terms of implementing a model, it is simpler as it provides a SQL code to score a model. This code is normally not efficient, but it’s a start whereas SAS sells the product that scores models for each database separately. For small data and an inexperienced team, SPSS is an option as good as SAS is.
The software is however rather limited, and experienced users will be orders of magnitude more productive using R or Python.
Matla, Octave: There are other tools available such as Matlab or its open source version (Octave). These tools are mostly used for research. In terms of capabilities, R or Python can do all that’s available in Matlab or Octave. It only makes sense to buy a license of the product if you are interested in the support they provide
The primary responsibilities of a data analyst are as follows:
Submitted questions and answers are subjecct to review and editing,and may or may not be selected for posting, at the sole discretion of Knowledgehut.