There are no fixed frameworks or defined templates for solving data science problems. Here, we will discuss the overview of data science and the challenges involved in it.
The strategy changes with every new problem sets of different projects. But the steps we applied to solve the problem are almost similar to many different problem statements.
This is the high-level data science workflow phases for all types of problem statement which is used widely in the market.
Here, we will list out the few data science workflow steps given below:
These are the steps involved in developing the workflow for different data problems seen by data scientists.
The Problem statement is the key to your modeling. If you don’t have any idea what you want to do with your data, you can’t proceed further with your data-set. To define your problem is the main key to move ahead.
The next step is importing your data to the platform, where you are doing analysis or building models on the data. Now here comes a question, from where you can import data, and here is the answer: you can import data from any database or from csv file which is located anywhere in your system. Data can be structured or unstructured.
Structured Data: Name itself speak for structured data, it means data is arranged in a format like a date wise, size (increasing or decreasing) and which gives quick information without any time consumption. e.g. in excel data, we can say its structured data.
Unstructured Data: Means data has scattered information or we can say that which is hard to read without any pre-knowledge. (e.g. emails, images, text documents, and etc) Which cannot give complete information? Or you can say which can not fit into the relational databases e.g. SQL.
Now we have data imported into the platform, let us check the data-frame what kind of variables are there in the data frame. By exploring the data frame there can be three things which can strike depend on the behavior of data variables, is it classification problem? or is it regression problem? or is this a supervised learning or unsupervised learning? or is it a prediction?
Let us discuss the above problems:
Classification or Regression problem: To know this we can directly check the output variable whether it is continuous or categorical. Categorical data are in binary form, Boolean form and Continuous data are in numerical form. If it is a categorical variable then we can go for classification, On the other hand, if the output variable is continuous then its’ a regression problem.
Supervised or unsupervised learning: For supervised learning definitely we have labeled our variable in two forms dependent or independent variables. If there are any dependent variable defined in our data-frame then definitely we are doing supervised learning because our to be model is learning from the given scenarios. Supervised learning can be regression or classification problem. In unsupervised learning, we do not have defined the dependent variable which contains only independent variables, which means we can go for clustering or association model building process.
Prediction or Inference: For regression problems, suppose we want to predict the value of a dependent variable on the new defined independent variable. Let us take an example of marketing data:
X(independent variable)Y(dependent variable)1121324254
Here let's feed our model x=6 and we want to predict the value of y=? on the given x value. On feeding the x value our model gives a value of y=7. In the Inference, we want to know how X is affecting our dependent variable.
Meaning of data cleaning is checking the data types, getting all the values incorrect format. This can involve stripping characters from the string, converting integers from the float. There might be some missing values as well in the data-set. Which you have to take care by adding or deleting some values.
Depends on the type of data. Continuous or categorical, if continuous apply regression modeling, categorical apply classification or logistics regression modeling.
As a data scientist, you will try lots of models to get the best-fitted model. As a data scientist, I would prefer to build linear regression models on continuous data, logistics regression or classification on categorical data. I would also prefer to go to K- Nearest Neighbour(K-NN) for classification models. If you are not getting satisfying results, then you can go for Neural Networks if you have unsupervised learning.
There might be some other problems can occur while building models like in regression model there can be a multicollinearity problem. There are various techniques to deal with multicollinearity problems. In classification, there can be the multi-class problem.
There are several problems which can occur while building models, I am not going to mention here.
After building models we need to check our model that how adequate or how does this model fit our data. e.g. the Regression model, there are various types of selecting the best model.
After doing all the above things we need to take the results out or need to build a report(Make presentation) where we can show our progress to the concerned person.
Hope, the teachings of data science process flowchart helps you to analyse and overcome the data science problems.
This is it, for now, stay tuned for next big articles based on Neural Networks Which are on its way. To get the latest updates follow my blog.