The most talked about buzzword of 2015,2016,2017 is Machine Learning. Still, the upcoming enthusiasts do not know how to start the journey into this fascinating world of machine learning.There is a common belief that machine learning is all about the algorithms. Well, the algorithms are a big deal but the preprocessing stuff or the stuff leading to algorithm designing and implementation is very crucial and largely ignored. Here's my take on the machine learning processes.
Definitions: "Machine learning is a field of computer science that gives computers the ability to learn without being explicitly programmed."
Building a machine learning project requires a knowledge of different fields like linear algebra ( in case of matrix ), statistics, machine learning algorithms, data manipulation techniques, programming languages like Python, R and a bunch of other computer science stuff but even though one has all these skills one can easily get confused on how to align these things in order.
That is the reason today I will take you through all the processes of machine learning project building and when to do them.
Ask a Question?
"If getting answer from data is an art then asking questions is the skill needed" - The Wise Man said.
Here, the fact that should be noted is that your data won't give you the answer you want unless you ask the perfect question.In simple words, imagine your machine learning algorithm is a genie that pledges to solve all the questions you want but, this genie is a mischievous genie ( genie watches Got shows and is a fan of Three-Eyed-Ravens), it will give you answers to your questions according to your questions. If you asked him “Will I score a girl?” then he will answer “Yes”. Now, this answer is not helpful as we don't know who so, we need to ask this question with a specific data (in this case name of the girl). The choice of questions should be sharp, not vague.
Now, once you have formulated your questions, check your machine learning dictionary to see whether you can answer that question (because there are many questions machine learning cannot answer...Bummmer!). The example of your answer is the target value that the algorithm is going to predict. Following is the list of answers that you can get from your questions:
- Classification: Is this A or B?
- Regression: How much, or how many of these?
- Anomaly Detection: Is this anomalous?
- Clustering: How can these elements be grouped?
- Reinforcement Learning: What should I do now?
This part is based on the way you want to build your project. According to the above wise man, there are 3 main directions of building your machine learning project as follows:
- SaaS -pre-built machine learning models
- Data Science and Applied Machine learning
- Machine Learning Research.
Now, let's go through all of them one by one.
1.SaaS-pre built machine learning models
These are the API-based machine learning GODs. These help us to apply the pre-built machine models by many companies and organizations on our data and get the results. The API-based SaaS industry is booming with large industries having API for almost all type of data we have today. These are the companies investing in Saas industry.
Google Cloud: with Vision API, Speech API, Jobs API, Video Intelligence API, Language API, Translation API etc.
AWS: with Rekognition, Lex, Polly etc.
2.Data Science and Applied Machine Learning
This part deals with using existing libraries on top of frameworks provided. These frameworks are made to solve specific problems. These include :
3.Machine Learning Research
This part contains the platforms or framework which can be used to solve the specific problem with help of programming languages like Python and R.
Basically, businesses use first 2 options while students and startups often use the 3rd option.
This is the most important aspect of building your machine learning project as this step deals with the manipulation of DATA. In this step, you work on your actual data and then convert it to the needed form to produce the required solution.
The processes involved in this step are as follows:
Here the data that is required for the project must be collected from different sources like web pages, datasets, local databases various APIs and different other sources.
This step is crucial and should be implemented carefully as all upcoming steps will be playing with the data collected in this step.
Up to this state you have collected all the data you will need to solve the questions and problems statement you have taken. Now you have the following data types :
- Nominal: Nominal data type is the data which is mutually exclusive but not ordered, this is usually categories.
- Ordinal: Ordered data type is for the data where order matters but not the difference between the values.
- Interval: This data type is a measurement where the difference between two values is meaningful.
- Ratio: This data type has all the properties of an interval variable, and also has a clear definition of 0/0.
The data exploration is mainly divided into two types: Univariate Analysis and Bivariate Analysis.
Univariate Analysis: This data refers to data where we're only observing one aspect of something at a time. With single-variable data, we can put all our observations into a list of numbers. Uni-Variate Analysis can be visualized using the following methods :
- Continuous Features: Mean, Median, Mode, Min, Max, Range, Quartile, IQR, Variance, Standard Deviation, Skewness, Histogram, Box Plot.
- Categorical Features: Histogram, Frequency.
Bivariate Analysis: This analysis refers to data where we observe two aspects. We can put our observations on a table. The columns-and-rows kind. This analysis can be visualized using the following methods: Scatter_plot, correlation map -Heatmap, Two-way table, stacked column chart, Chi-square test, Z-test/T-test, ANOVA etc.
This step deals with cleaning and preprocessing the data that is collected. This step is important as the features that will be participating in the machine learning algorithm should be cleaned. this step involves the following processes :
- Missing values: In this process, you can choose to omit the entire elements containing missing values or impute a value in place.
- Special values: Numeric variables are endowed with several formalized special values including ±Inf, NA, and NaN. Calculations involving special values often result in special values and need to be handled/cleaned.
- Outliers: These should be detected but not necessarily. Their inclusion is a statistical decision.
- Obvious inconsistencies: These can be located manually such as A person's age cannot be negative, a man cannot be pregnant and an under-aged person cannot possess a driver’s license.
Feature imputation is a decision to be made by the programmer on whether to omit the entire element containing null /special values or impute it. The feature imputation is simply replacing the element containing a null or special value with a more specific and meaningful value. The following techniques are often used for feature imputation :
- Hot-Deck: This technique finds the first missing value and uses the cell value immediately prior to the data that are missing to impute the missing value.
- Cold-Deck: In this technique, you can choose the donor from another dataset to fill for the missing value.
- Mean-substitution: This is another imputation technique which involves replacing any missing value with the mean of that variable for all other cases, which has the benefit of not changing the sample mean for that variable.
- Regression: A regression model is estimated to predict observed values of a variable based on other variables, and that model is then used to impute values in cases where that variable is missing.
At this stage we have a dataset which we have collected from various sources and loaded into our system (I'm referring to the code), after that we have performed the exploratory data analysis and peeked into the data's statistical insights after which we have done some investigation for the missing values and special values and if found guilty then offered some ways to fill the void. Seems like a lot but, trust me this is just 10% of the total work which is to be done.
Now, our next weapon on the dataset is of feature engineering, this is exactly as it sounds we are going to modify the existing attributes of the columns to have a more easy way or easy approach while doing the feature selection. some of the techniques are as follows:
- Decompose: Mainly converting numeric attributes mostly date and time into a categorical format like converting 2014-09-20 Time 20:45:40 into categorical attributes like hour_of_the_day, part_of_day, etc.
- Discretization: For continuous features, typically data is discretized into partitions of K equal lengths/width (equal intervals) or K% of the total data (equal frequencies). For Categorical Features, Values may be combined, particularly when there are few samples for some categories.
- Reframe Numerical Quantities: Changing from grams to kg, and losing detail might be both wanted and efficient for calculation.
- Crossing: Creating new features as a combination of existing features. Could be multiplying numerical features, or combining categorical variables. This is a great way to add domain expertise knowledge to the dataset.
Now we are entering a very crucial and equally important part of the machine learning model building process that is "feature selection". The work here is very simple that is to select the features that best fit the machine learning algorithm we want to apply.
The main advantages of feature selection are as follows:
- It enables the machine learning algorithm to train faster
- It reduces the complexity of a model and makes it easier to interpret.
- It improves the accuracy of a model if the right subset is chosen.
- It reduces overfitting.
There are various methods used for feature selection, the following are the ones we have considered appropriate for our discussion.
1. Correlation and Covariance
Most of the methods are dependent on this method. Correlation can be explained by this formulae.
The motive behind calculating correlation is to find feature. Features are uncorrelated with but are highly correlated with the feature we are trying to predict. Covariance can be explained as-
The covariance is a measure of how much two random variables change together.
2. Dimensionality Reduction
The dimensionality reduction methods are unsupervised methods which take the existing methods and convert them into new features by combining. So in this method, we get new features, unlike other feature selection methods where we select the best subset of the existing features. Two most used methods are-
- Principle Component Analysis (PCA): Principal component analysis (PCA) is a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components. This transformation is defined in such a way that the first principal component has the largest possible variance (that is, accounts for as much of the variability in the data as possible), and each succeeding component, in turn, has the highest variance possible under the constraint that it is orthogonal to the preceding components, and then plot the variance per each feature and select the features having best variance.
- Singular Value Decomposition(SVD): SVD is a factorization of a real or complex matrix. It is the generalization of the Eigen decomposition of a positive semidefinite normal matrix (for example, a symmetric matrix with positive eigenvalues) to any m×n matrix via an extension of the polar decomposition. It has many useful applications in signal processing and statistics.
3. Filter Methods:
Filter type methods select features based only on general metrics like the correlation with the variable to predict. Filter methods suppress the least interesting variables. The other variables will be part of a classification or a regression model used to classify or to predict data. These methods are particularly effective in computation time and robust to overfitting.
The type of algorithms that comes under these methods are-
- Linear Discriminant Analysis
4. Wrapper Methods
Wrapper methods evaluate subsets of variables which allows, unlike filter approaches, to detect the possible interactions between variables. The two main disadvantages of these methods are:
i. The increasing overfitting risk when the number of observations is insufficient.
ii. The significant computation time when the number of variables is large.
The algorithms in this category are:
- Forward Selection
- Backward Elimination
- Recursive Feature Elimination
- Genetic Algorithm
Embedded methods try to combine the advantages of both the previous methods. A learning algorithm takes advantage of its own variable selection process and performs feature selection and classification simultaneously.
The main algorithms here are:
- Lasso regression: Performs L1 regularization which adds penalty equivalent to the absolute value of the magnitude of coefficients.
- Ridge regression: Performs L2 regularization which adds penalty equivalent to the square of the magnitude of coefficients.
The above processes are very useful in building any machine learning project. These process will give you a basic idea of your data using some useful insights of the data and it will also improve your problem-solving technique. As a part of a project, perform above processes to solve a machine learning problem. This was the first part of the blog series on “Machine Learning- where to start and what to do” I will give you a brief idea of other processes used in coming articles.