There's probably no definition that the whole world would agree on, but there are certainly some core concepts. The core thing that machine learning does is finds patterns in data. It then uses those patterns to predict the future. For example, we could use machine learning to detect credit card fraud if we have data about previous credit card transactions. We could find patterns in that data potentially. That will let us detect when a new credit card transaction is likely to be fraudulent. Or, maybe we want to determine whether a customer is likely to switch to a competitor. There are lots more, but the core idea is that machine learning lets us find patterns in data, then use those patterns to predict the future.
How did we learn to read? In reading, we identify letters, and then the patterns of letters together to form words. We then had to recognize those patterns when we saw them again. That's what learning means and that's what machine learning does with data that we provide. So, suppose I have data about credit card transactions. I have only four records, each one has three fields; the customer's name, the amount of the transaction, and whether it was fraudulent or not. What's the pattern that this data suggests for fraudulent transactions? If the name starts with T, they're a criminal. Well, probably not. The problem with having so little data is that it's easy to find patterns, but it's hard to find patterns that are correct i.e. predictive patterns, they help us understand whether a new transaction is likely to be fraudulent. Suppose I have more data which means I have more records and more fields in each one, and I know where the card was issued, where it was used, the age of the user. Now what's the pattern for fraudulent transactions? Well, if we look, there really is a pattern in this data. It is that a transaction is fraudulent if the cardholder is in their 20s, if the card is issued in the USA, and used in Russia, and the amount is more than $1000. We could find that pattern, if we look at this data for a little while. But once again, do we know that pattern is truly predictive? Probably not. We don't have enough data. To do this well, we should have enough data that people just can't find the patterns. For this, we have to use the software. That's where machine learning comes in for humans.
Well, there are several reasons. A big one is that doing machine learning well requires lots of data and we live in the big data era. It requires lots of compute power, which we have. We live in the cloud era. And it requires effective machine learning algorithms, which we have because we have seen researchers spend years, decades, in this space, learning what works. All of these things are now more available than ever, and that's a big reason why machine learning in a human process is popular today.
Who's interested in machine learning? Well, majorly three groups of people. The first is business leaders. They want solutions to business problems and good solutions have real business value. Second are the Software developers, because they want to build better and smarter applications. And as we saw, applications can rely on models created via machine learning to make better predictions. The third category of people who are really involved in this space is called data scientists, who know about statistics and want powerful, easy-to-use tools which can help them in making good predictions.
There's a machine learning technology worth mentioning called R. R is an open source programming language and environment; it's not just a language. It supports machine learning, it supports various kinds of computing about statistics, and more. R has lots of available packages to address machine learning problems and all sorts of other things. Many commercial machine learning offerings support R. In fact, R has been around for a long time; its roots are in the 90s. But it's not the only choice in this area. Python is also increasingly popular, as an open source technology for doing machine learning. There are now a number of libraries and packages for Python as well. So, R is no longer alone as the only open source choice in this area, but it's still fair to say it's the most popular.
Finally, machine learning, in a nutshell, looks like this. We start with data that contains patterns. We then feed that data into a machine learning algorithm, it'd be more than one, that finds patterns in the data. This algorithm generates something called a model. A model is a functionality, typically code, that's able to recognize patterns when presented with new data. Applications can then use that model by supplying new data to see if this data matches known patterns, such as supplying data about a new transaction. The model can return a probability of whether this transaction is fraudulent. Machine learning lets us find patterns in existing data, then create and use a model that recognizes those patterns in new data.
Understanding machine learning means understanding the machine learning process, and the machine learning process is iterative. We repeat things over and over, in both big and small ways. The machine learning process also challenges and the reason is that we're working with what are often large amounts of potentially complex data, and we're trying to find patterns, meaningful patterns, predictive patterns, in this data.
Let’s look at machine learning concepts in a more detailed way and also the terminology used in machine learning.
The first thing we need to do is walk through some terminology. Like most fields, machine learning has its own unique jargon. Let's start with the idea of training data. Training data just means the prepared data that's used to create a model. So, instead of prepared data, we should use training data because in the jargon of machine learning, creating a model is called training a model. So, training data is used to train to create a model.
There are two big broad categories of machine learning. One is called supervised learning, and what it means is that the value we want to predict is actually in the training data. For instance, the data for predicting credit card fraud, whether or not a given transaction was fraudulent is actually contained in each record. That data in the jargon of machine learning is labeled, and so we're doing what's called supervised learning when we try to predict whether a new transaction is fraudulent.
The alternative, unsurprisingly, is called unsupervised learning and here the value we want to predict is not in the training data. The data is unlabeled. Both approaches are used, but it's fair to say that the most common approach is supervised learning.
The machine learning process starts with data. It might be relational data, it might be from a NoSQL database, it might be binary data. Wherever it comes from, though, we need to read this raw data into some data preprocessing modules typically chosen from the things our machine learning technology provides. We have to do this because raw data is very rarely in the right shape to be processed by machine learning algorithms. For example, maybe there are holes in your data, missing values, or duplicates, or maybe there's redundant data where the same thing is expressed in two different ways in different fields, or maybe there's information that we know will not be predictive, it won't help us create a good model. We want to deal with all of these issues. The goal is to create training data. The training data, as we discussed in the earlier example, commonly have columns. Those columns are called features. So, for example, in the data for credit card fraud, there were columns containing the country where the card was issued in, the country where the card was used in, and the amount of the transaction. Those are all features in the jargon of machine learning. And also the supervised learning, the value we're trying to predict, such as a given transaction is fraudulent, is also in the training data. In the jargon of machine learning, we call that the target value.
It's common to group machine learning problems into categories. There are three main categories as discussed below.
One of that category is called regression. The problem here is that we have data, and we'd like to find a line or a curve that best fits that data. Regression problems are typically supervised learning scenarios, and an example question would be something like, how many units of this product will we sell next month?
The second category of machine learning problems is called classification. Here we have data that we want to group into classes, at least two, sometimes more than two. When new data comes in, we want to determine which class that data belongs to. This is commonly used with supervised learning, and an example question would be something like, is this credit card transaction fraudulent? Because when a new transaction comes in, we want to predict which class it's in, fraudulent or not fraudulent. And often what we'll get back is not yes or no.
The third category of machine learning problems is commonly called clustering. Here we have data, we want to find clusters in that data. This is a good example of when we're going to use unsupervised learning because we don't have labeled data. We don't know necessarily what we're looking for. An example question here is something like, what are our customer segments? We might not know these things up front, but we can use machine learning, unsupervised machine learning, to help us figure out that.
The kind of problems that machine learning addresses aren't the only thing that can be categorized. It's also useful to think about the styles of machine learning algorithms that are used to solve those problems. For example, there are decision tree algorithms. There are algorithms that use neural networks, which in some ways emulate how the brain works. There are Bayesian algorithms that use Bayes' theorem to work up probabilities. There are K-means algorithms that are used for clustering, and there are lots more and having some broad sense of what the styles are certainly useful.
Models are very important and are always used. An application, for example, can call a model, providing the values for the features that the model requires. Models make predictions based on the features that were chosen when the model was trained. The model can then return a value, predicted using these features. That value might be whether or not it actually is fraudulent, estimated revenue, a list of movie recommendations, or something else.
Let's take a closer look at the process of creating and training a model. Let’s start with our training data because we're using supervised learning, the target value is part of the training data. In the case of the credit card example, that target value is whether a transaction is fraudulent or not. Our first problem is to choose the features that we think will be most predictive of that target value. For example, in the credit card case, maybe we decide that the country in which the card was issued, the country it's used in, and the age of the user are the most likely features to help us predict whether it's fraudulent. We've chosen, let's say, features 1, 3, and 6 in our training data. We then input that training data into our chosen learning algorithm. But we only send in 75%, say, of all the data for the features we've chosen. How do we decide which features were most predictive, and how do we choose a learning algorithm? There are lots of options as we've seen. The answer is if it's a simple problem, or maybe our technology is simple for machine learning, the choices can be limited, not too hard. If we have a more complex problem, though, with lots of data and a powerful machine learning technology with lots of algorithms, this can be hard. If we have training data that has 100 features? Which ones are predictive? How many should we use? 5, 10, 50? The answer is this is what data scientists are for. This is why people who have knowledge and facility with these technologies, as well as domain knowledge about some particular problem, are so valuable. It's because they can help us do this. It can be a hard problem. In any case, the result of this is to generate a candidate model.
The next problem is to work out whether or not this model is good. And so, we do that in supervised learning like this. We input test data to a candidate model. That test data is the remaining 25%, the data we held back for the features we're using, in this case, 1, 3, and 6. We use that data, because our candidate model can now generate target values from that test data. We know what those target values should be, because they are in the training data. All we have to do is compare the target values produced by our candidate model from the test data with the real target values, which are in the training data. That's how we could figure out whether or not our model is predictive or not when we're doing supervised learning. Suppose our model's just not very good. How can we improve it? One of them is, maybe we've chosen the wrong features. So, this time we choose different ones like 1, 2, and 5. We also may have wrong data, so we can get some new data. The problem might be the algorithm, so we can modify some parameters in our algorithm or choose a new one. Whatever we do will generate another candidate model, and we'll test it, and the process repeats. It iterates and evolves. This process is called machine learning, but notice how much people do. People make decisions about features, about algorithms, about parameters. The process is very human, even though it's called machine learning.
In this way, at long last machine learning has grown up. It's never again some innovation that is just for analysts in faraway labs. Machine adapting likewise isn't difficult to get it. I trust we consent to this now, in spite of the fact that, it can be difficult to do well. Lastly, that machine learning can presumably enable individuals to make better applications and contribute a lot to the society. Hope, this tutorial gave you all the information needed to understand the human process in machine learning clearly.