I often get asked by my friends and college-mates — “How to start Machine Learning or Data Science”.
So, here goes my answer..
Earlier, I wasn’t so sure. I would say something like do this course or read this tutorial or learn Python first (just the things that I did). But now, as I am going deeper and deeper into the field, I am beginning to realise the drawbacks of the approach that I took.
So, in hindsight, I believe that the best way to “get into" ML or Data Science might be through Kaggle.
In this article, I will tell you why I think so and how you can do that if you are convinced by my reasoning.
(Caution: I am a student. I am not a Data Scientist or an ML engineer by profession. I am definitely not an expert at Kaggle. So, take my advice/opinions with a healthy grain of salt. :-) )
But first, let me introduce Kaggle and clear some misconceptions about it.
You might have heard of Kaggle as a website that awards mind-boggling cash prizes for ML competitions.
Competitions hosted on Kaggle with the maximum prize money (yes those are MILLION DOLLAR+ prizes!)
It is this very fame which also causes a lot of misconceptions about the platform and makes newcomers feel a lot more hesitant to start than they should be.
(Oh and don’t worry if you have never heard of Kaggle before and therefore, don’t share any of the below mentioned misconceptions. This article will still make complete sense. Just treat the next section as me introducing Kaggle to you.)
This is such an incomplete description of what Kaggle is! I believe that competitions (and their highly lucrative cash prizes) are not even the true gems of Kaggle. Take a look at their website’s header—
Competitions are just one part of Kaggle
Along with hosting Competitions (it has hosted about 300 of them now), Kaggle also hosts 3 very important things:
Datasets , even the ones not related to any competition: It houses 9500 + datasets as compared to just the 300 competitions (at the time of writing). So you can sharpen your skills by choosing whatever dataset amuses or interests you
Kernels: They are just Kaggle’s version of Jupyter notebooks, which in turn, are just a really effective and cool way of sharing code along with lots of visualisations, outputs and explanations. The “Kernels” tab takes you to a list of public kernels which people use to showcase some new tool or share their expertise or insights about some particular dataset(/s).
Learn : This tab contains free, practical, hands-on courses that cover the minimum prerequisites needed to quickly get started in the field. The best thing about them? — everything is done using Kaggle’s kernels (described above). This means that you can interact and learn.. no more passive reading through hours of learning material!
All of these together have made Kaggle much more than simply a website that hosts competitions. It has, now, also become a complete project-based learning environment for data science. I will talk about that aspect of Kaggle in details after this section.
If you think so, I urge you to read this —
TL;DR: A high school kid became a Kaggle Competitions Master by simply following his curiosity and diving into the competitions. In his own words —
The most important part of machine learning is Exploratory Data Analysis(or EDA) and feature engineering and not model fitting. In fact, many Kaggle masters believe that newcomers move to the complex models too soon when the truth is that simple models can get you very far
Besides, a lot of challenges have structured data, meaning that all the data exists in neat rows and columns. There is no complex text or image data. And simple algorithms (no fancy neural nets) are often the winning algorithms for such datasets. EDA is probably what differentiates a winning solution from others in such cases.
Now, let’s move on to why you should use Kaggle to get started with ML or Data Science..
The Machine Learning course on Kaggle Learn won’t teach you the theory and the mathematics behind ML algorithms. Instead, it focuses on teaching only those things that are absolutely necessary in analysing and modelling a dataset. Similarly, the Python course over there won’t make you an expert at Python but it will ensure that you know just enough to go to the next level.
This minimises the time that you need to spend in passive learning and makes sure that you are ready to take on interesting challenges ASAP.
I believe that doing projects is so effective that it's worth to center your entire learning around completing one. What I mean to say is that instead of searching for a relevant project after you learn something, it might be better to start with a project and learn everything you need to, to bring that project to life.
I believe that learning is more exciting and effective this way.
I believe that learning is more exciting and effective this way.
(I wrote an article about the above methodology a few weeks ago. Its called — “How (and why) to start building useful, real-world software with no experience” and it seems to be doing well on Medium. So, check that out if you can :-) )
But this idea totally fails when you don’t have a project to leap towards. And doing an interesting project is difficult because..
a) Finding an interesting project idea
Finding ideas for Data Science projects seems to be more difficult than other programming fields because of the added requirement of having suitable datasets. It often feels like all the data being generated is just being hoarded away by the tech companies for their private use.
b) Help in learning the missing prerequisites
Sometimes when I have started some project, it feels like there are just so many things that I still don’t know. So many things that online course didn’t teach me about. Was I supposed to know all that beforehand? Am I just out of my depth? I feel like I don’t even know the prerequisites for learning this thing. So, how do I go about learning what I don’t know?
And that’s when all the motivation starts to wane away.
c) Helps during the building process
It seems like I keep hitting one roadblock after the other during the building process. And even if I do manage to build the first version how do I improve it? What’s the best way forward? It would be so good if I could talk to a group of people and know how they would tackle the problem.
And here’s how Kaggle seems to be the perfect solution to all those problems —
Soln. a → Datasets and Competitions :
With around 300 competition challenges, all accompanied by their public datasets, and 8500+ datasets in total (and more being added constantly) there seems to be no shortage of ideas that you can get here.
Soln. b → Kernels and Learn :
All the challenges have public kernels that you may use to get started with that challenge. Also, a lot of popular challenges have kernels intended to help newcomers who are just getting started. Apart from that, Kaggle seems to be making a real effort to include newcomers in their community. They have recently added a Lea`rn section and it now features on the website’s main header.
Kaggle Learn provides courses to give a practical intro to Data Science using Kaggle’s challenges
Soln. c → Kernels and Discussion :
Along with the public Kernels section, each competition has its own Discussion forum. You will often find some really useful analysis or insights in there. As written in the article, “Learning From the Best” on Kaggle’s blog —
“During the competitions, many participants write interesting questions which highlight features and quirks in the data set, and some participants even publish well-performing benchmarks with the code on the forums. After the competitions, it is common for the winners to share their winning solutions”
All of these can help you give ideas about improving your own approach and even guide you by telling you what you need to learn next.
The challenges on Kaggle are hosted by real companies looking to solve a real problem that they encounter. The datasets that they provide are real. All that prize money is real. This means that you get to learn Data Science/ ML and practice your skills by solving (at least what feels like) real-world problems.
If you have tried competitive programming before, you might relate to me when I say that the problems hosted on such websites feel too unrealistic at times. I mean why should I try to write a program to find out the number of Pythagorean triplets in an array? What is it going to accomplish??
I am not trying to assert that such problems are easy; I find them extremely difficult. Nor am I trying to undermine the importance of websites that host such problems; they are a good way to test and improve your data structures and algorithms knowledge.
All I’m saying is that it all feels way too fictional to me. When the problem that you are trying to solve is real, you will always want to work on improving your solution. That will provide the motivation to learn and grow. And that’s what you can get from participating in a Kaggle challenge.
I will be remiss to not mention the other side of this debate which argues that— Machine Learning Isn’t Kaggle Competitions. Some people even go as far and say that Kaggle competitions only represent a “touristy sh*t” of actual Data Science work and that the data over there is artificially clean.
Well, maybe that is true. Maybe real data science work doesn’t resemble the approach one takes in Kaggle competitions. I haven’t worked in a professional capacity, so I don’t know enough to comment.
But what I have done, plenty of times is using tutorials and courses to learn ML/ Data Science. And each of those times, I felt like there was a disconnect between the tutorial/course and my motivation to learn. I would learn something just because it is there in the tutorial/course and hope that it comes of use in some distant, mystical future.
On the other hand, when I’m doing a Kaggle challenge, I have a stage that allows me to immediately apply what I have learned and see its effects. And that gives me motivation and the glue that helps make all that knowledge stick. So, Kaggle, in itself, is probably not enough to help someone become a Data Scientist but it seems to be really effective in helping someone start his/her journey to becoming one.
Having all those ambitious, real problems has a downside though — it can be an intimidating place for beginners to get in.
I understand this feeling as I have recently started with Kaggle myself (with the Housing Prices Prediction and the Costa Rican Household Poverty Level Prediction challenges). But once I overcame that initial barrier, I was completely awed by its community and the learning opportunities that it has given me.
So, here I try to lay down how I started with my Kaggle journey (and how you can start yours too):
Choose a language: Python or R.
Once you have done that, head over to Kaggle Learn to quickly understand the basics of that language, machine learning and data visualisation techniques.
Courses on Kaggle Learn
Remember your goal isn’t to win a competition. It is to learn and improve your knowledge of Data Science and ML. It's easy to feel small and lost and out of place if you try to start by competing in an active challenge as a beginner.
I would suggest that you choose a competition from this list of competitions (sorted by the number of participating teams — highest to lowest). What this means is that you will mostly encounter the introductory and archived competitions before the active ones. And that is a good thing! When you choose such a competition as opposed to an active one, you can have the insights and the analysis of a lot more people.
Go through the list and choose the competition whose dataset or problem statement seems interesting to you; it is what will help to sustain your learning.
They will help you understand the general workflow of data exploration -> feature engineering -> modeling as well as the particular approach that other people are taking for this competition. Get a feel for how things are done by looking at how other people’s public kernels.
Often, these kernels and discussions will tell you what you don’t know in ML/ Data Science. Don’t feel discouraged when you encounter a technical term that you are unfamiliar with.
Knowing what you need to know is the first step to knowledge.
So, its okay if you don’t know the difference between continuous and discrete variables or what k-fold cross-validation is or how to produce those compelling graphs and visuals. They are just the things that you need to learn to help you grow. But before you do that...
Now go work on your own analysis. Implement whatever you learn from the previous step in your own kernel. Build as much as you can with your current knowledge.
Also, remember it is the open-source philosophy to have “the freedom to run it, to study and change it”. So, don’t shy away from using someone else’s approach in your own implementation. It's not cheating to copy. Just don’t take this as an excuse to slack off. Make sure that you copy only what you understand. Be honest with yourself. (Also it would be nice of you to credit and link back to the original authors if you use parts of their notebook.)
This is where you do the learning. Sometimes, it is just a short article while at other times it can be a meaty tutorial/course.
Just remember that you need to go back to step 3 and use what you learn in your kernel. This way you create the cycle needed to — “Learn, Leap and Repeat”!
You come to this step once you have built an entire prediction model. So, congratulations for that!
Now you probably want to improve your analysis. To do that you can go back to step 2 and look at what other people have done. That can give you ideas about improving your model. Or, if you feel like you have tried everything but have hit a wall, then asking for help on the discussion forums might help.
An example of such a discussion
So — learn, leap and repeat!
Weekly Kernels Award Winner Announcement thread: A while back, Kaggle started with an initiative of choosing one best public kernel (their decision) each week and awarding it with a cash prize. These kernels include some excellent analysis/visualizations. Also, on some weeks they fix the dataset upon which they are awarding. This means that you get lots of excellent public kernels for those datasets which makes those datasets perfectly apt for the 1st step of “How” above!
Data Science glossary on Kaggle: This public kernel uses the Meta Kaggle database to make a glossary of the most famous public kernels grouped by the tools/techniques that they use. This might be one of the best resources on the internet to understand the practical implementations of ML algorithms. Also, this is one of those winning kernels from the Weekly Kernels Award that I mentioned above (so now you know how useful that thread is.
No Free Hunch— the official blog of Kaggl: Among other things, this blog contains interviews from the top performers of Kaggle competitions. They discuss their strategy and dissect their approach for the respective challenges.
Reflecting back on one year of Kaggle contests: A Kaggle Master shares his year-long experience on how he became good at Kaggle competitions.
A Getting Started discussion thread about How to Become a Data Scientist at Your Own: It contains lots of links to various free resources for learning.
Learning From the Best: This article, published on the Kaggle blog, contains advice on how to do well at Kaggle competitions from some of the top performers of Kaggle. There’s a lot I don’t understand in here but it all seems really interesting and I have bookmarked it.
And finally, Machine Learning Isn’t Kaggle competitions: This is the second time I am linking to this article. In this blog post, Julia Evans explains how she found Kaggle competitions vastly different from a day-to-day job at Stripe as a Machine Learning engineer. Things I took away from this post —
* Doing well on the leaderboard isn’t the end of the world
* How little actual ML work is about the fancy algorithms
* What one may expect at an ML/ Data Science job
Alright then. Thank you for reading. I hope this Kaggle learning basic tutorial has been helpful for you.
I really believe that learning by building something is a very rewarding experience but it is difficult. Kaggle makes it easy for you. Kaggle competitions take care of coming up with a task, acquire data for you, clean it into some usable form and have a predefined metric to optimize towards.
But as pointed out by other people, that is 80% of a Data Scientist’s work. So, although Kaggle is a great tool to start your journey, it is not enough to take you to the end. You need to do other things to showcase in your Data Science portfolio.
And therefore I am trying to start a community — Build To Learn. It is a place where people can share their project ideas (weird ideas, welcome!) or cravings for tools and build them with the help of other members. It is a community of web developers, mobile app developers and ML engineers. So, no matter what domain your idea/problem falls in, you can expect to get at least some help with your fellow members.
Let me know your thought in the comments section below.