Data Scientist is a professional standing at the confluence of technology, domain knowledge, and business to tackle the data revolution. A Data Scientist needs to be a mathematician, computer programmer, analyst, statistician, and effective communicator to turn insights into actions.
It's not just the technical skills that make Data Scientist the most in-demand job of 21st Century, it takes a lot more. Data Scientist is a professional who utilizes these new-age tools to manage, analyse and visualize data.
Let us take an example to better understand a day in the life of a Data Scientist. On a typical day, A Data Scientist may be given an open-ended problem such as “We need our customers to stay longer and watch/read more content”. The following are a few steps he/she might get started with:
There are additional steps after this wherein the data scientist would then head towards building models to actually improve the time spent on the website by developing recommender engines and so on, sharing results/fine tuning models with business teams and so on. He would then take this towards production environment, where it can be actually tested and finally used.
The above example is an over-simplified version of tasks a typical data scientist performs. Yet, it should give you a glimpse into how different skill-sets are utilized by such a professional.
Data Science can be defined in many ways. One of the most interesting and true definitions marks it as the fourth paradigm (link). The first three being experimental, theoretical and computational science. The fourth paradigm, Dr. Jim Gray explains, is the answer to cope with the tremendous flood of data being collected/generated every day.
In simple words, Data Science is thus a new generation of scientific & computing tools which can help to manage, analyse and visualize such huge amounts of data.
The explanation of the term Data Scientist and Data Science seems to indicate it is a completely new field with its own set of techniques and tools. Though this is true to a certain degree, yet, not entirely. Data Science, as mentioned above, is at the confluence of technology, domain knowledge and business understanding. Thus, it utilizes tools and techniques from various fields to form a set of encompassing methodologies to turn data into insights.
Statistics traditionally has been the go-to subject to analyse data and hypothesis. Statistical methods are based on established theories and years of research.
Even though Data Science and Statistics have similar goals (and overlapping techniques in certain cases), i.e. to utilize data to reach conclusions and share insights, they are not the same. Statistics predates computing era while Data Science is new-age amalgamation of interdisciplinary knowledge.
There is a never-ending debate on the definitions of Data Science and Statistics. The old school believes Data Science is merely a rebranding of Statistics while the new-age experts grossly differ. Amongst all this, an interesting and somewhat accurate take on the issue was presented in an article on the website of Priceonomics (link):
“Statistics was primarily developed to help people deal with pre-computer data problems like testing the impact of fertilizer in agriculture or figuring out the accuracy of an estimate from a small sample. Data science emphasizes the data problems of the 21st Century, like accessing information from large databases, writing code to manipulate data, and visualizing data.”
It is worth reiterating the fact that Data Science is an interdisciplinary field. This makes sense as Data Science is not limited to just one field of study or industry. It is being used across every field which can or is generating data. It is not a surprise to see Data Scientists coming from varied academic backgrounds. Yet there are a few important and common skills such professionals have in the first place. Educational qualifications required to become a Data Scientist can be summarized as follows:
This is the trickiest part of the whole journey. While being interdisciplinary is good in most aspects, it also presents a daunting question for beginners. Data Scientists are storytellers. They turn raw data into actionable insights all the way leveraging tools and techniques from various fields. Yet generic programming skills remain as the common denominator. Apart from programming skills, the following are a few important technical skills a Data Scientist usually has:
Though there are no hard and fast rules, most Data Scientists rely on programming/scripting languages like python, R, scala, Julia, Java or SAS to perform everyday tasks from the raw data to insights.
Turning Data to Insights is easier said than done. A typical Data Science project involves a lot of important sub-tasks which need to be performed efficiently and correctly. Lets us breakdown the learning path into milestones and discuss how to go about the journey.
There are a number of courses on platforms like Coursera and Udemy to get you started with these languages. Some of the courses are:
Julia and languages of its type are upcoming ones with a special focus towards Data Science and Machine learning. These languages have advantages of having Data Science as one of its core concepts, unlike traditional languages which have been extended to cater to DS/ML needs. Again, it boils down to a personal choice and comfort when it comes to deciding which language to choose.
Mathematics and Statistics give you the understanding to learn the tools and techniques required to leverage Machine Learning to solve real-world problems. ML techniques expand on a Data Scientist’s capabilities to handle different types and size of data sets. It is a vast subject on its own which can be broadly categorized into :
The following are a few helpful resources to get you started on the subject:
There is a detailed list of datasets discussed here as well by Dr. Jason Brownlee on his blog machinelearningmastery.com
Apart from these datasets, there are regular competitions on Data Science problems on websites like Kaggle, AnalyticsVidya, KDNuggests and so on. It is worth participating in these competitions to learn the tricks of the trade from some of the seasoned performers.
Typically, Data Scientists leverage their portfolios along with their CVs for interviews and prospective employers to have a better understanding of their capabilities. Code repositories can be maintained on websites like github, bitbucket and so on. Maintaining a blog to share your findings, commentary and research to a broader audience along with self-promotion are also quite common.
Each of these platforms provides you with an ecosystem of experts and recruiters who can help you land a job or a freelancing project. These platforms also provide you with an opportunity to fine tune your skills and make them market ready.
The educational requirements to become a Data Scientist were discussed previously. Apart from traditional quantitative fields of study, a lot many reputed top universities across the globe are also offering specialized Data Science courses for undergraduate, graduate and online audiences. Some of the top US universities offering such courses are:
There are numerous other courses by other top universities in Europe and Asia as well. Also, MOOCs from platforms like Coursera, Udemy, Khan Academy and others have also gained popularity lately.
The role and responsibilities of a Data Scientist vary greatly from one organization to other. Since the life cycle of a data science project involves a lot of intricate pieces, each with their own importance, a data scientist might be required to perform different tasks. Typically, a day in a Data Scientist’s life comprises of one or more of the following tasks:
The above list is by no means exhaustive. Specific tasks may be required for specific organizations and/or scenarios. Depending upon the set of tasks assigned or strengths of a particular individual, the Data Scientist role may have different facets to it. Some organizations divide the above set tasks into specific roles like:
Though some organizations separate out the roles and responsibilities, others chose to have a common Data Scientist title.
The title of the most-coveted job of the 21st century ought to have an equally tempting salary as well. The data also confirms the hypothesis from various aspects. Different surveys from across the world have analysed salaries of Data Scientists and the results are astonishing.
The Burtch Works Study for Salaries of Data Scientists is one such survey:
The median base salary for Managers start out around $145K and go up to $250K (for 10+ years of experience)
A survey by PromptCloud on the similar lines tried to identify different skills required for different Data Scientist job postings. The results show python as the topmost skill required followed by SQL, R and others. This showcases how important python and python ecosystem is to the Data Science work and community.
The Glassdoor 50 Best Jobs in America for 2018 (link) rates Data Scientist as numero uno with an average salary of around USD 120k. The study also identifies other related Data Science job titles like Data Analyst and Quantitative Analyst in the study.
Similar results from Payscale, Linkedin and others reconfirm the fact. Data Scientists are really sought after across the globe.
With the advancements in compute & storage and corresponding lowering of cost for hardware, technology is part and parcel of almost every industry. From aerospace to mining, from the internet to farming, every sphere of commerce is generating an immense flood of data. Where there’s data, there’s data science. Almost every industry today is leveraging the benefits of Data Science.
These are some of the big names in their respective fields. There are a lot of start-ups along with small-medium sized enterprises that are also leveraging Data Scientists to make an impact in their respective fields.
Our discussion so far has revolved around Data Science and related concepts. In the same context, there’s another important term, Artificial Intelligence (AI). There are times when terms like AI and Data Science are used interchangeably while there people who perceive them differently as well. To understand each side, let us first try and understand the term Artificial Intelligence.
Artificial Intelligence can be defined in many ways. The most consistent and commonly accepted definition states:
“The designing and building of intelligent agents that receive percepts from the environment and take actions that affect that environment”
The above definition comes from AI heavyweights Dr. Peter Norvig and Dr. Stuart Russell. In simple words, this definition highlights the presence of intelligent agents which act based on stimulus from the environment, which in turn has an effect on the environment as well. Sounds very similar to how we, as humans, function.
The genesis of Artificial Intelligence as a field of study/research is credited to the famous Dartmouth workshop in 1956. The workshop was held by John McCarthy and Marvin Minsky, amongst other prominent personalities from computer science and AI space. Their workshop provided the first glimpse of intelligent systems/agents. The programs were learning strategies for the game of checkers. The programs were reported to play better than average human beings by 1959! A remarkable feat in itself. Since then, the field of AI has gone through a great many changes, theoretical and practical advancements.
The field of AI is focussed towards being successful at maximizing the agent’s chances of achieving a stated goal. The goal can be termed simple (if its only about winning or losing) or complex (take next steps based on rewards from past moves). Based on these goal categories, AI has focussed at solving problems in the following high-level domains over the course of its history:
The ability to move and explore the environment is an important characteristic highly utilized in the robotics space. Particularly industrial robots , robotic arms and the amazing machines from groups like Boston Dynamics are prime examples.
The domains of Learning Tasks, characterized as supervised and unsupervised learning along with Natural Language Processing tasks have been traditionally associated with AI. Yet, with recent advancements in these fields, they are sometimes seen separately or no longer part of AI. This is also known as AI effect or Tesler’s Theorem. The AI effect simply states:
“AI is whatever hasn’t been done yet”
On the same grounds, OCR or optical character recognition, speech translation and others have become everyday technologies. This advancement has led to these technologies being no longer considered as part of AI research anymore.
Before we move on, there is another important detail about AI. Artificial Intelligence is categorized into two broad categories. These are:
Deep AI or AGI seems like a far fetched dream yet advancements like Transfer Learning and Reinforcement Learning techniques are steps in the right direction.
Now that we understand Artificial Intelligence and its history, let us attempt at understanding how it is different from Data Science. Data Science, as we know, is an amalgamation of tools and techniques from different fields (similar to AI). From the above discussion, we see, there is a definite overlap between the definition of weak/narrow AI and Data Science tasks. Yet, Data Science is considered to be more data-driven and focussed on business outcomes & objectives. It is more application oriented study and utilization of tools and techniques. Though, there are certain overlaps and similarities in the areas of research and tools, Data Science and AI are certainly not the same. It would be hard to even set them as subset-superset entities either. They are best seen as interdisciplinary fields which make the best of uncertainties.
Data Science is THE keyword for every industry for quite a few years now. In this article on What is a Data Scientist, we covered a lot of ground in terms of concepts and related aspects. The aim was to help you understand what really makes Data Scientist the “top and trending ” job of the 21st century.
The discussion started off with a formal definition of Data Science and how it is ushering in the fourth paradigm to tackle this constant flood of data. We then briefly touched upon the subtle differences between Data Science and Statistics along with the point of contention between the experts from the two fields. We also presented an honest opinion on what all it takes, in terms of technical skills and educational qualifications, to become a Data Scientist. Sure, it is cool to be one, but it is not as easy as it seems.
Along with the skills, we touched upon the learning path to become a Data Scientist. In this section, we covered the fundamental concepts one should know to advanced techniques like Reinforcement Learning and so on.
The world is in deep shortage of Data Scientists. Top universities have taken up this challenge to upskill the existing and next generation of workforce. We discussed some of the courses being offered by these universities from across the globe. We also touched upon different companies that are hiring data scientists and at what salaries.
In the final leg, we introduced concepts related to Artificial Intelligence. It is imperative to understand how different yet overlapping Data Science and AI are.
With this, we hope you are equipped to get started on your journey to become a Data Scientist and contribute. If you are already working in this space, the article was aimed to demystify some commonly used terms and provide a high-level overview of Data Science.