top

How to Become a Data Scientist

What is a Data Scientist?Data Scientist is a professional standing at the confluence of technology, domain knowledge, and business to tackle the data revolution. A Data Scientist needs to be a mathematician, computer programmer, analyst, statistician, and effective communicator to turn insights into actions.Source Link:It's not just the technical skills that make Data Scientist the most in-demand job of 21st Century, it takes a lot more. Data Scientist is a professional who utilizes these new-age tools to manage, analyse and visualize data.Let us take an example to better understand a day in the life of a Data Scientist. On a typical day, A Data Scientist may be given an open-ended problem such as “We need our customers to stay longer and watch/read more content”. The following are a few steps he/she might get started with:The Business HatThe job of the Data Scientist would, first of all, involve translating this problem statement into a quantifiable data science problem. For this, he might first ask or identify the current time being spent by users and discuss with the business teams how to quantify “more”.The Programming HatHe/she would then get towards data collection. He would have to work with different teams to understand what kind of data is available, what all he might require for his analysis and so on. Once clear about what and where related to data, he would extract and prepare data for analysisThe Analytical HatHere he would utilize his analytical and statistical powers to ask important questions using data. This typically involves exploratory analysis, descriptive analysis and so on.There are additional steps after this wherein the data scientist would then head towards building models to actually improve the time spent on the website by developing recommender engines and so on, sharing results/fine tuning models with business teams and so on. He would then take this towards production environment, where it can be actually tested and finally used.The above example is an over-simplified version of tasks a typical data scientist performs. Yet, it should give you a glimpse into how different skill-sets are utilized by such a professional.Data Science vs. StatisticsData Science can be defined in many ways. One of the most interesting and true definitions marks it as the fourth paradigm (link). The first three being experimental, theoretical and computational science. The fourth paradigm, Dr. Jim Gray explains, is the answer to cope with the tremendous flood of data being collected/generated every day.In simple words, Data Science is thus a new generation of scientific & computing tools which can help to manage, analyse and visualize such huge amounts of data.The explanation of the term Data Scientist and Data Science seems to indicate it is a completely new field with its own set of techniques and tools. Though this is true to a certain degree, yet, not entirely. Data Science, as mentioned above, is at the confluence of technology, domain knowledge and business understanding. Thus, it utilizes tools and techniques from various fields to form a set of encompassing methodologies to turn data into insights.Statistics traditionally has been the go-to subject to analyse data and hypothesis. Statistical methods are based on established theories and years of research.Even though Data Science and Statistics have similar goals (and overlapping techniques in certain cases), i.e. to utilize data to reach conclusions and share insights, they are not the same. Statistics predates computing era while Data Science is new-age amalgamation of interdisciplinary knowledge.There is a never-ending debate on the definitions of Data Science and Statistics. The old school believes Data Science is merely a rebranding of Statistics while the new-age experts grossly differ. Amongst all this, an interesting and somewhat accurate take on the issue was presented in an article on the website of Priceonomics (link):“Statistics was primarily developed to help people deal with pre-computer data problems like testing the impact of fertilizer in agriculture or figuring out the accuracy of an estimate from a small sample. Data science emphasizes the data problems of the 21st Century, like accessing information from large databases, writing code to manipulate data, and visualizing data.”Educational Qualifications to become a Data ScientistIt is worth reiterating the fact that Data Science is an interdisciplinary field. This makes sense as Data Science is not limited to just one field of study or industry. It is being used across every field which can or is generating data. It is not a surprise to see Data Scientists coming from varied academic backgrounds. Yet there are a few important and common skills such professionals have in the first place. Educational qualifications required to become a Data Scientist can be summarized as follows:A graduate degree in a quantitative field of study. Areas in mathematics, computer science, engineering, statistics, physics, social science, economics, statistics or related fields are most common.Newer options like bootcamps and MOOCs (Massively Open Online Courses) are quite popular for professionals to pivot into the areas of Data Science.An advanced degree in the form of Masters’ or even PhD certainly helps. Increasingly, a lot many Data Science professionals are with such advanced degrees (link).Technical Skills required to become a Data ScientistThis is the trickiest part of the whole journey. While being interdisciplinary is good in most aspects, it also presents a daunting question for beginners. Data Scientists are storytellers. They turn raw data into actionable insights all the way leveraging tools and techniques from various fields. Yet generic programming skills remain as the common denominator. Apart from programming skills, the following are a few important technical skills a Data Scientist usually has:Mathematical background/understanding (linear algebra, calculus, and probability are important)Machine Learning concepts and algorithms.Statistical concepts (hypothesis testing, sampling techniques and so on)Computer Science/Software Engineering skills (data structures, algorithms)Data Visualization skills (tools like d3.js, ggplot, matplotlib, etc)Data Handling (RDBMS, Big Data tools like Hive, Spark)Though there are no hard and fast rules, most Data Scientists rely on programming/scripting languages like python, R, scala, Julia, Java or SAS to perform everyday tasks from the raw data to insights.Learning Path for Data Scientist - From Fundamentals, Statistics to Problem SolvingTurning Data to Insights is easier said than done. A typical Data Science project involves a lot of important sub-tasks which need to be performed efficiently and correctly. Lets us breakdown the learning path into milestones and discuss how to go about the journey.Step 1: Select a Programming LanguageR and python are widely accepted and used programming languages in the data science community. There are other languages like Java, Scala, Julia, Matlab and to a certain extent even SAS. Yet, R and python have a huge ecosystem and community contributing towards making it better every day. Though there is no such thing as the best programming language for Data Science, yet, there are some favorites and popular ones. When starting off with your Data Science journey, it may be confusing which one to choose. The following are a few pointers that might be helpful:RR is the most popular language when it comes to statistical analysis and time series modeling. It also has a good number of machine learning algorithms and visualization packages. It can have a peculiar learning curve, yet it is good for exploring your data, one-off projects or quick prototypes. It is also usually the go-to language for academic reports, research papers.PythonPython is one of the most widely used programming languages. It is also sometimes referred to as a popular scientific language. Its ever-expanding community, ease of writing code, ecosystem, and support are reasons for its popularity. Python packages like numpy, pandas, and sklearn enable Data Scientists and researchers to work with matrices and other mathematical concepts with ease.The Java FamilyR and python are great languages and are of great help when it comes to quick prototyping (though that is changing slowly with python being used in production as well). The heavyweights of the industry are still the languages from the Java family. Java in itself is a mature and proven technology with an extensive list of packages for machine learning, natural language processing and so on. Scala derives heavily from Java and is one of the go-to languages for handling big data.There are a number of courses on platforms like Coursera and Udemy to get you started with these languages. Some of the courses are:Programming for Everybody(Getting started with python)Applied Data Science with Python SpecializationR ProgrammingAdvanced R ProgrammingJulia and languages of its type are upcoming ones with a special focus towards Data Science and Machine learning. These languages have advantages of having Data Science as one of its core concepts, unlike traditional languages which have been extended to cater to DS/ML needs. Again, it boils down to a personal choice and comfort when it comes to deciding which language to choose.Step 2:  Learn Statistics and MathematicsThese are the basic concepts required to understand the intricacies of more involved ones. The most essential ones are:Linear Algebra, Calculus and Probability theoryHaving an understanding of these concepts would help you in the long run to understand complex concepts. Probability theory is a must have as a lot of machine learning and statistics is based on measuring the likelihood of events, probability of failures or wins and so on. These concepts can be learnt through a number of classroom textbooks like Probability Theory by E.T James, Pattern Recognition and Machine Learning by Christopher M. Bishop, Introduction to Linear Algebra by Gilbert Strang. You could look up for these books/ebooks or even videos on youtube, khan academy and so on.Statistics:These form the very foundation of a lot of things you would be doing as a data scientist. The following are some of the popular online resources which can be helpful in this journey:Statistics:The Statsoft Book on StatisticsOnline Statistics EducationStep 3: Powerup with Machine Learning:Mathematics and Statistics give you the understanding to learn the tools and techniques required to leverage Machine Learning to solve real-world problems. ML techniques expand on a Data Scientist’s capabilities to handle different types and size of data sets. It is a vast subject on its own which can be broadly categorized into :Supervised Methods like classification and regression algorithmsUnsupervised Methods like different clustering techniquesReinforcement Learning like q-learning, etcDeep Learning (spanning across the above three types, it is slowly emerging as a specialized field of its own)Image Source:The following are a few helpful resources to get you started on the subject:Python for Data Science and Machine Learning bootcampR:Complete Machine Learning SolutionsData Science and Machine Learning bootcamp in RDeep Learning SpecializationData Science Nano DegreeProgramming for Data Science Nano DegreeStep 4: Practice!All theory and no practice would lead you nowhere. Data Science has an element of art apart from all the science and theory behind it. A Data Scientist needs to practice to hone the skills required to work on real-world problems. Luckily, the Data Science ecosystem and community is really a great place. To practice Data Science, you need a problem statement and corresponding data. Websites like Kaggle, UCI Machine Learning Repository, and many others are a great resource. Some of the popular ones are as follows:Bike Sharing Demand: Given daily bike rental and weather records predict future daily bike rental demand.Iris dataset: Given flower measurements in centimeters predict the species of iris.Wine dataset: Given a chemical analysis of wines predict the origin of the wind.Car evaluation dataset: Given details about cars predict the estimated safety of the car.Breast Cancer Wisconsin dataset: Given the results of a diagnostic test on breast tissue, predict whether the mass is a tumor or not.There is a detailed list of datasets discussed here as well by Dr. Jason Brownlee on his blog machinelearningmastery.comApart from these datasets, there are regular competitions on Data Science problems on websites like Kaggle, AnalyticsVidya, KDNuggests and so on. It is worth participating in these competitions to learn the tricks of the trade from some of the seasoned performers.Step 5: Build a PortfolioJust like a photographer or a painter, a Data Scientist is as much of an artist. While working on the different datasets and competitions, you can build a portfolio of your completed work to showcase your findings and learnings. This will not only help you showcase your talent but also give you a glimpse of your progress as you learn new and complex methods. A machine learning/data science portfolio is a collection of independent projects which utilizes machine learning in one way or the other. A typical machine learning portfolio can give you the following benefits:Showcase: your skill set and technical understandingReusable code base: As you work on more and more projects, there are certain components which would be required time and again. Your portfolio can be a repository of such reusable components.Progress Map: A portfolio is also a map of your progress over time. With every project, you would be getting better and learning new complex concepts. This is a great way to keep yourself motivated as well.Typically, Data Scientists leverage their portfolios along with their CVs for interviews and prospective employers to have a better understanding of their capabilities. Code repositories can be maintained on websites like github, bitbucket and so on. Maintaining a blog to share your findings, commentary and research to a broader audience along with self-promotion are also quite common.Step 6: Job Search / Freelancing:Once the groundwork is done, it's time to reap some benefits. We are living in the age of data and almost every domain and sphere of commerce is (or trying to) leverage Data Science. To leverage your skill set for Job Search or Freelancing, there are some amazing resources to your aid:Interview Preparation:Machine Learning using PythonData Science and ML Interview GuideDeep LearningData Science Competitions:KaggleInnocentiveTuneditHackathons:HackerEarthMachineHackEach of these platforms provides you with an ecosystem of experts and recruiters who can help you land a job or a freelancing project. These platforms also provide you with an opportunity to fine tune your skills and make them market ready.Top Universities offering a Data Scientist CourseThe educational requirements to become a Data Scientist were discussed previously. Apart from traditional quantitative fields of study, a lot many reputed top universities across the globe are also offering specialized Data Science courses for undergraduate, graduate and online audiences. Some of the top US universities offering such courses are:1. Information Technology and Data Management courses at the Colorado Technical UniversityCourse Name: Professional Master of Science in Computer ScienceCourse Duration: 2 yearsLocation: Boulder, ColoradoCourses: Machine Learning, Neural Networks, and Deep Learning, Natural Language Processing, Big Data, HCC Big Data Computing and many moreTracks available: Data Science and EngineeringCredits: 302. MS in Data Science, Columbia UniversityCourse Name: Master of Science in Data ScienceCourse Duration: 1.5 yearLocation: New York City, New YorkCore courses: Probability Theory, Algorithms for Data Science, Statistical Inference and Modelling, Computer Systems for Data Science, Machine Learning for Data Science, and Exploratory Data Analysis and VisualizationCredits: 303. MS in Computational Data Science, Carnegie Mellon UniversityCourse Name: Master of Computational Data ScienceCourse duration: 2 yearsLocation: Pittsburgh, PennsylvaniaCore courses: Machine Learning, Cloud Computing, Interactive Data Science, and Data Science SeminarTracks available: Systems, Analytics, and Human-Centered Data ScienceUnits to complete: 1444. MS in Data Science, Stanford UniversityCourse Name: M.S. in Statistics: Data ScienceCourse Duration: 2 yearsLocation: Stanford, CaliforniaCore courses: Numerical Linear Algebra, Discrete Mathematics and Algorithms, Optimization,Stochastic Methods in Engineering or Randomized Algorithms and Probabilistic Analysis, Introduction to Statistical Inference, Introduction to Regression Models and Analysis of Variance or Introduction to Statistical Modeling, Modern Applied Statistics: Learning, and Modern Applied Statistics: Data MiningTracks available: The program in itself is a trackUnits to complete: 455. MS in Analytics, Georgia Institute of TechnologyCourse Name: Master of Science in AnalyticsCourse Duration: 1 yearLocation: Atlanta, GeorgiaCore courses: Big Data Analytics in Business, and Data and Visual Analytics,Tracks available: Analytical Tools, Business Analytics, and Computational Data AnalyticsCredits: 36There are numerous other courses by other top universities in Europe and Asia as well. Also, MOOCs from platforms like Coursera, Udemy, Khan Academy and others have also gained popularity lately.Roles and Responsibilities of a Data Scientist - What does a Data Scientist do?The role and responsibilities of a Data Scientist vary greatly from one organization to other. Since the life cycle of a data science project involves a lot of intricate pieces, each with their own importance, a data scientist might be required to perform different tasks. Typically, a day in a Data Scientist’s life comprises of one or more of the following tasks:Formulate open-ended questions and perform research into different areasExtract data from different sources from within and outside the organizationDevelop ETL pipelines to prepare data for analysisEmploy sophisticated statistical and/or machine learning techniques/algorithms to solve problems at handExploratory and Descriptive analysis of data.Visualization of data at different stages of the projectStory-telling/communicating results and findings to end-consumers/IT teams/business teamsDeploy intelligent solutions to automate tasksThe above list is by no means exhaustive. Specific tasks may be required for specific organizations and/or scenarios. Depending upon the set of tasks assigned or strengths of a particular individual, the Data Scientist role may have different facets to it. Some organizations divide the above set tasks into specific roles like:Data Engineer: concentrates more on developing ETL pipelines and Big Data infrastructure.Data Analyst: concentrates on hypothesis testing, A/B testing and so onBI Analyst: concentrates on visualizations, BI reporting and so onMachine Learning/Data Science Engineer: concentrates on implementing ML solutions into production systemsResearch Scientist: concentrates on researching new techniques, open-ended problems, etc.Though some organizations separate out the roles and responsibilities, others chose to have a common Data Scientist title.Salaries of a Data ScientistThe title of the most-coveted job of the 21st century ought to have an equally tempting salary as well. The data also confirms the hypothesis from various aspects. Different surveys from across the world have analysed salaries of Data Scientists and the results are astonishing.The Burtch Works Study for Salaries of Data Scientists is one such survey:The survey points out that post the peak increases in data scientist salaries across different levels in 2015-2016, the salaries for 2018 have been more or less steady at the previous year levels.The median base salary for a starting position is around $95k which rises up to $165K for 9+ years of experience (for individual contributors)The median base salary for Managers start out around $145K and go up to $250K (for 10+ years of experience)Image SourceA survey by PromptCloud on the similar lines tried to identify different skills required for different Data Scientist job postings. The results show python as the topmost skill required followed by SQL, R and others. This showcases how important python and python ecosystem is to the Data Science work and community.The Glassdoor 50 Best Jobs in America for 2018 (link) rates Data Scientist as numero uno with an average salary of around USD 120k. The study also identifies other related Data Science job titles like Data Analyst and Quantitative Analyst in the study.Image SourceSimilar results from Payscale, Linkedin and others reconfirm the fact. Data Scientists are really sought after across the globe.Top companies hiring Data ScientistWith the advancements in compute & storage and corresponding lowering of cost for hardware, technology is part and parcel of almost every industry. From aerospace to mining, from the internet to farming, every sphere of commerce is generating an immense flood of data. Where there’s data, there’s data science. Almost every industry today is leveraging the benefits of Data Science.Some of the top companies hiring for Data Scientists are:GoogleTwitterGE-HealthHPMicrosoftAirbnbGE-AviationIBMAppleUberUnitedHealth GroupIntelFacebookAmazonBoeingAmercian ExpressThese are some of the big names in their respective fields. There are a lot of start-ups along with small-medium sized enterprises that are also leveraging Data Scientists to make an impact in their respective fields.How is Data Science different from Artificial Intelligence?Our discussion so far has revolved around Data Science and related concepts. In the same context, there’s another important term, Artificial Intelligence (AI). There are times when terms like AI and Data Science are used interchangeably while there people who perceive them differently as well. To understand each side, let us first try and understand the term Artificial Intelligence.Artificial Intelligence can be defined in many ways. The most consistent and commonly accepted definition states:“The designing and building of intelligent agents that receive percepts from the environment and take actions that affect that environment”The above definition comes from AI heavyweights Dr. Peter Norvig and Dr. Stuart Russell. In simple words, this definition highlights the presence of intelligent agents which act based on stimulus from the environment, which in turn has an effect on the environment as well. Sounds very similar to how we, as humans, function.The genesis of Artificial Intelligence as a field of study/research is credited to the famous Dartmouth workshop in 1956. The workshop was held by John McCarthy and Marvin Minsky, amongst other prominent personalities from computer science and AI space. Their workshop provided the first glimpse of intelligent systems/agents. The programs were learning strategies for the game of checkers. The programs were reported to play better than average human beings by 1959! A remarkable feat in itself. Since then, the field of AI has gone through a great many changes, theoretical and practical advancements.The field of AI is focussed towards being successful at maximizing the agent’s chances of achieving a stated goal. The goal can be termed simple (if its only about winning or losing) or complex (take next steps based on rewards from past moves). Based on these goal categories, AI has focussed at solving problems in the following high-level domains over the course of its history:Knowledge RepresentationThis is one of the core concepts in classical AI research. As part of Knowledge Representation or Knowledge Engineering, we try to capture the world knowledge (where world is some specific narrow domain) possessed by experts. This was the foremost area of research for expert systems. The field of Ontology is highly associated with Knowledge Representation.Problem Solving and Reasoning TasksThis is one of the earliest areas of research. Herein, the researchers focussed at mimicking human reasoning step by step for tasks such as puzzle solving and logical deductions.PerceptionThe ability to utilize input from different sensors such as microphones, cameras, radars, temperature sensors and so on for decision making. This is also termed as Machine Perception with modern day applications like speech recognition, object detection and so on.Motion and ManipulationThe ability to move and explore the environment is an important characteristic highly utilized in the robotics space. Particularly industrial robots , robotic arms and the amazing machines from groups like Boston Dynamics are prime examples.Social IntelligenceIt is considered one of the far fetched goals wherein the intelligent systems are expected to understand human emotions and motives to take decisions. Current-day virtual assistants(the likes of Google Assistant, Alexa, Cortana, etc.) provide a glimpse of such advantages by allowing them(virtual assistants) to converse, joke and make small talk.The domains of Learning Tasks, characterized as supervised and unsupervised learning along with Natural Language Processing tasks have been traditionally associated with AI. Yet, with recent advancements in these fields, they are sometimes seen separately or no longer part of AI. This is also known as AI effect or Tesler’s Theorem. The AI effect simply states:“AI is whatever hasn’t been done yet”On the same grounds, OCR or optical character recognition, speech translation and others have become everyday technologies. This advancement has led to these technologies being no longer considered as part of AI research anymore.Before we move on, there is another important detail about AI. Artificial Intelligence is categorized into two broad categories. These are:Narrow AIAlso termed as weak AI. This category is focussed at tractable AI tasks. Specifically, most of current-day research is focussed on narrow tasks like developing autonomous vehicles, automated speech recognition, machine translation and so on. These areas work towards building intelligent systems which mimic human level performance but are limited to specific areas only.Deep AIThis is also termed as strong AI or better, Artificial General Intelligence. If an intelligent agent is capable of performing any intellectual task, it is considered to possess Artificial General Intelligence. AGI is considered to be a summation of knowledge representation, reasoning, planning, learning, and communication.Deep AI or AGI seems like a far fetched dream yet advancements like Transfer Learning and Reinforcement Learning techniques are steps in the right direction.Image SourceNow that we understand Artificial Intelligence and its history, let us attempt at understanding how it is different from Data Science. Data Science, as we know, is an amalgamation of tools and techniques from different fields (similar to AI). From the above discussion, we see, there is a definite overlap between the definition of weak/narrow AI and Data Science tasks. Yet, Data Science is considered to be more data-driven and focussed on business outcomes & objectives. It is more application oriented study and utilization of tools and techniques. Though, there are certain overlaps and similarities in the areas of research and tools, Data Science and AI are certainly not the same. It would be hard to even set them as subset-superset entities either. They are best seen as interdisciplinary fields which make the best of uncertainties.SummaryData Science is THE keyword for every industry for quite a few years now. In this article on What is a Data Scientist, we covered a lot of ground in terms of concepts and related aspects. The aim was to help you understand what really makes Data Scientist the “top and trending ” job of the 21st century.The discussion started off with a formal definition of Data Science and how it is ushering in the fourth paradigm to tackle this constant flood of data. We then briefly touched upon the subtle differences between Data Science and Statistics along with the point of contention between the experts from the two fields. We also presented an honest opinion on what all it takes, in terms of technical skills and educational qualifications, to become a Data Scientist. Sure, it is cool to be one, but it is not as easy as it seems.Along with the skills, we touched upon the learning path to become a Data Scientist. In this section, we covered the fundamental concepts one should know to advanced techniques like Reinforcement Learning and so on.The world is in deep shortage of Data Scientists. Top universities have taken up this challenge to upskill the existing and next generation of workforce. We discussed some of the courses being offered by these universities from across the globe. We also touched upon different companies that are hiring data scientists and at what salaries.In the final leg, we introduced concepts related to Artificial Intelligence. It is imperative to understand how different yet overlapping Data Science and AI are.With this, we hope you are equipped to get started on your journey to become a Data Scientist and contribute. If you are already working in this space, the article was aimed to demystify some commonly used terms and provide a high-level overview of Data Science.
Rated 4.5/5 based on 18 customer reviews
Normal Mode Dark Mode

How to Become a Data Scientist

Raghav Bali
Blog
22nd Mar, 2019
How to Become a Data Scientist

What is a Data Scientist?

Data Scientist is a professional standing at the confluence of technology, domain knowledge, and business to tackle the data revolution. A Data Scientist needs to be a mathematician, computer programmer, analyst, statistician, and effective communicator to turn insights into actions.

Data Scientist venn diagram

Source Link:

It's not just the technical skills that make Data Scientist the most in-demand job of 21st Century, it takes a lot more. Data Scientist is a professional who utilizes these new-age tools to manage, analyse and visualize data.

Let us take an example to better understand a day in the life of a Data Scientist. On a typical day, A Data Scientist may be given an open-ended problem such as “We need our customers to stay longer and watch/read more content”. The following are a few steps he/she might get started with:

  • The Business Hat

    The job of the Data Scientist would, first of all, involve translating this problem statement into a quantifiable data science problem. For this, he might first ask or identify the current time being spent by users and discuss with the business teams how to quantify “more”.
  • The Programming Hat

    He/she would then get towards data collection. He would have to work with different teams to understand what kind of data is available, what all he might require for his analysis and so on. Once clear about what and where related to data, he would extract and prepare data for analysis
  • The Analytical Hat

    Here he would utilize his analytical and statistical powers to ask important questions using data. This typically involves exploratory analysis, descriptive analysis and so on.

There are additional steps after this wherein the data scientist would then head towards building models to actually improve the time spent on the website by developing recommender engines and so on, sharing results/fine tuning models with business teams and so on. He would then take this towards production environment, where it can be actually tested and finally used.

The above example is an over-simplified version of tasks a typical data scientist performs. Yet, it should give you a glimpse into how different skill-sets are utilized by such a professional.

Data Science vs. Statistics

Data Science can be defined in many ways. One of the most interesting and true definitions marks it as the fourth paradigm (link). The first three being experimental, theoretical and computational science. The fourth paradigm, Dr. Jim Gray explains, is the answer to cope with the tremendous flood of data being collected/generated every day.

In simple words, Data Science is thus a new generation of scientific & computing tools which can help to manage, analyse and visualize such huge amounts of data.

The explanation of the term Data Scientist and Data Science seems to indicate it is a completely new field with its own set of techniques and tools. Though this is true to a certain degree, yet, not entirely. Data Science, as mentioned above, is at the confluence of technology, domain knowledge and business understanding. Thus, it utilizes tools and techniques from various fields to form a set of encompassing methodologies to turn data into insights.

Statistics traditionally has been the go-to subject to analyse data and hypothesis. Statistical methods are based on established theories and years of research.

Statistics of data science

Even though Data Science and Statistics have similar goals (and overlapping techniques in certain cases), i.e. to utilize data to reach conclusions and share insights, they are not the same. Statistics predates computing era while Data Science is new-age amalgamation of interdisciplinary knowledge.

There is a never-ending debate on the definitions of Data Science and Statistics. The old school believes Data Science is merely a rebranding of Statistics while the new-age experts grossly differ. Amongst all this, an interesting and somewhat accurate take on the issue was presented in an article on the website of Priceonomics (link):

“Statistics was primarily developed to help people deal with pre-computer data problems like testing the impact of fertilizer in agriculture or figuring out the accuracy of an estimate from a small sample. Data science emphasizes the data problems of the 21st Century, like accessing information from large databases, writing code to manipulate data, and visualizing data.”

Educational Qualifications to become a Data Scientist

It is worth reiterating the fact that Data Science is an interdisciplinary field. This makes sense as Data Science is not limited to just one field of study or industry. It is being used across every field which can or is generating data. It is not a surprise to see Data Scientists coming from varied academic backgrounds. Yet there are a few important and common skills such professionals have in the first place. Educational qualifications required to become a Data Scientist can be summarized as follows:

  • A graduate degree in a quantitative field of study. Areas in mathematics, computer science, engineering, statistics, physics, social science, economics, statistics or related fields are most common.
  • Newer options like bootcamps and MOOCs (Massively Open Online Courses) are quite popular for professionals to pivot into the areas of Data Science.
  • An advanced degree in the form of Masters’ or even PhD certainly helps. Increasingly, a lot many Data Science professionals are with such advanced degrees (link).

Technical Skills required to become a Data Scientist

This is the trickiest part of the whole journey. While being interdisciplinary is good in most aspects, it also presents a daunting question for beginners. Data Scientists are storytellers. They turn raw data into actionable insights all the way leveraging tools and techniques from various fields. Yet generic programming skills remain as the common denominator. Apart from programming skills, the following are a few important technical skills a Data Scientist usually has:

  • Mathematical background/understanding (linear algebra, calculus, and probability are important)
  • Machine Learning concepts and algorithms.
  • Statistical concepts (hypothesis testing, sampling techniques and so on)
  • Computer Science/Software Engineering skills (data structures, algorithms)
  • Data Visualization skills (tools like d3.js, ggplot, matplotlib, etc)
  • Data Handling (RDBMS, Big Data tools like Hive, Spark)

Though there are no hard and fast rules, most Data Scientists rely on programming/scripting languages like python, R, scala, Julia, Java or SAS to perform everyday tasks from the raw data to insights.

Learning Path for Data Scientist - From Fundamentals, Statistics to Problem Solving

Turning Data to Insights is easier said than done. A typical Data Science project involves a lot of important sub-tasks which need to be performed efficiently and correctly. Lets us breakdown the learning path into milestones and discuss how to go about the journey.

  • Step 1: Select a Programming Language

    and python are widely accepted and used programming languages in the data science community. There are other languages like Java, Scala, Julia, Matlab and to a certain extent even SAS. Yet, and python have a huge ecosystem and community contributing towards making it better every day. Though there is no such thing as the best programming language for Data Science, yet, there are some favorites and popular ones. When starting off with your Data Science journey, it may be confusing which one to choose. The following are a few pointers that might be helpful:
    • R

      R is the most popular language when it comes to statistical analysis and time series modeling. It also has a good number of machine learning algorithms and visualization packages. It can have a peculiar learning curve, yet it is good for exploring your data, one-off projects or quick prototypes. It is also usually the go-to language for academic reports, research papers.
    • Python

      Python is one of the most widely used programming languages. It is also sometimes referred to as a popular scientific language. Its ever-expanding community, ease of writing code, ecosystem, and support are reasons for its popularity. Python packages like numpy, pandas, and sklearn enable Data Scientists and researchers to work with matrices and other mathematical concepts with ease.
    • The Java Family

      R and python are great languages and are of great help when it comes to quick prototyping (though that is changing slowly with python being used in production as well). The heavyweights of the industry are still the languages from the Java family. Java in itself is a mature and proven technology with an extensive list of packages for machine learning, natural language processing and so on. Scala derives heavily from Java and is one of the go-to languages for handling big data.

There are a number of courses on platforms like Coursera and Udemy to get you started with these languages. Some of the courses are:

Julia and languages of its type are upcoming ones with a special focus towards Data Science and Machine learning. These languages have advantages of having Data Science as one of its core concepts, unlike traditional languages which have been extended to cater to DS/ML needs. Again, it boils down to a personal choice and comfort when it comes to deciding which language to choose.

  • Step 2:  Learn Statistics and Mathematics

    These are the basic concepts required to understand the intricacies of more involved ones. The most essential ones are:
  • Step 3: Powerup with Machine Learning:

Mathematics and Statistics give you the understanding to learn the tools and techniques required to leverage Machine Learning to solve real-world problems. ML techniques expand on a Data Scientist’s capabilities to handle different types and size of data sets. It is a vast subject on its own which can be broadly categorized into :

  • Supervised Methods like classification and regression algorithms
  • Unsupervised Methods like different clustering techniques
  • Reinforcement Learning like q-learning, etc
  • Deep Learning (spanning across the above three types, it is slowly emerging as a specialized field of its own)

Image Source:

The following are a few helpful resources to get you started on the subject:

There is a detailed list of datasets discussed here as well by Dr. Jason Brownlee on his blog machinelearningmastery.com
Apart from these datasets, there are regular competitions on Data Science problems on websites like Kaggle, AnalyticsVidya, KDNuggests and so on. It is worth participating in these competitions to learn the tricks of the trade from some of the seasoned performers.

  • Step 5: Build a Portfolio

    Just like a photographer or a painter, a Data Scientist is as much of an artist. While working on the different datasets and competitions, you can build a portfolio of your completed work to showcase your findings and learnings. This will not only help you showcase your talent but also give you a glimpse of your progress as you learn new and complex methods. A machine learning/data science portfolio is a collection of independent projects which utilizes machine learning in one way or the other. A typical machine learning portfolio can give you the following benefits:
    • Showcase: your skill set and technical understanding
    • Reusable code base: As you work on more and more projects, there are certain components which would be required time and again. Your portfolio can be a repository of such reusable components.
    • Progress Map: A portfolio is also a map of your progress over time. With every project, you would be getting better and learning new complex concepts. This is a great way to keep yourself motivated as well.

Typically, Data Scientists leverage their portfolios along with their CVs for interviews and prospective employers to have a better understanding of their capabilities. Code repositories can be maintained on websites like github, bitbucket and so on. Maintaining a blog to share your findings, commentary and research to a broader audience along with self-promotion are also quite common.

Each of these platforms provides you with an ecosystem of experts and recruiters who can help you land a job or a freelancing project. These platforms also provide you with an opportunity to fine tune your skills and make them market ready.

Top Universities offering a Data Scientist Course

The educational requirements to become a Data Scientist were discussed previously. Apart from traditional quantitative fields of study, a lot many reputed top universities across the globe are also offering specialized Data Science courses for undergraduate, graduate and online audiences. Some of the top US universities offering such courses are:

1. Information Technology and Data Management courses at the Colorado Technical University

  • Course NameProfessional Master of Science in Computer Science
  • Course Duration: 2 years
  • LocationBoulder, Colorado
  • Courses: Machine Learning, Neural Networks, and Deep Learning, Natural Language Processing, Big Data, HCC Big Data Computing and many more
  • Tracks available: Data Science and Engineering
  • Credits30

2. MS in Data Science, Columbia University

  • Course Name: Master of Science in Data Science
  • Course Duration: 1.5 year
  • Location: New York City, New York
  • Core courses: Probability Theory, Algorithms for Data Science, Statistical Inference and Modelling, Computer Systems for Data Science, Machine Learning for Data Science, and Exploratory Data Analysis and Visualization
  • Credits: 30

3. MS in Computational Data Science, Carnegie Mellon University

  • Course Name: Master of Computational Data Science
  • Course duration: 2 years
  • Location: Pittsburgh, Pennsylvania
  • Core courses: Machine Learning, Cloud Computing, Interactive Data Science, and Data Science Seminar
  • Tracks available: Systems, Analytics, and Human-Centered Data Science
  • Units to complete: 144

4. MS in Data Science, Stanford University

  • Course Name: M.S. in Statistics: Data Science
  • Course Duration: 2 years
  • Location: Stanford, California
  • Core coursesNumerical Linear Algebra, Discrete Mathematics and Algorithms, Optimization,Stochastic Methods in Engineering or Randomized Algorithms and Probabilistic Analysis, Introduction to Statistical Inference, Introduction to Regression Models and Analysis of Variance or Introduction to Statistical Modeling, Modern Applied Statistics: Learning, and Modern Applied Statistics: Data Mining
  • Tracks available: The program in itself is a track
  • Units to complete: 45

5. MS in Analytics, Georgia Institute of Technology

  • Course Name: Master of Science in Analytics
  • Course Duration: 1 year
  • Location: Atlanta, Georgia
  • Core courses: Big Data Analytics in Business, and Data and Visual Analytics,
  • Tracks available: Analytical Tools, Business Analytics, and Computational Data Analytics
  • Credits: 36

There are numerous other courses by other top universities in Europe and Asia as well. Also, MOOCs from platforms like Coursera, Udemy, Khan Academy and others have also gained popularity lately.

Roles and Responsibilities of a Data Scientist - What does a Data Scientist do?

The role and responsibilities of a Data Scientist vary greatly from one organization to other. Since the life cycle of a data science project involves a lot of intricate pieces, each with their own importance, a data scientist might be required to perform different tasks. Typically, a day in a Data Scientist’s life comprises of one or more of the following tasks:

  • Formulate open-ended questions and perform research into different areas
  • Extract data from different sources from within and outside the organization
  • Develop ETL pipelines to prepare data for analysis
  • Employ sophisticated statistical and/or machine learning techniques/algorithms to solve problems at hand
  • Exploratory and Descriptive analysis of data.
  • Visualization of data at different stages of the project
  • Story-telling/communicating results and findings to end-consumers/IT teams/business teams
  • Deploy intelligent solutions to automate tasks

The above list is by no means exhaustive. Specific tasks may be required for specific organizations and/or scenarios. Depending upon the set of tasks assigned or strengths of a particular individual, the Data Scientist role may have different facets to it. Some organizations divide the above set tasks into specific roles like:

  • Data Engineer: concentrates more on developing ETL pipelines and Big Data infrastructure.
  • Data Analyst: concentrates on hypothesis testing, A/B testing and so on
  • BI Analystconcentrates on visualizations, BI reporting and so on
  • Machine Learning/Data Science Engineerconcentrates on implementing ML solutions into production systems
  • Research Scientist: concentrates on researching new techniques, open-ended problems, etc.

Though some organizations separate out the roles and responsibilities, others chose to have a common Data Scientist title.

Salaries of a Data Scientist

The title of the most-coveted job of the 21st century ought to have an equally tempting salary as well. The data also confirms the hypothesis from various aspects. Different surveys from across the world have analysed salaries of Data Scientists and the results are astonishing.

The Burtch Works Study for Salaries of Data Scientists is one such survey:

  • The survey points out that post the peak increases in data scientist salaries across different levels in 2015-2016, the salaries for 2018 have been more or less steady at the previous year levels.
  • The median base salary for a starting position is around $95k which rises up to $165K for 9+ years of experience (for individual contributors)

The median base salary for Managers start out around $145K and go up to $250K (for 10+ years of experience)

Image Source

A survey by PromptCloud on the similar lines tried to identify different skills required for different Data Scientist job postings. The results show python as the topmost skill required followed by SQL, R and others. This showcases how important python and python ecosystem is to the Data Science work and community.

skills in the Job requirement for data scientist

The Glassdoor 50 Best Jobs in America for 2018 (link) rates Data Scientist as numero uno with an average salary of around USD 120k. The study also identifies other related Data Science job titles like Data Analyst and Quantitative Analyst in the study.

Data Scientist salaries

Image Source

Similar results from Payscale, Linkedin and others reconfirm the fact. Data Scientists are really sought after across the globe.

Top companies hiring Data Scientist

With the advancements in compute & storage and corresponding lowering of cost for hardware, technology is part and parcel of almost every industry. From aerospace to mining, from the internet to farming, every sphere of commerce is generating an immense flood of data. Where there’s data, there’s data science. Almost every industry today is leveraging the benefits of Data Science.

Some of the top companies hiring for Data Scientists are:

GoogleTwitterGE-HealthHP
MicrosoftAirbnbGE-AviationIBM
AppleUberUnitedHealth GroupIntel
FacebookAmazonBoeingAmercian Express

These are some of the big names in their respective fields. There are a lot of start-ups along with small-medium sized enterprises that are also leveraging Data Scientists to make an impact in their respective fields.

How is Data Science different from Artificial Intelligence?

Our discussion so far has revolved around Data Science and related concepts. In the same context, there’s another important term, Artificial Intelligence (AI). There are times when terms like AI and Data Science are used interchangeably while there people who perceive them differently as well. To understand each side, let us first try and understand the term Artificial Intelligence.

Artificial Intelligence can be defined in many ways. The most consistent and commonly accepted definition states:

“The designing and building of intelligent agents that receive percepts from the environment and take actions that affect that environment”

The above definition comes from AI heavyweights Dr. Peter Norvig and Dr. Stuart Russell. In simple words, this definition highlights the presence of intelligent agents which act based on stimulus from the environment, which in turn has an effect on the environment as well. Sounds very similar to how we, as humans, function.

The genesis of Artificial Intelligence as a field of study/research is credited to the famous Dartmouth workshop in 1956. The workshop was held by John McCarthy and Marvin Minsky, amongst other prominent personalities from computer science and AI space. Their workshop provided the first glimpse of intelligent systems/agents. The programs were learning strategies for the game of checkers. The programs were reported to play better than average human beings by 1959! A remarkable feat in itself. Since then, the field of AI has gone through a great many changes, theoretical and practical advancements.

The field of AI is focussed towards being successful at maximizing the agent’s chances of achieving a stated goal. The goal can be termed simple (if its only about winning or losing) or complex (take next steps based on rewards from past moves). Based on these goal categories, AI has focussed at solving problems in the following high-level domains over the course of its history:

  • Knowledge Representation

    This is one of the core concepts in classical AI research. As part of Knowledge Representation or Knowledge Engineering, we try to capture the world knowledge (where world is some specific narrow domain) possessed by experts. This was the foremost area of research for expert systems. The field of Ontology is highly associated with Knowledge Representation.
  • Problem Solving and Reasoning Tasks

    This is one of the earliest areas of research. Herein, the researchers focussed at mimicking human reasoning step by step for tasks such as puzzle solving and logical deductions.
  • Perception

    The ability to utilize input from different sensors such as microphones, cameras, radars, temperature sensors and so on for decision making. This is also termed as Machine Perception with modern day applications like speech recognition, object detection and so on.
  • Motion and Manipulation

The ability to move and explore the environment is an important characteristic highly utilized in the robotics space. Particularly industrial robots , robotic arms and the amazing machines from groups like Boston Dynamics are prime examples.

  • Social Intelligence

    It is considered one of the far fetched goals wherein the intelligent systems are expected to understand human emotions and motives to take decisions. Current-day virtual assistants(the likes of Google Assistant, Alexa, Cortana, etc.) provide a glimpse of such advantages by allowing them(virtual assistants) to converse, joke and make small talk.

The domains of Learning Tasks, characterized as supervised and unsupervised learning along with Natural Language Processing tasks have been traditionally associated with AI. Yet, with recent advancements in these fields, they are sometimes seen separately or no longer part of AI. This is also known as AI effect or Tesler’s Theorem. The AI effect simply states:

“AI is whatever hasn’t been done yet”

On the same grounds, OCR or optical character recognition, speech translation and others have become everyday technologies. This advancement has led to these technologies being no longer considered as part of AI research anymore.

Before we move on, there is another important detail about AI. Artificial Intelligence is categorized into two broad categories. These are:

  • Narrow AI

    Also termed as weak AI. This category is focussed at tractable AI tasks. Specifically, most of current-day research is focussed on narrow tasks like developing autonomous vehicles, automated speech recognition, machine translation and so on. These areas work towards building intelligent systems which mimic human level performance but are limited to specific areas only.
  • Deep AI

    This is also termed as strong AI or better, Artificial General Intelligence. If an intelligent agent is capable of performing any intellectual task, it is considered to possess Artificial General Intelligence. AGI is considered to be a summation of knowledge representation, reasoning, planning, learning, and communication.

Deep AI or AGI seems like a far fetched dream yet advancements like Transfer Learning and Reinforcement Learning techniques are steps in the right direction.

Artificial intelligence

Image Source

Now that we understand Artificial Intelligence and its history, let us attempt at understanding how it is different from Data Science. Data Science, as we know, is an amalgamation of tools and techniques from different fields (similar to AI). From the above discussion, we see, there is a definite overlap between the definition of weak/narrow AI and Data Science tasks. Yet, Data Science is considered to be more data-driven and focussed on business outcomes & objectives. It is more application oriented study and utilization of tools and techniques. Though, there are certain overlaps and similarities in the areas of research and tools, Data Science and AI are certainly not the same. It would be hard to even set them as subset-superset entities either. They are best seen as interdisciplinary fields which make the best of uncertainties.

Summary

Data Science is THE keyword for every industry for quite a few years now. In this article on What is a Data Scientist, we covered a lot of ground in terms of concepts and related aspects. The aim was to help you understand what really makes Data Scientist the “top and trending ” job of the 21st century.

The discussion started off with a formal definition of Data Science and how it is ushering in the fourth paradigm to tackle this constant flood of data. We then briefly touched upon the subtle differences between Data Science and Statistics along with the point of contention between the experts from the two fields. We also presented an honest opinion on what all it takes, in terms of technical skills and educational qualifications, to become a Data Scientist. Sure, it is cool to be one, but it is not as easy as it seems.

terms of technical skills

Along with the skills, we touched upon the learning path to become a Data Scientist. In this section, we covered the fundamental concepts one should know to advanced techniques like Reinforcement Learning and so on.

The world is in deep shortage of Data Scientists. Top universities have taken up this challenge to upskill the existing and next generation of workforce. We discussed some of the courses being offered by these universities from across the globe. We also touched upon different companies that are hiring data scientists and at what salaries.

In the final leg, we introduced concepts related to Artificial Intelligence. It is imperative to understand how different yet overlapping Data Science and AI are.

With this, we hope you are equipped to get started on your journey to become a Data Scientist and contribute. If you are already working in this space, the article was aimed to demystify some commonly used terms and provide a high-level overview of Data Science.

Raghav

Raghav Bali

Blog Author

Raghav Bali is a Senior Data Scientist at one the world's largest health care organization. His work involves research & development of enterprise level solutions based on Machine Learning, Deep Learning and Natural Language Processing for Healthcare & Insurance related use cases. In his previous role at Intel, he was involved in enabling proactive data driven IT initiatives using Natural Language Processing, Deep Learning and traditional statistical methods.

Leave a Reply

Your email address will not be published. Required fields are marked *

SUBSCRIBE OUR BLOG

Follow Us On

Share on