What is Data Science?
Data Science is a multidisciplinary field that uses scientific inference and mathematical algorithms to extract meaningful knowledge and insights from a large amount of structured and unstructured data. These algorithms are implemented via computer programs which are usually run on powerful hardware since it requires a significant amount of processing. Data Science is a combination of statistical mathematics, machine learning, data analysis and visualization, domain knowledge and computer science.
As it is apparent from the name, the most important component of Data Science is “Data” itself. No amount of algorithmic computation can draw meaningful insights from improper data. Data science involves various types of data, for example, image data, text data, video data, time-dependent data, etc.
History of Data Science
The term “Data Science” has been mentioned in various contexts the past thirty years, but it is only recently that it became internationally established and recognized. More recently, the term became a buzzword when Harvard Business Review called it “The Sexiest Job of the 21st Century” in 2012.
Origin of the Concept
Though it is unclear when and where the concept was originally developed, William S. Cleveland coined the term “Data Science” in 2001. Shortly thereafter, in April 2002 and January 2003, the publications of the “CODATA Data Science Journal” by the International Council for Science: Committee on Data for Science and Technology and the “Journal of Data Science” by Columbia University, respectively kickstarted the journey of Data Science.
Additionally, It was also around this time when the “dot-com” bubble was in full swing, which led to the widespread adoption of the internet and in turn, generation of a huge amount of data. This, in addition to the advancement in technology, which led to faster and cheaper computation, together was responsible for the launch of the concept of “Data Science” to the world.
Recent Additions to the Field of Data Science
The field of Data Science has been expanding ever since it’s onset in the early 2000s. With time, more and more cutting edge technologies are being incorporated into the field. Some of such more recent additions are listed below:
- Artificial Intelligence: Machine Learning has been one of the core elements of Data Science. However, with the increased parallel compute capabilities, Deep Learning has been the latest and one of the most significant additions to the Data Science field.
- Smart Apps or Intelligent Systems: The development of data-driven intelligent applications and their accessibility in a portable form factor has lead to the inclusion of a part of this field into Data Science. This is primarily because a large portion of Data Science is built around Machine Learning, which is also what Smart Apps and Intelligent Systems are based on.
- Edge Computing: Edge computing is a recently developed concept and is related to IoT (Internet of Things). Edge computing basically puts the Data Science pipeline of information collection, delivery, and processing closer to the source of information. This is achievable through IoT and has recently been added to be a part of Data Science.
- Security: Security has been a major challenge in the digital space. Malware injection and the concept of hacking is quite common and all digital systems are vulnerable to it. Fortunately, there have been few recent technological advancements which apply Data Science techniques to prevent exploitation of digital systems. For example, Machine Learning techniques have proven more capable of detecting computer virus or malware when compared to traditional algorithms.
Blurring the lines between Data Science and Data Analytics
The buzzwords “Data Science” and “Data Analytics” are often used interchangeably. Even though these two fields are closely related, they do not mean the same thing. In summary, Data Science is an umbrella term which consists of the fields of Machine Learning, Data Analytics, and Data Mining.
In terms of Job Description, a “Data Scientist” and a “Data Analyst” also works on different, but related technologies.
|Parameters||Data Scientist||Data Analyst|
|Definition||A person who is skilled at handling a huge amount of data to build models and extract meaningful insights from them with the help of statistical and machine learning algorithms using computer science concepts.||A person whose primary job is to sift through a huge amount of data, wrangle and visualize them and determine what insights the data is hiding.|
|Skills||Machine Learning, Statistics, Data Visualization, Databases, Software Engineering, Data Mining, Domain Knowledge||Statistics, Data Visualization, Data Wrangling, Databases, Data Mining|
|Technologies||Python, R, SQL, AWS, Machine Learning Libraries,||Java, Hadoop, Hive, Spark, AWS, SQL, Tableau|
Role of Big Data in Data Science
The term “Big Data” refers to a large collection of structured, semi-structured or unstructured heterogeneous data. Databases are usually not capable of handling such voluminous datasets.
As mentioned earlier, the key component of Data Science is Data. As a rule of thumb, “more the data, the better the insights”. Hence, Big Data plays a very important role in the field of Data Science. Big Data is characterized by its variety and volume, both of which are essential for Data Science. Data Science captures the complex patterns from Big Data by developing Machine Learning models and Algorithms.
Applications of Data Science
Data Science is such a field which can be applied to almost every industry to solve complex problems. Every company applies Data Science to a different application with the view of solving a different problem. Some companies completely depend upon Data Science and Machine Learning techniques to solve a certain set of problems, which, otherwise, could not have been solved. Some of such applications of Data Science and the companies behind them are listed below.
- Internet Search Results (Google): When a user searches for something on Google, complex Machine Learning algorithms determine which are the most relevant results for the search term(s). These algorithms help to rank pages such that the most relevant information is provided to the user at the click of a button.
- Recommendation Engine (Spotify): Spotify is a music streaming service which is quite popular for its ability to recommend music as per the taste of the user. This is a very good example of Data Science at play. Spotify’s algorithms use the data generated by each user over time to learn the user’s taste in music and recommend him/her with similar music in the future. This allows the company to attract more users since it is more convenient for the user to use Spotify as it does not demand much attention.
- Intelligent Digital Assistants (Google Assistant): Google Assistant, similar to other voice or text-based digital assistants (also known as chatbots) is one example of advanced Machine Learning algorithms put to use. These algorithms are able to convert the speech of a person (even with different accents and languages) to text, understand the context of the text/command and provide relevant information or perform a desired task, all just by speaking to the device.
- Autonomous Driving Vehicle (Waymo): Autonomous Driving vehicles are one of the bleeding edge of technology. Companies like Waymo uses high-resolution cameras and LIDARs to capture live video and 3D maps of the surrounding in order to feed that through Machine Learning algorithms which assist in autonomously driving the car. Here, the data is the videos and 3D maps captured by the sensors.
- Spam Filter (Gmail): Another key application of Data Science which we use in our day-to-day life is the spam filters in our emails. These filters automatically separate the spam emails from the rest, effectively giving the user a much cleaner email experience. Just like the other applications, Data Science is the key building block here.
- Abusive Content and Hate Speech Filter (Facebook): Similar to the spam filter, Facebook and other social media platforms use Data Science and Machine Learning algorithms to filter out abusive and age-restricted content from the unintended audience.
- Robotics (Boston Dynamics): A key component of Data Science is Machine Learning, which is exactly what fuels most of the robotics operations. Companies like Boston Dynamics are at the forefront of the robotics industry and develop autonomous robots that are capable of humanoid movements and actions.
- Automatic Piracy Detection (YouTube): Most videos that are uploaded to YouTube are original content created by content creators. However, quite often, pirated and copied videos are also uploaded to YouTube, which is against their policy. Due to the sheer volume of daily uploads, it is not possible to manually detect and take down such pirated videos. This is where Data Science is used to automatically detect pirated videos and remove them from the platform.
The Life Cycle of Data Science
The field of Data Science is not a single step process. It has many steps involved in it. These steps are listed below.
- Project Analysis: This step is more inclined towards Project Management and Resource Assessment than it is a direct implementation of algorithms. Instead of starting a project blindly, it is crucial to determine the requirements of the project in terms of the source of data and its availability, the number of human resource available and if the budget allocated for the project is sufficient to successfully complete it.
- Data Preparation: In this step, the raw data is converted to structured data and is cleaned. This involves Data Analysis, Data Cleaning, Handling of Missing Values, Transformation of data and Visualization. From this step onwards, programming languages like R and Python is used to achieve results for big datasets.
- Exploratory Data Analysis (EDA): This is a crucial step in Data Science, where the Data Scientist explores the data from various angles and tries to draw initial conclusions from the data. This includes Data Visualization, Rapid Prototyping, Feature Selection, and finally Model Selection. A different set of tools are used in this step. The most commonly used are R or Python for scripting and Data Manipulation, SQL for interacting with Databases, and different libraries for data manipulation and visualization.
- Model Building: Once the type of model to be used is determined from the EDA, most of the resources are channeled towards the development of the model with ideal hyperparameters (modifiable parameters), such that it can perform predictive analysis on similar but unseen data. Various Machine Learning techniques applied to the data, like Clustering, Regression, Classification or PCA (Principal Component Analysis) in order to extract valuable insights from it.
- Deployment: After the model has been built successfully, it is time to bring the model out to the real world from its sandbox. This is where model deployment comes to the picture. Up until now, all the steps were dedicated to rapid prototyping. However, once the model has been successfully built and trained, the main application of it is in the real world, where it is deployed. This can be in the form of a web app, mobile app, or it can be run in the back-end of the server to crunch high-frequency data.
- Real World Testing and Results: After the model has been deployed, it faces unseen data from the real world in real time. The model may perform very well in the sandbox, but fail to perform adequately after deployment. This is the phase where constant monitoring of the model output is required in order to detect scenarios where the model fails. If it does fail at some point, the development process goes back to Step 1. If the model succeeds, the key findings are noted and reported to the stakeholders.
Where does Data Science fit when compared to the other Buzzwords - AI, Machine Learning, Deep Learning
“Data Science” seems to be a rather confusing word, which does not have a clear definition or boundaries. The buzzwords “Artificial Intelligence”, “Machine Learning” and “Deep Learning” are often used interchangeably with “Data Science” or in association to it. Let us clearly define the boundaries for each of these terms.
As mentioned earlier, Machine Learning is a part of Data Science. As shown in the figure below, Deep Learning is a part of Machine Learning, and Machine Learning is in turn a part of Artificial Intelligence.
Even though Data Science includes a portion of each of Artificial Intelligence, Machine Learning and Deep Learning, it contains more than just these three subdomains inside it. Data Science also contains Statistical Programming, Data Analysis, Data Mining, Big Data and more recent additions like IoT, Edge Computing and Security.
Hence, Data Science is a complex field of the scientific study of data, which contains a significant portion of some of the most recent advancements in Computer Science and Mathematics.
Skills required to become a Data Scientist
As mentioned in the previous section, Data Science is a complex field. Hence, it requires the mastery of multiple sub-fields, which together add up to the complete knowledge required to be a Data Scientist.
1. Mathematics: The first and the most important field of study in order to become a Data Scientist is mathematics; more specifically, Probability and Statistics, Linear Algebra, and some basic Calculus.
- Statistics: It is essential in EDA and developing algorithms to conduct statistical inference on the data. Additionally, most Machine Learning Algorithms use statistics as its fundamental building blocks.
- Linear Algebra: Working with a huge amount of data means working with high dimensional matrices and matrix operations. The data that the model takes in and the one that it gives as output are in the form of matrices and hence any operation that is conducted on them uses the fundamentals of Linear Algebra.
- Calculus: Since Data Science does include Deep Learning, calculus is of immense importance. In Deep Learning, calculation of Gradient is very important and is done at every step of computation in Neural Networks. This requires a sound knowledge of differential and integral calculus.
2. Algorithmic Knowledge: Even though Data Science typically does not involve the development and design of Algorithms like any other application of Computer Science does, it is still imperative for a Data Scientist to have sound knowledge on Algorithms. This is because, at the end of the day, Data Scientists are programmers who are expected to develop programs which would derive meaningful insights from data. Having algorithmic knowledge allows the Data Scientist to write meaningful efficient code, which saves both time and resources and hence is highly valued.
3. Programming Languages (R and Python): Even though, any programming language can be used for any kind of logical use case, which of course, includes Data Science; but, the most commonly used languages are R and Python. Both of these languages are open source and hence have huge community support, have multiple libraries developed keeping Data Science in mind and are relatively easy to learn and use. Without the knowledge of programming languages, a Data Scientist cannot apply any kind of algorithmic or mathematical knowledge to the data.
4. Proper Programming Environment: Since sound programming knowledge is one of the key requirements for Data Science, there needs to be a convenient platform to write and execute the code. This platform is called the IDE or Integrated Development Environment. There are several IDEs to choose from, and some of them have been specifically developed for Data Science. This article talks about the Top 10 Python IDEs.
5. Machine Learning Frameworks: Machine Learning is an important part of Data Science and its implementation involves certain libraries and frameworks, the knowledge of which are essential for any Data Scientist. Here, some of the most commonly used Machine Learning frameworks are listed.
- Numpy: This is a library which allows the easy implementation of linear algebra and data manipulation.
- Pandas: This library is used to load, modify and save data. This is also used in data wrangling.
- Matplotlib: This is one of the most commonly used libraries for data visualization.
- Seaborn: This is a wrapper over Matplotlib, which is used to visualize more complex data.
- Sklearn: This is used to apply and implement most of the machine learning algorithms and data preprocessing techniques.
- Tensorflow: This is a deep learning framework backed by Google and allows easy implementation of various types of neural networks.
- PyTorch: Similar to tensorflow, this is also a deep learning framework which is frequently used.
- Keras: This is a wrapper which works alongside tensorflow and allows relatively easy implementation of Deep Learning techniques.
- OpenCV: This is a computer vision framework and is usually used for Image Processing and image manipulation. This is used for video or image-based data.
6. SQL: Databases are of immense importance in the field of Data Science since they are the most suitable method of storing data. Thorough knowledge of one or more database technologies like MySQL, MariaDB, PostgreSQL, MS SQL Server, MongoDB, Oracle NoSQL, etc. is also important.
Salaries of a Data Scientist
Data Science field is one of the highest paying jobs in the software domain. It is also the highest paying with the lowest amount of relevant work experience when compared to any other field in the software domain, as shown in the figure below. This data has been sourced from the Stack Overflow 2019 Developer Survey.
Some of the salaries offered are listed below.
- According to DataJobs the salary range for Data Scientists in USA is $85,000 to $170,000.
- According to PayScale the salary range in India is ₹305,000 to ₹2,000,000 and the median salary being ₹620,000.
- Glassdoor states the Average Base Pay for Data scientists in India as ₹947,698 per annum.
Future of Data Science
Data Science is an ever growing field and is expected to grow in demand in the foreseeable future. Some of the key changes are listed below.
- Data: With the radical increase of generation of data, the performance of the predictive algorithms is going to improve over time as more structured data is available to draw inference upon. This phenomenon is fueled by the growth of Social Media and IoT based devices, which generate a lot more structured data.
- Algorithms: Machine Learning algorithms like Genetic Algorithms and Reinforcement Learning algorithms are expected to improve over time causing more intelligent systems.
- Distributed Computing: With the advancements of blockchain technology, TPU (Tensor Processing Unit) development and faster GPU (Graphics Processing Unit) available in the cloud, Data Science sees a future where more powerful computational hardware aids the algorithms of increasing complexity.
More Data and improved Algorithms and Hardware together are expected to bring significant improvements in the field of Data Science in the near future.
Data Science is a hyped up complex field of study. For the most part, the hype is true and it delivers solutions to problems as promised. Some fields of data science have even started to outperform humans and that trend is expected to increase in the near future. You can take up Data Science training to enhance your career.
Data Science is definitely the “Sexiest” job in the 21st century. It defines the bleeding edge of technology at present and promises further technological advancements in the near future. It is also one of the most in-demand and high paying jobs in the industry. Hence, there is no better time to be a Data Scientist than now!