Facts are stubborn, but Statistics are pliable — Mark Twain
Descriptive statistics consist of methods for organizing and summarizing information (Weiss, 1999)
Descriptive statistics include the construction of graphs, charts, and tables, and the calculation of various descriptive measures such as averages, measures of variation, and percentiles.
Let’s consider an example of tossing dice in order to understand the statistics for Data Science. The dice is rolled 100 times and the results are forming the sample data. Descriptive statistics is used to grouping the sample data to the following table.
It is almost always necessary to use methods of descriptive statistics to organize and summarize the information obtained from a sample before methods of inferential statistics can be used to make a more thorough analysis of the subject under investigation. Sometimes, it is possible to collect the data from the whole population. In that case, it is possible to perform a descriptive study on the population as well as usual on the sample.
Well, Let’s see what is Descriptive Statistics and how to apply statistics for Data Science and Machine Learning.
In Descriptive Statistics, before you summarize the data, you need to get the data first. The data in most of the cases is captured in which the effect of variables under study can be captured. But creating an unbiased and proper environment for collecting data is equally important because ultimately good data leads to good and meaningful results.
So, we’ll start off with Design of Experiments. In its simplest form, an experiment aims at predicting the outcome by introducing a change of the preconditions, which is reflected in a variable called the predictor (independent). The change in the predictor is generally hypothesized to result in a change in the second variable, hence called the outcome (dependent) variable. Experimental design involves not only the selection of suitable predictors and outcomes but planning the delivery of the experiment under statistically optimal conditions given the constraints of available resources.
Khan Academy has a very good explanation on this topic, I strongly believe that it’ll be your exciting 10 minutes of watching the below video.
Now, we have data and we have made some analysis using Design of Experiments, your next important step should be conveying that information visually to make it more effective and Data Visualization is the way to do it.
Data Visualization is the technique to maximize how quickly and accurately people decode information from graphics.
In order to achieve this Data Visualization researchers have focused on two areas,
1. Preattentive cognition: This includes the concepts which use cognitive understanding to decode information from the graphics.
2. Accuracy: This includes the concepts which maximize the accuracy with which people interpret the visualizations.
While using various Visualization techniques you should have engagement from the users, understanding of the concepts, memorability of the information and emotional connection between users and the content.
As we know that Descriptive Statistics are all about showing summary, describing the data (descriptive intuition but not generalizing). There is a lot of Mathematical technique revolving around descriptive statistics.
Descriptive measures that indicate where the center or the most typical value of the variable lies in a collected set of measurements are called measures of center or Central Tendency. Measures of the center are often referred to as averages. The median and the mean apply only to quantitative data (information about quantities), whereas the mode can be used with either quantitative or qualitative data(information about qualities).
The mean of the variable is the sum of observed values in the data divided by the number of observations where x1, x2, x3...xn are taken as variable and n is the number of observed values
The Formula for calculating mean or average
For Example, 7 participants in horse riding had the following finishing times in minutes: 27,21,25,23,21,28,24. What is the mean?
By using the formula for calculating the mean or the average, we take: 27+21+25+23+21+28+24 / 7 equals 24 as the mean.
It is to arrange the observed values of a variable in a data in increasing order. The sample median of a quantitative variable is that value of the variable in a data set that divides the set of observed values in half, so that the observed values in one half are less than or equal to the median value and the observed values in the other half are greater or equal to the median value. To obtain the median of the variable, we arrange observed values in a data set in increasing order(ascending order) and then determine the middle value in the ordered list.
It is to obtain the frequency of each observed value of the variable in a data and noting down the greatest frequency.
1. If the greatest frequency is 1 (no value occurs more than once) then the variable has no mode.
2. If the greatest frequency is 2 or greater, then any value that occurs with that greatest frequency is called a mode of the variable.
The sample range is obtained by computing the difference between the largest observed value of the variable in a data set and the smallest one in the dataset.
Range = max - min
For Example, Consider the 8 participants in horse riding had the following finishing times in minutes: 28,22,26,29,21,23,24,50. Then
What is the range?
We take 50-21 = 29 as the range.
A boxplot is based on the five-number summary(min, max, three quartiles written in increasing order) and can be used to provide a graphical presentation of the center-point and variance of the observed values of variable in a data set.
But, How do you draw a Boxplot?
1. Well, Determine the five-number summaries first (min, max, three quartiles)
2. Draw a horizontal or vertical axis on which the numbers obtained can be located. Mark the quartiles, min and max values with horizontal and verticle lines above the axes
3. Connect the dividend (quartile) to each other that makes a box then connect the box to the min and max values with the lines.
The sample standard deviation is the most frequently used measure of variability, for a variable x.
The sample standard deviation denoted by s, is:
The population standard deviation formula is:
This is all about how you collect, analyze, summarize and make descriptive intuition from the data using Descriptive Statistics. Hope, this tutorial of statistics for data science helped you to learn the mathematical techniques that are revolving around descriptive statistics.Hope, this tutorial of statistics for data science helped you to learn the mathematical techniques that are revolving around descriptive statistics.