Statistics in Data Science

Mardiyyah Oduwole
3 min readMar 13, 2021
Picture was sourced on the internet

To perfectly describe the concept of statistics in data science, we start by defining both statistics and Data Science.

According to Wikipedia, statistics is a branch of mathematics that deals with the collection, analysis, interpretation and presentation of masses of numerical data.

Wikipedia also says Data Science is an inter-disciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from data across a broad range of application domains.

Statistics is an essential aspect of mathematics in data science because it helps with the analysis, interpretation and presentation of numerical data to derive insights from the data across a broad range of application. With statistics, we derive deeper insights on how data is structured and helps us with concrete and accurate estimation of our data rather than random conclusion.

Statistics can be divided into 2 categories:

  • Descriptive Statistics: This is the term given to analysis of data that helps describe and summarize data in a meaningful way such that patterns might emerge from the data. Descriptive statistics helps to visualize our data in a more meaningful way which allows for the data to be presented in a simpler way.

Generally, we have 2 types of measures use to describe data:

  1. Measures of Central tendency: These are ways of describing the central position of a frequency distribution of a group of data. we can describe the measures of central tendency using a number of statistics, including the mode, median and mean. These measures help us n driving accurate insights from our datasets, it helps to know how skewed the data is and if the data contains outliers or not.
  2. Measures of Spread: These are ways of summarizing a group of data by explaining how the scores are spread out. Measures of spread helps us deal with datasets that have outliers more accurately and efficiently. To describe the spread, we use range, quartiles, absolute deviation, variance and standard deviation.
  • Inferential Statistics: when we talk about inferential statistics we talk about samples. i.e using samples of a particular population to derive insights and then making concise generalization about the population. while using inferential statistics, it is important to be sure that the samples accurately represent the population. The process of achieving ths is called sampling. Inferential statistics is borne out of the fact that sampling naturally incurs sampling errors and thus a sample is not expected to perfectly represent the population. In data science, we sometimes use inferential statistics when having a sample of the entire population is almost equivalent to having the accurate data of the whole population.

WHY STATISTICS IS IMPORTANT IN DATA SCIENCE

  1. Statistics helps to figure out the important features and those that are not as important
  2. Since statistics helps understand the dataset better, it also helps to decide on the metrics to be used while building our ML models.
  3. Statistics determine the columns feature engineering will be done on.
  4. Statistics gives deep insight about the dataset. i.e makes it easy to see the correlations between columns and to determine the kind of correlation they have.
  5. Descriptive Statistics is used to transform raw data into insights that make sense while inferential statistics is used to study small samples of data and then extend findings to the entire population.

In Conclusion, Statistics is very important in data science and I have found it very useful in deriving insights from datasets. I hope this article has helped you with a better understanding of statistics and its usefulness(application) in data science

--

--

Mardiyyah Oduwole

Python || Data Scienctist || Machine Learning Engineer || ML Researcher