Timeless Statistical Concepts Every Data Scientist Must Master - With Links to Visual Illustrations & Examples

In this blog post, I will outline the foundational statistical concepts that are essential for every Data Scientist to know. These concepts are timeless and fundamental to Data Science and will not change unlike the constantly evolving versions of software that we use for data analysis. Hence, if you learn these concepts thoroughly, you will remain up-to-date and better equipped with the skills needed to excel.
  1. The first concept is the Scales of measurement (nominal, ordinal, interval and ratio - understanding scales of measurement is essential for deciding on the appropriate analysis to perform on each type of scale.
  2. Degrees of Freedom - a basic concept
  3. Z-score - how many standard deviations a data point is from the mean
  4. Central Limit Theorem - an important concept
  5. Standard Deviation vs Standard Error - a confusing topic
  6.  Confidence Interval - useful for interpretation
  7. Confusion matrix: useful tool for measuring the accuracy of a classification model.
  8. Occam's Razor, Bias-Variance Tradeoff, No Free Lunch Theorem and The Curse of Dimensionality - to understand the limitations of machine learning
  9. Train-Test split and Cross-validation: for building an optimum model which neither underfits nor overfits the dataset.
  10. Components of Time Series (TCSI): this is the fundamental concept for time series analysis.




Non-linear Relationships: When a 0 Pearson Correlation Coefficient Can Be Surprisingly Meaningful

We know that Pearson correlation coefficient (r) ranges from -1 to +1. And a zero Pearson correlation coefficient means there exists no linear relationship between the variables.

Here the word linear is crucial. Why? Let's find this out using an example where Pearson Correlation Coefficient = 0.

Consider a case where Y=X2.

Understanding Confidence Intervals with an Intuitive Example

The concept of confidence intervals (CI) is commonly used in data science. Hence, using an intuitive example, let us learn it with confidence!

Imagine you are waiting for the bus at a bus stop. Usually, the bus arrives at 9.30 am. But the arrival time varies.

Another person arrives at the bus stop to catch the same bus and asks you, "Based on your experience, between 9.25 am to 9.35 am, what percentage of the time the bus arrived here?"

You think and answer, "90% of the time".

Standard Deviation vs Standard Error: Clearing up the Confusion with Visual Examples

Standard deviation and standard error are two statistical concepts that are often confused with each other. Though these two measures are related to variability in the data, they are different.

Standard deviation measures the variability in the dataset. The formula for standard deviation is given below.

Mastering Central Limit Theorem (CLT) with Intuitive Examples

To understand the Central Limit Theorem (CLT), let's use the example of rolling two dice, repeatedly (say 30 times). Then calculate the sample mean (mean of two dice values) and plot its distribution.

Round 1:
We got 2 and 5. The sample mean of 2 and 5 is 3.5.

Demystifying Degrees of Freedom with Visual Examples: A Beginner's Guide

The concept of degrees of freedom (df) is fundamental and commonly used in statistical analysis. In this blog post, let us thoroughly understand this concept with examples.

Many a time, you might have encountered that the degrees of freedom for a particular test is, let's say (n-1) or (n-2) etc. In the blog post, let us understand this concept with examples.

A) Without any restriction

Suppose I give you three boxes as shown below. You are free to fill all these three boxes with any values of your choice. Hence, in this case, the degrees of freedom are 3. In other words, df=n here.

The Chi-Square Test Explained with Examples: A Beginner's Guide

Imagine, in a large gathering, people were given an option to buy any one of the two products for free. You want to test if is there any relation between gender and buying patterns.

When variables are independent

You take a random sample of 40 persons: in which there were 10 men and 10 women. You asked them what did they buy? A pen or a pencil?

The cross-tabulated data is shown in the following table: