Statistics 101: Analyzing and Visualizing Numerical Data

Feb 15, 2025

Understanding numerical data is crucial in statistics. This blog post covers key topics discussed in the Week 3 statistics-1 lectures of the IITM BS Degree in Data Science and Applications.

📚

In statistics, we broadly classify data into two types: numerical and categorical. This blog post focuses on numerical data and its analysis methods.

Types of Numerical Data

TypeDescriptionExample
DiscreteCountable values with gapsNumber of children
ContinuousAny value within a rangeHeight, Weight

Frequency Tables for Numerical Data

Frequency tables help organize data by counting occurrences. Let's look at some examples:

Example: Household Sizes

Consider survey data from 15 individuals:

Household SizeFrequencyRelative Frequency
120.13
230.20
350.33
440.27
510.07

Guidelines for Organizing Data into Classes

GuidelineDescription
Number of ClassesChoose between 5-20 classes
Mutual ExclusivityEach observation belongs to one class only
Equal LengthUse class intervals of equal length

Key Terms in Grouped Data

📝
  • Lower Class Limit: Smallest value within a class
  • Upper Class Limit: Largest value within a class
  • Class Width: Difference between upper and lower limits
  • Class Mark: Average of lower and upper limits

Example: Student Marks

Class IntervalFrequencyMidpoint
30-40335
40-50645
50-601855
60-701765
70-80475
80-90285

Graphical Summaries

Histogram

A histogram represents frequency distributions visually. Here's an example of student test scores:

Stem-and-Leaf Plot

A stem-and-leaf plot organizes data by separating stems (leading digits) from leaves (trailing digits).

🔢

Example data: 15, 22, 29, 36, 31, 23, 45, 10, 25, 28, 14

1 | 0, 4, 5
2 | 2, 3, 5, 8, 9
3 | 1, 6
4 | 5

Measures of Central Tendency

Mean

The arithmetic mean is calculated as:

xˉ=i=1nxin\bar{x} = \frac{\sum_{i=1}^{n} x_i}{n}

Example:

DataCalculationResult
[2, 12, 5, 7, 6, 7, 3](2 + 12 + 5 + 7 + 6 + 7 + 3) ÷ 76

For grouped data, we use:

xˉ=fimifi\bar{x} = \frac{\sum f_i m_i}{\sum f_i}

where fif_i is the frequency and mim_i is the midpoint of each class.

Median

The median is the middle value in ordered data:

📊

Ordered data: [2, 3, 5, 6, 7, 7, 12] Median = 6 (4th value)

Mode

The mode is the most frequent value:

🔢

In [2, 12, 5, 7, 6, 7, 3], the mode is 7 (appears twice)

Measures of Dispersion

Range

Range=MaximumMinimum\text{Range} = \text{Maximum} - \text{Minimum}

Variance & Standard Deviation

The sample variance is calculated as:

s2=i=1n(xixˉ)2n1s^2 = \frac{\sum_{i=1}^{n} (x_i - \bar{x})^2}{n-1}

The standard deviation is:

s=s2s = \sqrt{s^2}

Interquartile Range (IQR)

IQR=Q3Q1\text{IQR} = Q_3 - Q_1

Advanced Statistical Concepts

Properties of Mean

🔢

Impact of Adding a Constant (a):

  • New Mean = Old Mean + a

Impact of Multiplication by Constant (k):

  • New Mean = k × Old Mean

Example of Mean Properties

Original DataAdd 5Multiply by 0.4
68, 79, 3873, 84, 4327.2, 31.6, 15.2
Mean: 59Mean: 64Mean: 23.6

Properties of Median and Mode

MeasureAdding Constant (a)Multiplying by Constant (k)
MedianNew Median = Old Median + aNew Median = k × Old Median
ModeNew Mode = Old Mode + aNew Mode = k × Old Mode

Statistical Formulas

Population Parameters

Population Mean (µ)=i=1NxiN\text{Population Mean (µ)} = \frac{\sum_{i=1}^{N} x_i}{N}

Population Variance (σ²)=i=1N(xiµ)2N\text{Population Variance (σ²)} = \frac{\sum_{i=1}^{N} (x_i - µ)^2}{N}

Sample Statistics

Sample Mean (xˉ)=i=1nxin\text{Sample Mean (}\bar{x}\text{)} = \frac{\sum_{i=1}^{n} x_i}{n}

Sample Variance (s²)=i=1n(xixˉ)2n1\text{Sample Variance (s²)} = \frac{\sum_{i=1}^{n} (x_i - \bar{x})^2}{n-1}

Five-Number Summary

StatisticDescription
MinimumSmallest value in dataset
Q125th percentile
Median (Q2)50th percentile
Q375th percentile
MaximumLargest value in dataset

Calculating Percentiles

📊
  1. Arrange data in ascending order
  2. Calculate rank = Percentile × (n - 1) + 1
  3. Split rank into integer and fractional parts
  4. Find value using interpolation if necessary