Understanding numerical data is crucial in statistics. This blog post covers key topics discussed in the Week 3 statistics-1 lectures of the IITM BS Degree in Data Science and Applications.
In statistics, we broadly classify data into two types: numerical and categorical. This blog post focuses on numerical data and its analysis methods.
Types of Numerical Data
Type | Description | Example |
---|---|---|
Discrete | Countable values with gaps | Number of children |
Continuous | Any value within a range | Height, Weight |
Frequency Tables for Numerical Data
Frequency tables help organize data by counting occurrences. Let's look at some examples:
Example: Household Sizes
Consider survey data from 15 individuals:
Household Size | Frequency | Relative Frequency |
---|---|---|
1 | 2 | 0.13 |
2 | 3 | 0.20 |
3 | 5 | 0.33 |
4 | 4 | 0.27 |
5 | 1 | 0.07 |
Guidelines for Organizing Data into Classes
Guideline | Description |
---|---|
Number of Classes | Choose between 5-20 classes |
Mutual Exclusivity | Each observation belongs to one class only |
Equal Length | Use class intervals of equal length |
Key Terms in Grouped Data
- Lower Class Limit: Smallest value within a class
- Upper Class Limit: Largest value within a class
- Class Width: Difference between upper and lower limits
- Class Mark: Average of lower and upper limits
Example: Student Marks
Class Interval | Frequency | Midpoint |
---|---|---|
30-40 | 3 | 35 |
40-50 | 6 | 45 |
50-60 | 18 | 55 |
60-70 | 17 | 65 |
70-80 | 4 | 75 |
80-90 | 2 | 85 |
Graphical Summaries
Histogram
A histogram represents frequency distributions visually. Here's an example of student test scores:
Stem-and-Leaf Plot
A stem-and-leaf plot organizes data by separating stems (leading digits) from leaves (trailing digits).
Example data: 15, 22, 29, 36, 31, 23, 45, 10, 25, 28, 14
1 | 0, 4, 5
2 | 2, 3, 5, 8, 9
3 | 1, 6
4 | 5
Measures of Central Tendency
Mean
The arithmetic mean is calculated as:
Example:
Data | Calculation | Result |
---|---|---|
[2, 12, 5, 7, 6, 7, 3] | (2 + 12 + 5 + 7 + 6 + 7 + 3) ÷ 7 | 6 |
For grouped data, we use:
where is the frequency and is the midpoint of each class.
Median
The median is the middle value in ordered data:
Ordered data: [2, 3, 5, 6, 7, 7, 12] Median = 6 (4th value)
Mode
The mode is the most frequent value:
In [2, 12, 5, 7, 6, 7, 3], the mode is 7 (appears twice)
Measures of Dispersion
Range
Variance & Standard Deviation
The sample variance is calculated as:
The standard deviation is:
Interquartile Range (IQR)
Advanced Statistical Concepts
Properties of Mean
Impact of Adding a Constant (a):
- New Mean = Old Mean + a
Impact of Multiplication by Constant (k):
- New Mean = k × Old Mean
Example of Mean Properties
Original Data | Add 5 | Multiply by 0.4 |
---|---|---|
68, 79, 38 | 73, 84, 43 | 27.2, 31.6, 15.2 |
Mean: 59 | Mean: 64 | Mean: 23.6 |
Properties of Median and Mode
Measure | Adding Constant (a) | Multiplying by Constant (k) |
---|---|---|
Median | New Median = Old Median + a | New Median = k × Old Median |
Mode | New Mode = Old Mode + a | New Mode = k × Old Mode |
Statistical Formulas
Population Parameters
Sample Statistics
Five-Number Summary
Statistic | Description |
---|---|
Minimum | Smallest value in dataset |
Q1 | 25th percentile |
Median (Q2) | 50th percentile |
Q3 | 75th percentile |
Maximum | Largest value in dataset |
Calculating Percentiles
- Arrange data in ascending order
- Calculate rank = Percentile × (n - 1) + 1
- Split rank into integer and fractional parts
- Find value using interpolation if necessary