Statistics
Measures of Central Tendency
Mean
The arithmetic mean of a data set :
For grouped data with frequencies :
Median
The median is the middle value when data is arranged in order.
- If is odd: median
- If is even: median
For grouped data, use linear interpolation within the median class.
Mode
The mode is the most frequently occurring value. A data set can be unimodal, bimodal, or have no mode.
Comparing Measures
| Measure | Advantages | Disadvantages |
|---|---|---|
| Mean | Uses all data points, algebraic properties | Affected by outliers |
| Median | Robust to outliers | Does not use all data |
| Mode | Simple, useful for categorical data | May not exist or be unique |
Find the mean, median, and mode of: .
Mean:
Median: 5th value
Mode: (appears twice)
The mean (12.1) is significantly higher than the median (8) due to the outlier .
Measures of Spread
Range
Interquartile Range (IQR)
where is the first quartile (25th percentile) and is the third quartile (75th percentile).
Variance
The variance measures the average squared deviation from the mean.
Population variance:
Sample variance (unbiased estimator):
Standard Deviation
Computational Formula
Calculate the standard deviation of: .
Know whether to use the population formula () or the sample formula (). In IB exams, when data is from a sample, use (dividing by ). Your GDC typically uses the sample formula by default.
Grouped Data
Estimating the Mean from Grouped Data
Use the midpoint of each class interval:
where is the midpoint of class .
Estimating the Median from Grouped Data
Use linear interpolation within the median class:
where:
- = lower boundary of median class
- = total frequency
- = cumulative frequency before median class
- = frequency of median class
- = class width
| Mass (g) | Frequency |
|---|---|
| 5 | |
| 12 | |
| 18 | |
| 10 | |
| 5 |
Total . Median position .
Cumulative frequencies: .
Median is in the class (, ).
Box-and-Whisker Plots
Components
A box-and-whisker plot displays the five-number summary:
- Minimum (or lower whisker)
- (first quartile)
- Median ()
- (third quartile)
- Maximum (or upper whisker)
Outliers
A value is a potential outlier if it falls outside:
Interpreting Box Plots
- The box represents the middle 50% of data (IQR).
- The line inside the box is the median.
- Whiskers extend to the minimum and maximum (or to the most extreme non-outlier values).
- Skewness: if the median is closer to , the data is right-skewed (positively skewed). If closer to , left-skewed (negatively skewed).
Cumulative Frequency
Cumulative Frequency Graph (Ogive)
Plot cumulative frequency against the upper class boundary. From this graph, you can read:
- Median: at
- Quartiles: at and
- Percentiles: at the appropriate fraction of
Using the grouped data from the previous example:
| Upper boundary | Cumulative frequency |
|---|---|
| 20 | 5 |
| 40 | 17 |
| 60 | 35 |
| 80 | 45 |
| 100 | 50 |
To find (at ): interpolate between and .
Correlation
Scatter Diagrams
A scatter diagram plots two variables to visually assess the relationship between them.
Pearson's Correlation Coefficient ()
Measures the strength and direction of the linear relationship between two variables.
Properties of
| Value | Interpretation |
|---|---|
| Perfect positive linear correlation | |
| Perfect negative linear correlation | |
| No linear correlation | |
| Positive linear correlation | |
| Negative linear correlation |
Strength Guidelines
| $ | r | $ | Strength |
|---|---|---|---|
| -- | Weak | ||
| -- | Moderate | ||
| -- | Strong |
Computational Formula
Correlation does NOT imply causation. Two variables may be strongly correlated without one causing the other (they may both be influenced by a third variable).
Linear Regression
Least Squares Regression Line
The line of best fit minimises the sum of squared residuals:
where:
Regression Line of on
If predicting from :
Coefficient of Determination ()
represents the proportion of variance in explained by the linear relationship with .
- : the line explains all the variation.
- : the line explains none of the variation.
Given the data:
| 1 | 2 | 3 | 4 | 5 | |
|---|---|---|---|---|---|
| 2.1 | 3.9 | 6.2 | 7.8 | 10.1 |
Find the regression line of on .
Regression line: .
Extrapolation and Interpolation
- Interpolation: predicting within the range of data (generally reliable).
- Extrapolation: predicting outside the range of data (unreliable and potentially misleading).
Never extrapolate beyond the data range without acknowledging the uncertainty. IB exam questions often ask you to comment on the reliability of a prediction.
Hypothesis Testing
Key Concepts
- Null hypothesis (): The statement being tested (usually "no effect" or "no difference").
- Alternative hypothesis (): The statement we suspect might be true.
- Significance level (): The threshold for rejecting (commonly 0.05 or 0.01).
- Test statistic: A value computed from the sample data.
- -value: The probability of observing the test statistic (or more extreme) assuming is true.
- Critical value: The boundary value(s) that define the rejection region.
Decision Rule
- If -value : reject .
- If -value : do not reject .
Types of Errors
| Error Type | Description |
|---|---|
| Type I | Rejecting when it is true (false positive). Probability . |
| Type II | Failing to reject when it is false (false negative). Probability . |
One-Tailed vs Two-Tailed Tests
| Test | Rejection Region | |
|---|---|---|
| Two-tailed | Parameter value | Both tails |
| Right-tailed | Parameter value | Upper tail |
| Left-tailed | Parameter value | Lower tail |
Hypothesis Test for Correlation
To test whether the population correlation coefficient is zero:
- and (or or ).
- Compute from the sample.
- Compare with the critical value (or compute the -value).
- Make a conclusion.
The test statistic for large samples ():
which follows a -distribution with degrees of freedom.
A sample of 12 students gives a correlation coefficient of between hours studied and exam score. Test at the 5% significance level whether there is a positive correlation.
vs .
This is a one-tailed test. The critical value for at 5% level is approximately .
Since , we reject .
There is sufficient evidence at the 5% level to conclude a positive correlation between hours studied and exam score.
Chi-Squared Test for Independence
Used to determine whether two categorical variables are independent.
Test statistic:
where are observed frequencies and are expected frequencies.
Expected frequency:
E_`\{ij}` = \frac{(\mathrm{row } i \mathrm{ total}) \times (\mathrm{column } j \mathrm{ total})}{\mathrm{grand total}}Degrees of freedom: where is the number of rows and is the number of columns.
Test whether gender and favourite subject are independent:
| Maths | Science | English | Total | |
|---|---|---|---|---|
| Male | 30 | 25 | 15 | 70 |
| Female | 20 | 20 | 40 | 80 |
| Total | 50 | 45 | 55 | 150 |
Expected frequencies:
Degrees of freedom: .
Critical value at with : .
Since , we reject . Gender and favourite subject are not independent.
For the chi-squared test, always check that all expected frequencies are at least 5. If any , combine categories or note the limitation.
IB Exam-Style Questions
Question 1 (Paper 1 style)
The marks of 8 students in Maths and Physics are:
| Student | Maths () | Physics () |
|---|---|---|
A | 72 | 68 |
B | 85 | 82 |
C | 60 | 58 |
D | 90 | 88 |
E | 78 | 74 |
F | 65 | 70 |
G | 88 | 85 |
H | 76 | 72 |
(a) Calculate Pearson's correlation coefficient.
Using a GDC: .
(b) Interpret this value.
There is a very strong positive linear correlation between Maths and Physics marks.
(c) Find the equation of the regression line of on .
Using a GDC: .
Question 2 (Paper 2 style)
The reaction times (in seconds) of 20 drivers were measured:
0.42, 0.55, 0.61, 0.48, 0.72, 0.38, 0.65, 0.51, 0.44, 0.59, 0.67, 0.53, 0.46, 0.71, 0.39, 0.58, 0.63, 0.50, 0.57, 0.66
(a) Find the mean and standard deviation.
, .
(b) Construct a box-and-whisker plot.
Ordered data: 0.38, 0.39, 0.42, 0.44, 0.46, 0.48, 0.50, 0.51, 0.53, 0.55, 0.57, 0.58, 0.59, 0.61, 0.63, 0.65, 0.66, 0.67, 0.71, 0.72.
Median: .
: median of lower half .
: median of upper half .
.
Lower fence: . Minimum .
Upper fence: . Maximum .
No outliers.
Question 3 (Paper 1 style)
A researcher claims that the mean height of a population is . A sample of 25 people gives with . Test this claim at the 5% significance level.
vs .
Degrees of freedom .
Critical value (two-tailed, 5%) .
Since , we do not reject .
There is insufficient evidence at the 5% level to reject the claim that the mean height is .
Summary
| Concept | Formula |
|---|---|
| Mean | |
| Sample variance | |
| IQR | |
| Correlation | |
| Regression slope | |
| Chi-squared |
For statistics questions in Paper 2, always show your working. State hypotheses clearly for hypothesis tests. When using your GDC, note what function you used and the inputs. Interpret results in context — never leave a numerical answer without explaining what it means.
Transformations of Data
Linear Transformations
If every data value is transformed by :
| Statistic | Original | Transformed | | ----------------------- | ----------- | -------------------------------------------------- | --- | ------------ | | Mean | | | | | | Standard deviation | | | | Variance | | | | | | Median | | | | | | IQR | | | | Correlation coefficient | | (unchanged if , negated if ) | | |
Standardised Scores (z-scores)
The z-score measures how many standard deviations a value is from the mean:
In a test with mean 65 and standard deviation 8, a student scores 81. Find the z-score.
The student scored 2 standard deviations above the mean.
Non-Linear Regression
Transformations for Non-Linear Data
When data shows a non-linear pattern, transform the variables to linearise:
| Relationship | Transformation | Linear Form |
|---|---|---|
| Plot vs | ||
| Plot vs | ||
| Plot vs | ||
| Plot vs |
Power Law
If vs gives a straight line, then where:
- is the gradient
- is the -intercept
Exponential Law
If vs gives a straight line, then where:
- is the gradient
- is the -intercept
Data suggests is related to by . A plot of vs has gradient and -intercept . Find the relationship.
Additional Exam-Style Questions
Question 4 (Paper 2 style)
Two groups of students took a test. The results are summarised below:
Group A: , ,
Group B: , ,
(a) Find the overall mean.
(b) Comment on the spread of the two groups.
Group B has a larger standard deviation (10 vs 8), meaning the scores are more spread out in Group B. Group A's scores are more tightly clustered around the mean.
Question 5 (Paper 2 style)
A scientist investigates the relationship between temperature (, in C) and reaction rate (, in mol/L/s). The following data was collected:
| 10 | 20 | 30 | 40 | 50 | 60 | |
|---|---|---|---|---|---|---|
| 0.4 | 1.1 | 2.5 | 5.2 | 10.8 | 22.0 |
(a) Explain why a linear regression model may not be appropriate.
The data appears to show exponential growth — as temperature increases, the rate increases by an increasing amount. A plot of vs would show a curve, not a straight line.
(b) By plotting against , determine whether the relationship is of the form .
| 10 | 20 | 30 | 40 | 50 | 60 | |
|---|---|---|---|---|---|---|
A plot of vs would show an approximately linear relationship with a positive gradient, confirming the exponential model is appropriate.
(c) Using a GDC, find the equation of the regression line of on .
The regression gives approximately .
So .
Question 6 (Paper 1 style)
The correlation coefficient between study hours and exam scores for 50 students is .
(a) Calculate the coefficient of determination and interpret it.
About of the variation in exam scores is explained by the linear relationship with study hours.
(b) The -value for testing is . What conclusion can be drawn?
Since , we reject . There is strong evidence of a positive correlation between study hours and exam scores.
Measures of Shape: Skewness and Kurtosis
Skewness
Skewness measures the asymmetry of the distribution.
| Type | Description | Mean vs Median |
|---|---|---|
| Positive (right) skew | Long tail to the right | Mean Median |
| Negative (left) skew | Long tail to the left | Mean Median |
| Symmetric | No skew | Mean = Median |
The Pearson coefficient of skewness:
Kurtosis
Kurtosis measures the "tailedness" of the distribution compared to a normal distribution.
| Type | Description |
|---|---|
| Leptokurtic | Heavy tails, sharp peak (kurtosis ) |
| Mesokurtic | Same as normal (kurtosis ) |
| Platykurtic | Light tails, flat peak (kurtosis ) |
Standardised Scores and Normal Distribution Tables
Using z-Scores
Given a population with mean and standard deviation :
- Convert raw score to z-score:
- Use the standard normal table (or GDC) to find probabilities.
- Convert back:
Finding Percentiles
The -th percentile is the value below which of the data falls.
where is the z-score such that .
Scores on a test are normally distributed with and . Find the 90th percentile.
A score of 82.26 is at the 90th percentile.
Data Collection Methods
Types of Data
| Type | Description | Examples |
|---|---|---|
| Qualitative (categorical) | Labels or names | Colour, gender, nationality |
| Quantitative | Numerical | Height, mass, temperature |
| Discrete | Integer values | Number of siblings, goals scored |
| Continuous | Any value in a range | Height, time, mass |
Sampling Methods
| Method | Description | Advantages | Disadvantages |
|---|---|---|---|
| Simple random | Every member has equal chance | Unbiased | Difficult for large populations |
| Systematic | Every -th member selected | Easy to implement | May be periodic bias |
| Stratified | Population divided into strata | Ensures representation | Complex to organise |
| Quota | Fixed numbers from each group | Quick | Not random |
| Convenience | Easily accessible members | Easy | Likely biased |
Reliability and Validity
- Reliability: consistency of measurements (repeatable).
- Validity: measures what it claims to measure.
- A measurement can be reliable without being valid, but not valid without being reliable.
Diagnostic Test Ready to test your understanding of Statistics? The diagnostic test contains the hardest questions within the IB specification for this topic, each with a full worked solution.
Unit tests probe edge cases and common misconceptions. Integration tests combine Statistics with other IB mathematics topics to test synthesis under exam conditions.
See Diagnostic Guide for instructions on self-marking and building a personal test matrix.