Skip to main content

Statistics

Measures of Central Tendency

Mean

The arithmetic mean of a data set x1,x2,,xnx_1, x_2, \ldots, x_n:

xˉ=1ni=1nxi=xin\bar{x} = \frac{1}{n}\sum_{i=1}^{n} x_i = \frac{\sum x_i}{n}

For grouped data with frequencies fif_i:

xˉ=fixifi\bar{x} = \frac{\sum f_i x_i}{\sum f_i}

Median

The median is the middle value when data is arranged in order.

  • If nn is odd: median =xn+12= x_{\frac{n+1}{2}}
  • If nn is even: median =xn2+xn2+12= \dfrac{x_{\frac{n}{2}} + x_{\frac{n}{2}+1}}{2}

For grouped data, use linear interpolation within the median class.

Mode

The mode is the most frequently occurring value. A data set can be unimodal, bimodal, or have no mode.

Comparing Measures

MeasureAdvantagesDisadvantages
MeanUses all data points, algebraic propertiesAffected by outliers
MedianRobust to outliersDoes not use all data
ModeSimple, useful for categorical dataMay not exist or be unique
Example

Find the mean, median, and mode of: 3,5,5,7,8,9,12,15,453, 5, 5, 7, 8, 9, 12, 15, 45.

Mean: xˉ=3+5+5+7+8+9+12+15+459=109912.1\bar{x} = \dfrac{3+5+5+7+8+9+12+15+45}{9} = \dfrac{109}{9} \approx 12.1

Median: 5th value =8= 8

Mode: 55 (appears twice)

The mean (12.1) is significantly higher than the median (8) due to the outlier 4545.


Measures of Spread

Range

Range=xmaxxmin\mathrm{Range} = x_{\max} - x_{\min}

Interquartile Range (IQR)

IQR=Q3Q1\mathrm{IQR} = Q_3 - Q_1

where Q1Q_1 is the first quartile (25th percentile) and Q3Q_3 is the third quartile (75th percentile).

Variance

The variance measures the average squared deviation from the mean.

Population variance:

σ2=1Ni=1N(xiμ)2\sigma^2 = \frac{1}{N}\sum_{i=1}^{N}(x_i - \mu)^2

Sample variance (unbiased estimator):

s2=1n1i=1n(xixˉ)2s^2 = \frac{1}{n-1}\sum_{i=1}^{n}(x_i - \bar{x})^2

Standard Deviation

σ=σ2ors=s2\sigma = \sqrt{\sigma^2} \quad \mathrm{or} \quad s = \sqrt{s^2}

Computational Formula

s2=xi2nxˉ2n1=nxi2(xi)2n(n1)s^2 = \frac{\sum x_i^2 - n\bar{x}^2}{n - 1} = \frac{n\sum x_i^2 - (\sum x_i)^2}{n(n-1)}
Example

Calculate the standard deviation of: 4,8,6,5,3,8,9,2,74, 8, 6, 5, 3, 8, 9, 2, 7.

n=9,xˉ=5295.778n = 9, \quad \bar{x} = \frac{52}{9} \approx 5.778xi2=16+64+36+25+9+64+81+4+49=348\sum x_i^2 = 16 + 64 + 36 + 25 + 9 + 64 + 81 + 4 + 49 = 348s2=3489×(52/9)291=348300.448=47.568=5.944s^2 = \frac{348 - 9 \times (52/9)^2}{9 - 1} = \frac{348 - 300.44}{8} = \frac{47.56}{8} = 5.944s2.438s \approx 2.438
Exam Tip

Know whether to use the population formula (÷N\div N) or the sample formula (÷(n1)\div (n-1)). In IB exams, when data is from a sample, use s2s^2 (dividing by n1n-1). Your GDC typically uses the sample formula by default.


Grouped Data

Estimating the Mean from Grouped Data

Use the midpoint of each class interval:

xˉfimifi\bar{x} \approx \frac{\sum f_i m_i}{\sum f_i}

where mim_i is the midpoint of class ii.

Estimating the Median from Grouped Data

Use linear interpolation within the median class:

MedianL+(n2Ff)×w\mathrm{Median} \approx L + \left(\frac{\frac{n}{2} - F}{f}\right) \times w

where:

  • LL = lower boundary of median class
  • nn = total frequency
  • FF = cumulative frequency before median class
  • ff = frequency of median class
  • ww = class width
Example
Mass (g)Frequency
0m<200 \le m \lt 205
20m<4020 \le m \lt 4012
40m<6040 \le m \lt 6018
60m<8060 \le m \lt 8010
80m<10080 \le m \lt 1005

Total n=50n = 50. Median position =502=25= \dfrac{50}{2} = 25.

Cumulative frequencies: 5,17,35,45,505, 17, 35, 45, 50.

Median is in the 40m<6040 \le m \lt 60 class (F=17F = 17, f=18f = 18).

Median40+(251718)×20=40+818×20=40+8.89=48.89g\mathrm{Median} \approx 40 + \left(\frac{25 - 17}{18}\right) \times 20 = 40 + \frac{8}{18} \times 20 = 40 + 8.89 = 48.89 \mathrm{ g}

Box-and-Whisker Plots

Components

A box-and-whisker plot displays the five-number summary:

  1. Minimum (or lower whisker)
  2. Q1Q_1 (first quartile)
  3. Median (Q2Q_2)
  4. Q3Q_3 (third quartile)
  5. Maximum (or upper whisker)

Outliers

A value is a potential outlier if it falls outside:

Q11.5×IQRorQ3+1.5×IQRQ_1 - 1.5 \times \mathrm{IQR} \quad \mathrm{or} \quad Q_3 + 1.5 \times \mathrm{IQR}

Interpreting Box Plots

  • The box represents the middle 50% of data (IQR).
  • The line inside the box is the median.
  • Whiskers extend to the minimum and maximum (or to the most extreme non-outlier values).
  • Skewness: if the median is closer to Q1Q_1, the data is right-skewed (positively skewed). If closer to Q3Q_3, left-skewed (negatively skewed).

Cumulative Frequency

Cumulative Frequency Graph (Ogive)

Plot cumulative frequency against the upper class boundary. From this graph, you can read:

  • Median: at n2\dfrac{n}{2}
  • Quartiles: at n4\dfrac{n}{4} and 3n4\dfrac{3n}{4}
  • Percentiles: at the appropriate fraction of nn
Example

Using the grouped data from the previous example:

Upper boundaryCumulative frequency
205
4017
6035
8045
10050

To find Q1Q_1 (at 12.512.5): interpolate between (20,5)(20, 5) and (40,17)(40, 17).

Q120+12.55175×20=20+12.5=32.5gQ_1 \approx 20 + \frac{12.5 - 5}{17 - 5} \times 20 = 20 + 12.5 = 32.5 \mathrm{ g}

Correlation

Scatter Diagrams

A scatter diagram plots two variables to visually assess the relationship between them.

Pearson's Correlation Coefficient (rr)

Measures the strength and direction of the linear relationship between two variables.

r=(xixˉ)(yiyˉ)(xixˉ)2(yiyˉ)2r = \frac{\sum(x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum(x_i - \bar{x})^2 \sum(y_i - \bar{y})^2}}

Properties of rr

ValueInterpretation
r=1r = 1Perfect positive linear correlation
r=1r = -1Perfect negative linear correlation
r=0r = 0No linear correlation
0<r<10 \lt r \lt 1Positive linear correlation
1<r<0-1 \lt r \lt 0Negative linear correlation

Strength Guidelines

$r$Strength
0.00.0--0.30.3Weak
0.30.3--0.70.7Moderate
0.70.7--1.01.0Strong

Computational Formula

r=nxiyixiyi[nxi2(xi)2][nyi2(yi)2]r = \frac{n\sum x_iy_i - \sum x_i \sum y_i}{\sqrt{[n\sum x_i^2 - (\sum x_i)^2][n\sum y_i^2 - (\sum y_i)^2]}}
Exam Tip

Correlation does NOT imply causation. Two variables may be strongly correlated without one causing the other (they may both be influenced by a third variable).


Linear Regression

Least Squares Regression Line

The line of best fit minimises the sum of squared residuals:

y=a+bxy = a + bx

where:

b=(xixˉ)(yiyˉ)(xixˉ)2=nxiyixiyinxi2(xi)2b = \frac{\sum(x_i - \bar{x})(y_i - \bar{y})}{\sum(x_i - \bar{x})^2} = \frac{n\sum x_iy_i - \sum x_i \sum y_i}{n\sum x_i^2 - (\sum x_i)^2} a=yˉbxˉa = \bar{y} - b\bar{x}

Regression Line of xx on yy

If predicting xx from yy:

x=a+byx = a' + b'y b=nxiyixiyinyi2(yi)2b' = \frac{n\sum x_iy_i - \sum x_i \sum y_i}{n\sum y_i^2 - (\sum y_i)^2}

Coefficient of Determination (r2r^2)

r2=explainedvariationtotalvariationr^2 = \frac{\mathrm{explained variation}}{\mathrm{total variation}}

r2r^2 represents the proportion of variance in yy explained by the linear relationship with xx.

  • r2=1r^2 = 1: the line explains all the variation.
  • r2=0r^2 = 0: the line explains none of the variation.
Example

Given the data:

xx12345
yy2.13.96.27.810.1

Find the regression line of yy on xx.

n=5,x=15,y=30.1n = 5, \quad \sum x = 15, \quad \sum y = 30.1x2=55,xy=110.2\sum x^2 = 55, \quad \sum xy = 110.2b=5(110.2)15(30.1)5(55)225=551451.5275225=99.550=1.99b = \frac{5(110.2) - 15(30.1)}{5(55) - 225} = \frac{551 - 451.5}{275 - 225} = \frac{99.5}{50} = 1.99yˉ=6.02,xˉ=3\bar{y} = 6.02, \quad \bar{x} = 3a=6.021.99(3)=6.025.97=0.05a = 6.02 - 1.99(3) = 6.02 - 5.97 = 0.05

Regression line: y=0.05+1.99xy = 0.05 + 1.99x.

Extrapolation and Interpolation

  • Interpolation: predicting within the range of data (generally reliable).
  • Extrapolation: predicting outside the range of data (unreliable and potentially misleading).
Exam Tip

Never extrapolate beyond the data range without acknowledging the uncertainty. IB exam questions often ask you to comment on the reliability of a prediction.


Hypothesis Testing

Key Concepts

  • Null hypothesis (H0H_0): The statement being tested (usually "no effect" or "no difference").
  • Alternative hypothesis (H1H_1): The statement we suspect might be true.
  • Significance level (α\alpha): The threshold for rejecting H0H_0 (commonly 0.05 or 0.01).
  • Test statistic: A value computed from the sample data.
  • pp-value: The probability of observing the test statistic (or more extreme) assuming H0H_0 is true.
  • Critical value: The boundary value(s) that define the rejection region.

Decision Rule

  • If pp-value <α\lt \alpha: reject H0H_0.
  • If pp-value α\ge \alpha: do not reject H0H_0.

Types of Errors

Error TypeDescription
Type IRejecting H0H_0 when it is true (false positive). Probability =α= \alpha.
Type IIFailing to reject H0H_0 when it is false (false negative). Probability =β= \beta.

One-Tailed vs Two-Tailed Tests

TestH1H_1Rejection Region
Two-tailedParameter \neq valueBoth tails
Right-tailedParameter >\gt valueUpper tail
Left-tailedParameter <\lt valueLower tail

Hypothesis Test for Correlation

To test whether the population correlation coefficient ρ\rho is zero:

  1. H0:ρ=0H_0: \rho = 0 and H1:ρ0H_1: \rho \neq 0 (or ρ>0\rho \gt 0 or ρ<0\rho \lt 0).
  2. Compute rr from the sample.
  3. Compare r|r| with the critical value (or compute the pp-value).
  4. Make a conclusion.

The test statistic for large samples (n>30n \gt 30):

t=rn21r2t = r\sqrt{\frac{n-2}{1-r^2}}

which follows a tt-distribution with n2n - 2 degrees of freedom.

Example

A sample of 12 students gives a correlation coefficient of r=0.85r = 0.85 between hours studied and exam score. Test at the 5% significance level whether there is a positive correlation.

H0:ρ=0H_0: \rho = 0 vs H1:ρ>0H_1: \rho \gt 0.

This is a one-tailed test. The critical value for n=12n = 12 at 5% level is approximately 0.4970.497.

Since r=0.85>0.497r = 0.85 \gt 0.497, we reject H0H_0.

There is sufficient evidence at the 5% level to conclude a positive correlation between hours studied and exam score.

Chi-Squared Test for Independence

Used to determine whether two categorical variables are independent.

Test statistic:

χ2=(OiEi)2Ei\chi^2 = \sum \frac{(O_i - E_i)^2}{E_i}

where OiO_i are observed frequencies and EiE_i are expected frequencies.

Expected frequency:

E_`\{ij}` = \frac{(\mathrm{row } i \mathrm{ total}) \times (\mathrm{column } j \mathrm{ total})}{\mathrm{grand total}}

Degrees of freedom: ν=(r1)(c1)\nu = (r-1)(c-1) where rr is the number of rows and cc is the number of columns.

Example

Test whether gender and favourite subject are independent:

MathsScienceEnglishTotal
Male30251570
Female20204080
Total504555150

Expected frequencies:

E(Male,Maths)=70×50150=23.33E(\mathrm{Male, Maths}) = \dfrac{70 \times 50}{150} = 23.33

E(Male,Science)=70×45150=21.00E(\mathrm{Male, Science}) = \dfrac{70 \times 45}{150} = 21.00

E(Male,English)=70×55150=25.67E(\mathrm{Male, English}) = \dfrac{70 \times 55}{150} = 25.67

E(Female,Maths)=80×50150=26.67E(\mathrm{Female, Maths}) = \dfrac{80 \times 50}{150} = 26.67

E(Female,Science)=80×45150=24.00E(\mathrm{Female, Science}) = \dfrac{80 \times 45}{150} = 24.00

E(Female,English)=80×55150=29.33E(\mathrm{Female, English}) = \dfrac{80 \times 55}{150} = 29.33

χ2=(3023.33)223.33+(2521)221+(1525.67)225.67+(2026.67)226.67+(2024)224+(4029.33)229.33\chi^2 = \frac{(30-23.33)^2}{23.33} + \frac{(25-21)^2}{21} + \frac{(15-25.67)^2}{25.67} + \frac{(20-26.67)^2}{26.67} + \frac{(20-24)^2}{24} + \frac{(40-29.33)^2}{29.33}=1.91+0.76+4.44+1.67+0.67+3.88=13.33= 1.91 + 0.76 + 4.44 + 1.67 + 0.67 + 3.88 = 13.33

Degrees of freedom: (21)(31)=2(2-1)(3-1) = 2.

Critical value at α=0.05\alpha = 0.05 with ν=2\nu = 2: 5.995.99.

Since 13.33>5.9913.33 \gt 5.99, we reject H0H_0. Gender and favourite subject are not independent.

Exam Tip

For the chi-squared test, always check that all expected frequencies are at least 5. If any Ei<5E_i \lt 5, combine categories or note the limitation.


IB Exam-Style Questions

Question 1 (Paper 1 style)

The marks of 8 students in Maths and Physics are:

StudentMaths (xx)Physics (yy)
A7268
B8582
C6058
D9088
E7874
F6570
G8885
H7672

(a) Calculate Pearson's correlation coefficient.

Using a GDC: r0.960r \approx 0.960.

(b) Interpret this value.

There is a very strong positive linear correlation between Maths and Physics marks.

(c) Find the equation of the regression line of yy on xx.

Using a GDC: y5.10+0.917xy \approx 5.10 + 0.917x.

Question 2 (Paper 2 style)

The reaction times (in seconds) of 20 drivers were measured:

0.42, 0.55, 0.61, 0.48, 0.72, 0.38, 0.65, 0.51, 0.44, 0.59, 0.67, 0.53, 0.46, 0.71, 0.39, 0.58, 0.63, 0.50, 0.57, 0.66

(a) Find the mean and standard deviation.

xˉ0.555\bar{x} \approx 0.555, s0.099s \approx 0.099.

(b) Construct a box-and-whisker plot.

Ordered data: 0.38, 0.39, 0.42, 0.44, 0.46, 0.48, 0.50, 0.51, 0.53, 0.55, 0.57, 0.58, 0.59, 0.61, 0.63, 0.65, 0.66, 0.67, 0.71, 0.72.

Median: 0.55+0.572=0.56\dfrac{0.55 + 0.57}{2} = 0.56.

Q1Q_1: median of lower half =0.46+0.482=0.47= \dfrac{0.46 + 0.48}{2} = 0.47.

Q3Q_3: median of upper half =0.63+0.652=0.64= \dfrac{0.63 + 0.65}{2} = 0.64.

IQR=0.640.47=0.17\mathrm{IQR} = 0.64 - 0.47 = 0.17.

Lower fence: 0.471.5(0.17)=0.2150.47 - 1.5(0.17) = 0.215. Minimum =0.38= 0.38.

Upper fence: 0.64+1.5(0.17)=0.8950.64 + 1.5(0.17) = 0.895. Maximum =0.72= 0.72.

No outliers.

Question 3 (Paper 1 style)

A researcher claims that the mean height of a population is 170cm170\mathrm{ cm}. A sample of 25 people gives xˉ=173cm\bar{x} = 173\mathrm{ cm} with s=8cms = 8\mathrm{ cm}. Test this claim at the 5% significance level.

H0:μ=170H_0: \mu = 170 vs H1:μ170H_1: \mu \neq 170.

t=xˉμ0s/n=1731708/25=31.6=1.875t = \frac{\bar{x} - \mu_0}{s/\sqrt{n}} = \frac{173 - 170}{8/\sqrt{25}} = \frac{3}{1.6} = 1.875

Degrees of freedom =24= 24.

Critical value (two-tailed, 5%) 2.064\approx 2.064.

Since 1.875<2.064|1.875| \lt 2.064, we do not reject H0H_0.

There is insufficient evidence at the 5% level to reject the claim that the mean height is 170cm170\mathrm{ cm}.


Summary

ConceptFormula
Meanxˉ=xin\bar{x} = \dfrac{\sum x_i}{n}
Sample variances2=(xixˉ)2n1s^2 = \dfrac{\sum(x_i - \bar{x})^2}{n-1}
IQRQ3Q1Q_3 - Q_1
Correlationr=nxiyixiyi[nxi2(xi)2][nyi2(yi)2]r = \dfrac{n\sum x_iy_i - \sum x_i \sum y_i}{\sqrt{[n\sum x_i^2 - (\sum x_i)^2][n\sum y_i^2 - (\sum y_i)^2]}}
Regression slopeb=nxiyixiyinxi2(xi)2b = \dfrac{n\sum x_iy_i - \sum x_i \sum y_i}{n\sum x_i^2 - (\sum x_i)^2}
Chi-squaredχ2=(OiEi)2Ei\chi^2 = \displaystyle\sum \dfrac{(O_i - E_i)^2}{E_i}
Exam Strategy

For statistics questions in Paper 2, always show your working. State hypotheses clearly for hypothesis tests. When using your GDC, note what function you used and the inputs. Interpret results in context — never leave a numerical answer without explaining what it means.


Transformations of Data

Linear Transformations

If every data value is transformed by yi=axi+by_i = ax_i + b:

| Statistic | Original | Transformed | | ----------------------- | ----------- | -------------------------------------------------- | --- | ------------ | | Mean | xˉ\bar{x} | axˉ+ba\bar{x} + b | | | | Standard deviation | sxs_x | asx | a | s_x | | Variance | sx2s_x^2 | a2sx2a^2 s_x^2 | | | | Median | Q2Q_2 | aQ2+baQ_2 + b | | | | IQR | Q3Q1Q_3 - Q_1 | a(Q3Q1) | a | (Q_3 - Q_1) | | Correlation coefficient | rr | rr (unchanged if a>0a \gt 0, negated if a<0a \lt 0) | | |

Standardised Scores (z-scores)

The z-score measures how many standard deviations a value is from the mean:

z=xxˉsz = \frac{x - \bar{x}}{s}
Example

In a test with mean 65 and standard deviation 8, a student scores 81. Find the z-score.

z=81658=2.0z = \frac{81 - 65}{8} = 2.0

The student scored 2 standard deviations above the mean.


Non-Linear Regression

Transformations for Non-Linear Data

When data shows a non-linear pattern, transform the variables to linearise:

RelationshipTransformationLinear Form
y=axby = a x^blogy=loga+blogx\log y = \log a + b \log xPlot logy\log y vs logx\log x
y=aebxy = a e^{bx}lny=lna+bx\ln y = \ln a + bxPlot lny\ln y vs xx
y=a+blnxy = a + b\ln xy=a+blnxy = a + b\ln xPlot yy vs lnx\ln x
y=ax+by = \frac{a}{x} + by=a ⁣(1x)+by = a\!\left(\frac{1}{x}\right) + bPlot yy vs 1/x1/x

Power Law

If logy\log y vs logx\log x gives a straight line, then y=axby = ax^b where:

  • bb is the gradient
  • loga\log a is the yy-intercept

Exponential Law

If lny\ln y vs xx gives a straight line, then y=aebxy = ae^{bx} where:

  • bb is the gradient
  • lna\ln a is the yy-intercept
Example

Data suggests yy is related to xx by y=axby = ax^b. A plot of logy\log y vs logx\log x has gradient 1.51.5 and yy-intercept 0.70.7. Find the relationship.

b=1.5,loga=0.7    a=100.75.01b = 1.5, \quad \log a = 0.7 \implies a = 10^{0.7} \approx 5.01y5.01x1.5y \approx 5.01x^{1.5}

Additional Exam-Style Questions

Question 4 (Paper 2 style)

Two groups of students took a test. The results are summarised below:

Group A: n=30n = 30, xˉ=72\bar{x} = 72, s=8s = 8

Group B: n=25n = 25, xˉ=68\bar{x} = 68, s=10s = 10

(a) Find the overall mean.

xˉoverall=30×72+25×6855=2160+170055=386055=70.2\bar{x}_{\mathrm{overall}} = \frac{30 \times 72 + 25 \times 68}{55} = \frac{2160 + 1700}{55} = \frac{3860}{55} = 70.2

(b) Comment on the spread of the two groups.

Group B has a larger standard deviation (10 vs 8), meaning the scores are more spread out in Group B. Group A's scores are more tightly clustered around the mean.

Question 5 (Paper 2 style)

A scientist investigates the relationship between temperature (xx, in °\degreeC) and reaction rate (yy, in mol/L/s). The following data was collected:

xx102030405060
yy0.41.12.55.210.822.0

(a) Explain why a linear regression model may not be appropriate.

The data appears to show exponential growth — as temperature increases, the rate increases by an increasing amount. A plot of yy vs xx would show a curve, not a straight line.

(b) By plotting lny\ln y against xx, determine whether the relationship is of the form y=aebxy = ae^{bx}.

xx102030405060
lny\ln y0.916-0.9160.0950.0950.9160.9161.6491.6492.3802.3803.0913.091

A plot of lny\ln y vs xx would show an approximately linear relationship with a positive gradient, confirming the exponential model is appropriate.

(c) Using a GDC, find the equation of the regression line of lny\ln y on xx.

The regression gives approximately lny=1.40+0.074x\ln y = -1.40 + 0.074x.

So y=e1.40e0.074x0.247e0.074xy = e^{-1.40}e^{0.074x} \approx 0.247e^{0.074x}.

Question 6 (Paper 1 style)

The correlation coefficient between study hours and exam scores for 50 students is r=0.72r = 0.72.

(a) Calculate the coefficient of determination and interpret it.

r2=0.5184r^2 = 0.5184

About 51.8%51.8\% of the variation in exam scores is explained by the linear relationship with study hours.

(b) The pp-value for testing H0:ρ=0H_0: \rho = 0 is 0.00010.0001. What conclusion can be drawn?

Since p=0.0001<0.05p = 0.0001 \lt 0.05, we reject H0H_0. There is strong evidence of a positive correlation between study hours and exam scores.


Measures of Shape: Skewness and Kurtosis

Skewness

Skewness measures the asymmetry of the distribution.

TypeDescriptionMean vs Median
Positive (right) skewLong tail to the rightMean >\gt Median
Negative (left) skewLong tail to the leftMean <\lt Median
SymmetricNo skewMean = Median

The Pearson coefficient of skewness:

Skewness=3(xˉQ2)s\mathrm{Skewness} = \frac{3(\bar{x} - Q_2)}{s}

Kurtosis

Kurtosis measures the "tailedness" of the distribution compared to a normal distribution.

TypeDescription
LeptokurticHeavy tails, sharp peak (kurtosis >3\gt 3)
MesokurticSame as normal (kurtosis =3= 3)
PlatykurticLight tails, flat peak (kurtosis <3\lt 3)

Standardised Scores and Normal Distribution Tables

Using z-Scores

Given a population with mean μ\mu and standard deviation σ\sigma:

  1. Convert raw score to z-score: z=xμσz = \dfrac{x - \mu}{\sigma}
  2. Use the standard normal table (or GDC) to find probabilities.
  3. Convert back: x=μ+zσx = \mu + z\sigma

Finding Percentiles

The pp-th percentile is the value below which p%p\% of the data falls.

xp=μ+zpσx_p = \mu + z_p \cdot \sigma

where zpz_p is the z-score such that P(Z<zp)=p/100P(Z \lt z_p) = p/100.

Example

Scores on a test are normally distributed with μ=72\mu = 72 and σ=8\sigma = 8. Find the 90th percentile.

z0.90=1.282z_{0.90} = 1.282x90=72+1.282×8=72+10.26=82.26x_{90} = 72 + 1.282 \times 8 = 72 + 10.26 = 82.26

A score of 82.26 is at the 90th percentile.


Data Collection Methods

Types of Data

TypeDescriptionExamples
Qualitative (categorical)Labels or namesColour, gender, nationality
QuantitativeNumericalHeight, mass, temperature
DiscreteInteger valuesNumber of siblings, goals scored
ContinuousAny value in a rangeHeight, time, mass

Sampling Methods

MethodDescriptionAdvantagesDisadvantages
Simple randomEvery member has equal chanceUnbiasedDifficult for large populations
SystematicEvery kk-th member selectedEasy to implementMay be periodic bias
StratifiedPopulation divided into strataEnsures representationComplex to organise
QuotaFixed numbers from each groupQuickNot random
ConvenienceEasily accessible membersEasyLikely biased

Reliability and Validity

  • Reliability: consistency of measurements (repeatable).
  • Validity: measures what it claims to measure.
  • A measurement can be reliable without being valid, but not valid without being reliable.

tip

Diagnostic Test Ready to test your understanding of Statistics? The diagnostic test contains the hardest questions within the IB specification for this topic, each with a full worked solution.

Unit tests probe edge cases and common misconceptions. Integration tests combine Statistics with other IB mathematics topics to test synthesis under exam conditions.

See Diagnostic Guide for instructions on self-marking and building a personal test matrix.