For grouped data in classes [ai,bi) with frequency fi, use the class midpoint
xi=2ai+bi:
xˉ≈∑fi∑fixi
Estimating the Median from a Grouped Frequency Distribution
Calculate cumulative frequencies to identify the median class.
Use linear interpolation within the median class:
Median≈L+(f2n−F)⋅w
where L is the lower boundary of the median class, F is the cumulative frequency before the
median class, f is the frequency of the median class, and w is the class width.
R2 represents the proportion of variance in y explained by the linear relationship with x.
For example, R2=0.81 means 81% of the variation in y is accounted for by the regression.
Addition principle. If task A can be done in m ways and task B in n ways, and the tasks are
mutually exclusive, then A or B can be done in m+n ways.
Multiplication principle. If task A can be done in m ways and for each way, task B can be done
in n ways, then A followed by B can be done in m×n ways.
Confusing n and n−1 in variance. Use n for population variance and n−1 for sample
variance (Bessel's correction). The IB generally uses n for frequency tables and n−1 for
sample standard deviation.
Correlation versus causation. A high correlation between two variables does not mean one
causes the other. Confounding variables may explain the association.
Extrapolation. Do not use a regression line to make predictions far outside the range of the
data. The linear model may not hold.
Base rate fallacy. When applying Bayes' theorem, always account for the prior probability. A
test with 99% accuracy can still yield mostly false positives when the condition is rare.
Ordering in permutations vs. combinations. Permutations count arrangements (order matters);
combinations count selections (order does not matter). Identify which is needed before computing.
Histograms with unequal class widths. The bar height is frequency density, not frequency.
Failing to divide by class width produces a misleading graph.
Regression of x on y vs. y on x. The regression line of y on x minimises vertical
residuals. The regression line of x on y minimises horizontal residuals. They are different
unless r=±1.
Find the mean, median, mode, variance, and standard deviation of: 4,7,2,9,4,6,3,4,8,5.
Problem 2
Given the bivariate data (x,y): (1,3),(2,5),(3,4),(4,7),(5,6),(6,9), find the
equation of the regression line of y on x and Pearson's correlation coefficient r.
Problem 3
A bag contains 5 red and 3 blue marbles. Two marbles are drawn without replacement. Find the
probability that both are red, and the probability that the second is red given that the first is
blue.
Problem 4
A disease affects 2% of a population. A test has P(+∣D)=0.99 and P(−∣Dc)=0.97. Find
P(D∣+).
Problem 5
How many ways can 8 people be seated in a row if two particular people must sit together?
Problem 6
A committee of 5 is to be chosen from 7 men and 6 women. How many committees contain at least 2
women?
Problem 7
Find the number of distinct arrangements of the letters in "MISSISSIPPI".
Problem 8
A grouped frequency distribution has classes 0--10, 10--20, 20--30, 30--40,
40--50 with frequencies 5,12,18,10,5. Estimate the mean and median.
Problem 3:P(bothred)=85×74=5620=145.
P(2ndred∣1stblue)=75 (after removing one blue, 5 red remain
out of 7).
Problem 4:P(D∣+)=0.99×0.02+0.03×0.980.99×0.02=0.0198+0.02940.0198=0.04920.0198≈0.402.
Problem 5: Treat the two people as a single unit. We have 7 units to arrange: 7! ways. The two
people can swap within the unit: 2! ways. Total: 7!×2=10080.
Problem 6: Total committees: (513)=1287. Committees with 0 or 1 woman:
(57)+(47)(16)=21+35×6=231. Committees with at least 2
women: 1287−231=1056.
Problem 7: MISSISSIPPI has 11 letters: M(1), I(4), S(4), P(2). Arrangements:
1!4!4!2!11!=24×24×239916800=34650.
Problem 8: Midpoints: 5,15,25,35,45. ∑fi=50. Mean:
(25+180+450+350+225)/50=1230/50=24.6. Median position: 50/2=25. Cumulative
frequencies: 5,17,35,…. Median class is 20--30. Median
≈20+1825−17×10=20+1880≈24.4.
Worked Example: Full Regression Analysis with Prediction
A study records the temperature (x, in \mathrm{}^\circ C}) and ice cream sales (y, in USD) for
7 days: (18,220), (22,310), (25,380), (28,440), (31,510), (34,580), (20,270).
Find the regression line of y on x, the correlation coefficient, and predict sales at 30\mathrm{}^\circ C}.
Prediction at x=30: y^=−173.4+22.0(30)=−173.4+660=486.6 USD.
Since 30\mathrm{}^\circ C} lies within the data range (18 to 34), this is interpolation and
reliable.
Worked Example: Bayes' Theorem with Multiple Categories
A factory has three machines producing bolts. Machine A produces 50% of output with 2% defect rate.
Machine B produces 30% with 3% defect rate. Machine C produces 20% with 5% defect rate. A bolt is
selected at random and found to be defective. What is the probability it came from Machine C?
Solution
Let D be the event "bolt is defective" and A,B,C be the events "bolt from Machine A, B, C"
respectively.
Eight students are ranked by two judges. The rank pairs are: (1,2), (2,1), (3,4), (4,3),
(5,6), (6,5), (7,8), (8,7). Calculate Spearman's rank correlation coefficient.
Sample vs. population variance. On the GDC, the sample standard deviation function (usually
denoted sx or σn−1) divides by n−1. The population standard deviation (σx
or σn) divides by n. Using the wrong one changes the answer significantly for small samples.
Spearman's formula requires distinct ranks. The formula rs=1−n(n2−1)6∑di2
assumes no tied ranks. When ties exist, use the full Pearson formula on the ranks instead.
Confusing independent with mutually exclusive. Independent events can occur simultaneously
(P(A∩B)=P(A)P(B)=0 in general). Mutually exclusive events cannot
(P(A∩B)=0). Independent events with nonzero probability are never mutually exclusive.
Misidentifying favourable outcomes. When computing probabilities, carefully define what
constitutes a "favourable outcome" and ensure the counting is complete. Overcounting or
undercounting is a frequent source of error.
Regression interpretation beyond the data. The regression line of y on x gives y^
for a given x, but not x^ for a given y. Using y=a+bx to predict x from y
requires solving for x, which gives a different line than the regression of x on y.
The lengths of 10 rods (in cm) are: 12.3,12.7,11.9,12.5,13.1,12.0,12.8,11.8,12.4,12.6.
Calculate the mean, standard deviation, and the percentage of data within one standard deviation of
the mean.
Problem 10
In a bag there are 4 red, 5 blue, and 6 green marbles. Three marbles are drawn without replacement.
Find the probability that: (a) all three are the same colour; (b) exactly two are blue.
Problem 11
Two events A and B satisfy P(A)=0.4, P(B)=0.7, and P(A∩B)=0.25. Find:
(a) P(A∣B); (b) P(A∪B); (c) whether A and B are independent.
Problem 12
A fair coin is tossed 5 times. Find the probability of getting exactly 3 heads.
Problem 13
The bivariate data has ∑x=60, ∑y=84, ∑x2=440, ∑y2=860,
∑xy=580, and n=10. Find the equation of the regression line of y on x and the
coefficient of determination.
Problem 14
How many 4-digit even numbers can be formed from the digits 0,1,2,3,4,5 if no digit may be
repeated?
Problem 15
A doctor gives a patient a test for a condition. The test has sensitivity 0.92 and specificity 0.95.
If the condition prevalence in the patient's demographic is 0.08, find the probability that the patient
actually has the condition given a positive test result.
Answers to Additional Problems
Problem 9: Ordered data: 11.8,11.9,12.0,12.3,12.4,12.5,12.6,12.7,12.8,13.1.
Mean: xˉ=124.1/10=12.41.
s2=9∑xi2−10(12.41)2.
∑xi2=139.09+141.61+144.00+151.29+153.76+156.25+158.76+161.29+163.84+171.61=1541.50.
s2=(1541.50−1540.08)/9=1.42/9=0.158. s=0.158≈0.397.
Range within one standard deviation: 12.41±0.40=(12.01,12.81).
Values in range: 12.0,12.3,12.4,12.5,12.6,12.7,12.8 = 7 out of 10 = 70%.
Problem 10: Total marbles: 15. Total ways to draw 3: (315)=455.
(a) Same colour: (34)+(35)+(36)=4+10+20=34. P=34/455≈0.0747.
(b) Exactly two blue: (25)(110)=10×10=100. P=100/455≈0.2198.
Problem 11: (a) P(A∣B)=P(A∩B)/P(B)=0.25/0.70=5/14≈0.357.
(b) P(A∪B)=P(A)+P(B)−P(A∩B)=0.4+0.7−0.25=0.85.
(c) Independent iff P(A∩B)=P(A)P(B). P(A)P(B)=0.28=0.25=P(A∩B). Not independent.
Problem 12:(35)(0.5)3(0.5)2=10×321=165.
Problem 13:Sxx=440−10(6)2=440−360=80.
Sxy=580−10(6)(8.4)=580−504=76.
b=76/80=0.95. a=8.4−0.95(6)=8.4−5.7=2.7.
Line: y=2.7+0.95x.
Syy=860−10(8.4)2=860−705.6=154.4.
r=76/80×154.4=76/12352=76/111.1≈0.684.
R2=r2≈0.468, meaning about 46.8% of the variation in y is explained by the linear model.
Problem 14: For a 4-digit number, the first digit cannot be 0. The last digit must be even (0, 2, 4).
Case 1: Last digit is 0. First digit: 5 choices (1--5), middle two: 4P2=12. Total: 5×12=60.
Case 2: Last digit is 2 or 4 (2 choices). First digit: 4 choices (1--5, excluding the one used for the last digit). Middle two: 4P2=12. Total: 2×4×12=96.
Total: 60+96=156.
Problem 15:P(D∣+)=0.92×0.08+0.05×0.920.92×0.08=0.0736+0.0460.0736=0.11960.0736≈0.615.
Despite a positive test, there is only about a 61.5% chance the patient has the condition.