Skip to main content

Statistics

Descriptive Statistics

Measures of Central Tendency

Mean. The arithmetic mean of x1,x2,,xnx_1, x_2, \ldots, x_n:

xˉ=1ni=1nxi\bar{x} = \frac{1}{n}\sum_{i=1}^{n} x_i

For grouped data with class midpoints xix_i and frequencies fif_i:

xˉ=fixifi\bar{x} = \frac{\sum f_i x_i}{\sum f_i}

Median. The middle value of ordered data. If nn is odd, the median is x(n+1)/2x_{(n+1)/2}. If nn is even, it is xn/2+xn/2+12\dfrac{x_{n/2} + x_{n/2 + 1}}{2}.

Mode. The most frequently occurring value. A distribution may be unimodal, bimodal, or have no mode.

Measures of Spread

Range. Range=xmaxxmin\mathrm{Range} = x_{\max} - x_{\min}

Interquartile range (IQR). IQR=Q3Q1\mathrm{IQR} = Q_3 - Q_1

where Q1Q_1 is the 25th percentile and Q3Q_3 is the 75th percentile.

Variance. The population variance is:

σ2=1Ni=1N(xiμ)2\sigma^2 = \frac{1}{N}\sum_{i=1}^{N} (x_i - \mu)^2

The sample variance (unbiased estimator) is:

s2=1n1i=1n(xixˉ)2s^2 = \frac{1}{n - 1}\sum_{i=1}^{n} (x_i - \bar{x})^2

Computational formula:

s2=xi2nxˉ2n1=nxi2(xi)2n(n1)s^2 = \frac{\sum x_i^2 - n\bar{x}^2}{n - 1} = \frac{n\sum x_i^2 - (\sum x_i)^2}{n(n - 1)}

Standard deviation. s=s2s = \sqrt{s^2}, measured in the same units as the data.

Box-and-Whisker Plots

A box plot displays five key statistics: minimum, Q1Q_1, median, Q3Q_3, maximum. Outliers are points more than 1.5×IQR1.5 \times \mathrm{IQR} below Q1Q_1 or above Q3Q_3.


Grouped Data

Estimating the Mean and Standard Deviation

For grouped data in classes [ai,bi)[a_i, b_i) with frequency fif_i, use the class midpoint xi=ai+bi2x_i = \dfrac{a_i + b_i}{2}:

xˉfixifi\bar{x} \approx \frac{\sum f_i x_i}{\sum f_i}

Estimating the Median from a Grouped Frequency Distribution

  1. Calculate cumulative frequencies to identify the median class.
  2. Use linear interpolation within the median class:

MedianL+(n2Ff)w\mathrm{Median} \approx L + \left(\frac{\frac{n}{2} - F}{f}\right) \cdot w

where LL is the lower boundary of the median class, FF is the cumulative frequency before the median class, ff is the frequency of the median class, and ww is the class width.

Histograms

In a histogram, the area of each bar represents the frequency, not the height. The height (frequency density) is:

Frequencydensity=FrequencyClasswidth\mathrm{Frequency density} = \frac{\mathrm{Frequency}}{\mathrm{Class width}}


Correlation

Scatter Diagrams

A scatter diagram plots bivariate data points (xi,yi)(x_i, y_i). Visual inspection reveals the direction and strength of association.

Pearson's Product-Moment Correlation Coefficient

r=(xixˉ)(yiyˉ)(xixˉ)2(yiyˉ)2=SxySxxSyyr = \frac{\sum(x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum(x_i - \bar{x})^2 \cdot \sum(y_i - \bar{y})^2}} = \frac{S_{xy}}{\sqrt{S_{xx} \cdot S_{yy}}}

where Sxy=xiyinxˉyˉS_{xy} = \sum x_i y_i - n\bar{x}\bar{y}, Sxx=xi2nxˉ2S_{xx} = \sum x_i^2 - n\bar{x}^2, and Syy=yi2nyˉ2S_{yy} = \sum y_i^2 - n\bar{y}^2.

Properties:

  • 1r1-1 \le r \le 1
  • r=1r = 1: perfect positive linear correlation
  • r=1r = -1: perfect negative linear correlation
  • r=0r = 0: no linear correlation (nonlinear relationships may exist)
  • Correlation does not imply causation

Spearman's Rank Correlation Coefficient

For ordinal data or when the relationship is monotonic but not linear:

rs=16di2n(n21)r_s = 1 - \frac{6\sum d_i^2}{n(n^2 - 1)}

where did_i is the difference in ranks of the ii-th pair. This formula applies when ranks are distinct (no ties).


Linear Regression

Least Squares Method

The regression line of yy on xx minimises (yiy^i)2\sum(y_i - \hat{y}_i)^2:

y=a+bxy = a + bx

where:

b=SxySxx=xiyinxˉyˉxi2nxˉ2,a=yˉbxˉb = \frac{S_{xy}}{S_{xx}} = \frac{\sum x_i y_i - n\bar{x}\bar{y}}{\sum x_i^2 - n\bar{x}^2}, \qquad a = \bar{y} - b\bar{x}

The line always passes through (xˉ,yˉ)(\bar{x}, \bar{y}).

Interpolation and Extrapolation

  • Interpolation: predicting within the range of observed data (generally reliable)
  • Extrapolation: predicting outside the observed range (unreliable; the linear model may not hold)

Coefficient of Determination

R2=r2R^2 = r^2

R2R^2 represents the proportion of variance in yy explained by the linear relationship with xx. For example, R2=0.81R^2 = 0.81 means 81% of the variation in yy is accounted for by the regression.


Conditional Probability

Definition

The conditional probability of AA given BB is:

P(AB)=P(AB)P(B),P(B)>0P(A \mid B) = \frac{P(A \cap B)}{P(B)}, \quad P(B) \gt 0

Multiplication Rule

P(AB)=P(AB)P(B)=P(BA)P(A)P(A \cap B) = P(A \mid B) \cdot P(B) = P(B \mid A) \cdot P(A)

For independent events: P(AB)=P(A)P(B)P(A \cap B) = P(A) \cdot P(B), and P(AB)=P(A)P(A \mid B) = P(A).

Total Probability

If B1,B2,,BnB_1, B_2, \ldots, B_n partition the sample space (BiB_i are mutually exclusive and exhaustive):

P(A)=i=1nP(ABi)P(Bi)P(A) = \sum_{i=1}^{n} P(A \mid B_i) \cdot P(B_i)

Bayes' Theorem

P(BjA)=P(ABj)P(Bj)i=1nP(ABi)P(Bi)P(B_j \mid A) = \frac{P(A \mid B_j) \cdot P(B_j)}{\sum_{i=1}^{n} P(A \mid B_i) \cdot P(B_i)}

Bayes' theorem inverts a conditional probability. It is foundational to Bayesian inference, medical testing, spam filtering, and machine learning.

Example. A test for a disease has sensitivity P(+D)=0.95P(+\mid D) = 0.95 and specificity P(Dc)=0.98P(-\mid D^c) = 0.98. If the disease prevalence is P(D)=0.01P(D) = 0.01, find P(D+)P(D \mid +).

P(D+)=P(+D)P(D)P(+D)P(D)+P(+Dc)P(Dc)=0.95×0.010.95×0.01+0.02×0.99=0.00950.02930.324P(D \mid +) = \frac{P(+\mid D)\,P(D)}{P(+\mid D)\,P(D) + P(+\mid D^c)\,P(D^c)} = \frac{0.95 \times 0.01}{0.95 \times 0.01 + 0.02 \times 0.99} = \frac{0.0095}{0.0293} \approx 0.324

Despite a positive test, the probability of actually having the disease is only about 32%, due to the low prevalence. This is the base rate fallacy.


Combinatorics

Factorial

n!=n×(n1)××2×1,0!=1n! = n \times (n-1) \times \cdots \times 2 \times 1, \quad 0! = 1

Permutations

The number of ways to arrange nn distinct objects in order:

nPr=n!(nr)!{}^n P_r = \frac{n!}{(n - r)!}

When all nn objects are arranged: nPn=n!{}^n P_n = n!.

Combinations

The number of ways to choose rr objects from nn without regard to order:

(nr)=n!r!(nr)!{n \choose r} = \frac{n!}{r!(n - r)!}

Key identity: (nr)=(nnr){n \choose r} = {n \choose n - r}

Counting Principles

Addition principle. If task A can be done in mm ways and task B in nn ways, and the tasks are mutually exclusive, then A or B can be done in m+nm + n ways.

Multiplication principle. If task A can be done in mm ways and for each way, task B can be done in nn ways, then A followed by B can be done in m×nm \times n ways.

Arrangements with Repetition

The number of distinct arrangements of nn objects where there are n1n_1 identical of type 1, n2n_2 identical of type 2, etc.:

n!n1!n2!nk!\frac{n!}{n_1!\, n_2!\, \cdots\, n_k!}

Applications to Probability

When outcomes are equally likely:

P(event)=numberoffavourableoutcomestotalnumberofoutcomesP(\mathrm{event}) = \frac{\mathrm{number of favourable outcomes}}{\mathrm{total number of outcomes}}


Common Pitfalls

  1. Confusing nn and n1n - 1 in variance. Use nn for population variance and n1n - 1 for sample variance (Bessel's correction). The IB generally uses nn for frequency tables and n1n - 1 for sample standard deviation.

  2. Correlation versus causation. A high correlation between two variables does not mean one causes the other. Confounding variables may explain the association.

  3. Extrapolation. Do not use a regression line to make predictions far outside the range of the data. The linear model may not hold.

  4. Base rate fallacy. When applying Bayes' theorem, always account for the prior probability. A test with 99% accuracy can still yield mostly false positives when the condition is rare.

  5. Ordering in permutations vs. combinations. Permutations count arrangements (order matters); combinations count selections (order does not matter). Identify which is needed before computing.

  6. Histograms with unequal class widths. The bar height is frequency density, not frequency. Failing to divide by class width produces a misleading graph.

  7. Regression of xx on yy vs. yy on xx. The regression line of yy on xx minimises vertical residuals. The regression line of xx on yy minimises horizontal residuals. They are different unless r=±1r = \pm 1.


Practice Problems

Problem 1

Find the mean, median, mode, variance, and standard deviation of: 4,7,2,9,4,6,3,4,8,54, 7, 2, 9, 4, 6, 3, 4, 8, 5.

Problem 2

Given the bivariate data (x,y)(x, y): (1,3),(2,5),(3,4),(4,7),(5,6),(6,9)(1, 3), (2, 5), (3, 4), (4, 7), (5, 6), (6, 9), find the equation of the regression line of yy on xx and Pearson's correlation coefficient rr.

Problem 3

A bag contains 5 red and 3 blue marbles. Two marbles are drawn without replacement. Find the probability that both are red, and the probability that the second is red given that the first is blue.

Problem 4

A disease affects 2% of a population. A test has P(+D)=0.99P(+\mid D) = 0.99 and P(Dc)=0.97P(-\mid D^c) = 0.97. Find P(D+)P(D \mid +).

Problem 5

How many ways can 8 people be seated in a row if two particular people must sit together?

Problem 6

A committee of 5 is to be chosen from 7 men and 6 women. How many committees contain at least 2 women?

Problem 7

Find the number of distinct arrangements of the letters in "MISSISSIPPI".

Problem 8

A grouped frequency distribution has classes 00--1010, 1010--2020, 2020--3030, 3030--4040, 4040--5050 with frequencies 5,12,18,10,55, 12, 18, 10, 5. Estimate the mean and median.

Answers to Selected Problems

Problem 1: Ordered data: 2,3,4,4,4,5,6,7,8,92, 3, 4, 4, 4, 5, 6, 7, 8, 9. Mean: xˉ=52/10=5.2\bar{x} = 52/10 = 5.2. Median: (4+5)/2=4.5(4 + 5)/2 = 4.5. Mode: 44. s2=(16+9+16+16+16+25+36+49+64+8110×27.04)/9=(328270.4)/9=57.6/9=6.4s^2 = (16 + 9 + 16 + 16 + 16 + 25 + 36 + 49 + 64 + 81 - 10 \times 27.04)/9 = (328 - 270.4)/9 = 57.6/9 = 6.4. s=6.42.53s = \sqrt{6.4} \approx 2.53.

Problem 2: n=6n = 6, x=21\sum x = 21, y=34\sum y = 34, x2=91\sum x^2 = 91, y2=216\sum y^2 = 216, xy=135\sum xy = 135. xˉ=3.5\bar{x} = 3.5, yˉ=5.667\bar{y} = 5.667. b=(1356×3.5×5.667)/(916×12.25)=(135119)/9173.5)=16/17.5=0.914b = (135 - 6 \times 3.5 \times 5.667)/(91 - 6 \times 12.25) = (135 - 119)/91 - 73.5) = 16/17.5 = 0.914. a=5.6670.914×3.5=2.467a = 5.667 - 0.914 \times 3.5 = 2.467. Line: y=2.47+0.91xy = 2.47 + 0.91x. r=16/17.5×23.33=16/408.3=16/20.21=0.792r = 16/\sqrt{17.5 \times 23.33} = 16/\sqrt{408.3} = 16/20.21 = 0.792.

Problem 3: P(bothred)=58×47=2056=514P(\mathrm{both red}) = \dfrac{5}{8} \times \dfrac{4}{7} = \dfrac{20}{56} = \dfrac{5}{14}. P(2ndred1stblue)=57P(\mathrm{2nd red} \mid \mathrm{1st blue}) = \dfrac{5}{7} (after removing one blue, 5 red remain out of 7).

Problem 4: P(D+)=0.99×0.020.99×0.02+0.03×0.98=0.01980.0198+0.0294=0.01980.04920.402P(D \mid +) = \dfrac{0.99 \times 0.02}{0.99 \times 0.02 + 0.03 \times 0.98} = \dfrac{0.0198}{0.0198 + 0.0294} = \dfrac{0.0198}{0.0492} \approx 0.402.

Problem 5: Treat the two people as a single unit. We have 7 units to arrange: 7!7! ways. The two people can swap within the unit: 2!2! ways. Total: 7!×2=100807! \times 2 = 10080.

Problem 6: Total committees: (135)=1287{13 \choose 5} = 1287. Committees with 0 or 1 woman: (75)+(74)(61)=21+35×6=231{7 \choose 5} + {7 \choose 4}{6 \choose 1} = 21 + 35 \times 6 = 231. Committees with at least 2 women: 1287231=10561287 - 231 = 1056.

Problem 7: MISSISSIPPI has 11 letters: M(1), I(4), S(4), P(2). Arrangements: 11!1!4!4!2!=3991680024×24×2=34650\dfrac{11!}{1!\,4!\,4!\,2!} = \dfrac{39916800}{24 \times 24 \times 2} = 34650.

Problem 8: Midpoints: 5,15,25,35,455, 15, 25, 35, 45. fi=50\sum f_i = 50. Mean: (25+180+450+350+225)/50=1230/50=24.6(25 + 180 + 450 + 350 + 225)/50 = 1230/50 = 24.6. Median position: 50/2=2550/2 = 25. Cumulative frequencies: 5,17,35,5, 17, 35, \ldots. Median class is 2020--3030. Median 20+251718×10=20+801824.4\approx 20 + \dfrac{25 - 17}{18} \times 10 = 20 + \dfrac{80}{18} \approx 24.4.


Worked Examples

Worked Example: Full Regression Analysis with Prediction

A study records the temperature (xx, in \mathrm{}^\circ C}) and ice cream sales (yy, in USD) for 7 days: (18,220)(18, 220), (22,310)(22, 310), (25,380)(25, 380), (28,440)(28, 440), (31,510)(31, 510), (34,580)(34, 580), (20,270)(20, 270). Find the regression line of yy on xx, the correlation coefficient, and predict sales at 30\mathrm{}^\circ C}.

Solution

Computing: n=7n = 7, x=178\sum x = 178, y=2710\sum y = 2710, x2=4734\sum x^2 = 4734, y2=1163900\sum y^2 = 1163900, xy=73420\sum xy = 73420.

xˉ=178/725.43\bar{x} = 178/7 \approx 25.43, yˉ=2710/7387.14\bar{y} = 2710/7 \approx 387.14.

Sxx=47347(25.43)2=47344529.0=205.0S_{xx} = 4734 - 7(25.43)^2 = 4734 - 4529.0 = 205.0

Syy=11639007(387.14)2=11639001049190.0=114710.0S_{yy} = 1163900 - 7(387.14)^2 = 1163900 - 1049190.0 = 114710.0

Sxy=734207(25.43)(387.14)=7342068903.0=4517.0S_{xy} = 73420 - 7(25.43)(387.14) = 73420 - 68903.0 = 4517.0

b=SxySxx=4517.0205.022.04b = \frac{S_{xy}}{S_{xx}} = \frac{4517.0}{205.0} \approx 22.04

a=yˉbxˉ=387.1422.04×25.43387.14560.5173.4a = \bar{y} - b\bar{x} = 387.14 - 22.04 \times 25.43 \approx 387.14 - 560.5 \approx -173.4

Regression line: y=173.4+22.0xy = -173.4 + 22.0x.

r=SxySxxSyy=4517.0205.0×114710.0=4517.0235155504517.04849.30.931r = \frac{S_{xy}}{\sqrt{S_{xx} \cdot S_{yy}}} = \frac{4517.0}{\sqrt{205.0 \times 114710.0}} = \frac{4517.0}{\sqrt{23515550}} \approx \frac{4517.0}{4849.3} \approx 0.931

There is a strong positive linear correlation.

Prediction at x=30x = 30: y^=173.4+22.0(30)=173.4+660=486.6\hat{y} = -173.4 + 22.0(30) = -173.4 + 660 = 486.6 USD.

Since 30\mathrm{}^\circ C} lies within the data range (1818 to 3434), this is interpolation and reliable.

Worked Example: Bayes' Theorem with Multiple Categories

A factory has three machines producing bolts. Machine A produces 50% of output with 2% defect rate. Machine B produces 30% with 3% defect rate. Machine C produces 20% with 5% defect rate. A bolt is selected at random and found to be defective. What is the probability it came from Machine C?

Solution

Let DD be the event "bolt is defective" and A,B,CA, B, C be the events "bolt from Machine A, B, C" respectively.

Prior probabilities: P(A)=0.50P(A) = 0.50, P(B)=0.30P(B) = 0.30, P(C)=0.20P(C) = 0.20.

Likelihoods: P(DA)=0.02P(D \mid A) = 0.02, P(DB)=0.03P(D \mid B) = 0.03, P(DC)=0.05P(D \mid C) = 0.05.

Total probability of defect:

P(D)=0.02(0.50)+0.03(0.30)+0.05(0.20)=0.010+0.009+0.010=0.029P(D) = 0.02(0.50) + 0.03(0.30) + 0.05(0.20) = 0.010 + 0.009 + 0.010 = 0.029

By Bayes' theorem:

P(CD)=P(DC)P(C)P(D)=0.05×0.200.029=0.0100.0290.345P(C \mid D) = \frac{P(D \mid C) \cdot P(C)}{P(D)} = \frac{0.05 \times 0.20}{0.029} = \frac{0.010}{0.029} \approx 0.345

Despite producing only 20% of output, Machine C is responsible for about 34.5% of defects due to its higher defect rate.

Worked Example: Combinatorial Counting with Restrictions

A panel of 4 people is to be selected from 6 men and 5 women. How many panels contain at least one man and at least one woman?

Solution

Total ways to choose 4 from 11: (114)=330{11 \choose 4} = 330.

Subtract the all-men panels: (64)=15{6 \choose 4} = 15.

Subtract the all-women panels: (54)=5{5 \choose 4} = 5.

Panels with at least one of each: 330155=310330 - 15 - 5 = 310.

Alternatively, count directly: (61)(53)+(62)(52)+(63)(51)=60+150+100=310{6 \choose 1}{5 \choose 3} + {6 \choose 2}{5 \choose 2} + {6 \choose 3}{5 \choose 1} = 60 + 150 + 100 = 310.

Worked Example: Spearman's Rank Correlation

Eight students are ranked by two judges. The rank pairs are: (1,2)(1, 2), (2,1)(2, 1), (3,4)(3, 4), (4,3)(4, 3), (5,6)(5, 6), (6,5)(6, 5), (7,8)(7, 8), (8,7)(8, 7). Calculate Spearman's rank correlation coefficient.

Solution

Differences: di=1,1,1,1,1,1,1,1d_i = 1, -1, -1, 1, -1, 1, -1, 1.

di2=1,1,1,1,1,1,1,1d_i^2 = 1, 1, 1, 1, 1, 1, 1, 1.

di2=8\sum d_i^2 = 8. n=8n = 8.

rs=16×88(641)=148504=1110.5=10.0952=0.905r_s = 1 - \frac{6 \times 8}{8(64 - 1)} = 1 - \frac{48}{504} = 1 - \frac{1}{10.5} = 1 - 0.0952 = 0.905

There is a strong positive monotonic relationship between the two judges' rankings.


Additional Common Pitfalls

  • Sample vs. population variance. On the GDC, the sample standard deviation function (usually denoted sxs_x or σn1\sigma_{n-1}) divides by n1n - 1. The population standard deviation (σx\sigma_x or σn\sigma_n) divides by nn. Using the wrong one changes the answer significantly for small samples.

  • Spearman's formula requires distinct ranks. The formula rs=16di2n(n21)r_s = 1 - \dfrac{6\sum d_i^2}{n(n^2-1)} assumes no tied ranks. When ties exist, use the full Pearson formula on the ranks instead.

  • Confusing independent with mutually exclusive. Independent events can occur simultaneously (P(AB)=P(A)P(B)0P(A \cap B) = P(A)P(B) \ne 0 in general). Mutually exclusive events cannot (P(AB)=0P(A \cap B) = 0). Independent events with nonzero probability are never mutually exclusive.

  • Misidentifying favourable outcomes. When computing probabilities, carefully define what constitutes a "favourable outcome" and ensure the counting is complete. Overcounting or undercounting is a frequent source of error.

  • Regression interpretation beyond the data. The regression line of yy on xx gives y^\hat{y} for a given xx, but not x^\hat{x} for a given yy. Using y=a+bxy = a + bx to predict xx from yy requires solving for xx, which gives a different line than the regression of xx on yy.


Exam-Style Problems

Problem 9

The lengths of 10 rods (in cm) are: 12.3,12.7,11.9,12.5,13.1,12.0,12.8,11.8,12.4,12.612.3, 12.7, 11.9, 12.5, 13.1, 12.0, 12.8, 11.8, 12.4, 12.6. Calculate the mean, standard deviation, and the percentage of data within one standard deviation of the mean.

Problem 10

In a bag there are 4 red, 5 blue, and 6 green marbles. Three marbles are drawn without replacement. Find the probability that: (a) all three are the same colour; (b) exactly two are blue.

Problem 11

Two events AA and BB satisfy P(A)=0.4P(A) = 0.4, P(B)=0.7P(B) = 0.7, and P(AB)=0.25P(A \cap B) = 0.25. Find: (a) P(AB)P(A \mid B); (b) P(AB)P(A \cup B); (c) whether AA and BB are independent.

Problem 12

A fair coin is tossed 5 times. Find the probability of getting exactly 3 heads.

Problem 13

The bivariate data has x=60\sum x = 60, y=84\sum y = 84, x2=440\sum x^2 = 440, y2=860\sum y^2 = 860, xy=580\sum xy = 580, and n=10n = 10. Find the equation of the regression line of yy on xx and the coefficient of determination.

Problem 14

How many 4-digit even numbers can be formed from the digits 0,1,2,3,4,50, 1, 2, 3, 4, 5 if no digit may be repeated?

Problem 15

A doctor gives a patient a test for a condition. The test has sensitivity 0.92 and specificity 0.95. If the condition prevalence in the patient's demographic is 0.08, find the probability that the patient actually has the condition given a positive test result.

Answers to Additional Problems

Problem 9: Ordered data: 11.8,11.9,12.0,12.3,12.4,12.5,12.6,12.7,12.8,13.111.8, 11.9, 12.0, 12.3, 12.4, 12.5, 12.6, 12.7, 12.8, 13.1. Mean: xˉ=124.1/10=12.41\bar{x} = 124.1/10 = 12.41. s2=xi210(12.41)29s^2 = \dfrac{\sum x_i^2 - 10(12.41)^2}{9}. xi2=139.09+141.61+144.00+151.29+153.76+156.25+158.76+161.29+163.84+171.61=1541.50\sum x_i^2 = 139.09 + 141.61 + 144.00 + 151.29 + 153.76 + 156.25 + 158.76 + 161.29 + 163.84 + 171.61 = 1541.50. s2=(1541.501540.08)/9=1.42/9=0.158s^2 = (1541.50 - 1540.08)/9 = 1.42/9 = 0.158. s=0.1580.397s = \sqrt{0.158} \approx 0.397. Range within one standard deviation: 12.41±0.40=(12.01,12.81)12.41 \pm 0.40 = (12.01, 12.81). Values in range: 12.0,12.3,12.4,12.5,12.6,12.7,12.812.0, 12.3, 12.4, 12.5, 12.6, 12.7, 12.8 = 7 out of 10 = 70%.

Problem 10: Total marbles: 15. Total ways to draw 3: (153)=455{15 \choose 3} = 455. (a) Same colour: (43)+(53)+(63)=4+10+20=34{4 \choose 3} + {5 \choose 3} + {6 \choose 3} = 4 + 10 + 20 = 34. P=34/4550.0747P = 34/455 \approx 0.0747. (b) Exactly two blue: (52)(101)=10×10=100{5 \choose 2}{10 \choose 1} = 10 \times 10 = 100. P=100/4550.2198P = 100/455 \approx 0.2198.

Problem 11: (a) P(AB)=P(AB)/P(B)=0.25/0.70=5/140.357P(A \mid B) = P(A \cap B)/P(B) = 0.25/0.70 = 5/14 \approx 0.357. (b) P(AB)=P(A)+P(B)P(AB)=0.4+0.70.25=0.85P(A \cup B) = P(A) + P(B) - P(A \cap B) = 0.4 + 0.7 - 0.25 = 0.85. (c) Independent iff P(AB)=P(A)P(B)P(A \cap B) = P(A)P(B). P(A)P(B)=0.280.25=P(AB)P(A)P(B) = 0.28 \ne 0.25 = P(A \cap B). Not independent.

Problem 12: (53)(0.5)3(0.5)2=10×132=516{5 \choose 3}(0.5)^3(0.5)^2 = 10 \times \dfrac{1}{32} = \dfrac{5}{16}.

Problem 13: Sxx=44010(6)2=440360=80S_{xx} = 440 - 10(6)^2 = 440 - 360 = 80. Sxy=58010(6)(8.4)=580504=76S_{xy} = 580 - 10(6)(8.4) = 580 - 504 = 76. b=76/80=0.95b = 76/80 = 0.95. a=8.40.95(6)=8.45.7=2.7a = 8.4 - 0.95(6) = 8.4 - 5.7 = 2.7. Line: y=2.7+0.95xy = 2.7 + 0.95x. Syy=86010(8.4)2=860705.6=154.4S_{yy} = 860 - 10(8.4)^2 = 860 - 705.6 = 154.4. r=76/80×154.4=76/12352=76/111.10.684r = 76/\sqrt{80 \times 154.4} = 76/\sqrt{12352} = 76/111.1 \approx 0.684. R2=r20.468R^2 = r^2 \approx 0.468, meaning about 46.8% of the variation in yy is explained by the linear model.

Problem 14: For a 4-digit number, the first digit cannot be 0. The last digit must be even (0, 2, 4). Case 1: Last digit is 0. First digit: 5 choices (1--5), middle two: 4P2=12{}^4P_2 = 12. Total: 5×12=605 \times 12 = 60. Case 2: Last digit is 2 or 4 (2 choices). First digit: 4 choices (1--5, excluding the one used for the last digit). Middle two: 4P2=12{}^4P_2 = 12. Total: 2×4×12=962 \times 4 \times 12 = 96. Total: 60+96=15660 + 96 = 156.

Problem 15: P(D+)=0.92×0.080.92×0.08+0.05×0.92=0.07360.0736+0.046=0.07360.11960.615P(D \mid +) = \dfrac{0.92 \times 0.08}{0.92 \times 0.08 + 0.05 \times 0.92} = \dfrac{0.0736}{0.0736 + 0.046} = \dfrac{0.0736}{0.1196} \approx 0.615. Despite a positive test, there is only about a 61.5% chance the patient has the condition.


If You Get These Wrong, Revise:

  • Probability fundamentals → Review conditional probability and the laws of probability
  • Algebraic manipulation for summation formulas → Review ./calculus (sections on summation and sigma notation)
  • Quadratic equations and simultaneous equations → Review algebra fundamentals
  • Logarithms for regression transformation → Review exponential and logarithmic functions
  • Set theory and Venn diagrams → Review logic and set theory fundamentals