Skip to main content

Statistics — Diagnostic Tests

Unit Tests

Tests edge cases, boundary conditions, and common misconceptions for statistics.

UT-1: Identifying Skew from Quartile Positions

Question:

For a dataset, the quartiles are Q1=42Q_1 = 42, Q2=55Q_2 = 55, and Q3=70Q_3 = 70.

(a) Determine whether the data is positively skewed, negatively skewed, or symmetric.

(b) A student argues: "Since Q2Q1=13Q_2 - Q_1 = 13 and Q3Q2=15Q_3 - Q_2 = 15, the data is positively skewed because Q3Q2>Q2Q1Q_3 - Q_2 \gt Q_2 - Q_1." Is this reasoning correct?

(c) If the interquartile range is IQR=28IQR = 28, state the outlier boundaries using the 1.5×IQR1.5 \times IQR rule.

[Difficulty: hard. Tests interpretation of quartile positions to identify skewness and outlier detection.]

Solution:

(a) The distances from the median are:

  • Q2Q1=5542=13Q_2 - Q_1 = 55 - 42 = 13
  • Q3Q2=7055=15Q_3 - Q_2 = 70 - 55 = 15

Since Q3Q2>Q2Q1Q_3 - Q_2 \gt Q_2 - Q_1, the right tail is longer than the left tail, indicating positive skew.

(b) The student's reasoning is correct in principle: positive skew means the right tail is longer. However, the student should note that this is a heuristic — formal skewness is measured by the moment coefficient 1n(xixˉs)3\frac{1}{n}\sum\left(\frac{x_i - \bar{x}}{s}\right)^3, not just quartile differences. The quartile-based test is a quick check, not definitive proof.

(c) Lower fence: Q11.5×IQR=4242=0Q_1 - 1.5 \times IQR = 42 - 42 = 0. Upper fence: Q3+1.5×IQR=70+42=112Q_3 + 1.5 \times IQR = 70 + 42 = 112.

Outliers are values below 00 or above 112112.


UT-2: PMCC with Coded Data

Question:

A dataset has the following coded values. The coding is y=x5010y = \frac{x - 50}{10}:

y=45,y2=285,n=9\sum y = 45, \quad \sum y^2 = 285, \quad n = 9

(a) Find xˉ\bar{x} and sxs_x (the standard deviation of xx).

(b) A student computes sy=285925=31.6725=6.67s_y = \sqrt{\frac{285}{9} - 25} = \sqrt{31.67 - 25} = \sqrt{6.67} and concludes sx=sys_x = s_y. Explain why this is wrong.

[Difficulty: hard. Tests coded data transformations and the effect on mean and standard deviation.]

Solution:

(a) yˉ=459=5\bar{y} = \frac{45}{9} = 5.

Since y=x5010y = \frac{x - 50}{10}, we have x=10y+50x = 10y + 50:

xˉ=10yˉ+50=10(5)+50=100\bar{x} = 10\bar{y} + 50 = 10(5) + 50 = 100

For the standard deviation: sx=10sys_x = 10s_y.

sy=y2nyˉ2=285925=31.6725=6.672.58s_y = \sqrt{\frac{\sum y^2}{n} - \bar{y}^2} = \sqrt{\frac{285}{9} - 25} = \sqrt{31.67 - 25} = \sqrt{6.67} \approx 2.58

sx=10×2.58=25.8s_x = 10 \times 2.58 = 25.8

(b) The student's error is concluding sx=sys_x = s_y. The coding y=x5010y = \frac{x-50}{10} scales by a factor of 110\frac{1}{10} and shifts by 5050. Scaling by cc multiplies the standard deviation by c|c|, so sx=10sys_x = 10s_y, not sys_y. The student forgot to account for the scaling factor. Additionally, the student used 285931.67\frac{285}{9} \approx 31.67 and then subtracted 2525 (where 25=5225 = 5^2), which is correct for computing sys_y, but then incorrectly applied the result to sxs_x.


Integration Tests

Tests synthesis of statistics with other topics.

IT-1: Least Squares Regression and Summation (with Algebra)

Question:

Given five data points (xi,yi)(x_i, y_i) with xi=15\sum x_i = 15, yi=20\sum y_i = 20, xi2=55\sum x_i^2 = 55, xiyi=68\sum x_iy_i = 68, and yi2=90\sum y_i^2 = 90:

(a) Find the equation of the least squares regression line of yy on xx in the form y=a+bxy = a + bx.

(b) Find PMCC (Pearson product-moment correlation coefficient).

(c) Predict the value of yy when x=5x = 5.

[Difficulty: hard. Combines regression computation with correlation analysis.]

Solution:

(a)

b=nxiyixiyinxi2(xi)2=5(68)15(20)5(55)225=340300275225=4050=0.8b = \frac{n\sum x_iy_i - \sum x_i \sum y_i}{n\sum x_i^2 - (\sum x_i)^2} = \frac{5(68) - 15(20)}{5(55) - 225} = \frac{340 - 300}{275 - 225} = \frac{40}{50} = 0.8

a=yˉbxˉ=2050.8×155=42.4=1.6a = \bar{y} - b\bar{x} = \frac{20}{5} - 0.8 \times \frac{15}{5} = 4 - 2.4 = 1.6

Regression line: y=1.6+0.8xy = 1.6 + 0.8x.

(b)

r=nxiyixiyi[nxi2(xi)2][nyi2(yi)2]r = \frac{n\sum x_iy_i - \sum x_i\sum y_i}{\sqrt{\big[n\sum x_i^2 - (\sum x_i)^2\big]\big[n\sum y_i^2 - (\sum y_i)^2\big]}}

=340300(275225)(450400)=4050×50=4050=0.8= \frac{340 - 300}{\sqrt{(275 - 225)(450 - 400)}} = \frac{40}{\sqrt{50 \times 50}} = \frac{40}{50} = 0.8

(c) When x=5x = 5: y=1.6+0.8(5)=1.6+4=5.6y = 1.6 + 0.8(5) = 1.6 + 4 = 5.6.