-
Notifications
You must be signed in to change notification settings - Fork 269
/
03-ch3.Rmd
1466 lines (979 loc) · 66 KB
/
03-ch3.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
# A Review of Statistics using R {#arosur}
This section reviews important statistical concepts:
- Estimation of unknown population parameters
- Hypothesis testing
- Confidence intervals
These methods are heavily used in econometrics. We will discuss them in the simple context of inference about an unknown population mean and discuss several applications in `r ttcode("R")`. These `r ttcode("R")` applications rely on the following packages which are not part of the base version of `r ttcode("R")`:
+ `r ttcode("readxl")` - allows to import data from *Excel* to `r ttcode("R")`.
+ `r ttcode("dplyr")` - provides a flexible grammar for data manipulation.
+ `r ttcode("MASS")` - a collection of functions for applied statistics.
Make sure these are installed before you go ahead and try to replicate the examples. The safest way to do so is by checking whether the following code chunk executes without any errors.
```{r, warning=FALSE, message=FALSE, eval=FALSE}
library(dplyr)
library(MASS)
library(readxl)
```
## Estimation of the Population Mean
```{r, eval = my_output == "html", results='asis', echo=F, purl=F}
cat('
<div class = "keyconcept" id="KC3.1">
<h3 class = "right"> Key Concept 3.1 </h3>
<h3 class = "left"> Estimators and Estimates </h3>
*Estimators* are functions of sample data drawn from an unknown population. *Estimates* are numeric values computed by estimators based on the sample data. Estimators are random variables because they are functions of *random* data. Estimates are nonrandom numbers.
</div>
')
```
```{r, eval = my_output == "latex", results='asis', echo=F, purl=F}
cat('\\begin{keyconcepts}[Estimators and Estimates]{3.1}
\\textit{Estimators} are functions of sample data that are drawn randomly from an unknown population. \\textit{Estimates} are numeric values computed by estimators based on the sample data. Estimators are random variables because they are functions of \\textit{random} data. Estimates are nonrandom numbers.
\\end{keyconcepts}')
```
Think of some economic variable, for example hourly earnings of college graduates, denoted by $Y$. Suppose we are interested in $\mu_Y$ the mean of $Y$. In order to exactly calculate $\mu_Y$ we would have to interview every working graduate in the economy. We simply cannot do this due to time and cost constraints. However, we can draw a random sample of $n$ i.i.d. observations $Y_1, \dots, Y_n$ and estimate $\mu_Y$ using one of the simplest estimators in the sense of Key Concept 3.1 one can think of, that is,
$$ \overline{Y} = \frac{1}{n} \sum_{i=1}^n Y_i, $$
the sample mean of $Y$. Then again, we could use an even simpler estimator for $\mu_Y$: the very first observation in the sample, $Y_1$. Is $Y_1$ a good estimator? For now, assume that
$$ Y \sim \chi_{12}^2, $$
which is not too unreasonable as hourly income is non-negative and we expect many hourly earnings to be in a range of $5$ Euro to $15$ Euro. Moreover, it is common for income distributions to be skewed to the right --- a property of the $\chi^2_{12}$ distribution.
```{r, fig.align='center'}
# plot the chi_12^2 distribution
curve(dchisq(x, df=12),
from = 0,
to = 40,
ylab = "Density",
xlab = "Hourly earnings in Euro")
```
We now draw a sample of $n=100$ observations and take the first observation $Y_1$ as an estimate for $\mu_Y$
```{r, fig.align='center'}
# set seed for reproducibility
set.seed(1)
# sample from the chi_12^2 distribution, use only the first observation
rsamp <- rchisq(n = 100, df = 12)
rsamp[1]
```
The estimate $8.26$ is not too far away from $\mu_Y = 12$ but it is somewhat intuitive that we could do better: the estimator $Y_1$ discards a lot of information and its variance is the population variance:
$$ \text{Var}(Y_1) = \text{Var}(Y) = 2 \cdot 12 = 24 $$
This brings us to the following question: What is a *good* estimator of an unknown parameter in the first place? This question is tackled in Key Concepts 3.2 and 3.3.
```{r, eval = my_output == "html", results='asis', echo=F, purl=F}
cat('
<div class = "keyconcept" id="KC3.2">
<h3 class = "right"> Key Concept 3.2 </h3>
<h3 class = "left"> Bias, Consistency and Efficiency </h3>
Desirable characteristics of an estimator include unbiasedness, consistency and efficiency.
**Unbiasedness:**
If the mean of the sampling distribution of some estimator $\\hat\\mu_Y$ for the population mean $\\mu_Y$ equals $\\mu_Y$,
$$ E(\\hat\\mu_Y) = \\mu_Y, $$
the estimator is unbiased for $\\mu_Y$. The *bias* of $\\hat\\mu_Y$ then is $0$:
$$ E(\\hat\\mu_Y) - \\mu_Y = 0. $$
**Consistency:**
We want the uncertainty of the estimator $\\mu_Y$ to decrease as the number of observations in the sample grows. More precisely, we want the probability that the estimate $\\hat\\mu_Y$ falls within a small interval around the true value $\\mu_Y$ to get increasingly closer to $1$ as $n$ grows. We write this as
$$ \\hat\\mu_Y \\xrightarrow{p} \\mu_Y $$
**Variance and efficiency:**
We want the estimator to be efficient. Suppose we have two estimators, $\\hat\\mu_Y$ and $\\overset{\\sim}{\\mu}_Y$ and for some given sample size $n$ it holds that
$$ E(\\hat\\mu_Y) = E(\\overset{\\sim}{\\mu}_Y) = \\mu_Y,$$
but
$$\\text{Var}(\\hat\\mu_Y) < \\text{Var}(\\overset{\\sim}{\\mu}_Y).$$
We then prefer to use $\\hat\\mu_Y$ as it has a lower variance than $\\overset{\\sim}{\\mu}_Y$, meaning that $\\hat\\mu_Y$ is more *efficient* in using the information provided by the observations in the sample.
</div>
')
```
```{r, eval = my_output == "latex", results='asis', echo=F, purl=F}
cat('\\begin{keyconcepts}[Bias\\comma Consistency and Efficiency]{3.2}
Desirable characteristics of an estimator include unbiasedness, consistency and efficiency.\\newline
\\textbf{Unbiasedness:}
If the mean of the sampling distribution of some estimator $\\hat\\mu_Y$ for the population mean $\\mu_Y$ equals $\\mu_Y$,
$$ E(\\hat\\mu_Y) = \\mu_Ym, $$
the estimator is unbiased for $\\mu_Y$. The \\textit{bias} of $\\hat\\mu_Y$ then is $0$:
$$ E(\\hat\\mu_Y) - \\mu_Y = 0.$$
\\textbf{Consistency:}
We want the uncertainty of the estimator $\\mu_Y$ to decrease as the number of observations in the sample grows. More precisely, we want the probability that the estimate $\\hat\\mu_Y$ falls within a small interval around the true value $\\mu_Y$ to get increasingly closer to $1$ as $n$ grows. We write this as
$$ \\hat\\mu_Y \\xrightarrow{p} \\mu_Y. $$
\\textbf{Variance and efficiency:}
We want the estimator to be efficient. Suppose we have two estimators, $\\hat\\mu_Y$ and $\\overset{\\sim}{\\mu}_Y$ and for some given sample size $n$ it holds that
$$ E(\\hat\\mu_Y) = E(\\overset{\\sim}{\\mu}_Y) = \\mu_Y ,$$
but
$$\\text{Var}(\\hat\\mu_Y) < \\text{Var}(\\overset{\\sim}{\\mu}_Y).$$
We then prefer to use $\\hat\\mu_Y$ as it has a lower variance than $\\overset{\\sim}{\\mu}_Y$, meaning that $\\hat\\mu_Y$ is more \\textit{efficient} in using the information provided by the observations in the sample.
\\end{keyconcepts}')
```
## Properties of the Sample Mean {#potsm}
```{block2, consistency, type='rmdknit'}
A more precise way to express consistency of an estimator $\hat\mu$ for a parameter $\mu$ is
$$ P(|\hat{\mu} - \mu|<\epsilon) \xrightarrow[n \rightarrow \infty]{p} 1 \quad \text{for any}\quad\epsilon>0.$$
This expression says that the probability of observing a deviation from the true value $\mu$ that is smaller than some arbitrary $\epsilon > 0$ converges to $1$ as $n$ grows. Consistency does not require unbiasedness.
```
To examine properties of the sample mean as an estimator for the corresponding population mean, consider the following `r ttcode("R")` example.
We generate a population, denoted as `r ttcode("pop")`, consisting of observations $Y_i$, where $i=1,\dots,10000$. These observations are generated from a normal distribution with mean $\mu = 10$ and variance $\sigma^2 = 1$.
To investigate the behavior of the estimator $\hat{\mu} = \bar{Y}$, we can draw random samples from this population and calculate $\bar{Y}$ for each of them. This is easily done by making use of the function `r ttcode("replicate()")`. The argument `r ttcode("expr")` is evaluated `r ttcode("n")` times. In this case we draw samples of sizes $n=5$ and $n=25$, compute the sample means and repeat this exactly $N=25000$ times.
For comparison purposes we store results for the estimator $Y_1$, the first observation in a sample of size $5$, separately.
```{r, echo = T, eval = T, message = F, warning = F}
# generate a fictious population
pop <- rnorm(10000, 10, 1)
# sample from the population and estimate the mean
est1 <- replicate(expr = mean(sample(x = pop, size = 5)), n = 25000)
est2 <- replicate(expr = mean(sample(x = pop, size = 25)), n = 25000)
fo <- replicate(expr = sample(x = pop, size = 5)[1], n = 25000)
```
Check that `r ttcode("est1")` and `r ttcode("est2")` are vectors of length $25000$:
```{r}
# check if object type is vector
is.vector(est1)
is.vector(est2)
# check length
length(est1)
length(est2)
```
The code chunk below produces a plot of the sampling distributions of the estimators $\bar{Y}$ and $Y_1$ on the basis of the $25000$ samples in each case. We also plot the density function of the $\mathcal{N}(10,1)$ distribution.
```{r, echo = T, eval = T, message = F, warning = F, fig.align='center'}
# plot density estimate Y_1
plot(density(fo),
col = "green",
lwd = 2,
ylim = c(0, 2),
xlab = "Estimates",
main = "Sampling Distributions of Unbiased Estimators")
# add density estimate for the distribution of the sample mean with n=5 to the plot
lines(density(est1),
col = "steelblue",
lwd = 2,
bty = "l")
# add density estimate for the distribution of the sample mean with n=25 to the plot
lines(density(est2),
col = "red2",
lwd = 2)
# add a vertical line at the true parameter
abline(v = 10, lty = 2)
# add N(10,1) density to the plot
curve(dnorm(x, mean = 10),
lwd = 2,
lty = 2,
add = T)
# add a legend
legend("topleft",
legend = c("N(10,1)",
expression(Y[n == 1]),
expression(bar(Y)[n == 5]),
expression(bar(Y)[n == 25])
),
lty = c(2, 1, 1, 1),
col = c("black","green", "steelblue", "red2"),
lwd = 2)
```
First, *all* sampling distributions (represented by the solid lines) are centered around $\mu = 10$. This is evidence for the *unbiasedness* of $Y_1$, $\overline{Y}_{5}$ and $\overline{Y}_{25}$. Of course, the theoretical density $\mathcal{N}(10,1)$ is also centered at $10$.
Next, have a look at the spread of the sampling distributions. Several things are noteworthy:
- The sampling distribution of $Y_1$ (green curve) tracks the density of the $\mathcal{N}(10,1)$ distribution (black dashed line) pretty closely. In fact, the sampling distribution of $Y_1$ is the $\mathcal{N}(10,1)$ distribution. This is less surprising if you keep in mind that the $Y_1$ estimator does nothing but reporting an observation that is randomly selected from a population with $\mathcal{N}(10,1)$ distribution. Hence, $Y_1 \sim \mathcal{N}(10,1)$. Note that this result does not depend on the sample size $n$: the sampling distribution of $Y_1$ *is always* the population distribution, no matter how large the sample is, $Y_1$ is a good estimate of $\mu_Y$, but we can do better.
- Both the sampling distributions of $\overline{Y}$ show less dispersion than the sampling distribution of $Y_1$. This means that $\overline{Y}$ has a lower variance than $Y_1$. In view of Key Concepts 3.2 and 3.3, we find that $\overline{Y}$ is a more efficient estimator than $Y_1$. In fact, this holds for all $n>1$.
- $\overline{Y}$ shows a behavior illustrating consistency (see Key Concept 3.2). The blue and the red densities are much more concentrated around $\mu=10$ than the green one. As the number of observations is increased from $1$ to $5$, the sampling distribution tightens around the true parameter. By increasing the sample size to $25$, this effect becomes more apparent. This implies that the probability of obtaining estimates that are close to the true value increases with $n$. This is also reflected by the estimated values of the density function close to 10: the larger the sample size, the larger the value of the density.
We encourage you to go ahead and modify the code. Try out different values for the sample size and see how the sampling distribution of $\overline{Y}$ changes!
#### $\overline{Y}$ is the Least Squares Estimator of $\mu_Y$ {-}
Assume you have some observations $Y_1,\dots,Y_n$ on $Y \sim \mathcal{N}(10,1)$ (which is unknown) and would like to find an estimator $m$ that predicts the observations as good as possible. Here, we aim to find an estimator $m$ that results in a small total squared deviation between the predicted and observed values. Mathematically, this means we want to find an $m$ that minimizes
\begin{equation}
\sum_{i=1}^n (Y_i - m)^2. (\#eq:sqm)
\end{equation}
Think of $Y_i - m$ as the mistake made when predicting $Y_i$ by $m$. We could also minimize the sum of absolute deviations from $m$ but minimizing the sum of squared deviations is mathematically more convenient (and will lead to a different result). That is why the estimator we are looking for is called the *least squares estimator*. $m = \overline{Y}$, the sample mean, is this estimator.
We can show this by generating a random sample and plotting \@ref(eq:sqm) as a function of $m$.
```{r, fig.align='center'}
# define the function and vectorize it
sqm <- function(m) {
sum((y-m)^2)
}
sqm <- Vectorize(sqm)
# draw random sample and compute the mean
y <- rnorm(100, 10, 1)
mean(y)
```
```{r, fig.align='center'}
# plot the objective function
curve(sqm(x),
from = -50,
to = 70,
xlab = "m",
ylab = "sqm(m)")
# add vertical line at mean(y)
abline(v = mean(y),
lty = 2,
col = "darkred")
# add annotation at mean(y)
text(x = mean(y),
y = 0,
labels = paste(round(mean(y), 2)))
```
Notice that \@ref(eq:sqm) is a quadratic function so that there is only one minimum. The plot shows that this minimum lies exactly at the sample mean of the sample data.
```{block2, vecfunction, type='rmdknit'}
Some <tt>R</tt> functions can only interact with functions that take a vector as an input and evaluate the function body on every entry of the vector, for example <tt>curve()</tt>. We call such functions vectorized functions and it is often a good idea to write vectorized functions yourself, although this is cumbersome in some cases. Having a vectorized function in <tt>R</tt> is never a drawback since, these functions work on both single values and vectors.
Let us look at the function <tt>sqm()</tt>, which is non-vectorized:
<tt>
sqm <- function(m) {
sum((y-m)^2) #body of the function
}
</tt>
Providing, e.g., <tt>c(1,2,3)</tt> as the argument <tt>m</tt> would cause an error since then the operation <tt>y-m</tt> is invalid: the vectors <tt>y</tt> and <tt>m</tt> are of incompatible dimensions. This is why we cannot use <tt>sqm()</tt> in conjunction with <tt>curve()</tt>.
Here <tt>Vectorize()</tt> comes into play. It generates a vectorized version of a non-vectorized function.
```
#### Why is Random Sampling Important ? {-}
So far, we assumed (sometimes implicitly) that the observed data $Y_1, \dots, Y_n$ are the result of a sampling process that satisfies the assumption of simple random sampling. This assumption often is fulfilled when estimating a population mean using $\overline{Y}$. If this is not the case, estimates may be biased.
Let us fall back to `r ttcode("pop")`, the fictitious population of $10000$ observations and compute the population mean $\mu_{\texttt{pop}}$:
```{r}
# compute the population mean of pop
mean(pop)
```
Next we sample $25$ observations from `r ttcode("pop")` with `r ttcode("sample()")` and estimate $\mu_{\texttt{pop}}$ using $\overline{Y}$ repeatedly. However, now we use a sampling scheme that deviates from simple random sampling: instead of ensuring that each member of the population has the same chance to end up in a sample, we assign a higher probability of being sampled to the $2500$ smallest observations of the population by setting the argument `r ttcode("prob")` to a suitable vector of probability weights:
```{r}
# simulate outcomes for the sample mean when the i.i.d. assumption fails
est3 <- replicate(n = 25000,
expr = mean(sample(x = sort(pop),
size = 25,
prob = c(rep(4, 2500), rep(1, 7500)))))
# compute the sample mean of the outcomes
mean(est3)
```
Next we plot the sampling distribution of $\overline{Y}$ for this non-i.i.d. case and compare it to the sampling distribution when the i.i.d. assumption holds.
```{r, fig.align='center'}
# sampling distribution of sample mean, i.i.d. holds, n=25
plot(density(est2),
col = "steelblue",
lwd = 2,
xlim = c(8, 11),
xlab = "Estimates",
main = "When the i.i.d. Assumption Fails")
# sampling distribution of sample mean, i.i.d. fails, n=25
lines(density(est3),
col = "red2",
lwd = 2)
# add a legend
legend("topleft",
legend = c(expression(bar(Y)[n == 25]~", i.i.d. fails"),
expression(bar(Y)[n == 25]~", i.i.d. holds")
),
lty = c(1, 1),
col = c("red2", "steelblue"),
lwd = 2)
```
Here, the failure of the i.i.d. assumption implies that, on average, we *underestimate* $\mu_Y$ using $\overline{Y}$: the corresponding distribution of $\overline{Y}$ is shifted to the left. In other words, $\overline{Y}$ is a *biased* estimator for $\mu_Y$ if the i.i.d. assumption does not hold.
## Hypothesis Tests concerning the Population Mean
In this section we briefly review concepts in hypothesis testing and discuss how to conduct hypothesis tests in `r ttcode("R")`. We focus on drawing inferences about an unknown population mean.
#### About Hypotheses and Hypothesis Testing {-}
In a significance test we want to exploit the information contained in a sample as evidence in favor of against a hypothesis. Essentially, hypotheses are simple questions that can be answered by 'yes' or 'no'. In a hypothesis test we typically deal with two different hypotheses:
- The *null hypothesis*, denoted by $H_0$, is the hypothesis we are interested in testing.
- There must be an *alternative hypothesis*, denoted by $H_1$, the hypothesis that is thought to hold if the null hypothesis is rejected.
The null hypothesis that the population mean of $Y$ equals the value $\mu_{Y,0}$ is written as
$$ H_0: E(Y) = \mu_{Y,0}. $$
Often the alternative hypothesis is chosen such that it is the most general one,
$$ H_1: E(Y) \neq \mu_{Y,0}, $$
meaning that $E(Y)$ may be anything but the value under the null hypothesis. This is called a *two-sided* alternative.
For the sake of brevity, we only consider two-sided alternatives in the subsequent sections of this chapter.
### The p-Value {-}
Assume that the null hypothesis is *true*. The $p$-value is the probability of drawing data and observing a corresponding test statistic that is at least as adverse to what is stated under the null hypothesis as the test statistic actually computed using the sample data.
In the context of the population mean and the sample mean, this definition can be stated mathematically in the following way:
\begin{equation}
p \text{-value} = P_{H_0}\left[ \lvert \overline{Y} - \mu_{Y,0} \rvert > \lvert \overline{Y}^{act} - \mu_{Y,0} \rvert \right] , (\#eq:pvalue)
\end{equation}
In \@ref(eq:pvalue), $\overline{Y}^{act}$ is the sample mean for the data at hand (a value). In order to compute the $p$-value as in \@ref(eq:pvalue), knowledge about the sampling distribution of $\overline{Y}$ (a random variable) when the null hypothesis is true (the *null distribution*) is required. However, in most cases the sampling distribution and thus the null distribution of $\overline{Y}$ are unknown. Fortunately, the CLT (see Key Concept 2.7) allows for the large-sample approximation $$ \overline{Y} \approx \mathcal{N}(\mu_{Y,0}, \, \sigma^2_{\overline{Y}}) \ \ , \ \ \sigma^2_{\overline{Y}} = \frac{\sigma_Y^2}{n}, $$ assuming the null hypothesis $H_0: E(Y) = \mu_{Y, 0}$ is true. With some algebra it follows for large $n$ that
$$ \frac{\overline{Y} - \mu_{Y,0}}{\sigma_Y/\sqrt{n}} \sim \mathcal{N}(0,1). $$
So in large samples, the $p$-value can be computed *without* knowledge of the exact sampling distribution of $\overline{Y}$ using the above normal approximation.
### Calculating the p-Value when the Standard Deviation is Known {-}
For now, let us assume that $\sigma_{\overline{Y}}$ is known. Then, we can rewrite \@ref(eq:pvalue) as
\begin{align}
p \text{-value} =& \, P_{H_0}\left[ \left\lvert \frac{\overline{Y} - \mu_{Y,0}}{\sigma_{\overline{Y}}} \right\rvert > \left\lvert \frac{\overline{Y}^{act} - \mu_{Y,0}}{\sigma_{\overline{Y}}} \right\rvert \right] \\
=& \, 2 \cdot \Phi \left[ - \left\lvert \frac{\overline{Y}^{act} - \mu_{Y,0}}{\sigma_{\overline{Y}}} \right\rvert\right]. (\#eq:pvaluenorm1)
\end{align}
The $p$-value is this area in the tails of the $\mathcal{N}(0,1)$ distribution that lies beyond
\begin{equation}
\pm \left\lvert \frac{\overline{Y}^{act} - \mu_{Y,0}}{\sigma_{\overline{Y}}} \right\rvert (\#eq:pvaluenorm2) \ ,
\end{equation}
We now use `r ttcode("R")` to visualize what is stated in \@ref(eq:pvaluenorm1) and \@ref(eq:pvaluenorm2). The next code chunk replicates Figure 3.1 of the book.
```{r, fig.align='center'}
# plot the standard normal density on the interval [-4,4]
curve(dnorm(x),
xlim = c(-4, 4),
main = "Calculating a p-Value",
yaxs = "i",
xlab = "z",
ylab = "",
lwd = 2,
axes = "F")
# add x-axis
axis(1,
at = c(-1.5, 0, 1.5),
padj = 0.75,
labels = c(expression(-frac(bar(Y)^"act"~-~bar(mu)["Y,0"], sigma[bar(Y)])),
0,
expression(frac(bar(Y)^"act"~-~bar(mu)["Y,0"], sigma[bar(Y)]))))
# shade p-value/2 region in left tail
polygon(x = c(-6, seq(-6, -1.5, 0.01), -1.5),
y = c(0, dnorm(seq(-6, -1.5, 0.01)),0),
col = "steelblue")
# shade p-value/2 region in right tail
polygon(x = c(1.5, seq(1.5, 6, 0.01), 6),
y = c(0, dnorm(seq(1.5, 6, 0.01)), 0),
col = "steelblue")
```
### Sample Variance, Sample Standard Deviation and Standard Error {- #SVSSDASE}
If $\sigma^2_Y$ is unknown, it must be estimated. This can be done using the sample variance
\begin{equation}
s_Y^2 = \frac{1}{n-1} \sum_{i=1}^n (Y_i - \overline{Y})^2.
\end{equation}
Furthermore
\begin{equation}
s_Y = \sqrt{\frac{1}{n-1} \sum_{i=1}^n (Y_i - \overline{Y})^2}
\end{equation}
is a suitable estimator for the standard deviation of $Y$. In `r ttcode("R")`, $s_Y$ is implemented in the function `r ttcode("sd()")`, see `?sd`.
Using `r ttcode("R")` we can illustrate that $s_Y$ is a consistent estimator for $\sigma_Y$, that is
$$ s_Y \overset{p}{\longrightarrow} \sigma_Y. $$
The idea here is to generate a large number of samples $Y_1,\dots,Y_n$ where, $Y\sim \mathcal{N}(10, 9)$ say, estimate $\sigma_Y$ using $s_Y$ and investigate how the distribution of $s_Y$ changes as $n$ gets larger.
```{r, fig.align='center'}
# vector of sample sizes
n <- c(10000, 5000, 2000, 1000, 500)
# sample observations, estimate using 'sd()' and plot the estimated distributions
sq_y <- replicate(n = 10000, expr = sd(rnorm(n[1], 10, 3)))
plot(density(sq_y),
main = expression("Sampling Distributions of" ~ s[Y]),
xlab = expression(s[y]),
lwd = 2)
for (i in 2:length(n)) {
sq_y <- replicate(n = 10000, expr = sd(rnorm(n[i], 10, 3)))
lines(density(sq_y),
col = i,
lwd = 2)
}
# add a legend
legend("topleft",
legend = c(expression(n == 10000),
expression(n == 5000),
expression(n == 2000),
expression(n == 1000),
expression(n == 500)),
col = 1:5,
lwd = 2)
```
The plot shows that the distribution of $s_Y$ tightens around the true value $\sigma_Y = 3$ as $n$ increases.
The function that estimates the standard deviation of an estimator is called the *standard error of the estimator*. Key Concept 3.4 summarizes the terminology in the context of the sample mean.
```{r, eval = my_output == "html", results='asis', echo=F, purl=F}
cat('
<div class = "keyconcept" id="KC3.4">
<h3 class = "right"> Key Concept 3.4 </h3>
<h3 class = "left"> The Standard Error of $\\overline{Y}$ </h3>
Take an i.i.d. sample $Y_1, \\dots, Y_n$. The mean of $Y$ is consistently estimated by $\\overline{Y}$, the sample mean of the $Y_i$. Since $\\overline{Y}$ is a random variable, it has a sampling distribution with variance $\\frac{\\sigma_Y^2}{n}$.
The standard error of $\\overline{Y}$, denoted $SE(\\overline{Y})$ is an estimator of the standard deviation of $\\overline{Y}$:
$$ SE(\\overline{Y}) = \\hat\\sigma_{\\overline{Y}} = \\frac{s_Y}{\\sqrt{n}}. $$
The caret (^) over $\\sigma$ indicates that $\\hat\\sigma_{\\overline{Y}}$ is an estimator for $\\sigma_{\\overline{Y}}$.
</div>
')
```
```{r, eval = my_output == "latex", results='asis', echo=F, purl=F}
cat('\\begin{keyconcepts}[The Standard Error of $\\overline{Y}$]{3.4}
Take an i.i.d. sample $Y_1, \\dots, Y_n$. The mean of $Y$ is consistently estimated by $\\overline{Y}$, the sample mean of the $Y_i$. Since $\\overline{Y}$ is a random variable, it has a sampling distribution with variance $\\frac{\\sigma_Y^2}{n}$.
The standard error of $\\overline{Y}$, denoted $SE(\\overline{Y})$ is an estimator of the standard deviation of $\\overline{Y}$:
$$ SE(\\overline{Y}) = \\hat\\sigma_{\\overline{Y}} = \\frac{s_Y}{\\sqrt{n}}. $$
The caret (\\string^) over $\\sigma$ indicates that $\\hat\\sigma_{\\overline{Y}}$ is an estimator for $\\sigma_{\\overline{Y}}$.
\\end{keyconcepts}
')
```
As an example to underpin Key Concept 3.4, consider a sample of $n=100$ i.i.d. observations of the Bernoulli distributed variable $Y$ with success probability $p=0.1$. Thus $E(Y)=p=0.1$ and $\text{Var}(Y)=p(1-p)$. $E(Y)$ can be estimated by $\overline{Y}$, which then has variance
$$ \sigma^2_{\overline{Y}} = p(1-p)/n = 0.0009 $$
and standard deviation
$$ \sigma_{\overline{Y}} = \sqrt{p(1-p)/n} = 0.03. $$
In this case the standard error of $\overline{Y}$ can be estimated by
$$ SE(\overline{Y}) = \sqrt{\overline{Y}(1-\overline{Y})/n}. $$
Let us check whether $\overline{Y}$ and $SE(\overline{Y})$ estimate the respective true values, on average.
```{r}
# draw 10000 samples of size 100 and estimate the mean of Y and
# estimate the standard error of the sample mean
mean_estimates <- numeric(10000)
se_estimates <- numeric(10000)
for (i in 1:10000) {
s <- sample(0:1,
size = 100,
prob = c(0.9, 0.1),
replace = T)
mean_estimates[i] <- mean(s)
se_estimates[i] <- sqrt(mean(s) * (1 - mean(s)) / 100)
}
mean(mean_estimates)
mean(se_estimates)
```
Both estimators seem to be unbiased for the true parameters. In fact, this is true for the sample mean, but not for $SE(\overline{Y})$. However, both estimators are *consistent* for the true parameters.
### Calculating the p-value When the Standard Deviation is Unknown {-}
When $\sigma_Y$ is unknown, the $p$-value for a hypothesis test concerning $\mu_Y$ using $\overline{Y}$ can be computed by replacing $\sigma_{\overline{Y}}$ in \@ref(eq:pvaluenorm1) by the standard error $SE(\overline{Y}) = \hat\sigma_{\overline{Y}}$. Then,
$$ p\text{-value} = 2\cdot\Phi\left(-\left\lvert \frac{\overline{Y}^{act}-\mu_{Y,0}}{SE(\overline{Y})} \right\rvert \right). $$
This is easily done in `r ttcode("R")`:
```{r}
# sample and estimate, compute standard error
samplemean_act <- mean(
sample(0:1,
prob = c(0.9, 0.1),
replace = T,
size = 100))
SE_samplemean <- sqrt(samplemean_act * (1 - samplemean_act) / 100)
# null hypothesis
mean_h0 <- 0.1
# compute the p-value
pvalue <- 2 * pnorm(- abs(samplemean_act - mean_h0) / SE_samplemean)
pvalue
```
Later in the book, we will encounter more convenient approaches to obtain $t$-statistics and $p$-values using `r ttcode("R")`.
### The t-statistic {-}
In hypothesis testing, the standardized sample average
\begin{equation}
t = \frac{\overline{Y} - \mu_{Y,0}}{SE(\overline{Y})} (\#eq:tstat)
\end{equation}
is called a $t$-statistic. This $t$-statistic plays an important role in testing hypotheses about $\mu_Y$. It is a prominent example of a test statistic.
Implicitly, we already have computed a $t$-statistic for $\overline{Y}$ in the previous code chunk.
```{r}
# compute a t-statistic for the sample mean
tstatistic <- (samplemean_act - mean_h0) / SE_samplemean
tstatistic
```
Using `r ttcode("R")` we can illustrate that if $\mu_{Y,0}$ equal the true value, that is, if the null hypothesis is true, \@ref(eq:tstat) is approximately $\mathcal{N}(0,1)$ distributed when $n$ is large.
```{r}
# prepare empty vector for t-statistics
tstatistics <- numeric(10000)
# set sample size
n <- 300
# simulate 10000 t-statistics
for (i in 1:10000) {
s <- sample(0:1,
size = n,
prob = c(0.9, 0.1),
replace = T)
tstatistics[i] <- (mean(s)-0.1)/sqrt(var(s)/n)
}
```
In the simulation above we estimate the variance of the $Y_i$ using `r ttcode("var(s)")`. This is more general than `r ttcode("mean(s)*(1-mean(s))")` since the latter requires that the data are Bernoulli distributed and that we know this.
```{r, fig.align='center'}
# plot density and compare to N(0,1) density
plot(density(tstatistics),
xlab = "t-statistic",
main = "Estimated Distribution of the t-statistic when n=300",
lwd = 2,
xlim = c(-4, 4),
col = "steelblue")
# N(0,1) density (dashed)
curve(dnorm(x),
add = T,
lty = 2,
lwd = 2)
```
Judging from the plot, the normal approximation works reasonably well for the chosen sample size. This normal approximation has already been used in the definition of the $p$-value, see \@ref(eq:tstat).
### Hypothesis Testing with a Prespecified Significance Level {-}
```{r, eval = my_output == "html", results='asis', echo=F, purl=F}
cat('
<div class = "keyconcept" id="KC3.5">
<h3 class = "right"> Key Concept 3.5 </h3>
<h3 class = "left"> The Terminology of Hypothesis Testing </h3>
In hypothesis testing, two types of mistakes are possible:
1. The null hypothesis *is* rejected although it is true (type-I-error)
2. The null hypothesis *is not* rejected although it is false (type-II-error)
The **significance level** of the test is the probability to commit a type-I-error we are willing to accept in advance. E.g., using a prespecified significance level of $0.05$, we reject the null hypothesis if and only if the $p$-value is less than $0.05$. The significance level is chosen before the test is conducted.
An equivalent procedure is to reject the null hypothesis if the observed test statistic is, in absolute value terms, larger than the **critical value** of the test statistic. The critical value is determined by the significance level chosen and defines two disjoint sets of values which are called **acceptance region** and **rejection region**. The acceptance region contains all values of the test statistic for which the test does not reject while the rejection region contains all the values for which the test does reject.
The **$p$-value** is the probability that, in repeated sampling under the same conditions a test statistic is observed that provides just as much evidence against the null hypothesis as the test statistic actually observed.
The actual probability that the test rejects the true null hypothesis is called the **size of the test**. In an ideal setting, the size does equal the significance level.
The probability that the test correctly rejects a false null hypothesis is called **power**.
</div>
')
```
```{r, eval = my_output == "latex", results='asis', echo=F, purl=F}
cat('\\begin{keyconcepts}[The Terminology of Hypothesis Testing]{3.5}
In hypothesis testing, two types of mistakes are possible:\\newline
\\begin{enumerate}
\\item The null hypothesis \\textit{is} rejected although it is true (type-I-error)
\\item The null hypothesis \\textit{is not} rejected although it is false (type-II-error)
\\end{enumerate}\\vspace{0.5cm}
The \\textit{significance level} of the test is the probability to commit a type-I-error we are willing to accept in advance. E.g., using a prespecified significance level of $0.05$, we reject the null hypothesis if and only if the $p$-value is less than $0.05$. The significance level is chosen before the test is conducted.\\newline
An equivalent procedure is to reject the null hypothesis if the observed test statistic is, in absolute value terms, larger than the \\textit{critical value} of the test statistic. The critical value is determined by the significance level chosen and defines two disjoint sets of values which are called \\textit{acceptance region} and \\textit{rejection region}. The acceptance region contains all values of the test statistic for which the test does not reject while the rejection region contains all the values for which the test does reject.\\newline
The \\textit{$p$-value} is the probability that, in repeated sampling under the same conditions, a test statistic is observed that provides just as much evidence against the null hypothesis as the test statistic actually observed.\\newline
The actual probability that the test rejects the true null hypothesis is called the \\textit{size of the test}. In an ideal setting, the size equals the significance level.\\newline
The probability that the test correctly rejects a false null hypothesis is called \\textit{power}.
\\end{keyconcepts}
')
```
Reconsider the `r ttcode("pvalue")` computed further above:
```{r}
# check whether p-value < 0.05
pvalue < 0.05
```
The condition is not fulfilled so we do not reject the null hypothesis correctly.
When working with a $t$-statistic instead, it is equivalent to apply the following rule:
$$ \text{Reject } H_0 \text{ if } \lvert t^{act} \rvert > 1.96. $$
We reject the null hypothesis at the significance level of $5\%$ if the computed $t$-statistic lies beyond the critical value of 1.96 in absolute value terms. $1.96$ is the $0.975$-quantile of the standard normal distribution.
```{r}
# check the critical value
qnorm(p = 0.975)
# check whether the null is rejected using the t-statistic computed further above
abs(tstatistic) > 1.96
```
Just like using the $p$-value, we cannot reject the null hypothesis using the corresponding $t$-statistic. Key Concept 3.6 summarizes the procedure of performing a two-sided hypothesis test about the population mean $E(Y)$.
```{r, eval = my_output == "html", results='asis', echo=F, purl=F}
cat('
<div class = "keyconcept" id="KC3.6">
<h3 class = "right"> Key Concept 3.6 </h3>
<h3 class = "left"> Testing the Hypothesis $E(Y) = \\mu_{Y,0}$ Against the Alternative $E(Y) \\neq \\mu_{Y,0}$ </h3>
1. Estimate $\\mu_{Y}$ using $\\overline{Y}$ and compute the standard error of $\\overline{Y}$, $SE(\\overline{Y})$.
2. Compute the $t$-statistic.
3. Compute the $p$-value and reject the null hypothesis at the $5\\%$ level of significance if the $p$-value is smaller than $0.05$ or, equivalently, if
$$ \\left\\lvert t^{act} \\right\\rvert > 1.96. $$
</div>
')
```
```{r, eval = my_output == "latex", results='asis', echo=F, purl=F}
cat('\\begin{keyconcepts}[Testing the Hypothesis $E(Y) = \\mu_{Y,0}$ Against the Alternative $E(Y) \\neq \\mu_{Y,0}$]{3.6}
\\begin{enumerate}
\\item Estimate $\\mu_{Y}$ using $\\overline{Y}$ and compute the standard error of $\\overline{Y}$, $SE(\\overline{Y})$.
\\item Compute the $t$-statistic.
\\item Compute the $p$-value and reject the null hypothesis at the $5\\%$ level of significance if the $p$-value is smaller than $0.05$ or, equivalently, if $$\\left\\lvert t^{act} \\right\\rvert > 1.96.$$
\\end{enumerate}
\\end{keyconcepts}
')
```
### One-sided Alternatives {-}
Sometimes we are interested in testing if the mean is bigger or smaller than some value hypothesized under the null. To stick to the book, take the presumed wage gap between well and less educated working individuals. Since we anticipate that such a differential exists, a relevant alternative (to the null hypothesis that there is no wage differential) is that well educated individuals earn more, i.e., that the average hourly wage for this group, $\mu_Y$ is *bigger* than $\mu_{Y,0}$, the average wage of less educated workers which we assume to be known here for simplicity (Section \@ref(cmfdp) discusses how to test the equivalence of to unknown population means).
This is an example of a *right-sided test* and the hypotheses pair is chosen to be
$$ H_0: \mu_Y = \mu_{Y,0} \ \ \text{vs} \ \ H_1: \mu_Y > \mu_{Y,0}. $$
We reject the null hypothesis if the computed test-statistic is larger than the critical value $1.64$, the $0.95$-quantile of the $\mathcal{N}(0,1)$ distribution. This ensures that $1-0.95=5\%$ probability mass remains in the area to the right of the critical value. As before, we can visualize this in `r ttcode("R")` using the function `r ttcode("polygon()")`.
```{r, fig.align='center'}
# plot the standard normal density on the domain [-4,4]
curve(dnorm(x),
xlim = c(-4, 4),
main = "Rejection Region of a Right-Sided Test",
yaxs = "i",
xlab = "t-statistic",
ylab = "",
lwd = 2,
axes = "F")
# add the x-axis
axis(1,
at = c(-4, 0, 1.64, 4),
padj = 0.5,
labels = c("", 0, expression(Phi^-1~(.95)==1.64), ""))
# shade the rejection region in the left tail
polygon(x = c(1.64, seq(1.64, 4, 0.01), 4),
y = c(0, dnorm(seq(1.64, 4, 0.01)), 0),
col = "darkred")
```
Analogously, for the left-sided test we have $$H_0: \mu_Y = \mu_{Y,0} \ \ \text{vs.} \ \ H_1: \mu_Y < \mu_{Y,0}.$$ The null is rejected if the observed test statistic falls short of the critical value which, for a test at the $0.05$ level of significance, is given by $-1.64$, the $0.05$-quantile of the $\mathcal{N}(0,1)$ distribution. $5\%$ probability mass lies to the left of the critical value.
It is straightforward to adapt the code chunk above to the case of a left-sided test. We only have to adjust the color shading and the tick marks.
```{r, fig.align='center'}
# plot the the standard normal density on the domain [-4,4]
curve(dnorm(x),
xlim = c(-4, 4),
main = "Rejection Region of a Left-Sided Test",
yaxs = "i",
xlab = "t-statistic",
ylab = "",
lwd = 2,
axes = "F")
# add x-axis
axis(1,
at = c(-4, 0, -1.64, 4),
padj = 0.5,
labels = c("", 0, expression(Phi^-1~(.05)==-1.64), ""))
# shade rejection region in right tail
polygon(x = c(-4, seq(-4, -1.64, 0.01), -1.64),
y = c(0, dnorm(seq(-4, -1.64, 0.01)), 0),
col = "darkred")
```
## Confidence Intervals for the Population Mean
As mentioned before, we will never estimate the *exact* value of the population mean of $Y$ using a random sample. However, we can compute confidence intervals for the population mean. In general, a confidence interval for an unknown parameter is a recipe that, in repeated samples, yields intervals that contain the true parameter with a prespecified probability, the *confidence level*. Confidence intervals are computed using the information available in the sample. Since this information is the result of a random process, confidence intervals are random variables themselves.
Key Concept 3.7 shows how to compute confidence intervals for the unknown population mean $E(Y)$.
```{r, eval = my_output == "html", results='asis', echo=F, purl=F}
cat('
<div class = "keyconcept" id="KC3.7">
<h3 class = "right"> Key Concept 3.7 </h3>
<h3 class = "left"> Confidence Intervals for the Population Mean </h3>
A $95\\%$ confidence interval for $\\mu_Y$ is a random variable that contains the true $\\mu_Y$ in $95\\%$ of all possible random samples. When $n$ is large we can use the normal approximation. Then, $99\\%$, $95\\%$, $90\\%$ confidence intervals are
\\begin{align}
&99\\%\\text{ confidence interval for } \\mu_Y = \\left[ \\overline{Y} \\pm 2.58 \\times SE(\\overline{Y}) \\right], \\\\
&95\\%\\text{ confidence interval for } \\mu_Y = \\left[\\overline{Y} \\pm 1.96 \\times SE(\\overline{Y}) \\right], \\\\
&90\\%\\text{ confidence interval for } \\mu_Y = \\left[ \\overline{Y} \\pm 1.64 \\times SE(\\overline{Y}) \\right].
\\end{align}
These confidence intervals are sets of null hypotheses we cannot reject in a two-sided hypothesis test at the given level of confidence.
Now consider the following statements.
1. In repeated sampling, the interval
$$ \\left[ \\overline{Y} \\pm 1.96 \\times SE(\\overline{Y}) \\right] $$
covers the true value of $\\mu_Y$ with a probability of $95\\%$.
2. We have computed $\\overline{Y} = 5.1$ and $SE(\\overline{Y})=2.5$ so the interval
$$ \\left[ 5.1 \\pm 1.96 \\times 2.5 \\right] = \\left[0.2,10\\right] $$ covers the true value of $\\mu_Y$ with a probability of $95\\%$.
While 1. is right (this is in line with the definition above), 2. is wrong and none of your lecturers wants to read such a sentence in a term paper, written exam or similar, believe us.
The difference is that, while 1. is the definition of a random variable, 2. is one possible *outcome* of this random variable so there is no meaning in making any probabilistic statement about it. Either the computed interval does cover $\\mu_Y$ *or* it does not!
</div>
')
```
```{r, eval = my_output == "latex", results='asis', echo=F, purl=F}
cat('\\begin{keyconcepts}[Confidence Intervals for the Population Mean]{3.7}
A $95\\%$ confidence interval for $\\mu_Y$ is a \\texttt{random variable} that contains the true $\\mu_Y$ in $95\\%$ of all possible random samples. When $n$ is large we can use the normal approximation. Then, $99\\%$, $95\\%$, $90\\%$ confidence intervals are
\\begin{align}
&99\\%\\text{ confidence interval for } \\mu_Y = \\left[ \\overline{Y} \\pm 2.58 \\times SE(\\overline{Y}) \\right], \\\\
&95\\%\\text{ confidence interval for } \\mu_Y = \\left[\\overline{Y} \\pm 1.96 \\times SE(\\overline{Y}) \\right], \\\\
&90\\%\\text{ confidence interval for } \\mu_Y = \\left[ \\overline{Y} \\pm 1.64 \\times SE(\\overline{Y}) \\right].
\\end{align}
These confidence intervals are sets of null hypotheses we cannot reject in a two-sided hypothesis test at the given level of confidence.\\newline
Now consider the following statements.\\newline
\\begin{enumerate}
\\item In repeated sampling, the interval
$$ \\left[ \\overline{Y} \\pm 1.96 \\times SE(\\overline{Y}) \\right] $$
covers the true value of $\\mu_Y$ with a probability of $95\\%$.
\\item We have computed $\\overline{Y} = 5.1$ and $SE(\\overline{Y})=2.5$ so the interval
$$ \\left[5.1 \\pm 1.96 \\times 2.5 \\right] = \\left[0.2,10\\right] $$ covers the true value of $\\mu_Y$ with a probability of $95\\%$.
\\end{enumerate}\\vspace{0.5cm}
While 1. is right (this is in line with the definition above), 2. is wrong and none of your lecturers wants to read such a sentence in a term paper, written exam or similar, believe us.
The difference is that, while 1. is the definition of a random variable, 2. is one possible \\textit{outcome} of this random variable so there is no meaning in making any probabilistic statement about it. Either the computed interval \\textit{does cover} $\\mu_Y$ or it \\textit{does not}!
\\end{keyconcepts}
')
```
In `r ttcode("R")`, testing of hypotheses about the mean of a population on the basis of a random sample is very easy due to functions like `r ttcode("t.test()")` from the `r ttcode("stats")` package. It produces an object of type `r ttcode("list")`. Luckily, one of the most simple ways to use `r ttcode("t.test()")` is when you want to obtain a $95\%$ confidence interval for some population mean. We start by generating some random data and calling `r ttcode("t.test()")` in conjunction with `r ttcode("ls()")` to obtain a breakdown of the output components.
```{r}
# set seed
set.seed(1)
# generate some sample data
sampledata <- rnorm(100, 10, 10)
# check the type of the outcome produced by t.test
typeof(t.test(sampledata))
# display the list elements produced by t.test
ls(t.test(sampledata))
```
Though we find that many items are reported, at the moment we are only interested in computing a $95\%$ confidence set for the mean.
```{r}
t.test(sampledata)$"conf.int"
```
This tells us that the $95\%$ confidence interval is
$$ \left[9.31, 12.87\right]. $$
In this example, the computed interval obviously does cover the true $\mu_Y$ which we know to be $10$.
Let us have a look at the whole standard output produced by `r ttcode("t.test()")`.
```{r}
t.test(sampledata)
```
We see that `r ttcode("t.test()")` does not only compute a $95\%$ confidence interval but automatically conducts a two-sided significance test of the hypothesis $H_0: \mu_Y = 0$ at the level of $5\%$ and reports relevant parameters thereof: the alternative hypothesis, the estimated mean, the resulting $t$-statistic, the degrees of freedom of the underlying $t$-distribution (note that `r ttcode("t.test()")` performs the normal approximation) and the corresponding $p$-value. This is very convenient!
In this example, we come to the conclusion that the population mean is significantly different from $0$ (which is correct) at the level of $5\%$, since $\mu_Y = 0$ is not an element of the $95\%$ confidence interval
$$ 0 \not\in \left[9.31,12.87\right]. $$
We come to an equivalent result when using the $p$-value rejection rule since
$$ p\text{-value} = 2.2\cdot 10^{-16} \ll 0.05. $$
## Comparing Means from Different Populations {#cmfdp}
Suppose you are interested in the means of two different populations, denote them $\mu_1$ and $\mu_2$. More specifically, you are interested in whether these population means are different from each other and plan to use a hypothesis test to verify this on the basis of independent sample data from both populations. A suitable pair of hypotheses is
\begin{equation}
H_0: \mu_1 - \mu_2 = d_0 \ \ \text{vs.} \ \ H_1: \mu_1 - \mu_2 \neq d_0, (\#eq:hypmeans)
\end{equation}
where $d_0$ denotes the hypothesized difference in means (so $d_0=0$ when the means are equal, under the null hypothesis). The book teaches us that $H_0$ can be tested with the $t$-statistic
\begin{equation}
t=\frac{(\overline{Y}_1 - \overline{Y}_2) - d_0}{SE(\overline{Y}_1 - \overline{Y}_2)} (\#eq:tstatmeans)
\end{equation}
where
\begin{equation}
SE(\overline{Y}_1 - \overline{Y}_2) = \sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}.
\end{equation}
This is called a two sample $t$-test. For large $n_1$ and $n_2$, \@ref(eq:tstatmeans) is standard normal under the null hypothesis. Analogously to the simple $t$-test we can compute confidence intervals for the true difference in population means:
$$ (\overline{Y}_1 - \overline{Y}_2) \pm 1.96 \times SE(\overline{Y}_1 - \overline{Y}_2) $$
is a $95\%$ confidence interval for $d$. <br> In `r ttcode("R")`, hypotheses as in \@ref(eq:hypmeans) can be tested with `r ttcode("t.test()")`, too. Note that `r ttcode("t.test()")` chooses $d_0 = 0$ by default. This can be changed by setting the argument `r ttcode("mu")` accordingly.
The subsequent code chunk demonstrates how to perform a two sample $t$-test in `r ttcode("R")` using simulated data.
```{r}
# set random seed
set.seed(1)
# draw data from two different populations with equal mean
sample_pop1 <- rnorm(100, 10, 10)
sample_pop2 <- rnorm(100, 10, 20)
# perform a two sample t-test
t.test(sample_pop1, sample_pop2)
```
We find that the two sample $t$-test does not reject the (true) null hypothesis that $d_0 = 0$.
## An Application to the Gender Gap of Earnings {#aattggoe}
This section discusses how to reproduce the results presented in the box *The Gender Gap of Earnings of College Graduates in the United States* of the book.
In order to reproduce Table 3.1 of the book you need to download the replication data which are hosted by Pearson and can be downloaded [here](https://github.com/mca91/EconometricsWithR/raw/master/data/cps_ch3.xlsx). This file contains data that range from $1992$ to $2008$ and earnings are reported in prices of $2008$.
There are several ways to import the `r ttcode(".xlsx")`-files into `r ttcode("R")`. Our suggestion is the function `r ttcode("read_excel()")` from the `r ttcode("readxl")` package [@R-readxl]. The package is not a part of `r ttcode("R")`'s base version and has to be installed manually.
```{r, message=F, warning=F}
# load the 'readxl' package
library(readxl)
```
You are now ready to import the dataset. Make sure you use the correct path to import the downloaded file! In our example, the file is saved in a subfolder of the working directory named `r ttcode("data")`. If you are not sure what your current working directory is, use `r ttcode("getwd()")`, see also `r ttcode("?getwd")`. This will give you the path that points to the place `r ttcode("R")` is currently looking for files to work with.
```{r, message = F, warning = F,}
# import the data into R
cps <- read_excel(path = "data/cps_ch3.xlsx")
```
Next, install and load the package `r ttcode("dyplr")` [@R-dplyr]. This package provides some handy functions that simplify data wrangling a lot. It makes use of the `r ttcode("%>%")` operator.
```{r, message=F, warning=F}
# load the 'dplyr' package
library(dplyr)
```
First, get an overview over the dataset. Next, use `r ttcode("%>%")` and some functions from the `r ttcode("dplyr")` package to group the observations by gender and year and compute descriptive statistics for both groups.
```{r, echo=T, eval=T, fig.align='center'}