output | title | abstract | author | date | geometry | linkcolor | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
Basic inferential data analysis of ToothGrowth dataset |
***This document provides the assignment 'Course Project Part 2' for Coursera's Statistical Inference Class in the Coursera Data Science series. Replication files are available on the author's Github account (https://github.com/tomfischersz).*** |
Thomas Fischer |
`r format(Sys.time(), '%B %d, %Y')` |
margin=1in |
blue |
knitr::opts_chunk$set(fig.pos= "h", out.extra = '')
In this report we aim to conduct some basic inferential data analysis on the ToothGrowth dataset of the R library 'datasets'. We aim to answer the question, if dosage and/or delivery method of vitamin C affects tooth growth in guinea pigs. We therefore observe patterns from the data, formulate hypotheses and then use statistical tests like confident intervals or student's t-test to validate these hypotheses.
The data consists of 60 observations with 3 variables, here the first few observations:
The help page1 for the data set ToothGrowth gives following description:
The response is the length of odontoblasts (cells responsible for tooth growth) in 60 guinea pigs. Each animal received one of three dose levels of vitamin C (0.5, 1, and 2 mg/day) by one of two delivery methods, (orange juice or ascorbic acid (a form of vitamin C and coded as VC).
Our data are results from a study performed on guinea pigs to determine the effect of vitamin C on tooth growth. The data contains 3 variables:
- len: The response (dependent) variable for the experiment measured for 60 guinea pigs is the tooth length.
- supp and dose Two factors (independent variables), the delivery method of the vitamin C (supplement type) and the dose levels of vitamin C in mg/day. We are interested in the effect of these two factors on the response.
Table \ref{tab:data_summary} depicts a aggregated summary of our data. We can see that there are 6 factor-level combinations and each of these 6 combinations were applied to 10 guinea pigs each. We hereafter call this different combinations just treatment (and also added a new column), e.g. "OJ_0.5" just denotes the treatment with the factors 'Orange Juice' with a dose level of 0.5 mg/day.
We now visualize the means and spread of tooth growth for our six distinct treatment groups (Code):
plot(fig_1)
Figure \ref{fig:boxplot_1} suggests that the dose and the delivery method both have some effect on the tooth growth. It appears that the average tooth growth increases with the dose levels and that orange juice might have higher growth rates than Vitamin C except for dose levels of 2 mg.
We are now testing several hypotheses. Our significance level (i.e. the risk of getting a Type I error) for all tests will be
Before proceeding in our analysis it is important to assure certain assumptions necessary to apply student's t-test, so we must be sure that following assumptions are not violated:
-
Independent and identically distributed: We are assuming that the process of choosing 60 guinea pigs for the experiment was independed and that they are drawn from the same population. Otherwise our results would be not reliable, e.g. if the guinea pigs origin from two different breeders, or there are differences in male and female populations our conclusions could be flawed.
-
The probability distributions of the measured tooth length for each treatment are normal. Depicting Figure \ref{fig:fig_2} it seems that this assumption appears to be reasonably satisfied.
We want to test the null hypothesis that the mean tooth length for the two delivery methods are equal against the alternative hypothesis that they differ:
Stated the relevant null and alternative hypotheses, we then conduct a two-tailed t-test (Code):
print(t_01)
As the obtained p-value of r round(t_01$p.value ,3)
is greater than the significance level of 0.05 (and the confidence interval at 95% contains 0) we cannot reject our null hypothesis. Looking at figure \ref{fig:boxplot_1} again, failing to reject the null hypothesis is likely due to the similar results in tooth length for a vitamin C dose of 2 mg/day.
Our next hypothesis test will be examining if, for orange juice only, higher doses of vitamin C are significantly associated with higher tooth length. We are conducting two one-tailed t-tests and therefore need to adjust our confidence intervals. We adjust the original confidence level of our tests of 95% using Bonferroni correction to
Conducted the relevant t-test (Code) we get following results:
kable(sum_ttests,
format = 'latex',
booktabs = TRUE,
digits = 2,
caption = 'Summary of t-tests for different levels of doses (Orange Juice)\\label{tab:sum_ttests}',
col.names = c('Sample Groups', 'p-values',
'Lower Conf.Interval', 'Upper Conf.Interval' )) %>%
kable_styling(latex_options = c("striped", "hold_position"))
As we can see, both p-values are below our significance level
- No evidence for the hypothesis that tooth length differs for different delivery methods.
- Strong evidence that tooth length varies for different doses given the delivery method orange juice.
\newpage
kable(df_summary,
format = 'latex',
booktabs = TRUE,
digits = 2,
caption = 'Summary of the different treatments for the guinea pigs with
their associated average tooth length and the corresponding standard
deviation\\label{tab:data_summary}',
col.names = c('Supplement', 'Dose (mg/day)',
'Treatment', 'N (number of pigs)',
'Mean',
'Standard Deviation')) %>%
kable_styling(latex_options = c("striped", "hold_position"))
plot(fig_2)
require(knitr)
require(kableExtra)
require(datasets)
require(ggplot2)
require(dplyr)
data(ToothGrowth)
# names(ToothGrowth) <- c('length', 'supplement', 'dose')
ToothGrowth$treatment=with(ToothGrowth,interaction(supp,dose, sep = '_'))
kable(head(ToothGrowth[, 1:3], n=3),
format = 'latex',
booktabs = TRUE,
caption = "The first few observations of the data set
ToothGrowth\\label{tab:show_obs}") %>%
kable_styling(latex_options = c("striped", "hold_position"))
df_summary <-
ToothGrowth %>%
group_by(supp, dose, treatment) %>%
summarise(N = n(),
mean_len = mean(len),
sd_len = sd(len)) %>%
as.data.frame()
fig_1 <- ggplot(ToothGrowth, aes(x=factor(dose), y=len)) +
facet_grid(.~supp) +
geom_boxplot(aes(fill = supp), show.legend = FALSE) +
labs(title = "Guinea pig Tooth Length by Dosage for different treatments",
x = "Dose (mg/day)",
y = "Tooth Length")
fig_2 <- ggplot(ToothGrowth, aes(x = len)) +
geom_density(adjust = 1.5) +
facet_wrap(~ treatment)
t_01 <- t.test(len~supp,data=ToothGrowth, paired = FALSE, var.equal = FALSE, alternative = 'two.sided')
t_02_1 <-
t.test(len~dose,
data = ToothGrowth[ToothGrowth$treatment %in% c('OJ_0.5', 'OJ_1'),],
paired = FALSE, var.equal = FALSE,
alternative = 'less', conf.level = 0.975)
t_02_2 <-
t.test(len~dose,
data = ToothGrowth[ToothGrowth$treatment %in% c('OJ_1', 'OJ_2'),],
paired = FALSE, var.equal = FALSE,
alternative = 'less', conf.level = 0.975)
sum_ttests <-
data.frame(sample_group = c('OJ_0.5 versus OJ_1', 'OJ_0.5 versus OJ_1'),
p_value = c(round(t_02_1$p.value,4), round(t_02_2$p.value,4)),
confint_lower = c(t_02_1$conf.int[[1]], t_02_2$conf.int[[1]]),
confint_upper = c(t_02_1$conf.int[[2]], t_02_2$conf.int[[2]]))
Footnotes
-
Use R command help(ToothGrowth) to get further information. ↩