-
Notifications
You must be signed in to change notification settings - Fork 0
/
Visualising_by_Scatter_plot.Rmd
138 lines (112 loc) · 6.11 KB
/
Visualising_by_Scatter_plot.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
---
title: "Visualising the performance of 4 countries"
output:
pdf_document:
toc: true
number_sections: true
df_print: kable
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
# Introduction
In this document/report we read the data in CSV file to compare the values of two variables a and b across 4 countries - Ireland, Scotland, England and Wales. Our data consists of two continuous variables - var_a and var_b and one categorical variable - country, which is represented as a factor.
```{r}
library(knitr)
library(kableExtra)
data<-read.csv("week2_data_4cat.csv")
str(data)
kable(head(data),"latex",caption ="Sample of the dataset in consideration",booktabs =T) %>%
kable_styling(latex_options=c("striped","hold_position"))
levels(data$country)<-c('England','Ireland','Scotland','Wales')
```
# Scatterplot
## Plot
For the countries - England, Ireland, Scotland and Wales, where the countries are represented by various shapes and colors as indicated by the legend given next to them, the following plot visualises the data of var-a against var-b.
```{r echo=FALSE, message=FALSE, warning=FALSE, paged.print=FALSE}
library(ggplot2)
ggplot(data,aes(x=var_a,y=var_b,color=country,shape=country,size=country))+
geom_point()+
scale_color_manual(values=c("orange","blue","green","red"),name = "country")+
scale_shape_manual(values=c(0,3,5,1),name = "country")+
scale_size_manual(values=c(2,2,3,3),name = "country")+
scale_x_continuous(breaks= c(4,6,8,10,12,14,16,18,20)) +
labs(caption ="Fig 1: Plot of var_a vs var_b by country")+
xlab("var_a")+
ylab("var_b")+
theme_bw()+
theme(axis.text = element_text(size=10),
axis.title = element_text(size=10),
legend.title = element_text(face="bold",size=9),
legend.text = element_text(face="italic",size=10),
plot.caption = element_text(hjust=0.6))
```
## Some Information regarding the above plots
The above scatter plot is created using a library called ggplot2. Position, color, shape, and size are the aesthetics considered. Orange, Blue, Green, Red are the different colors selected ,so that the data in the plot can be easily distinguished. The shapes were selected not to have a fill because certain data overlaps.Ireland, England and Scotland were given different sizes since at point (6,6) all the countries overlap. Figures typically have captions below them, so the caption was used and modified to the horizontal center of the figure. To make the background white and grid lines light grey, the theme theme_bw was used, which inturn helps in highlighting the data points in the plot.
# Statistical summary
## Table
For each country, the table below summarizes the mean, standard deviation and correlations.
```{r}
library(dplyr)
data_table <- data %>% group_by(country) %>% summarize(mean_a = mean(var_a),
sd_a =sd(var_a),
mean_b = mean(var_b),
sd_b = sd(var_b),
corr_ab = cor(var_a,var_b))
data_table
```
```{r }
kable(data_table,"latex",caption ="Summarized statistics table of all the 4 countries in consideration",
col.names=c("Country","Mean","SD","Mean","SD","Correlation(a,b)"),
booktabs =T) %>%
kable_styling(latex_options=c("striped","hold_position")) %>%
add_header_above(c(" "=1,"A"=2,"B"=2," "=1))
```
## Insights from the above Visualisation and summary statistics
*Even though there are few outliers present, we can observe that the corelation between a and b is high indicating some relationship between a and b , except for wales.
* England and scotland follow a linear realtionship w.r.to a and b, whereas Ireland follows a non-linear relationship.
* England and Scotland shares the same distribution because of a high number of overlapping values.
* From the visualization, we can even determine the minimum and maximum values of each country.
* var_a and var_b and continuous variables , whereas country is a categorical variable .
# References
- Week2 Tutorials
- https://cran.r-project.org/web/packages/kableExtra/vignettes/awesome_table_in_pdf.pdf
- https://bookdown.org/yihui/rmarkdown/pdf-document.html
# Appendix
```{r, eval=FALSE}
library(knitr)
library(kableExtra)
library(dplyr)
library(ggplot2)
data<-read.csv("week2_data_4cat.csv")
str(data)
kable(head(data),"latex",caption ="Sample of the dataset in consideration",booktabs =T) %>%
kable_styling(latex_options=c("striped","hold_position"))
levels(data$country)<-c('England','Ireland','Scotland','Wales')
ggplot(data,aes(x=var_a,y=var_b,color=country,shape=country,size=country))+
geom_point()+
scale_color_manual(values=c("orange","blue","green","red"),name = "country")+
scale_shape_manual(values=c(0,3,5,1),name = "country")+
scale_size_manual(values=c(2,2,3,3),name = "country")+
scale_x_continuous(breaks= c(4,6,8,10,12,14,16,18,20)) +
labs(caption ="Fig 1: Plot of var_a vs var_b by country")+
xlab("var_a")+
ylab("var_b")+
theme_bw()+
theme(axis.text = element_text(size=10),
axis.title = element_text(size=10),
legend.title = element_text(face="bold",size=9),
legend.text = element_text(face="italic",size=10),
plot.caption = element_text(hjust=0.6))
data_table <- data %>% group_by(country) %>% summarize(mean_a = mean(var_a),
sd_a =sd(var_a),
mean_b = mean(var_b),
sd_b = sd(var_b),
corr_ab = cor(var_a,var_b))
data_table
kable(data_table,"latex",caption ="Summarized statistics table of all the 4 countries in consideration",
col.names=c("Country","Mean","SD","Mean","SD","Correlation(a,b)"),
booktabs =T) %>%
kable_styling(latex_options=c("striped","hold_position")) %>%
add_header_above(c(" "=1,"A"=2,"B"=2," "=1))
```