-
Notifications
You must be signed in to change notification settings - Fork 0
/
R programming for data science.rmd
167 lines (108 loc) · 8.28 KB
/
R programming for data science.rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
---
title: "LM312- R programming for Data Science"
date: "2023-06-01"
output:
word_document: default
html_document: default
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
# 1. Upload your dataset (Make sure you are only dealing with the dataset assigned to you).
```{r, message=FALSE, warning=FALSE}
#Import the data
library(readr)
data <- read_csv("dataset_2_ILM312 2.csv")
```
# 2. Generate your hypothesis for regression analysis.
Hypothesis 1:
Research Question: Do the lot area, number of bedrooms above ground, and garage area have a significant impact on the sale price of the houses?
Null Hypothesis (H0): There is no relationship between the lot area, number of bedrooms above ground, garage area, and the sale price of the houses.
Alternative Hypothesis (H1): There is a relationship between the lot area, number of bedrooms above ground, garage area, and the sale price of the houses.
Hypothesis 2:
Research Question: Are any of the predictors (lot area, number of bedrooms above ground, garage area) individually associated with the sale price of the houses?
Null Hypothesis (H0): The coefficients of the lot area, number of bedrooms above ground, and garage area in the regression model are all zero (i.e., no effect on the sale price).
Alternative Hypothesis (H1): At least one of the coefficients of the lot area, number of bedrooms above ground, and garage area in the regression model is not zero (i.e., there is an effect on the sale price).
# 3. Check whether your regression assumptions are met
Assumption 1: Linearity: The relationship between the predictors and the response variable is linear.
```{r,message=FALSE, warning=FALSE}
library(ggplot2)
library(ggplot2)
# Scatterplot of response variable against index/observation number
data$Index <- 1:nrow(data) # Create an index column
ggplot(data, aes(x = Index, y = SalePrice*0.001)) +
geom_point() +
geom_smooth(method = "loess", se = FALSE, color = "blue") +
labs(x = "Index", y = "Sale Price")+
theme_bw()+
labs(title = "Scatterplot of response variable against index")
```
• The points as observed from the scatter plot above lie along the line suggesting that there is a consistent and proportional relationship between the response variable and the index or observation number. As the index increases, the response variable also increases in a linear fashion.Hence the assumption of linearity.
Assumption 2: Homoscedasticity: The variability of the residuals are constant across all levels of the predictors.
```{r,message=FALSE, warning=FALSE}
# regression model
model <- lm(SalePrice ~ LotArea + BedroomAbvGr + GarageArea, data = data)
# Breusch-Pagan test for heteroscedasticity
library(lmtest)
bptest(model)
```
• The test statistic, denoted as BP, has a value of 130.52 performed with 3 degrees of freedom. The Significance Level (The p-value)is reported as "< 2.2e-16," which indicates that the p-value is very small (essentially 0) and below any conventional significance level This implies strong evidence against the assumption of homoscedasticity.
Assumption 3: Normality: The residuals should follow a normal distribution.
```{r}
library(ggplot2)
# Histogram of residuals with normality line
ggplot(data, aes(x = model$residuals)) +
geom_histogram(binwidth = 1000, aes(y = ..density..), fill = "lightblue", color = "cornflowerblue") +
geom_density(color = "red", size = 1) +
labs(x = "Residuals", y = "Density") +
scale_y_continuous(labels = function(x) x * 10000)
# Q-Q plot of residuals
qqnorm(model$residuals, yaxt = "n") # Disable automatic y-axis labels
qqline(model$residuals)
# Modify y-axis labels
y_vals <- axTicks(2) * 0.001 # Compute new y-axis values
axis(2, at = axTicks(2), labels = y_vals, las = 1) # Add custom y-axis labels
```
• From the above histogram and QQ-plots it is evident that the data follows a normal distribution and hence the assumption of normality
Assumption 4: No multicollinearity: The predictors should not be highly correlated with each other.
```{r,message=FALSE, warning=FALSE}
# Correlation matrix
cor(data[, c("LotArea", "BedroomAbvGr", "GarageArea")])
# Variance Inflation Factors (VIF)
library(car)
vif(model)
```
• No Strong Evidence of Multicollinearity: According to the values from the correlation matrix, there isn't much proof that the predictor variables are multicollinear overall. The low correlation coefficients indicate that there is not a strong association between the variables.
# 4. Apply regression analysis in R as response variable; SalePrice, explain your model and outcomes generated by R in detail.
The linear regression model is:
SalePrice ~ LotArea + BedroomAbvGr + GarageArea
This specifies that "SalePrice" is the response variable, and "LotArea", "BedroomAbvGr," and "GarageArea" are the predictor variables. The data argument is used to specify the dataset containing the variables.
```{r}
# Fit linear regression model
model <- lm(SalePrice ~ LotArea + BedroomAbvGr + GarageArea, data = data)
# Summary of the regression model
summary(model)
```
• Impact of Predictors: The model summary's coefficients show the estimated influence of each predictor variable on the SalePrice. For instance:
-LotArea: The SalePrice is predicted to rise by roughly 1145 units for every unit increase in LotArea (on a par with LotArea).
-BedroomAbvGr: The SalePrice is predicted to rise by roughly 10950 units for every additional unit of bedrooms above ground.
-GarageArea: The SalePrice is predicted to rise by about 219 units for every unit increase in the GarageArea.
• Statistical Significance: All of the predictor variables (LotArea, BedroomAbvGr, and GarageArea) have p-values below 0.001, indicating that they are statistically significant in connection to the SalePrice. This implies that these variables offer valuable information for forecasting the SalePrice.
• Model Fit: According to the R-squared value of 0.4248, the predictor variables in the model can account for about 42.48% of the variation in the sale price. This means that based on the provided predictors, the model captures a moderate amount of the variation in the SalePrice.
• Overall Model Significance: The F-statistic has a p-value of 2.2e-16 or less, which is exceptionally low and indicates that the overall model is statistically significant. This indicates a substantial correlation between the SalePrice and at least one of the predictor variables.
• Residuals: The residuals are a measure of how the anticipated SalePrice values from the model depart from the observed SalePrice values. The range of residuals (from -290734 to 471257) shows the distribution of the model's prediction mistakes.
# 5.State if you reject your null hypothesis or not.
Hypothesis 1:
Research Question: Do the lot area, number of bedrooms above ground, and garage area have a significant impact on the sale price of the houses?
• The lot area, number of bedrooms above ground, and garage area significantly impact the sale price of houses.This is because they have a p-value less than 0.05. Increasing lot area, bedrooms, and garage area are associated with higher sale prices.Therfore we do not reject the null hypothesis
Hypothesis 2:
Research Question: Are any of the predictors (lot area, number of bedrooms above ground, garage area) individually associated with the sale price of the houses?
• We do not reject the null hypothesis. This is because all of the predictors (lot area, number of bedrooms above ground, and garage area) are individually associated with the sale price of the houses (p < 0.001). i.e Each predictor has a significant impact on the sale price.
# 6. According to the linear regression model that you have generated, predict the sale price of a house if the Lot Area is 9000-meter square, number of bedrooms above the garage is 3 and garage area is 700-meter square.
```{r}
# Predicting sale price
new_data <- data.frame(LotArea = 9000, BedroomAbvGr = 3, GarageArea = 700)
predicted_price <- predict(model, newdata = new_data)
# Printing the predicted sale price
cat("The predicted sale price of a house with a Lot Area of 9000 sq. meters, 3 bedrooms above ground, and a garage area of 700 sq. meters is:", predicted_price, "\n")
```