-
Notifications
You must be signed in to change notification settings - Fork 0
/
Spotify.Rmd
181 lines (143 loc) · 5.14 KB
/
Spotify.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
---
title: "Streaming Data Analytics"
author: "Group"
date: '2023-02-22'
output:
word_document: default
pdf_document: default
---
```{r, include=FALSE, message=FALSE, warning=FALSE}
# Load required packages
library(MASS)
library(matrixcalc)
library(mvtnorm)
library(data.table)
library(numDeriv)
library(ggplot2)
library(abind)
library(bbmle)
library(emmeans)
library(glm2)
library(MatrixModels)
library(mnormt)
library(Matrix)
library(mlogit)
library(reshape2)
library(reshape2)
library(AER)
library(car)
library(mclust)
library(MASS)
```
a)Simulation study
```{r}
# Set parameters for the simulation
n_list <- c(25, 50, 100, 500) # Sample sizes
num_simulations <- 2000 # Number of simulations for each sample size
true_mean <- c(0, 0, 0, 0) # True mean vector
true_cov <- matrix(c(1, 0.5, 0.5, 0.5, 1, 0.5, 0.5, 0.5, 1), nrow = 3) # True covariance matrix
true_df <- 5 # True degrees of freedom parameter
print(true_mean)
print(true_cov)
```
```{r}
set.seed(123)
# Simulation parameters
n_sim <- 2000
n_vec <- c(25, 50, 100, 500)
Sigma <- matrix(c(1,1/3,1/3,2), nrow=2)
Sigma
```
To begin, we can import the "streaming_popul_data.csv" file into R and load the necessary libraries for our analysis.
```{r, include=FALSE, echo=FALSE, message=FALSE, warning=FALSE}
#Importing the dataset
library(readr)
data <- read_csv("C:/Users/Admins/Desktop/R/data/streaming_popul_data.csv")
```
To estimate the parameters of the Multivariate T distribution, we first need to load the data and install and load the mvtnorm package in R:
```{r}
# create separate datasets for each genre
rap_data <- data$rap
pop_data <- data$pop
metal_data <- data$metal
rock_data <- data$rock
```
Next, we can create a matrix of the data for each genre:
```{r}
library(mvtnorm)
# Create matrices of the data for each genre
rap_data <- as.matrix(data$rap)
pop_data <- as.matrix(data$pop)
metal_data <- as.matrix(data$metal)
rock_data <- as.matrix(data$rock)
```
```{r}
# Calculate the mean vector and covariance matrix for the data
mu <- c(mean(rap_data), mean(pop_data), mean(metal_data), mean(rock_data))
mu
sigma <- cov(cbind(rap_data, pop_data, metal_data, rock_data))
sigma
```
Finally, we can use the rmvt() function from the mvtnorm package to generate samples from the Multivariate T distribution, with the estimated mean vector and covariance matrix. We can also calculate the sample means and variances for each genre from the samples, and use these to make inferences about the streaming success of each genre.
```{r}
# Generate 1000 samples from the Multivariate T distribution
set.seed(123) # Set a seed for reproducibility
samples <- rmvt(n = 1000, sigma = sigma, df = 5, delta = mu)
# Calculate the sample means and variances for each genre
sample_means <- apply(samples, 2, mean)
sample_vars <- apply(samples, 2, var)
# Print the sample means and standard errors for each genre
cat("Rap: mean =", round(sample_means[1], 2), "SE =", round(sqrt(sample_vars[1]/1000), 2), "\n")
cat("Pop: mean =", round(sample_means[2], 2), "SE =", round(sqrt(sample_vars[2]/1000), 2), "\n")
cat("Metal: mean =", round(sample_means[3], 2), "SE =", round(sqrt(sample_vars[3]/1000), 2), "\n")
cat("Rock: mean =", round(sample_means[4], 2), "SE =", round(sqrt(sample_vars[4]/1000), 2), "\n")
```
c)To test the hypothesis that the expected popularity is the same in every genre, we can use a one-way ANOVA (Analysis of Variance) test with a significance level of 0.05.
Here's the R code to perform the test:
```{r}
# Create a data frame with the relevant columns
genres_data <- data.frame(
rap = data$rap,
pop = data$pop,
metal = data$metal,
rock = data$rock
)
```
```{r}
# Perform a ANOVA test
anova_result <- anova(lm(data$...1 ~ data$rap+data$metal+data$rock+data$pop, data = data))
print(anova_result)
```
To test the hypothesis that the expected popularity is the same in every genre, we can use an analysis of variance (ANOVA) test in R. Here is the code:
```{r}
# Extract the columns for each genre
rap_data <- data$rap
pop_data <- data$pop
metal_data <- data$metal
rock_data <- data$rock
# Perform ANOVA test
result <- aov(c(rap_data, pop_data, metal_data, rock_data) ~ rep(c("rap", "pop", "metal", "rock"), c(length(rap_data), length(pop_data), length(metal_data), length(rock_data))))
summary(result)
```
d)
```{r}
# Extract the pop data
pop_data <- data$pop
# Calculate the sample mean and standard deviation
pop_mean <- mean(pop_data)
pop_sd <- sd(pop_data)
# Set the null hypothesis mean
null_mean <- 40
# Calculate the t-value and the p-value
t_value <- (pop_mean - null_mean) / (pop_sd / sqrt(length(pop_data)))
p_value <- pt(t_value, df = length(pop_data) - 1, lower.tail = FALSE) * 2
# Print the results
cat("t-value:", t_value, "\n")
cat("p-value:", p_value, "\n")
# Check if the null hypothesis is rejected or not
if (p_value < 0.05) {
cat("Reject the null hypothesis. The expected popularity for pop songs is not equal to 40.\n")
} else {
cat("Fail to reject the null hypothesis. The expected popularity for pop songs is equal to 40.\n")
}
```