Spotify.Rmd

---
title: "Streaming  Data Analytics"
author: "Group"
date: '2023-02-22'
output:
  word_document: default
  pdf_document: default
---

```{r, include=FALSE, message=FALSE, warning=FALSE}
# Load required packages
library(MASS)
library(matrixcalc)
library(mvtnorm)
library(data.table)
library(numDeriv)
library(ggplot2)
library(abind)
library(bbmle)
library(emmeans)
library(glm2)
library(MatrixModels)
library(mnormt)
library(Matrix)
library(mlogit)
library(reshape2)
library(reshape2)
library(AER)
library(car)
library(mclust)
library(MASS)
```

a)Simulation study

```{r}
# Set parameters for the simulation
n_list <- c(25, 50, 100, 500) # Sample sizes
num_simulations <- 2000 # Number of simulations for each sample size
true_mean <- c(0, 0, 0, 0) # True mean vector
true_cov <- matrix(c(1, 0.5, 0.5, 0.5, 1, 0.5, 0.5, 0.5, 1), nrow = 3) # True covariance matrix
true_df <- 5 # True degrees of freedom parameter
print(true_mean)
print(true_cov)
```
```{r}
set.seed(123)

# Simulation parameters
n_sim <- 2000
n_vec <- c(25, 50, 100, 500)
Sigma <- matrix(c(1,1/3,1/3,2), nrow=2)
Sigma

```

To begin, we can import the "streaming_popul_data.csv" file into R and load the necessary libraries for our analysis.

```{r, include=FALSE, echo=FALSE, message=FALSE, warning=FALSE}
#Importing the dataset
library(readr)
data <- read_csv("C:/Users/Admins/Desktop/R/data/streaming_popul_data.csv")
```

To estimate the parameters of the Multivariate T distribution, we first need to load the data and install and load the mvtnorm package in R:

```{r}
# create separate datasets for each genre
rap_data <- data$rap
pop_data <- data$pop
metal_data <- data$metal
rock_data <- data$rock
```

Next, we can create a matrix of the data for each genre:

```{r}
library(mvtnorm)
# Create matrices of the data for each genre
rap_data <- as.matrix(data$rap)
pop_data <- as.matrix(data$pop)
metal_data <- as.matrix(data$metal)
rock_data <- as.matrix(data$rock)

```

```{r}
# Calculate the mean vector and covariance matrix for the data
mu <- c(mean(rap_data), mean(pop_data), mean(metal_data), mean(rock_data))
mu
sigma <- cov(cbind(rap_data, pop_data, metal_data, rock_data))
sigma
```
Finally, we can use the rmvt() function from the mvtnorm package to generate samples from the Multivariate T distribution, with the estimated mean vector and covariance matrix. We can also calculate the sample means and variances for each genre from the samples, and use these to make inferences about the streaming success of each genre.

```{r}
# Generate 1000 samples from the Multivariate T distribution
set.seed(123) # Set a seed for reproducibility
samples <- rmvt(n = 1000, sigma = sigma, df = 5, delta = mu)

# Calculate the sample means and variances for each genre
sample_means <- apply(samples, 2, mean)
sample_vars <- apply(samples, 2, var)

# Print the sample means and standard errors for each genre
cat("Rap: mean =", round(sample_means[1], 2), "SE =", round(sqrt(sample_vars[1]/1000), 2), "\n")
cat("Pop: mean =", round(sample_means[2], 2), "SE =", round(sqrt(sample_vars[2]/1000), 2), "\n")
cat("Metal: mean =", round(sample_means[3], 2), "SE =", round(sqrt(sample_vars[3]/1000), 2), "\n")
cat("Rock: mean =", round(sample_means[4], 2), "SE =", round(sqrt(sample_vars[4]/1000), 2), "\n")


```
c)To test the hypothesis that the expected popularity is the same in every genre, we can use a one-way ANOVA (Analysis of Variance) test with a significance level of 0.05.

Here's the R code to perform the test:

```{r}

# Create a data frame with the relevant columns
genres_data <- data.frame(
  rap = data$rap,
  pop = data$pop,
  metal = data$metal,
  rock = data$rock
)

```

```{r}
# Perform a ANOVA test 
anova_result <- anova(lm(data$...1 ~ data$rap+data$metal+data$rock+data$pop, data = data))
print(anova_result)
```

To test the hypothesis that the expected popularity is the same in every genre, we can use an analysis of variance (ANOVA) test in R. Here is the code:

```{r}

# Extract the columns for each genre
rap_data <- data$rap
pop_data <- data$pop
metal_data <- data$metal
rock_data <- data$rock

# Perform ANOVA test
result <- aov(c(rap_data, pop_data, metal_data, rock_data) ~ rep(c("rap", "pop", "metal", "rock"), c(length(rap_data), length(pop_data), length(metal_data), length(rock_data))))
summary(result)

```
d)
```{r}


# Extract the pop data
pop_data <- data$pop

# Calculate the sample mean and standard deviation
pop_mean <- mean(pop_data)
pop_sd <- sd(pop_data)

# Set the null hypothesis mean
null_mean <- 40

# Calculate the t-value and the p-value
t_value <- (pop_mean - null_mean) / (pop_sd / sqrt(length(pop_data)))
p_value <- pt(t_value, df = length(pop_data) - 1, lower.tail = FALSE) * 2

# Print the results
cat("t-value:", t_value, "\n")
cat("p-value:", p_value, "\n")

# Check if the null hypothesis is rejected or not
if (p_value < 0.05) {
  cat("Reject the null hypothesis. The expected popularity for pop songs is not equal to 40.\n")
} else {
  cat("Fail to reject the null hypothesis. The expected popularity for pop songs is equal to 40.\n")
}



```