Skip to content

[STAT 35000] A statistical analysis of the data pulled from the Google Play app store.

License

Notifications You must be signed in to change notification settings

JeroSik/Google-Play-Store-App-Data-Analysis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Google-Play-Store-App-Data-Analysis

A statistical analysis of the data scraped from the Google Play app store. During the semester of Fall 2018, the honors project for STAT 35000 decided to find trends and patterns related to the market to help developers. See also the presentation for this project.

Analysis Questions

  1. Which Google Play store app category has the greatest ratings?

  2. Is there a significant difference in ratings between a Google Play store app that has either a restricted or non-restricted content rating?

  3. Is there a significant difference in ratings between a Google Play store app that is either paid or free?

Dataset

The dataset for this project was found on Kaggle, and the information was collected through web scraping around 10,000 Play Store apps. Since Google Play uses modern-day techniques like dynamic page load using JQuery, this made scraping more challenging. Each app has values for category, rating, size, installs, and other app specifications. Although Google Play acts as a digital media store as well, the data covers only mobile applications. The program R was used in the analysis. To address the first question, an ANOVA test was used to observe whether there was a statistically significant difference between the ratings of each of categories. To address the second and third question, two Welch two-sample t-test were used to observe whether there was a statistically significant difference between the two different content ratings and the cost of the app. The assumptions made for all tests were met by the datasets.

Results

The ANOVA test indicated that there was significant statistical evidence to conclude that the true mean ratings among the all categories were not equal to each other. Instead, “UTILITY” and “EDUCATION” were not statistically different, and “LIFESTYLE” and “BUSINESS” were not statistically different. This was determined by whether the confidence intervals between the categories contained a zero value for the difference between the true means. The Welch two-sample t-tests indicated the p-value of the null hypothesis, the true mean rating. For the content rating, the p-value was less than 2.2e-16 which shows that the true mean rating between content ratings are not equal to each other. For the cost, the p-value was 0.1251 which shows that the true mean rating between paid or free are equal to each other.

License

This project is lincensed under the MIT License - see the LICENSE.md file for details.

Acknowledgments

About

[STAT 35000] A statistical analysis of the data pulled from the Google Play app store.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages