CS 30122 Final Project
Title: Survey Search
Team: Pawsitive
Members: Bethany Bailey (The Data Cleaner Cat)
Ruixue Li (The Interior Designer Cat)
Leoson Hoay (The Django Janitor Cat)
Goal: Create an easy-to-use open-source social science survey data repository. Unique selling points:
- Search multiple surveys from different sources on one site.
- Search surveys on a granular level (for individual questions) as well as for entire surveys
- Researchers can upload and share their own survey data
The code of this project consist of two part:
- (1) code for data cleaning
- (2) code for building the Django site Both parts are documented below.
The data cleaning was all done in jupyter notebooks, and took survey documentation from the web in four different formats - pdf, csv, rtf, and docx - and extracted the variables and put them into csvs with the variable names and text. During this process, the actual survey data was converted to csv and stored in the data folder for each survey as well as the Survey Detail folder (which is what we provide to the Django site for links).
The Django portion is structured as follows (the site is available in Final Site/surveysearch folder on github):
├── documents *where user uploaded survey question files are stored
├── manage.py
├── search
│ ├── __init__.py
│ ├── __pycache__
│ ├── admin.py
│ ├── apps.py
│ ├── forms.py *define a form class to display a form so that users can upload survey files
│ ├── migrations
│ ├── models.py *django models
│ ├── static *static files, including Bootstrap and customized css/javascript, image files, generated wordcloud image
│ ├── templates *html templates
│ ├── urls.py
│ └── views.py *where django views are defined and a lot of the processing happens
├── search_beta.sqlite3 *the database generated by the model
├── surveysearch
├── __init__.py
├── __pycache__
├── settings.py *app-wide settings
├── urls.py
└── views.py
- PyPDF2 (sudo pip3 install PyPDF2)
- docx (sudo pip3 install python-docx)
- docx2txt (sudo pip3 install docx2txt)
- The regular expression library (re) should be installed on ths VMs already.
To rerun the data cleaning, go into each data folder, open the jupyter notebook (run "jupyter notebook" from the command line and go to the jupyter notebook document), and rerun the script. The data that was used to find the questions/variable names and descriptions is in each individual folder. All of these data sources were downloaded directly from online sources.
- wordcloud (sudo pip3 install wordcloud)
- sklearn (sudo pip3 install sklearn)
- Pillow and Matplotlib should already be installed on the VMs.
To run our site, please clone the repository, go into the
Final Site/surveysearch
folder and run
python3 manage.py runserver --insecure --nothreading
"--insecure" is for running a local development server with "DEBUG" set to "False", and "--nothreading" is required for competibility for with matplotlib.
Once you have run this, go to
http://127.0.0.1:8000/search/
in your browser to access the homepage and browse the website from there. We've provided the following use cases to facilitate your testing.
Go to homepage -> click "Find relevant questions" button -> input keyword(s) separated by space -> click search button and you'll be redirected to a result page listing all the relevant questions from all the surveys in the database. You can then view the survey that contains a particular question by clicking on the title. While viewing a survey, you can browse all the questions in that survey through the "See all variables and questions" link at the last cell of the form.
Go to homepage -> click "Find relevant surveys" button -> input keyword(s) separated by space -> click search button and you'll be redirected to a result page listing all the relevant surveys in the database. You can then view the details of any survey by clicking on the title. Again, while viewing a survey, you can browse all the questions in that survey through the "See all variables and questions" link at the last cell of the form.
Go to homepage -> click "Browse" button and you'll be redirected to a page listing all the surveys in the database. You can then view the details of any survey by clicking on the title. Once again, while viewing a survey, you can browse all the questions in that survey through the "See all variables and questions" link at the last cell of the form.
Test the upload portion of the site using either a handmade file or, if you are so inclined, the sample upload on our github in Sample Upload/student.csv. You can use any values for the upload parameters. Once you have uploaded this survey, you should be able to see it on the "Browse" page (which you can go to from the homepage or from the navigation bar), and it will show up in keyword searches for questions and surveys (try searching "final" or "semester" before and after the upload). After you upload a new survey, the wordcloud on the homepage background should update to reflect the the newly added data. If the image does not automatically regenerate, clear the cache and refresh the webpage using Ctrl-Shift-R in the VM (or any Linux machine) or Cmd-Shift-R on Mac.
The surveys were collected by all three members. The data that was in SPSS was converted to CSV by Leoson, and then Bethany completed all of the data cleaning and extraction. Ruixue contributed and formatted a survey from her undergradate research project as the sample file you can use for testing.
Bethany created the upload form, and Ruixue handled the file and its integration to other parts of the site. Together, all three team members worked on file validating and processing.
Leoson completed the search by question function, and Bethany completed the search by survey function. Ruixue completed the ranking algorithm.
Ruixue completed the wordcloud.
All three team members participated in the view and template creation. All templates are written by the team with the help of html/css and Bootstrap documentation. Ruixue applied and standardized styling of all pages with Bootstrap and designed the homepage (index.html).
Please see comments starting with "Code Ownership : " in files. The descriptions of code ownership are as the example provided by Professor Rogers, outlined below.
- "Direct copy" : Generated by installed package or online source (Django or other) and few edits made
- "Modified" : Generated by installed package or online source (Django or other) and meaningful edits made OR heavily utilized template(s) provided by tutorial sessions (TA- or Django-generated)
- "Original" : Original code or heavily modified given structure