Skip to content

bethbailey/survey-search

Repository files navigation

survey-search

CS 30122 Final Project
Title: Survey Search
Team: Pawsitive

Members:
Bethany Bailey (The Data Cleaner Cat)
Ruixue Li (The Interior Designer Cat)
Leoson Hoay (The Django Janitor Cat)

Project description

Goal: Create an easy-to-use open-source social science survey data repository. Unique selling points:

  1. Search multiple surveys from different sources on one site.
  2. Search surveys on a granular level (for individual questions) as well as for entire surveys
  3. Researchers can upload and share their own survey data

Overall Code Structure

The code of this project consists of two parts:

  • (1) code for data cleaning
  • (2) code for building the Django site

Both parts are documented below.

Data Cleaning

The data cleaning was completed in Jupyter notebooks, and took survey documentation from the web in four different formats (pdf, csv, rtf, and docx), extracted the variable names and descriptions, and wrote the information to csvs. During this process, the actual survey data was converted to csv and stored in each survey's data folder as well as the Survey Detail folder (which is what we provide to the Django site for links). We split one survey (the OLS Animal survey) into three datasets because we wanted to see if the the subdivisions would come up similarly in rankings.

Django Site

The Django portion is structured as follows (the site is available in Final Site/surveysearch folder on github):

├── documents *where user uploaded survey question files are stored
├── manage.py 
├── search
│   ├── __init__.py
│   ├── __pycache__
│   ├── admin.py
│   ├── apps.py 
│   ├── forms.py *define a form class to display a form so that users can upload survey files 
│   ├── migrations
│   ├── models.py  *django models
│   ├── static  *static files, including Bootstrap and customized css/javascript, image files, generated wordcloud image
│   ├── templates  *html templates 
│   ├── urls.py  
│   └── views.py  *where django views are defined and a lot of the processing happens
├── search_beta.sqlite3  *the database generated by the model
├── surveysearch
    ├── __init__.py
    ├── __pycache__
    ├── settings.py  *app-wide settings
    ├── urls.py
    └── views.py

Instructions for Running the Project

Running the Data Cleaning code

Libraries to Install for Data Cleaning:

  • PyPDF2 (sudo pip3 install PyPDF2)
  • docx (sudo pip3 install python-docx)
  • docx2txt (sudo pip3 install docx2txt)
  • The regular expression library (re) should be installed on ths VMs already.

To rerun the data cleaning, go into each data folder, open the jupyter notebook (run "jupyter notebook" from the command line and go to the .ipynb document), and rerun the script. The data that was used to find the questions/variable names and descriptions is in each individual folder. All of these data sources were downloaded directly from online sources.

Running the Django Site

Libraries to Install:

  • wordcloud (sudo pip3 install wordcloud)
  • sklearn (sudo pip3 install sklearn)
  • Pillow and Matplotlib should already be installed on the VMs.

To run our site, please clone the repository, go into the

Final Site/surveysearch

folder and run

python3 manage.py runserver --insecure --nothreading

"--insecure" is for running a local development server with "DEBUG" set to "False", and "--nothreading" is required for compatibility for with matplotlib.

Once you have run this, go to

http://127.0.0.1:8000/search/ 

in your browser to access the homepage and browse the website from there. We've provided the following use cases to facilitate your testing.

Use Cases

Use Case 1: Testing Question Search

Go to homepage -> click "Find relevant questions" button -> input keyword(s) separated by space -> click search button

You will be redirected to a result page listing all the relevant questions from all the surveys in the database. You can then view the survey that contains a particular question by clicking on the title. While viewing a survey, you can browse all the questions in that survey through the "See all variables and questions" link at the last cell of the form.

Use Case 2: Testing Survey Search

Go to homepage -> click "Find relevant surveys" button -> input keyword(s) separated by space -> click search button

You will be redirected to a result page listing all the relevant surveys in the database. You can then view the details of any survey by clicking on the title. Again, while viewing a survey, you can browse all the questions in that survey through the "See all variables and questions" link at the last cell of the form.

Use Case 3: Browse Surveys

Go to homepage -> click "Browse" button

You will be redirected to a page listing all the surveys in the database. You can then view the details of any survey by clicking on the title. Once again, while viewing a survey, you can browse all the questions in that survey through the "See all variables and questions" link at the last cell of the form.

Use Case 4: Testing Upload

Test the upload portion of the site using either a handmade file or, if you are so inclined, the sample upload on our github in Sample Upload/student.csv. You can use any values for the upload parameters. However, if you try to use a non-number type for the number of participants or number of questions, the upload button will not work and will return you to that portion of the page. Additionally, you can test the checks we built in the system as follows:

  • To ensure that the document is a csv in the correct format, we built in a check to see whether the file the user is trying to upload is in csv format. To test this, try to upload a different type of document (e.g. pdf).
  • In order to prevent individuals from putting in the same survey twice, we check the survey name. To test this, try providing the same survey name (e.g. "General Social Survey 2016").
  • The form also asks to user to enter 'NA' in the links fields if a particular link is not available. This is used as a check in the pages displaying survey details on whether a particular external link exists.

Once you have uploaded this survey, you should be able to see it on the "Browse" page (which you can go to from the homepage or from the navigation bar), and it will show up in keyword searches for questions and surveys (try searching "final" or "semester" before and after the upload). After you upload a new survey, the wordcloud on the homepage background should update to reflect the the newly added data. If the image does not automatically regenerate, clear the cache and refresh the webpage using Ctrl-Shift-R in the VM (or any Linux machine) or Cmd-Shift-R on Mac.

Coding Contribution Breakdown

Data Collection

The surveys were collected by all three members. The data that was in SPSS was converted to CSV by Leoson, and Bethany completed all of the data conversion, cleaning, and extraction. Leo created the databases, and Bethany input the data into the databases. Ruixue formatted a survey from her undergradate research project as the sample file you can use for testing.

Django Site

Forms

Bethany created the upload form, and Ruixue and Leoson handled the file and its integration to other parts of the site. Together, all three team members worked on file validating and processing.

Search

Leoson completed the search by question function, and Bethany completed the search by survey function. Ruixue completed the ranking algorithm.

WordCloud

Ruixue completed the wordcloud.

Templates/Formatting and HTML

All three team members participated in the and template creation. All templates were written by the team with the help of html/css and Bootstrap documentation. Ruixue applied and standardized styling of all pages with Bootstrap and designed the homepage (index.html).

Documentation of Code Ownership

Please see comments starting with "Code Ownership : " in files. The descriptions of code ownership are as the example provided by Professor Rogers, outlined below.

  • "Direct copy" : Generated by installed package or online source (Django or other) and few edits made
  • "Modified" : Generated by installed package or online source (Django or other) and meaningful edits made OR heavily utilized template(s) provided by tutorial sessions (TA- or Django-generated)
  • "Original" : Original code or heavily modified given structure