-
Notifications
You must be signed in to change notification settings - Fork 4
/
lesson_02_dataCleaning.Rmd
159 lines (80 loc) · 7.72 KB
/
lesson_02_dataCleaning.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
---
title: "Exploring Data, Cleaning, Assessment"
---
## Exploring this test dataset
During class we will discuss and identify some of the issues with this dataset - click on link to download the data in CSV (comma separated format) format [Dataset_02_fixq2.csv](https://github.com/melindahiggins2000/N736Homework01/raw/master/Dataset_02_fixq2.csv).
NOTE: CSV files are basically TEXT files where each value is entered on rows separated by columns. Each column is assumed to be a different VARIABLE (a different FIELD). Each row is assumed to be a different RECORD - think of a patient's medical file/folder.
## Project Organization
1. Create a folder on your computer for this project. For example "C:\\N736\\exercise01". Click on New Folder and type in the name of the folder you want. Repeat to create the subfolders.
**NOTE**: On Windows the folder separator is the left-tilt "back-slash" \\. However on a MAC the folder separator is the right-tilt "forward-slash" /. Be sure to check how these folder paths need to be input for (A) your operating system and (B) software. For example, I am on a Windows operating system which expects "\\", but when I am using `R` I have to type my folder paths using the "/".
#### Create Folder - using File Explorer on Windows operating system
![exfolder1](images/exercise01_createfolder1.png)
Type in name of folder
![exfolder2](images/exercise01_createfolder2.png)
Repeat again for subfolder.
On Windows, click at top to get full path name.
![exfolder3](images/exercise01_createfolder3.png)
2. Now put the data file you just downloaded [Dataset_02_fixq2.csv](https://github.com/melindahiggins2000/N736Homework01/raw/master/Dataset_02_fixq2.csv) into this folder. This will be the folder you use for ALL files associated with this exercise.
![exfolder4](images/exercise01_createfolder4.png)
3. Follow this process for ALL of your exercises, homeworks and project(s). This will help you stay organized and avoid file location problems with software. Most software assumes that the files you are inoutting, using and saving are in this project folder.
-----
* R projects (should) always start with defining your project folder [File/New Project]
![rnewproj](images/Rnewproject.png)
![rnewproj2](images/Rnewproject2.png)
-----
* SAS begins by defining a library using a `libname` statement
![sasnewproj](images/SASnewproject.png)
-----
* SPSS is the most flexible. It typically defaults to remembering which folder you were in when you last exited the software.
* Go to EDIT/OPTIONS and choose "File Locations" TAB
![spssnewproj1](images/SPSSeditoptions.png)
You can either keep the "Last Folder Used" default setting or go ahead and override this setting and put in your project folder path in the "Specified folder" for Data files and Other files. You need to do this AT THE BEGINNING and REDO for every new project.
![spssnewproj2](images/SPSSfilelocations.png)
![spssnewproj3](images/SPSSfilelocations2.png)
## IMPORT CSV Data into your software
Details are provided below for importing the data into each stats software: `SPSS`, `SAS` and `R`.
-----
### SPSS
<p style="background-color:powderblue;font-size:30px;">SPSS - Import CSV Data File</p>
![spssimportdata](images/SPSSimportcsv.png)
Follow the steps in the wizard. Click "PASTE" to save the SYNTAX for importing this datafile.
![spssimportdata2](images/SPSSimportcsv2.png)
-----
### SAS
<p style="background-color:powderblue;font-size:30px;">SAS - Import CSV Data File</p>
![sasimportdata](images/SASimportdata.png)
Follow the steps in the wizard. At the end you have the option to save the SAS (program) which will save the code for importing this dataset.
![sasimportdata2](images/SASimportdata2.png)
-----
### R
<p style="background-color:powderblue;font-size:30px;">R - Import CSV Data File</p>
![rimportdata](images/Rimportdata.png)
You can choose either the base or readr options:
`base` R import:
![rimportdata3](images/Rimportdata3.png)
After clicking IMPORT you can see the R code run for this import in the CONSOLE window. You can save this code for future use as needed.
![rimportdata4](images/Rimportdata4.png)
`readr` package (tidyverse) import:
The R code for this import is shown in the CODE PREVIEW at the bottom right - you can cut and paste this code to save for future use.
![rimportdata2](images/Rimportdata2.png)
## Previous Notes - Reviewing the dataset - for Homework 01
In today's class we'll get started exploring and finding the issues and problems with the dataset you'll be working with for Homework 01. See Homework 1 instructions.
**NOTES from Class Today** [homework1_notes.txt](homework1_notes.txt)
## ALL Class discussions and videos
All classes will be recorded and the video posted at the **EchoALP** link on Canvas for NRSG 736.
Weblinks to be discussed during class:
* Journal of Biostatistics - author guidelines for "Reproducible Research" [https://academic.oup.com/biostatistics/pages/General_Instructions](https://academic.oup.com/biostatistics/pages/General_Instructions).
* Washington Post Article on "An Alarming Number of Scientific Papers Contain Excel Errors" [https://www.washingtonpost.com/news/wonk/wp/2016/08/26/an-alarming-number-of-scientific-papers-contain-excel-errors/?utm_term=.8ec47ce8bc16](https://www.washingtonpost.com/news/wonk/wp/2016/08/26/an-alarming-number-of-scientific-papers-contain-excel-errors/?utm_term=.8ec47ce8bc16).
* Genome Biology 2016 Paper on "Gene name error are widespread in the scientific literature" [https://genomebiology.biomedcentral.com/articles/10.1186/s13059-016-1044-7](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-016-1044-7).
* "The Excel Error heard Around the World" [https://newrepublic.com/article/112951/rogoff-reinhart-and-world-excel-error-research](https://newrepublic.com/article/112951/rogoff-reinhart-and-world-excel-error-research) & more at [http://nymag.com/daily/intelligencer/2013/04/grad-student-who-shook-global-austerity-movement.html](http://nymag.com/daily/intelligencer/2013/04/grad-student-who-shook-global-austerity-movement.html).
* "How Bright Promise in Cancer Testing Fell Apart" - Duke cancer trials controversy [http://www.nytimes.com/2011/07/08/health/research/08genes.html?mcubz=1](http://www.nytimes.com/2011/07/08/health/research/08genes.html?mcubz=1) & "An array of errors" [http://www.economist.com/node/21528593](http://www.economist.com/node/21528593).
* "A Manifesto for Reproducible Science" [https://www.nature.com/articles/s41562-016-0021](https://www.nature.com/articles/s41562-016-0021).
* Center for Open Science [https://cos.io/](https://cos.io/) & COS History [https://cos.io/about/brief-history-cos-2013-2017/](https://cos.io/about/brief-history-cos-2013-2017/).
* Science "One in Five Genetics Papers Contains Errors Thanks to Microsoft Excel" [http://www.sciencemag.org/news/sifter/one-five-genetics-papers-contains-errors-thanks-microsoft-excel](http://www.sciencemag.org/news/sifter/one-five-genetics-papers-contains-errors-thanks-microsoft-excel).
## Other links you might want to explore
* Gitbook [https://www.gitbook.com/](https://www.gitbook.com/).
* Bookdown [https://bookdown.org](https://bookdown.org).
* Yihui Xie's book on the `bookdown` package [https://bookdown.org/yihui/bookdown/](https://bookdown.org/yihui/bookdown/).
* Github repo for Yihui Xie's "Dynamic Documents with R and knitr" - 1st 3 chapters available online [https://github.com/yihui/knitr-book](https://github.com/yihui/knitr-book).
* Garrett Grolemund and Hadley Wickham's book "R for Data Science" - also online at [http://r4ds.had.co.nz/](http://r4ds.had.co.nz/) & if you're interested, here is the Github repo for their book [https://github.com/hadley/r4ds](https://github.com/hadley/r4ds).
* Tidyverse [https://www.tidyverse.org/](https://www.tidyverse.org/).