-
Notifications
You must be signed in to change notification settings - Fork 30
/
milestone4.Rmd
177 lines (138 loc) · 8.38 KB
/
milestone4.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
---
title: "Milestone 4"
descriptions: |
Abstract project functions to their own software package,
and address any feedback received in earlier milestones.
output:
distill::distill_article:
toc: true
---
## Overall project summary
In this course you will work in assigned teams of three or four
(see group assignments in [Canvas](https://canvas.ubc.ca))
to answer a predictive question using a publicly available data set that will allow you to answer that question.
To answer this question,
you will perform a complete data analysis in R and/or Python,
from data import to communication of results,
while placing significant emphasis on reproducible and trustworthy workflows.
Your data analysis project will evolve throughout the course from a single, monolithic Jupyter notebook,
to a fully reproducible and robust data data analysis project, comprised of:
- a well documented and modularized software package and scripts written in R and/or Python,
- a data analysis pipeline automated with GNU Make,
- a reproducible report powered by Quarto,
- a containerized computational environment created and made shareable by Docker, and
- remote version control repositories on GitHub for project collaboration and sharing,
as well as automation of test suite execution and documentation and software deployment.
An example final project from another course (where the project is similar) can be seen here:
[Breast Cancer Predictor](https://github.com/ttimbers/breast_cancer_predictor_py)
## Milestone 4 summary
In this final project milestone, you will:
- Abstract project functions to their own software package
(this means the code for your project will be split across two repositories)
- Address any feedback received in earlier milestones or during the peer review
## Final project specifics
### Abstract project functions to their own software package
To further refine your project, you are going to split the code
for your analysis into two version control repositories.
One will be for your analysis,
the other will be another separate repository,
that will serve as a place to package up the functions you wrote in milestone 3.
Decoupling these two code bases
will allow for those functions to more easily be reused in future projects by
you and others who might find them useful!
Your new version control repository for your packaged functions should:
- be given a name relevant to the functions it contains
(just like the data science packages you know and love are).
- be in the DSCI-310-2024 organization: https://github.com/DSCI-310-2024
- contain at least 3 or 4 useful functions that handle errors gracefully
(you already wrote these - you need to now
copy them from your analysis repository to the software package repository)
- be well documented
(use the three level of package documentation we learned about in class:
function/reference docs, vignette/tutorials docs, and high-level package website docs)
- have unit and integration tests
(you already wrote these - you need to now
copy them from your analysis repository to the software package repository)
- use GitHub actions for continuous integration to run your tests,
and continuous deployment to render your package documentation
(if you are creating a Python package,
you can even use continuous deployment to publish your package to PyPI!)
- have a badge for a code coverage
- have a `README` that describes your package from a high-level,
including where it sits in the package ecosystem
for the language you chose to write it in chose
(i.e., what other similar packages exist and how do they differ)
- have a Software Licence
that describes how you will permit others to use your work
- have a code of conduct document
that describes behavioural expectations of team interactions
- have a contributions document
that outlines how the team will work together
(expected workflow practices)
You should update your the analysis code in your data analysis repository
so that:
- you load your R (or Python) package to access your functions,
instead of sourcing or importing files in your analysis repository
- remove the files that contain your functions
(you no longer need these since you are loading them from your package),
and the `tests` directory files that holds the tests and the helper data for those tests
(again, these should now exist in your package repository)
- update your projects computational environment so that it contains your R (or Python) package.
*Don't forget to include the version when doing this!*
### 2. Address any feedback received in earlier milestones and during peer review
Revise your data analysis project to address feedback received
from the DSCI 310 teaching team from past milestones,
as well as feedback received from the peer review.
50% of your final grade for this milestone will be assessing
whether you have addressed this feedback to improve your project.
To help us easily, and correctly, assess this,
please create a GitHub issue in your analysis project's GitHub repository
with the title "Feedback addressed".
In this issue describe any improvements you made to the project based on feedback,
and point to evidence of these improvements.
You can point us to evidence of addressing it
by providing URLs to reference
specific lines of code, commit messages, pull requests, etc.
Be sure to add some narration when sharing these URLs
so that it is easy for us to identify which changes to your work
addressed which pieces of feedback.
You will be graded on a sliding scale for this,
the more improvements you make,
the higher your grade for this part of the milestone will be.
The improvements should be at least one per team member,
and they should significantly improve the project.
This minimum could earn at most 37.5/50.
To earn more, you need to exceed these minimum improvements.
## Submission Instructions
Just before you submit the milestone 4, create a release on your analysis project repository on GitHub and name it something like `3.0.0` ([how to create a release](https://docs.github.com/en/repositories/releasing-projects-on-github/managing-releases-in-a-repository#creating-a-release)). This release allows us and you to easily jump to the state of your repository at the time of submission for grading purposes, while you continue to work on your project for the next milestone.
Also create a release on your software package repository
(do this manually if you are not using an automated tool for this
in your continuous deployment).
You will submit a PDF to Gradescope for milestone 4 that includes:
1. the URL of your analysis project's GitHub.com repository
2. the URL of a GitHub release of your analysis project's GitHub.com repository
3. the URL to the "Feedback addressed" issue that outlines the feedback you have addressed
3. the URL of your software package's GitHub.com repository
5. the URL of a GitHub release of your software package's GitHub.com repository
## Expectations
- **Everyone should contribute equally to all aspects of the project**
**(e.g., code, writing, project management).**
**This should be evidenced by a roughly equal number of commits,**
**pull request reviews and participation in communication via GitHub issues.**
- Each group member should work in a
[GitHub flow workflow](https://docs.github.com/en/get-started/quickstart/github-flow);
where they create branches for each feature or fix,
which are reviewed and critiqued by at least one other teammate
before the the pull request is accepted.
- You should be committing to git and pushing to GitHub.com every time you work on this project.
- Git commit messages should be meaningful. These will be marked.
It's OK if one or two are less meaningful, but most should be.
- Use GitHub issues to communicate to their team mate (as opposed to email or Slack).
- Your question, analysis and visualization should make sense. It doesn't have to be complicated.
- **Your analysis should be correct, and run reproducibly given the instructions provided in the README.**
- You should use proper grammar and full sentences in your README.
Point form may occur, but should be less than 10% of your written documents.
- R code should follow the [tidyverse style guide](https://style.tidyverse.org/),
and Python code should follow the
[black style guide](https://black.readthedocs.io/en/stable/the_black_code_style/index.html) for Python)
- You should not have extraneous files in your repository that should be ignored.