project.Rmd

---
title: "Bayesian Data Analysis course - Project work"
date: "Page updated: `r format.Date(file.mtime('project.Rmd'),'%Y-%m-%d')`"
params:
  schedule_file: "schedule2024.csv"
  project_dl: "1.12."
  project_feedback_dl: "6.12."
  project_presentations: "9.-13.12."
  exam_week_lecture_date: "16.10."
---

```{r, echo=FALSE, results=FALSE}
Sys.setlocale("LC_TIME", "C")

schedule <- read.csv2(params$schedule_file, sep = "\t", header = FALSE, row.names = 1, col.names = paste("Week", 0:13, sep = ""))

# parse a date from the schedule. Google sheets format DD/MM/YYYY
sdate <- function(row, col, out_format = "%-d.%-m.", offset = 0) {
  tidyr::replace_na(as.character(as.Date(schedule[row, col], format="%d/%m/%Y") + offset, format = out_format), schedule[row, col])
}

# return the string representation of the weekday of the scheduled day
sday <- function(row, col) {
  weekdays(as.Date(schedule[row, col], format = "%d/%m/%Y"))
}
```

## Project work details

Project work involves choosing a data set and performing a whole analysis
according to all the parts of Bayesian workflow studied along the course.

  - The project work is meant to be done in period II.
  - In the beginning of the period II
      - Form a group. We prefer groups of 2--3, but the project can be done alone.
      - Select a topic. You may ask in the course chat channel #project for opinion whether it's a good topic and a good dataset. You can change the topic later.
      - Start planning.
  - The main work for the project and the presentation will be done in the second half of the period II after all the workflow parts have been discussed in the course.
  - The in person presentations will be made on the evaluation week after period II.
  - Use of AI is allowed on the course, but the most of the work needs to by the students, and you need to report whether you used AI and in which way you used them (See [points 5 and 6 in Aalto guidelines for use of AI in teaching](https://www.aalto.fi/en/services/guidance-for-the-use-of-artificial-intelligence-in-teaching-and-learning-at-aalto-university)). We have tested some AI on the course topics and assignments and the output can be copy of existing text without attribution (ie. plagiarism), vague or have mistakes, so you need to be careful when using such outputs.
  - All suspected plagiarism will be reported and investigated. See more about the [Aalto University Code of Academic Integrity and Handling Violations Thereof](https://into.aalto.fi/display/ensaannot/Aalto+University+Code+of+Academic+Integrity+and+Handling+Violations+Thereof).


## Project schedule

- Form a group and pick a topic. Register the group before end of 7th Nov, 2023. The registration will open soon.
- Groups of 3 can reserve a presentation slot starting 7th Nov, 2024. 
- Groups of 1-2 can reserve a presentation slot starting 8th Nov, 2024.
- Groups that register late can reserve a presentation slot starting 9th Nov, 2024.
<!--
- Presentation slots 8th Dec, 2023, are meant only for those who are leaving the country before 11th Dec.
-->
- Work on the project. TA session queue is also for project questions.
- Project report deadline `r params$project_dl`. Submit in FeedbackFruits (the link will be added to MyCourses).
- Project report peer grading `r substr(params$project_dl, 1, 2)` - `r params$project_feedback_dl` (so that you'll get feedback for the report before the presentations).
- Project presentations `r params$project_presentations`.

## Groups

Project work is done in groups of 1-3 persons. Preferred group size is
3, because you learn more when you talk about the project with
someone else. 

If you don't have a group, you can ask other students in the group
chat channel **#project**. Tell what kind of data you are interested
in (e.g. medicine, health, biological, engineering, political,
business), whether you prefer R or Python, and whether you have
already more concrete idea for the topic.

Groups of 3 students can choose their presentation time slot before
1-2 student groups. 3 person group is expected to do a bit more work
than 1-2 person groups.

You can do the project alone, but the amount of work is expected to
the same for 2 person groups.

## TA sessions

The groups will get help for the project work in [TA
sessions](assignments.html#TA_sessions). When there are no weekly
assignments, the TA sessions are still organized for helping in the
project work.

## Evaluation

The project work's evaluation consists of

- peergraded project report (30%) (within FeedbackFruits submission 90% and feedback 10%)
- presentation and oral exam graded by the course staff (70%)
    - clarity of slides + use of figures
    - clarity of oral presentation + flow of the presentation
    - all required parts included (not necessarily all in main slides, but it needs to be clear that all required steps were performed)
    - accuracy of use of terms (oral exam)
    - responses to questions (oral exam)

## Project report

In the project report you practice presenting the problem and data
analysis results, which means that minimal listing of code and figures
is not a good report. There are different levels for how data analysis
project could be reported. This report should be more than a summary
of results without workflow steps. While describing the steps and
decisions made during the workflow, to keep the report readable some
of the **diagnostic outputs and code** can be put in the **appendix**. If you
are uncertain you can ask TAs in TA sessions whether you are on a good
level of amount of details.

The report **should not be over 20 pages**. The report can be much
shorter than 20 pages, but as figures ca take a lot of space, the
upper limit is quite high. There is a upper limit, so that you need to
do at least some selection of what to show and specifically don't
include all possible output from the inference. If you are uncertain
whether your report is containing sufficient information, ask TAs.

The report **should include**

  1. Introduction describing
      - the motivation
      - the problem
      - the main modeling idea
      - In addition showing some illustrative figure is recommended.
  1. Description of the data and the analysis problem. Provide information where the data was obtained, and if it has been previously used in some online case study and how your analysis differs from the existing analyses.
  1. Description of at least two models, for example:
      - non hierarchical and hierarchical,
      - linear and non linear,
      - different plausible observation models,
      - variable selection with many models.
  1. Informative or weakly informative priors, and justification of the choice of these priors.
  1. brms, rstanarm, or Stan code.
  1. How the MCMC inference was run, that is, what options were used. A good option is to show the command you did run, and a textual explanation of the choice of options.
  1. Convergence diagnostic ($\widehat{R}$, ESS, divergences) value and what can be interpreted from them.  What was done if the convergence was not good with the first try. <u>This should be reported for all models.</u>
  1. Posterior predictive checks and what can be interpreted from them. What was done to improve the model if the checks indicated misspecification. <u>This should be reported for all models.</u>
  1. **Optional/Bonus**: Predictive performance assessment if applicable (e.g. classification
    accuracy) and evaluation of practical usefulness of the accuracy. This should be reported for all models as well.
  1. Sensitivity analysis with respect to prior choices (i.e. checking whether the result changes a lot if prior is changed). <u>This should be reported for all models.</u>
  1. Model comparison (e.g. with LOO-CV).
  1. Discussion of issues and potential improvements.
  1. Conclusion what was learned from the data analysis.
  1. Self-reflection of what the group learned while making the project.

You can check also the [FeedbackFruits rubric for the project report](project_work/rubric.pdf).

For guidance on how many digits to report, see [Digits notebook](https://users.aalto.fi/~ave/casestudies/Digits/digits.html).

## Project presentation

In addition to the submitted report, each project must be presented by the
authoring group, according to the following guidelines:

  - The presentation should be high level but sufficiently detailed information
    should be readily available to help answering questions from the audience.
  - The duration of the presentation should be 10 minutes (groups of 1-2 students) or 15 minutes (groups of 3 students).
  - At the end of the presentation there will be an extra 5-10 minutes of
    questions by anyone in the audience or two members of the course staff who are present. The questions from lecturer/TAs can be considered as an oral exam questions, and if answers to these questions reveal weak knowledge of the methods and workflow steps which should be part of the project, that can reduce the grade.
  - Grading will be done by the two members of the course staff using standardized grading instructions.
    
Specific recommendations for the presentations include:

  - The first slide should include project's title and group members' names.
  - The chosen statistical model(s), including observation model and priors, must be explained and justified,
  - Make sure the font is big enough to be easily readable by the
    audience. This includes figure captions, legends and axis
    information,
  - The last slide should be a summary or take-home-messages and include contact information or link to a further information. (The grade will be reduced by one if the last slide has only something like "Thank you" or "Questions?"),
  - In general, the best presentations are often given by teams that have
    frequently attended TA sessions and gotten feedback, so we strongly
    recommend attending these sessions.

More details on the presentation sessions in 2024

- Presentations will be given on campus
<!--
- In Zoom you need to have video on.
- If you can't come to campus (e.g., if you have such symptoms that you can't come to campus), and you don't have microphone or video camera (e.g. in your laptop or mobile phone) or you can't have video on for other reasons, you can present on campus in period III.
-->
- If you reserved a presentation slot but need to cancel, do it ASAP via the reservation link on MyCourses. If you cancel after the reservation link has closed, send a message to David Kohns (TA) on Zulip.
- As we have many presentation in each slot join the meeting in time. Late arrivals will lower the grade. Very late arrivals will fail the presentation and can present in period III.
- It is easiest if just one from the group shares the slides, but it is expected that all group members present some part of the presentation orally.
- Presentation time is 10 min for 1-2 person groups and 15min for 3 person groups
- Time limit is strict. It's good idea to practice the talk so that you get the timing right. Staff will announce 2min and 1min left and time ended. Going overtime reduces the grade.
- After the presentation there will be 5min for questions, answers, and feedback.
- Each student has to come up with at least one question during the session. Students can ask more questions. 
- Staff will ask further questions (kind of oral exam)
- Grading of the project presentation takes into account following things in the order of importance. You can get feedback for all parts.
    - Students show understanding of essential concepts covered in the course
        - TAs will ask clarifying questions to check these
    - Inclusion of required workflow steps
    - Clarity of the problem, model and prior description
    - Focus of the presentation
        - focus on the most relevant points in the limited time
    - Overall clarity and coherence
        - how well the slides and presentation flows
        - effective presentation of the results
    - Slide design and visual aspects
        - e.g. use of figures, font size, number of digits
        - effective use of figures
    - Lack of typos, grammatical and spelling errors
    - Oral
        - clarity of oral presentation, rhythm and timing, rate of speech
        - clarity of voice projection and appropriate volume
        - presenting style created interest and was engaging
        - presence (e.g. speaking to the audience / camera)
        - possible distracting mannerisms
	- note: this has minor effect, and involuntary aspects (e.g. ticks) do not affect grading
    - Timing
        - point reduction if overtime
        - fair amount of time for all group members
    - Bonus: going beyond the expectations
      - you may get a bonus point for going beyond the explicit requirements
- Students will also self-evaluate their project. Instructions for this will provided.

## Data sets

As some data sets have been overused for these particular goals, note that the
following ones are forbidden in this work (more can be added to this list so
make sure to check it regularly):

  - extremely common data sets like titanic, mtcars, iris, penguins (Palmer Archipelago, n=344)
  - Baseball batting (used by Bob Carpenter's StanCon case study).
  - Data sets used in the course demos like bodyfat, diabetes, 
  - Data sets used in other CS courses

It's best to use a dataset for which there is no ready made analysis in internet, but if you choose a dataset used already in some online case study, provide the link to previous studies and report how your analysis differs from those (for example if someone has made non-Bayesian analysis and you do the full Bayesian analysis).

Depending on the model and the structure of the data, a good data set would have more than 100 observations but less than 1 million. If you know an interesting big data set, you can use a smaller subset of the data to keep the computation times feasible. It would be good that the data has some structure, so that it is sensible to use multilevel/hierarchical models. If you are uncertain, ask TAs.

Here are some public datasets with plenty of choices:

  - [Medical datasets](https://higgi13425.github.io/medicaldata/)
  - [Vanderbilt Biostatistics datasets](https://hbiostat.org/data/)
  - [EU data](https://data.europa.eu/data/datasets?locale=en&minScoring=0)
  - [Inter-university Consortium for Political and Social Research](https://www.icpsr.umich.edu/web/pages/ICPSR/index.html)
  - [WHO mortality data](https://platform.who.int/mortality/themes/theme-details/topics/indicator-groups/indicator-group-details/MDB/road-traffic-accidents)
  - [The World Bank Data](https://data.worldbank.org/)
  - [A long list of public datasets grouped by topic](https://github.com/awesomedata/awesome-public-datasets)
  - [R datasets](https://vincentarelbundock.github.io/Rdatasets/datasets.html)

## Model requirements

  - Every parameter needs to have an explicit proper prior. Improper flat priors are not allowed.
  - A hierarchical model is a model where the prior of certain parameter
    contain other parameters that are also estimated in the model. For instance,
    `b ~ normal(mu, sigma)`, `mu ~ normal(0, 1)`, `sigma ~ exponential(1)`.
  - Do not impose hard constrains on a parameter unless they are natural to
    them. `uniform(a, b)` should not be used unless the boundaries are really logical boundaries and values beyond the boundaries are completely impossible.
  - At least some models should include covariates. Modelling the outcome without predictors
    is likely too simple for the project.
  - `brms` or `rstanarm` is recommended (less time spent for coding the model). `brms` uses improper flat prior for many parameters, so you need to define proper priors. Report also the proper default priors which you didn't change.

## Some examples

The following case study examples demonstrate how text, equations, figures, and code, and inference results can be included in one report. These examples don't necessarily have all the workflow steps required in your report, but different steps are illustrated in different case studies and you can get good ideas for your report just by browsing through them.

  - [BDA R and Python demos](demos.html) are quite minimal in description of the data and discussion of the results, but show many diagnostics and basic plots.
  - Some [Stan case studies](https://mc-stan.org/users/documentation/case-studies) focus on some specific methods, but there are many case studies that are excellent examples for this course. They don't include all the steps required in this course, but are good examples of writing. Some of them are longer or use more advanced models than required in this course.
      - [Bayesian workflow for disease transmission modeling in Stan](https://mc-stan.org/users/documentation/case-studies/boarding_school_case_study.html)
      - [Model-based Inference for Causal Effects in Completely Randomized Experiments](https://mc-stan.org/users/documentation/case-studies/model-based_causal_inference_for_RCT.html)
      - [Tagging Basketball Events with HMM in Stan](https://mc-stan.org/users/documentation/case-studies/bball-hmm.html)
      - [Model building and expansion for golf putting](https://mc-stan.org/users/documentation/case-studies/golf.html)
      - [A Dyadic Item Response Theory Model](https://mc-stan.org/users/documentation/case-studies/dyadic_irt_model.html)
      - [Predator-Prey Population Dynamics:
the Lotka-Volterra model in Stan](https://mc-stan.org/users/documentation/case-studies/lotka-volterra-predator-prey.html)
      - [Hierarchical model for motivational shifts in aging monkeys](https://dansblog.netlify.app/posts/2022-09-04-everybodys-got-something-to-hide-except-me-and-my-monkey/everybodys-got-something-to-hide-except-me-and-my-monkey.html)
  - Some [StanCon case studies](https://github.com/stan-dev/stancon_talks) (scroll down) can also provide good project ideas.