-
Notifications
You must be signed in to change notification settings - Fork 190
/
index.html
174 lines (126 loc) · 16.7 KB
/
index.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8" />
<meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1" />
<title>index</title>
<meta name="viewport" content="width=device-width, initial-scale=1.0" />
<style>
html,body{color:black}*:not('#mkdbuttons'){margin:0;padding:0}#wrapper{font:15px helvetica,arial,freesans,clean,sans-serif;-webkit-font-smoothing:antialiased;line-height:1.7;padding:3px;background:#fff;border-radius:3px;-moz-border-radius:3px;-webkit-border-radius:3px}p{margin:1em 0}a{color:#4183c4;text-decoration:none}#wrapper{background-color:#fff;padding:30px;margin:15px;font-size:15px;line-height:1.6}#wrapper>*:first-child{margin-top:0 !important}#wrapper>*:last-child{margin-bottom:0 !important}@media screen{#wrapper{box-shadow:0 0 0 1px #cacaca, 0 0 0 4px #eee}}h1,h2,h3,h4,h5,h6{font-weight:700;line-height:1.7;cursor:text;position:relative;margin:1em 0 15px;padding:0}h1{font-size:2.5em;border-bottom:1px solid #ddd}h2{font-size:2em;border-bottom:1px solid #eee}h3{font-size:1.5em}h4{font-size:1.2em}h5{font-size:1em}h6{color:#777;font-size:1em}p,blockquote,table,pre{margin:15px 0}ul{padding-left:30px}ol{padding-left:30px}ol li ul:first-of-type{margin-top:0px}hr{background:transparent url(data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAAYAAAAECAYAAACtBE5DAAAAGXRFWHRTb2Z0d2FyZQBBZG9iZSBJbWFnZVJlYWR5ccllPAAAAyJpVFh0WE1MOmNvbS5hZG9iZS54bXAAAAAAADw/eHBhY2tldCBiZWdpbj0i77u/IiBpZD0iVzVNME1wQ2VoaUh6cmVTek5UY3prYzlkIj8+IDx4OnhtcG1ldGEgeG1sbnM6eD0iYWRvYmU6bnM6bWV0YS8iIHg6eG1wdGs9IkFkb2JlIFhNUCBDb3JlIDUuMC1jMDYwIDYxLjEzNDc3NywgMjAxMC8wMi8xMi0xNzozMjowMCAgICAgICAgIj4gPHJkZjpSREYgeG1sbnM6cmRmPSJodHRwOi8vd3d3LnczLm9yZy8xOTk5LzAyLzIyLXJkZi1zeW50YXgtbnMjIj4gPHJkZjpEZXNjcmlwdGlvbiByZGY6YWJvdXQ9IiIgeG1sbnM6eG1wPSJodHRwOi8vbnMuYWRvYmUuY29tL3hhcC8xLjAvIiB4bWxuczp4bXBNTT0iaHR0cDovL25zLmFkb2JlLmNvbS94YXAvMS4wL21tLyIgeG1sbnM6c3RSZWY9Imh0dHA6Ly9ucy5hZG9iZS5jb20veGFwLzEuMC9zVHlwZS9SZXNvdXJjZVJlZiMiIHhtcDpDcmVhdG9yVG9vbD0iQWRvYmUgUGhvdG9zaG9wIENTNSBNYWNpbnRvc2giIHhtcE1NOkluc3RhbmNlSUQ9InhtcC5paWQ6OENDRjNBN0E2NTZBMTFFMEI3QjRBODM4NzJDMjlGNDgiIHhtcE1NOkRvY3VtZW50SUQ9InhtcC5kaWQ6OENDRjNBN0I2NTZBMTFFMEI3QjRBODM4NzJDMjlGNDgiPiA8eG1wTU06RGVyaXZlZEZyb20gc3RSZWY6aW5zdGFuY2VJRD0ieG1wLmlpZDo4Q0NGM0E3ODY1NkExMUUwQjdCNEE4Mzg3MkMyOUY0OCIgc3RSZWY6ZG9jdW1lbnRJRD0ieG1wLmRpZDo4Q0NGM0E3OTY1NkExMUUwQjdCNEE4Mzg3MkMyOUY0OCIvPiA8L3JkZjpEZXNjcmlwdGlvbj4gPC9yZGY6UkRGPiA8L3g6eG1wbWV0YT4gPD94cGFja2V0IGVuZD0iciI/PqqezsUAAAAfSURBVHjaYmRABcYwBiM2QSA4y4hNEKYDQxAEAAIMAHNGAzhkPOlYAAAAAElFTkSuQmCC) repeat-x 0 0;border:0 none;color:#ccc;height:4px;margin:15px 0;padding:0}#wrapper>h2:first-child{margin-top:0;padding-top:0}#wrapper>h1:first-child{margin-top:0;padding-top:0}#wrapper>h1:first-child+h2{margin-top:0;padding-top:0}#wrapper>h3:first-child,#wrapper>h4:first-child,#wrapper>h5:first-child,#wrapper>h6:first-child{margin-top:0;padding-top:0}a:first-child h1,a:first-child h2,a:first-child h3,a:first-child h4,a:first-child h5,a:first-child h6{margin-top:0;padding-top:0}h1+p,h2+p,h3+p,h4+p,h5+p,h6+p,ul li>:first-child,ol li>:first-child{margin-top:0}dl{padding:0}dl dt{font-size:14px;font-weight:bold;font-style:italic;padding:0;margin:15px 0 5px}dl dt:first-child{padding:0}dl dt>:first-child{margin-top:0}dl dt>:last-child{margin-bottom:0}dl dd{margin:0 0 15px;padding:0 15px}dl dd>:first-child{margin-top:0}dl dd>:last-child{margin-bottom:0}blockquote{border-left:4px solid #DDD;padding:0 15px;color:#777}blockquote>:first-child{margin-top:0}blockquote>:last-child{margin-bottom:0}table{border-collapse:collapse;border-spacing:0;font-size:100%;font:inherit}table th{font-weight:bold;border:1px solid #ccc;padding:6px 13px}table td{border:1px solid #ccc;padding:6px 13px}table tr{border-top:1px solid #ccc;background-color:#fff}table tr:nth-child(2n){background-color:#f8f8f8}img{max-width:100%}code,tt{margin:0 2px;padding:0 5px;white-space:nowrap;border:1px solid #eaeaea;background-color:#f8f8f8;border-radius:3px;font-family:Consolas, 'Liberation Mono', Courier, monospace;font-size:12px;color:#333}pre>code{margin:0;padding:0;white-space:pre;border:none;background:transparent}.highlight pre{background-color:#f8f8f8;border:1px solid #ccc;font-size:13px;line-height:19px;overflow:auto;padding:6px 10px;border-radius:3px}pre{background-color:#f8f8f8;border:1px solid #ccc;font-size:14px;line-height:19px;overflow:auto;padding:6px 10px;border-radius:3px;margin:26px 0}pre code,pre tt{background-color:transparent;border:none}.poetry pre{font-family:Georgia, Garamond, serif !important;font-style:italic;font-size:110% !important;line-height:1.6em;display:block;margin-left:1em}.poetry pre code{font-family:Georgia, Garamond, serif !important;word-break:break-all;word-break:break-word;-webkit-hyphens:auto;-moz-hyphens:auto;hyphens:auto;white-space:pre-wrap}sup,sub,a.footnote{font-size:1.4ex;height:0;line-height:1;vertical-align:super;position:relative}sub{vertical-align:sub;top:-1px}@media print{body{background:#fff}img,pre,table,figure{page-break-inside:avoid}#wrapper{background:#fff;border:none}pre code{overflow:visible}}@media screen{body.inverted{color:#eee !important;border-color:#555;box-shadow:none}.inverted #wrapper,.inverted hr,.inverted p,.inverted td,.inverted li,.inverted h1,.inverted h2,.inverted h3,.inverted h4,.inverted h5,.inverted h6,.inverted th,.inverted .math,.inverted caption,.inverted dd,.inverted dt,.inverted blockquote{color:#eee !important;border-color:#555;box-shadow:none}.inverted td,.inverted th{background:#333}.inverted pre,.inverted code,.inverted tt{background:#eeeeee !important;color:#111}.inverted h2{border-color:#555555}.inverted hr{border-color:#777;border-width:1px !important}::selection{background:rgba(157,193,200,0.5)}h1::selection{background-color:rgba(45,156,208,0.3)}h2::selection{background-color:rgba(90,182,224,0.3)}h3::selection,h4::selection,h5::selection,h6::selection,li::selection,ol::selection{background-color:rgba(133,201,232,0.3)}code::selection{background-color:rgba(0,0,0,0.7);color:#eeeeee}code span::selection{background-color:rgba(0,0,0,0.7) !important;color:#eeeeee !important}a::selection{background-color:rgba(255,230,102,0.2)}.inverted a::selection{background-color:rgba(255,230,102,0.6)}td::selection,th::selection,caption::selection{background-color:rgba(180,237,95,0.5)}.inverted{background:#0b2531;background:#252a2a}.inverted #wrapper{background:#252a2a}.inverted a{color:#acd1d5}}.highlight .c{color:#998;font-style:italic}.highlight .err{color:#a61717;background-color:#e3d2d2}.highlight .k,.highlight .o{font-weight:bold}.highlight .cm{color:#998;font-style:italic}.highlight .cp{color:#999;font-weight:bold}.highlight .c1{color:#998;font-style:italic}.highlight .cs{color:#999;font-weight:bold;font-style:italic}.highlight .gd{color:#000;background-color:#fdd}.highlight .gd .x{color:#000;background-color:#faa}.highlight .ge{font-style:italic}.highlight .gr{color:#a00}.highlight .gh{color:#999}.highlight .gi{color:#000;background-color:#dfd}.highlight .gi .x{color:#000;background-color:#afa}.highlight .go{color:#888}.highlight .gp{color:#555}.highlight .gs{font-weight:bold}.highlight .gu{color:#800080;font-weight:bold}.highlight .gt{color:#a00}.highlight .kc,.highlight .kd,.highlight .kn,.highlight .kp,.highlight .kr{font-weight:bold}.highlight .kt{color:#458;font-weight:bold}.highlight .m{color:#099}.highlight .s{color:#d14}.highlight .na{color:#008080}.highlight .nb{color:#0086B3}.highlight .nc{color:#458;font-weight:bold}.highlight .no{color:#008080}.highlight .ni{color:#800080}.highlight .ne,.highlight .nf{color:#900;font-weight:bold}.highlight .nn{color:#555}.highlight .nt{color:#000080}.highlight .nv{color:#008080}.highlight .ow{font-weight:bold}.highlight .w{color:#bbb}.highlight .mf,.highlight .mh,.highlight .mi,.highlight .mo{color:#099}.highlight .sb,.highlight .sc,.highlight .sd,.highlight .s2,.highlight .se,.highlight .sh,.highlight .si,.highlight .sx{color:#d14}.highlight .sr{color:#009926}.highlight .s1{color:#d14}.highlight .ss{color:#990073}.highlight .bp{color:#999}.highlight .vc,.highlight .vg,.highlight .vi{color:#008080}.highlight .il{color:#099}.highlight .gc{color:#999;background-color:#EAF2F5}.type-csharp .highlight .k,.type-csharp .highlight .kt{color:#00F}.type-csharp .highlight .nf{color:#000;font-weight:normal}.type-csharp .highlight .nc{color:#2B91AF}.type-csharp .highlight .nn{color:#000}.type-csharp .highlight .s,.type-csharp .highlight .sc{color:#A31515}body.dark #wrapper{background:transparent !important;box-shadow:none !important}
@media print{
#generated-toc-clone,#generated-toc{display:none!important}hr{border:none!important;page-break-after:always!important}
}
body { font-size: 14px }
#wrapper * { font-size: 100%!important; }
</style>
</head>
<body class="normal">
<div id="wrapper">
<h1 id="courseorganization">Course organization</h1>
<p>Most of science, including statsitcs, revolves around the design of models to represent data of some sort. In statistics, these models are typically probability distributions with parameters to estimate. The typical stages of analysis goes something like this: </p>
<ol>
<li>Data management
<ol>
<li>Receive possibly messy and bad data</li>
<li>Clean and filter the data</li>
</ol></li>
<li>Expoloratory data analysis
<ol>
<li>Describe and visualize the data to come up with possible models</li>
</ol></li>
<li>Statistical inference
<ol>
<li>Estimate model parameters given data (Model estimation and selection)</li>
<li>Make predictions with model (Model prediction)</li>
</ol></li>
<li>Reporting</li>
</ol>
<p>In today’s context, statisticians will be doing all three stages in front of a computer. We believe that modern analysiis of massive data sets is best achieved using a combination of complementatry software tools, and the course will cover what we consider to be an essential toolkit comprising <code>bash</code>, <code>git</code>, <code>make</code>, <code>sqlite3</code>, <code>python</code>, <code>R</code>, <code>C</code> and <code>LaTeX</code>. We use Python as the computational “glue” that integrates these collection of tools, and also in its capacity as an efficient high-level language for scientiific and statistical computing. </p>
<p>To deal with the massive data sets that you will enocounter in your career, the course will emphasize reproducible analysis, code optimization, high-perforamnce computing and cloud computing. Examples will be drawn from the core topics in computational statistics of optimization (e.g. smoothing, interpolation, maximum lkelihood, constrained and unconstrained methods) and simulation (e.g. jackknife, bootstrap, permutation, Monte Carlo integrals, MCMC). </p>
<p>At the end of the course, these are the practical skills every student should learn: </p>
<ol>
<li>How to set up a reproducible analysis pipeline using bash, git, make and LaTeX</li>
<li>How to clean, manage and manipulate huge data sets using text processing, relational databases</li>
<li>How to explore data sets interactively using the IPython notebook and visualization packages</li>
<li>How to code statistical routines efficiently in high level languages</li>
<li>How to optimize statistical routines by compiling to native code</li>
<li>How to compute in parallel on multi-core machines, clusters and GPUs</li>
</ol>
<p>In particular, students should be able to write readable, well-documented, efficient (and if necessary parallel) code to implement a statistical method described in a textbook or paper, and apply it to real-world, possibly messy, data sets. </p>
<h2 id="unit1settingupareproducibleanalysispipeline5hours">Unit 1: Setting up a reproducible analysis pipeline (5 hours)</h2>
<ul>
<li>Setting up workspace and introduction to bash</li>
<li>Version control with git</li>
<li>Document generation with LaTeX</li>
<li>Automating the pipeline with make</li>
<li>Introduction to Python and the IPython notebook</li>
<li>Testing and debugging</li>
</ul>
<p><strong>Exercise</strong>: Create a git repository on Github for this project. Write a makefile to automate generation of a LaTeX report with embedded R and Python results. Ensure that git commits are performed regularly and well-documented. </p>
<p><strong>Exercise</strong>: You are given a Python program that has errors. Fix it. </p>
<p><strong>Exercise</strong>: Write functions to calculate the mean, median and variance of a set of numbers using test-driven development. </p>
<h2 id="unit2datamanipulationandmunging5hours">Unit 2: Data manipulation and munging (5 hours)</h2>
<ul>
<li>Reading and writing data</li>
<li>Text processing and regular expressions</li>
<li>Querying a relational database with SQL</li>
<li>Database design for statisticians</li>
</ul>
<p><strong>Exercise</strong>: You are given a data set that needs to be cleaned and reformatted into a data frame. </p>
<p><strong>Exercise</strong>: Given an SQLite3 database, use SQL to answer some questions and extract data subsets. </p>
<p><strong>Exercise</strong>: Given a spreadsheet, design a normalized database to manage the data. Transfer the data from the spreadsheet into the database. </p>
<h2 id="unit3exploratorydataanalysisandvisualization5hours">Unit 3: Exploratory data analysis and visualization (5 hours)</h2>
<ul>
<li>Manipulating data in a DataFrame</li>
<li>Visualizing data with matplotlib</li>
<li>Grammar of graphics with ggplot and bokeh</li>
<li>Animation - Metropolis, Gibbs and Hamiltonian sampling</li>
</ul>
<p><strong>Exercise</strong>: Load a dataset into a DataFrame and use joins and the split-apply-combine pattern to answer some quesiotns. </p>
<p><strong>Exercise</strong>: Plot and annotate the given dataset to illustrate its key features. </p>
<h2 id="unit4efficientstatisticalroutines15hours">Unit 4: Efficient statistical routines (15 hours)</h2>
<ul>
<li>Broadcasting and vectorization in numpy and pandas</li>
<li>Functional programming</li>
<li>Representation of numbers and linear algebra</li>
<li>Introducing BLAS and LAPACK</li>
<li>Quadrature (Numerical integration)</li>
<li>Constrained and unconstrained optimization</li>
<li>Resampling methods</li>
<li>Monte Carlo simulations</li>
<li>Markov chain Monte Carlo</li>
</ul>
<p><strong>Exercise</strong>: You are given a slow and buggy simulation script. Fix the errors and speed it up using vectorization. </p>
<p><strong>Exercise</strong>: Write an efficient function to calculate Cook’s distance for the influence of data points on a regression. </p>
<p><strong>Exercise</strong>: Use Newton’s method to fit a logistic regression model (aka Iterative reweighted least squares). </p>
<p><strong>Exercise</strong>: Implement the non-parametric and parametric bootstrap for phylogenetic trees described in <a href="http://www.pnas.org/content/93/23/13429.full.pdf">http://www.pnas.org/content/93/23/13429.full.pdf</a> </p>
<p><strong>Exercise</strong>: Use simulation to perform evaluate the power of . </p>
<p><strong>Exercsie</strong>: Use symbolic integration, numerical integration and Monte Carlo integration to evaluate a definite double integral </p>
<p><strong>Exercise</strong>: Use regular Python, PyMC3 and PyStan to find the posterior distribution for a two-level model. </p>
<h2 id="unit5codeoptimizationandnativecode5hours">Unit 5: Code optimization and native code (5 hours)</h2>
<ul>
<li>Complexity and performance of algorithms and data structures</li>
<li>C crash course for statisticians</li>
<li>Using numexpr, numba and cython</li>
<li>Using functions from C/C++ libraries</li>
<li>Writing functions in C/C++ and wrapping for Python/R</li>
</ul>
<p><strong>Exercise</strong>: You are given some slow code. Speed it up by using a better algoithm or data structure. </p>
<p><strong>Exercise</strong>: Write the Newton-Raphson method in C - it should take the following arguments - a function pointer f, a function pointer fprime, an initial point x0, and a tolerance. </p>
<p><strong>Exercise</strong>: You are given some slow code in Python. Speed it up using Cython. </p>
<h2 id="unit6high-performancecomputing10hours">Unit 6: High-performance computing (10 hours)</h2>
<ul>
<li>Parallel progrmming patterns</li>
<li>Multiprocessing and IPython.Parallel</li>
<li>Processing big data with MapReduce</li>
<li>Multi-CPU computing with MPI</li>
<li>GPU computing with CUDA</li>
</ul>
<p><strong>Exercise</strong>: Rewrite the function to calculate Cook’s distance using multiprocessing. </p>
<p><strong>Exercise</strong>: Use Elastic MapReduce to do some massive genomic data manipulation. </p>
<p><strong>Exercise</strong>: Use MPI to run an MCMC with parallel temperinig (aka <span class="math">\(MC^3\)</span>) for a long time with some defined swap interval between chains. </p>
<p><strong>Exercise</strong>: Write a matrix multiplication kernel with and without use of shared memory using CUDA. Test it out using square matrices initialized with random numbers from CURAND. </p>
<!-- ##END MARKED WRAPPER## -->
</div>
</body>
</html>