-
Notifications
You must be signed in to change notification settings - Fork 4
/
Lecture04.Rmd
609 lines (487 loc) · 18.1 KB
/
Lecture04.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
---
title: "MBB 243 - Lecture 4"
author: "Ryan Morin"
date: 2022/02/02
output:
powerpoint_presentation: default
slidy_presentation: default
---
```{r setup, include=FALSE}
require(reticulate)
knitr::opts_chunk$set(echo = FALSE)
# if necessary, use this command to install required non-standard command-line tools:
# conda install -c bioconda samtools seqtk
```
# Learning goals for the week
- Apply and implement the most common type of loops in Python
- Understand why you should avoid loops in R
- Understand conditionals, boolean logic and their common applications in programming
- Put conditionals to use in Python and R using various flavours of if/else
- Put concepts into practical applications relating to sequence data
# Comparing values with Conditions
- Conditions are used in programs to make decisions
- Most common type of comparison used in decision making is to determine how one numeric value compares to the other in magnitude
- Another common comparison is between strings (identity or partial identity)
# If this, then...
- The `if` statement is shared across most programming languages with similar syntax
- Used to check whether the evaluation of code results in one of two outcomes:
- True vs False or 0 vs a non-zero value
```{python,eval=T,echo=T}
if 1+1:
print("it's not zero!")
if 1-1:
print("we can't get here!")
```
```{python,echo=F,eval=T}
gene_expression = {}
gene_expression["ACTB"]= 12
gene_expression["TERT"]= 5
```
# Putting `if` to use
```{python,eval=T,echo=T}
highly_expressed = [] #empty array to store the genes that make the cut
expression_threshold = 6 # some criterion we want to apply
this_gene = "ACTB"
#assume we have a dictionary that stores the expression value for every gene
if gene_expression[this_gene] > expression_threshold:
highly_expressed.append(this_gene) #any indented code following the expression will ONLY be run if the expression evaluates to true or a non-zero value
print(highly_expressed)
```
# Putting `if` to use
- Why does this code not change the contents of the highly_expressed array?
```{python,eval=T,echo=T}
this_gene = "TERT" # Do it for another gene and see what changes
if gene_expression[this_gene] > expression_threshold:
highly_expressed.append(this_gene)
print(highly_expressed)
```
# Else is the counterpart to if
:::::::::::::: {.columns}
::: {.column}
- Instead of executing specific code when a condition is met, you may want to use some other code instead
- Combining if/else allows *something* to happen in any condition
- Creates a non-linear path in your program
- After the end of an if/else the code is once again followed linearly
:::
::: {.column}
![](images/ifelse.png)
:::
::::::::::::::
# Putting `else` to use
```{python,eval=T,echo=T}
not_highly_expressed = [] #new array to store some more genes
this_gene = "TERT" # Same gene as previous example
if gene_expression[this_gene] > expression_threshold:
highly_expressed.append(this_gene)
else:
not_highly_expressed.append(this_gene)
print("Highly expressed:",highly_expressed)
print("The rest:",not_highly_expressed)
```
# R syntax
```{R,echo=T,eval=F}
if (some_condition) {
Code that runs when that condition is met
Indentation level of this code is unimportant
The braces define where the code block ends
} else {
Code that runs when the condition is NOT met
}
Code that always runs afterward, regardless of the condition
```
# Python syntax is trickier
```{python,eval=F,echo=T}
# the lines that start with if/else must end with a colon
if some_condition:
some code
some more code
# lines you want run when the condition is met are all at the next indentation level
# This can be one tab or four spaces but don't mix and match!
#the else must be at the same indentation level as the if
else:
some alternative code #etc
more code that always runs
```
# Testing for equal values
- To test if two variables have an *Equal value* we have to use `==`
- Using `=` (the assignment operator) is one of the most common programming errors
```{python,echo=T,eval=T}
print(gene_expression) #our expression values used earlier
if gene_expression["TERT"] == gene_expression["ACTB"]:
print("they are equal")
else:
print("they are not equal")
```
# All the tests
- Using >= instead of > allows you to handle the special condition of the two values being equal
```{python,echo=T,eval=T}
a = 1 #assign the variable
a == 1 # check if it equals 1
a >= 1 # is it greater than or equal to 1?
a > 1 # is it greater than 1?
a <= 1 # etc
a < 1
a != 1 # is it NOT equal?
```
# Boolean logic and variables
- A form of algebra centered around three operators: OR, AND and NOT
- Useful for binary data or logicals (e.g. True vs False, 0 vs 1)
- In Python, a boolean is a simple type of variable for this
```{python,echo=T,eval=T}
a = True #no quotes should be used here
b = False # if we instead put b = "False" this would be a string
type(a)
type(b)
print("a:",a,"b:",b,sep=" ")
```
# Overview of logic and operators
![](images/logic_table.png)
# Boolean logic in Python
```{python,eval=T,echo=T}
a & b # this means logical AND
a and b # Python allows this
```
# Boolean logic in Python
```{python,eval=T,echo=T}
a & b # this means logical AND
a and b # Python allows this
a | b # this means logical OR
a or b # and this
```
# Boolean logic in Python
```{python,eval=T,echo=T}
a and b # are both true?
a or b # Is either true?
a and not b # Is a true but not b?
not b # is b not true (i.e. is it False)?
```
# Logicals in R
- R has a logical variable type that can be either TRUE or FALSE
- R doesn't allow you to use the words `and` and `or`
```{R, eval=T,echo=T}
a = TRUE
b = FALSE
a && b
a || b
```
# What's with the doubling up?
- The `&&` is equivalent to `and` in Python whereas `||` is equivalent to `or`
- Using a single `&` or `|` allows R to perform the evaluation across an entire vector
```{R,echo=T,eval=T}
some_stuff = c(TRUE,FALSE,TRUE)
more_stuff = c(FALSE,FALSE,TRUE)
some_stuff & more_stuff # The right thing to use for vectors with more than one value
some_stuff && more_stuff # Not useful here
some_stuff || more_stuff # NYET!
some_stuff | more_stuff # Yes
```
# Why are logicals/booleans useful?
- Conditionals are actually always using booleans
- You can evaluate multiple conditions at once and logically combine them
```{python,echo=T,eval=T}
a = 1
b = 2
c = 100
b > a
b > a and c > b
if b > a and c > b:
print("c is the boss")
```
# Getting booleans from functions
- Many functions return a boolean based on their input and are somewhat intuitive to understand
```{python,echo=T,eval=T}
seq = "GATTACA"
seq.startswith("G") # does it start with a G?
seq.endswith("G") # does it end with a G?
seq.islower() # is it all lowercase
seq.isupper() # is it all uppercase?
```
# For Loops
- What if you want the same code run on every value in a list or dictionary?
- What if you wanted to process every line of a file in a consistent way (e.g. similar to earlier lectures using the command line)?
- Two types of loops in Python: `for` and `while`
```{python,eval=F,echo=T}
my_seq=list("AAAGCGCCGGGA")
#split sequence and scores into arrays
scores=list("012345543210")
for base in my_seq:
print(base)
#note the formatting, including indentation, is similar to if/else
```
# For Loops
```{python,eval=T,echo=T}
my_seq=list("AAAGCGCCGGGA")
scores=list("012345543210")
for base in my_seq:
print(base)
```
# Loops with counters
```{python,eval=T,echo=T}
i = 0 # a counter variable to track where we are
for base in my_seq:
print("Nucleotide", base, "at index",i,"has score:",scores[i],sep=" ")
i+=1 # same as i = i+1. Increases i with each iteration of the loop
```
```{python,eval=T,echo=F}
genetic_code = {
'ATA':'I', 'ATC':'I', 'ATT':'I', 'ATG':'M',
'ACA':'T', 'ACC':'T', 'ACG':'T', 'ACT':'T',
'AAC':'N', 'AAT':'N', 'AAA':'K', 'AAG':'K',
'AGC':'S', 'AGT':'S', 'AGA':'R', 'AGG':'R',
'CTA':'L', 'CTC':'L', 'CTG':'L', 'CTT':'L',
'CCA':'P', 'CCC':'P', 'CCG':'P', 'CCT':'P',
'CAC':'H', 'CAT':'H', 'CAA':'Q', 'CAG':'Q',
'CGA':'R', 'CGC':'R', 'CGG':'R', 'CGT':'R',
'GTA':'V', 'GTC':'V', 'GTG':'V', 'GTT':'V',
'GCA':'A', 'GCC':'A', 'GCG':'A', 'GCT':'A',
'GAC':'D', 'GAT':'D', 'GAA':'E', 'GAG':'E',
'GGA':'G', 'GGC':'G', 'GGG':'G', 'GGT':'G',
'TCA':'S', 'TCC':'S', 'TCG':'S', 'TCT':'S',
'TTC':'F', 'TTT':'F', 'TTA':'L', 'TTG':'L',
'TAC':'Y', 'TAT':'Y', 'TAA':'_', 'TAG':'_',
'TGC':'C', 'TGT':'C', 'TGA':'_', 'TGG':'W',
}
```
# Loops and Dictionaries
```{python,eval=T,echo=T}
# Using the genetic code dictionary from last class
printed = 0
for codon, amino in genetic_code.items():
print(codon,amino)
# use a counter here just for fun
printed += 1
# use "break" to terminate the loop
if printed == 5:
break
```
# R has loops too
- `for` loops are available in R but there are often better alternatives
- Syntax is clunkier and more explicit
- Importantly, the existence of `vectorized` functions make loops unnecessary in many contexts
```{R,echo=T,eval=T}
bignums <- c(10,20,30) # NOTE: R
smallnums <- c() #empty vector for results
for(i in c(1:length(bignums))){ #specify the range explicitly
smallnums[i] <- bignums[i]/10
}
smallnums
```
# Vectorization in R
- Many functions in R automatically work across every element in a vector
- This will be even more relevant once we encounter higher dimensional data structures in R (e.g. matrices and data frames)
```{R,echo=T,eval=T}
smallnums <- bignums / 10 # NOTE: R
smallnums # all elements were affected by the same operation!
biggernums = bignums * c(10,100,1000)
biggernums
# If we operate on two vectors of the same length the behaviour is different
```
# Recap
- A `for` loop will give you access to every value in a list in increasing order from the first index (0 or 1, depending on language) to the last index
- If you want to use the index at any iteration, you can count from 0 upwards using an integer variable
- This is the main option in R
- You can use the dictionary method `items` to iterate over the key and value together (in no particular order!)
- When using R, you can avoid many of the common scenarios that would require a loop in Python
# Reading in data from files
- Programs need access to data in order to be useful
- Reading from a text file is straightforward, especially in conjunction with a loop
```{python,eval=F,echo=T}
input_fasta = "data/human_genes_chr7/ENST00000305119.fa"
fasta_file_handle = open(input_fasta,"r")
# open the file and store file handle
for text_line in fasta_file_handle:
text_line = text_line.rstrip("\n")
#remove newline
print(text_line)
```
# Reading from a FASTA file
```{python,eval=T,echo=T}
input_fasta = "data/human_genes_chr7/ENST00000305119.fa"
fasta_file_handle = open(input_fasta,"r")
for text_line in fasta_file_handle:
text_line = text_line.rstrip("\n")
print(text_line)
```
# Parsing a FASTA file
- Reading data and storing it in a useful way is one form of "parsing"
- Involves deconstructing the relevant components of a file and storing them as variables for use later in the code
- Implementation will depend on the format/structure of the file being read and your needs (i.e. what the program should do)
- e.g. information from header is stored separately from the associated sequence data but linked in some way
```{python,eval=T,echo=F}
fasta_file_handle = open("data/some_human_genes.fa","r")
```
# Reminder: FASTA format
`>ENSG00000001626|ENSG00000001626|ENST00000003084|1|1|CFTR`
`GTAGTAGGTCTTTGGCATTAGGAGCTTGAGCCCAGACGGCCCTAGCAGGGACCCCAG`
`CCGAGAGACCATGCAGAGGTCGCCTCTGGAAAAGGCCAGCGTTGTCTCCAAACTTTT`
`CAGCTGGACCAGACCAATTTTGAGGAAAGGATACAGACAGCGCCTGGAATTGTCAGA`
`>ENSG00000241644|ENSG00000241644|ENST00000013222|1|1|INMT`
`ACATTTCAGGGACACCATGAAGGGTGGCTTCACTGGGGGTGATGAGTACCAGAAGCA`
`CCTGCCCAGGGACTACTTGGCTACTTACTACAGCTTCGATGGCAGCCCCTCACCCGA`
`CGAGATGCTGAAGTTTAACTTGGAATGTCTCCACAAGACCTTCGGCCCTGGAGGCCT`
# Demo FASTA parser
```{python,eval=T,echo=T}
sequences = {}
for text_line in fasta_file_handle:
text_line = text_line.rstrip("\n")
if text_line.startswith(">"):
text_line = text_line.lstrip(">")
header_info = text_line
header_vals = header_info.split("|")
id = header_vals[2]
else:
if id in sequences:
sequences[id] = sequences[id] + text_line
else:
sequences[id] = text_line
```
# Deconstructing the parser
```{python,eval=T,echo=F}
fasta_file_handle = open("data/some_human_genes.fa","r")
```
```{python,eval=T,echo=T}
sequences = {} #dictionary to store sequence and unique identifier
for text_line in fasta_file_handle:
text_line = text_line.rstrip("\n")
print(text_line[0:45]) # for display
# this code just removes the newline and stores the rest as a string
# NOTE: every line is treated the same way here
```
# Deconstructing the parser
```{python,eval=T,echo=F}
fasta_file_handle = open("data/some_human_genes.fa","r")
```
```{python,eval=T,echo=T}
for text_line in fasta_file_handle:
text_line = text_line.rstrip("\n")
if text_line.startswith(">"):
text_line = text_line.lstrip(">")
print(text_line[0:45]) # for display
# NOTE: why don't we see any sequences?
```
# Deconstructing the parser
```{python,eval=T,echo=F}
fasta_file_handle = open("data/some_human_genes.fa","r")
```
```{python,eval=T,echo=T}
for text_line in fasta_file_handle:
text_line = text_line.rstrip("\n")
if text_line.startswith(">"):
text_line = text_line.lstrip(">")
header_info = text_line
header_vals = header_info.split("|")
id = header_vals[2]
print(header_vals[2]) # for display
# NOTE: what is split("|") accomplishing?
```
# Deconstructing the parser
```{python,eval=T,echo=F}
fasta_file_handle = open("data/some_human_genes.fa","r")
```
```{python,eval=T,echo=T}
sequences = {}
for text_line in fasta_file_handle:
text_line = text_line.rstrip("\n")
if text_line.startswith(">"):
text_line = text_line.lstrip(">")
header_info = text_line
header_vals = header_info.split("|")
id = header_vals[2]
else:
# here is where it gets tricky
if id in sequences:
sequences[id] = sequences[id] + text_line
# concatenate to existing sequence
else:
# add a new key to the dictionary
# and put the first line of sequence
sequences[id] = text_line
```
# Concepts and Gotchas: if/else and dictionaries
- Parser used two nested if/else statements
- The second `if` was only ever encountered when the first condition evaluated to False (i.e. line didn't start with ">")
- The second if/else is required because our code attempted to modify a value in the dictionary before the key existed
```{python,eval=F,echo=T}
a_dict = {}
a_dict["gorilla"] += 1 #this won't work because the key doesn't exist
## KeyError: 'gorilla'
a_dict["aardvark"] = 1
# this works because we're not relying on the existence of a value in the dictionary
```
# Concepts and Gotchas: Loops
- In Python, variables that are declared within the loop survive after the last iteration and retain their final value
- This may be convenient but if you ever reuse a variable that gets modified in a loop you need to remember to set it to the starting value you need
- Best practice is to just not recycle variable names
```{python,echo=T,eval=T}
some_number = 0
for i in (0,1,2):
some_number = i * 5
other_number = i * 10
print(i,some_number,other_number)
print(i,some_number,other_number)
some_number = 0 #put it back explicitly to zero if you use it again later
```
# Things to consider
- FASTA parsers exist in BioPython so you shouldn't have to implement one, but you may need to extract information from a complex header
- Our example made assumptions about a consistent format in the header
- What would happen if the ENST identifiers were not unique in the file?
- How might we store all the data in the header in a convenient way?
# Load then process or load/process?
- Parsing or loading the contents of a file is the first step in many workflows
- Depending on the scale of your data, you may not actually want to store everything in memory
- How much RAM do you think the parser example would use if run on the human genome sequence?
- A preferable strategy in such cases is to run any subsequent analyses on-the-fly then reuse the variable for your next task
# Sidebar: defining a function
- Put reusable code into your own functions to reduce duplicated code and enhance readability of complex code
```{python,echo=T,eval=T}
def count_nucleotides(seq):
nucl_dict = {}
bases = list(seq)
for base in bases:
if base in nucl_dict.keys():
nucl_dict[base] += 1
else:
nucl_dict[base] = 1
return(nucl_dict)
```
# A better version
```{python,eval=T,echo=F}
fasta_file_handle = open("data/some_human_genes.fa","r")
```
```{python,eval=T,echo=T}
sequence = "" # for the "current" sequence
# NOTE: why make an empty variable here?
for text_line in fasta_file_handle:
text_line = text_line.rstrip("\n")
if text_line.startswith(">"):
if sequence: # NOTE: this is only false at the first header!
nuc_count = count_nucleotides(sequence)
print(id,sequence[0:10],nuc_count)
sequence = ""
text_line = text_line.lstrip(">")
header_info = text_line
header_vals = header_info.split("|")
id = header_vals[2]
else:
sequence += text_line
nuc_count = count_nucleotides(sequence)
print(id,sequence[0:10],nuc_count)
# NOTE: Why do I need these last two lines?
```
# Lab assignment
- You have been provided a Python script with the demo FASTA parser code `scripts/fasta_parse.py` and access to the reference genome for S. cerevesiae `data/yeast_genome.fa`
- Make a copy of that script and save it as `scripts/assignment-04.py` then modify it to:
- Parse the contents of the yeast genome fasta
- Calculate the length of each chromosome and the number of each nucleotide
- Calculate the %GC in each chromosome
- Report this information along with how many non-ACTG characters were found on each chromosome
# Details
- Your script should print its output to the terminal
- The number of lines printed out should be the number of yeast chromosomes plus a header
- The format of the printed output should look resemble the example below
- The percent GC should be reported to two decimal places
- A zero should be printed in the last column for any chromosome with zero non-ACTG bases
```
chrom size num_A num_C num_G num_T pc_GC not_ACTG
chrI 240542 77836 45740 46965 70001 38.54 3
```