-
Notifications
You must be signed in to change notification settings - Fork 0
/
AssemblyLog.txt
501 lines (395 loc) · 24.2 KB
/
AssemblyLog.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
## Content
# Aim and reference genome
# 1. Initial steps
# 2. spades de novo assembly
# 3. possible deduplication and normalization steps before assembly
# 4. read mapping on reference genomes
# 5. contig mapping on reference genome
# 6. blast against ref genome and use contigs as query
# 7. spades assemlby using reference genome as trusted contigs
##############################
## Aim and reference genomes ##
"Obtain a full Pseudoalteromonas rubra genome"
Reference genome based on 16S rRNA gene blast result: Pseudoalteromonas rubra strain SCSIO 6842
All 5,913,780
Chromosome 1 4,532,889
Chromosome 2 1,311,023
Plasmid pMBL6842 69,868
https://www.ncbi.nlm.nih.gov/assembly/GCA_001482385.1
*NEW* Reference genome (12.01.2019): Pseudoalteromonas sp. R3
All 5,824,782
Chromosome 1 4,496,250
Chromosome 2 1,328,532
https://www.ncbi.nlm.nih.gov/assembly/GCA_004014715.1
######################
## 1. Initial steps ##
# Quality control
fastqc *.gz
multiqc .
# Sample Name % Dups % GC M Seqs
# NG-19286_1783_0s_lib326994_6367_2_1 22.0% 47% 4.5
# NG-19286_1783_0s_lib326994_6367_2_2 22.0% 47% 4.5
# NG-19286_1783_0s_lib326994_6377_2_1 31.4% 47% 3.8
# NG-19286_1783_0s_lib326994_6377_2_2 30.9% 47% 3.8
# merge libraries
cat *_1.fastq.gz > NG_1783_R1.fastq.gz
cat *_2.fastq.gz > NG_1783_R2.fastq.gz
################################
## 2. spades de novo assembly ##
spades.py -1 NG_1783_R1.fastq.gz -2 NG_1783_R2.fastq.gz -o spades_NG_1783_denovo_cf_oa_v1 -t 2 -k 21,33,55,77 --careful --only-assembler
A C G T N IUPAC Other GC GC_stdev
0.2651 0.2433 0.2318 0.2598 0.0000 0.0000 0.0000 0.4751 0.0794
Main genome scaffold total: 531
Main genome contig total: 531
Main genome scaffold sequence total: 5.975 MB
Main genome contig sequence total: 5.975 MB 0.000% gap
Main genome scaffold N/L50: 8/222.155 KB
Main genome contig N/L50: 8/222.155 KB
Main genome scaffold N/L90: 27/48.635 KB
Main genome contig N/L90: 27/48.635 KB
Max scaffold length: 861.495 KB
Max contig length: 861.495 KB
Number of scaffolds > 50 KB: 25
% main genome in scaffolds > 50 KB: 88.77%
Minimum Number Number Total Total Scaffold
Scaffold of of Scaffold Contig Contig
Length Scaffolds Contigs Length Length Coverage
-------- -------------- -------------- -------------- -------------- --------
All 531 531 5,974,904 5,974,904 100.00%
50 531 531 5,974,904 5,974,904 100.00%
100 456 456 5,968,435 5,968,435 100.00%
250 156 156 5,914,072 5,914,072 100.00%
500 79 79 5,890,926 5,890,926 100.00%
1 KB 61 61 5,879,752 5,879,752 100.00%
2.5 KB 46 46 5,856,032 5,856,032 100.00%
5 KB 45 45 5,852,679 5,852,679 100.00%
10 KB 43 43 5,835,401 5,835,401 100.00%
25 KB 35 35 5,690,421 5,690,421 100.00%
50 KB 25 25 5,303,816 5,303,816 100.00%
100 KB 21 21 5,020,661 5,020,661 100.00%
250 KB 6 6 2,522,577 2,522,577 100.00%
500 KB 1 1 861,495 861,495 100.00%
# 2.1 remove short contigs < 500 or < 1000
# 2.2 can I bin the contigs into the 3 SCISO genome parts? cocacola binning?
# 2.3 Normalized and error correction
spades.py -1 NG_1783_R1_bbduk_bbnorm.fastq.gz -2 NG_1783_R2_bbduk_bbnorm.fastq.gz -o spades_NG_1783_bbduk_norm_denovo_cf_v1 -t 8 -k 21,33,55,77 --careful --only-assembler -t 8
# contigs.fasta
A C G T N IUPAC Other GC GC_stdev
0.2618 0.2356 0.2397 0.2630 0.0000 0.0000 0.0000 0.4753 0.0750
Main genome scaffold total: 275
Main genome contig total: 275
Main genome scaffold sequence total: 5.921 MB
Main genome contig sequence total: 5.921 MB 0.000% gap
Main genome scaffold N/L50: 9/217.637 KB
Main genome contig N/L50: 9/217.637 KB
Main genome scaffold N/L90: 27/64.374 KB
Main genome contig N/L90: 27/64.374 KB
Max scaffold length: 861.162 KB
Max contig length: 861.162 KB
Number of scaffolds > 50 KB: 28
% main genome in scaffolds > 50 KB: 91.31%
Minimum Number Number Total Total Scaffold
Scaffold of of Scaffold Contig Contig
Length Scaffolds Contigs Length Length Coverage
-------- -------------- -------------- -------------- -------------- --------
All 275 275 5,921,002 5,921,002 100.00%
50 275 275 5,921,002 5,921,002 100.00%
100 256 256 5,919,449 5,919,449 100.00%
250 137 137 5,897,892 5,897,892 100.00%
500 89 89 5,882,304 5,882,304 100.00%
1 KB 67 67 5,867,484 5,867,484 100.00%
2.5 KB 47 47 5,836,786 5,836,786 100.00%
5 KB 43 43 5,821,648 5,821,648 100.00%
10 KB 41 41 5,807,276 5,807,276 100.00%
25 KB 35 35 5,677,780 5,677,780 100.00%
50 KB 28 28 5,406,418 5,406,418 100.00%
100 KB 20 20 4,837,585 4,837,585 100.00%
250 KB 8 8 2,893,114 2,893,114 100.00%
500 KB 1 1 861,162 861,162 100.00%
# scaffolds.fasta
A C G T N IUPAC Other GC GC_stdev
0.2618 0.2356 0.2397 0.2630 0.0002 0.0000 0.0000 0.4753 0.0767
Main genome scaffold total: 261
Main genome contig total: 275
Main genome scaffold sequence total: 5.922 MB
Main genome contig sequence total: 5.921 MB 0.023% gap
Main genome scaffold N/L50: 5/286.09 KB
Main genome contig N/L50: 9/217.64 KB
Main genome scaffold N/L90: 18/75.54 KB
Main genome contig N/L90: 27/64.375 KB
Max scaffold length: 1.079 MB
Max contig length: 861.162 KB
Number of scaffolds > 50 KB: 22
% main genome in scaffolds > 50 KB: 95.36%
Minimum Number Number Total Total Scaffold
Scaffold of of Scaffold Contig Contig
Length Scaffolds Contigs Length Length Coverage
-------- -------------- -------------- -------------- -------------- --------
All 261 275 5,922,228 5,920,853 99.98%
50 261 275 5,922,228 5,920,853 99.98%
100 242 256 5,920,675 5,919,300 99.98%
250 123 137 5,899,118 5,897,743 99.98%
500 75 89 5,883,530 5,882,155 99.98%
1 KB 54 68 5,869,437 5,868,062 99.98%
2.5 KB 34 48 5,838,739 5,837,364 99.98%
5 KB 30 44 5,823,601 5,822,226 99.98%
10 KB 28 42 5,809,229 5,807,854 99.98%
25 KB 25 39 5,740,947 5,739,572 99.98%
50 KB 22 35 5,647,263 5,645,985 99.98%
100 KB 16 27 5,225,129 5,224,050 99.98%
250 KB 6 13 3,487,853 3,487,165 99.98%
500 KB 2 7 2,104,224 2,103,734 99.98%
1 MB 2 7 2,104,224 2,103,734 99.98%
spades.py -1 NG_1783_R1_bbduk_bbnorm.fastq.gz -2 NG_1783_R2_bbduk_bbnorm.fastq.gz -o spades_NG_1783_bbduk_norm_denovo_cf_v2 -t 8 -k 21,33,55,77,99,127 --careful --only-assembler -t 8
A C G T N IUPAC Other GC GC_stdev
0.2615 0.2340 0.2413 0.2633 0.0001 0.0000 0.0000 0.4753 0.0719
Main genome scaffold total: 135
Main genome contig total: 140
Main genome scaffold sequence total: 5.940 MB
Main genome contig sequence total: 5.940 MB 0.008% gap
Main genome scaffold N/L50: 5/291.264 KB
Main genome contig N/L50: 6/288.762 KB
Main genome scaffold N/L90: 16/141.784 KB
Main genome contig N/L90: 18/120.405 KB
Max scaffold length: 1.081 MB
Max contig length: 861.675 KB
Number of scaffolds > 50 KB: 21
% main genome in scaffolds > 50 KB: 97.87%
Minimum Number Number Total Total Scaffold
Scaffold of of Scaffold Contig Contig
Length Scaffolds Contigs Length Length Coverage
-------- -------------- -------------- -------------- -------------- --------
All 135 140 5,940,390 5,939,899 99.99%
100 135 140 5,940,390 5,939,899 99.99%
250 121 126 5,937,916 5,937,425 99.99%
500 57 62 5,916,971 5,916,480 99.99%
1 KB 37 42 5,904,263 5,903,772 99.99%
2.5 KB 30 35 5,893,488 5,892,997 99.99%
5 KB 24 28 5,873,720 5,873,326 99.99%
10 KB 24 28 5,873,720 5,873,326 99.99%
25 KB 22 26 5,843,200 5,842,806 99.99%
50 KB 21 25 5,813,874 5,813,480 99.99%
100 KB 18 21 5,607,634 5,607,338 99.99%
250 KB 8 9 3,842,700 3,842,602 100.00%
500 KB 2 3 1,897,797 1,897,699 99.99%
1 MB 1 2 1,081,451 1,081,353 99.99%
spades.py -1 NG_1783_R1_bbduk_bbnorm.v2.fastq.gz -2 NG_1783_R2_bbduk_bbnorm.v2.fastq.gz -o spades_NG_1783_bbduk_norm_denovo_cf_v3 -t 8 -k 21,33,55,77,99,127 --careful --only-assembler -t 8
A C G T N IUPAC Other GC GC_stdev
0.2615 0.2350 0.2403 0.2632 0.0001 0.0000 0.0000 0.4753 0.0745
Main genome scaffold total: 125
Main genome contig total: 130
Main genome scaffold sequence total: 5.937 MB
Main genome contig sequence total: 5.936 MB 0.008% gap
Main genome scaffold N/L50: 5/291.264 KB
Main genome contig N/L50: 7/277.876 KB
Main genome scaffold N/L90: 16/141.784 KB
Main genome contig N/L90: 19/120.405 KB
Max scaffold length: 1.085 MB
Max contig length: 862.609 KB
Number of scaffolds > 50 KB: 21
% main genome in scaffolds > 50 KB: 97.79%
Minimum Number Number Total Total Scaffold
Scaffold of of Scaffold Contig Contig
Length Scaffolds Contigs Length Length Coverage
-------- -------------- -------------- -------------- -------------- --------
All 125 130 5,936,839 5,936,348 99.99%
100 125 130 5,936,839 5,936,348 99.99%
250 107 112 5,933,601 5,933,110 99.99%
500 57 62 5,918,071 5,917,580 99.99%
1 KB 38 43 5,905,932 5,905,441 99.99%
2.5 KB 31 36 5,894,265 5,893,774 99.99%
5 KB 26 30 5,875,660 5,875,266 99.99%
10 KB 24 28 5,865,361 5,864,967 99.99%
25 KB 22 26 5,834,843 5,834,449 99.99%
50 KB 21 25 5,805,517 5,805,123 99.99%
100 KB 18 21 5,599,238 5,598,942 99.99%
250 KB 8 10 3,846,716 3,846,518 99.99%
500 KB 2 4 1,901,739 1,901,541 99.99%
1 MB 1 2 1,085,062 1,084,962 99.99%
# 2.4 map reads on contigs - or create an anvio file
#######################################################################
## 3. possible deduplication and normalization steps before assembly ##
# adapter and contamination removal
~/phylo/bbmap/bbduk.sh in1=NG_1783_R1.fastq.gz in2=NG_1783_R2.fastq.gz out1=NG_1783_R1_bbduk.fastq.gz out2=NG_1783_R2_bbduk.fastq.gz ref=~/phylo/bbmap/resources/adapters.fa ktrim=r k=23 mink=11 hdist=1 tpe tbo
BBDuk version 37.66
maskMiddle was disabled because useShortKmers=true
Initial:
Memory: max=1407m, free=1376m, used=31m
Added 216529 kmers; time: 0.372 seconds.
Memory: max=1407m, free=1310m, used=97m
Input is being processed as paired
Started output streams: 0.362 seconds.
Processing time: 500.187 seconds.
Input: 16560750 reads 2500673250 bases.
KTrimmed: 109614 reads (0.66%) 2898976 bases (0.12%)
Trimmed by overlap: 52638 reads (0.32%) 495056 bases (0.02%)
Total Removed: 50 reads (0.00%) 3394032 bases (0.14%)
Result: 16560700 reads (100.00%) 2497279218 bases (99.86%)
Time: 500.947 seconds.
Reads Processed: 16560k 33.06k reads/sec
Bases Processed: 2500m 4.99m bases/sec
# normalization
~/phylo/bbmap/bbnorm.sh in1=NG_1783_R1_bbduk.fastq.gz in2=NG_1783_R2_bbduk.fastq.gz out1=NG_1783_R1_bbduk_bbnorm.fastq.gz out2=NG_1783_R2_bbduk_bbnorm.fastq.gz target=100 min=3
Total reads in: 14506466 36.101% Kept
Total bases in: 2187508126 36.103% Kept
Error reads in: 493769 3.404%
Error pairs in: 429288 5.919%
Error type 1: 26345 0.182%
Error type 2: 466599 3.216%
Error type 3: 471033 3.247%
Total kmers counted: 1752271855
Total unique kmer count: 32740694
Includes forward kmers only.
The unique kmer estimate can be more accurate than the unique count, if the tables are very full.
The most accurate value is the greater of the two.
Percent unique: 1.87%
Depth average: 53.52 (unique kmers)
Depth median: 1 (unique kmers)
Depth standard deviation: 111.46 (unique kmers)
Corrected depth average: 53.52
Depth average: 285.63 (all kmers)
Depth median: 286 (all kmers)
Depth standard deviation: 75.83 (all kmers)
Approx. read depth median: 357.03
Removing temp files.
Total time: 509.698 seconds. 9191.30 kb/sec
~/phylo/bbmap/bbnorm.sh in1=NG_1783_R1_bbduk.fastq.gz in2=NG_1783_R2_bbduk.fastq.gz out1=NG_1783_R1_bbduk_bbnorm.v2.fastq.gz out2=NG_1783_R2_bbduk_bbnorm.v2.fastq.gz target=100 min=5
# deduplication
~/phylo/bbmap/dedupe.sh in1=NG_1783_R1_bbduk_bbnorm.fastq.gz in2=NG_1783_R2_bbduk_bbnorm.fastq.gz out=NG_1783_bbduk_bbnorm_dedup.fastq.gz ac=f
reformat.sh in=NG_1783_bbduk_bbnorm_dedup.fastq.gz out1=NG_1783_R1_bbduk_bbnorm_dedup.fastq.gz out2=NG_1783_R2_bbduk_bbnorm_dedup.fastq.gz
Pair Limitations:
Dedupe supports paired reads, but it was not really designed for them. When processing paired reads, some parts of Dedupe are restricted to a single thread due to a complication that causes non-deterministic output. As such, processing paired reads is slower than unpaired reads. Also, pair support is limited to exact matches and overlaps, not containments.
# alternatively
# clumpify and deduplication
clumpify.sh in=reads.fq.gz out=clumped.fq.gz dedupe subs=0
Paired reads:
Clumpify supports paired reads, in which case it will clump based on read 1 only. However, it’s much more effective to treat reads as unpaired. For example, merge the reads with BBMerge, then concatenate the merged reads with the unmerged pairs, and clump them all together as unpaired.
# PE merge - optional might need concatenation of unmerged reads
bbmerge.sh in1=NG_1783_R1.fastq.gz in2=NG_1783_R2.fastq.gz out=NG_1783_merged.fastq.gz outu=NG_1783_unmerged.fastq.gz ihist=ihist.txt
(((NOTE - maybe just bbduk and bbnorm - too many issues with dedupe and clumpify and bbmerge - first spades assembly already looked really good)))
#########################################
## 4. read mapping on reference genome ##
bowtie2-build mapping/GCF_001482385.1_ASM148238v1_genomic.fna mapping/GCF_001482385.1_ASM148238v1_genomic
bowtie2 --threads 2 --very-sensitive -x mapping/GCF_001482385.1_ASM148238v1_genomic -1 NG_1783_R1_bbduk.fastq -2 NG_1783_R2_bbduk.fastq -S mapping/NG_1783.sam --un-conc-gz mapping/unmapped.fastq.gz --al-conc-gz mapping/mapped.fastq.gz
8280350 reads; of these:
8280350 (100.00%) were paired; of these:
4437287 (53.59%) aligned concordantly 0 times
3738078 (45.14%) aligned concordantly exactly 1 time
104985 (1.27%) aligned concordantly >1 times
----
4437287 pairs aligned concordantly 0 times; of these:
913846 (20.59%) aligned discordantly 1 time
----
3523441 pairs aligned 0 times concordantly or discordantly; of these:
7046882 mates make up the pairs; of these:
6016678 (85.38%) aligned 0 times
945395 (13.42%) aligned exactly 1 time
84809 (1.20%) aligned >1 times
63.67% overall alignment rate
samtools view -F 4 -bS mapping/NG_1783.sam > mapping/NG_1783.bam
###########################################
## 5. contig mapping on reference genome ##
bowtie2 --sensitive-local --threads 8 -f -x ./mapping/GCF_001482385.1_ASM148238v1_genomic -U ./spades_NG_1783_bbduk_norm_denovo_cf_v1/contigs.fasta -S ./mapping/NG_1783_contigs.sam --un ./mapping/unmapped_contigs.fasta --al mapping/mapped_contigs.fasta
(((NOTE - does not work - probably not a good idea to align large contigs against 2 chromosomes and 1 plasmid)))
# try mummer?
##########################################################
## 6. blast against ref genome and use contigs as query ##
# 6.1 SCSIO 6842 BLAST
makeblastdb -in GCF_001482385.1_ASM148238v1_genomic.fna -out GCF_001482385.1_ASM148238v1_genomic -dbtype 'nucl' -hash_index
blastn -query fastq/spades_NG_1783_denovo_cf_oa_v1/contigs.fasta -max_target_seqs 3 -evalue 100 -outfmt 6 -out spades_NG_1783_denovo_cf_oa_v1_contigs.txt -db SCSIO6842blast/GCF_001482385.1_ASM148238v1_genomic
(((NOTE - repeat with contigs > 1000 and more strict parameters)))
blastn -query spades_NG_1783_bbduk_norm_denovo_cf_v1/contigs.1000.fasta -max_target_seqs 1 -max_hsps 1 -evalue 10 -outfmt 6 -out spades_NG_1783_bbduk_norm_denovo_cf_v1_contigs.txt -db SCSIO6842blast/GCF_001482385.1_ASM148238v1_genomic -num_threads 8
NODE_2_length_327220_cov_63.750956 NZ_CP013611.1 327220
NODE_3_length_316322_cov_63.479498 NZ_CP013611.1 316322
NODE_4_length_297949_cov_63.535154 NZ_CP013611.1 297949
NODE_5_length_286090_cov_63.590326 NZ_CP013611.1 286090
NODE_6_length_277140_cov_63.813418 NZ_CP013611.1 277140
NODE_7_length_266901_cov_63.819840 NZ_CP013611.1 266901
NODE_8_length_260330_cov_63.529200 NZ_CP013611.1 260330
NODE_10_length_216936_cov_63.406799 NZ_CP013611.1 216936
NODE_12_length_200388_cov_63.302819 NZ_CP013611.1 200388
NODE_13_length_184034_cov_63.957930 NZ_CP013611.1 184034
NODE_14_length_159356_cov_63.602703 NZ_CP013611.1 159356
NODE_15_length_141684_cov_63.284696 NZ_CP013611.1 141684
NODE_16_length_134502_cov_63.500911 NZ_CP013611.1 134502
NODE_17_length_129528_cov_63.669767 NZ_CP013611.1 129528
NODE_18_length_125958_cov_63.280106 NZ_CP013611.1 125958
NODE_19_length_120306_cov_63.348768 NZ_CP013611.1 120306
NODE_20_length_112311_cov_63.463897 NZ_CP013611.1 112311
NODE_21_length_77902_cov_63.242313 NZ_CP013611.1 77902
NODE_22_length_76233_cov_63.488038 NZ_CP013611.1 76233
NODE_23_length_75540_cov_63.537495 NZ_CP013611.1 75540
NODE_24_length_75504_cov_64.217256 NZ_CP013611.1 75504
NODE_25_length_71373_cov_63.538039 NZ_CP013611.1 71373
NODE_26_length_70217_cov_63.288238 NZ_CP013611.1 70217
NODE_27_length_64374_cov_63.131515 NZ_CP013611.1 64374
NODE_28_length_57690_cov_63.738080 NZ_CP013611.1 57690
NODE_29_length_49133_cov_63.147750 NZ_CP013611.1 49133
NODE_30_length_46418_cov_63.072657 NZ_CP013611.1 46418
NODE_31_length_41267_cov_63.989536 NZ_CP013611.1 41267
NODE_32_length_39271_cov_63.085217 NZ_CP013611.1 39271
NODE_33_length_38901_cov_64.398645 NZ_CP013611.1 38901
NODE_34_length_28998_cov_62.107534 NZ_CP013611.1 28998
NODE_35_length_27374_cov_64.186248 NZ_CP013611.1 27374
NODE_36_length_24923_cov_63.533406 NZ_CP013611.1 24923
NODE_37_length_24753_cov_63.117685 NZ_CP013611.1 24753
NODE_38_length_23927_cov_63.261342 NZ_CP013611.1 23927
NODE_39_length_18799_cov_63.748478 NZ_CP013611.1 18799
NODE_41_length_18488_cov_63.921243 NZ_CP013611.1 18488
NODE_42_length_8884_cov_63.743045 NZ_CP013611.1 8884
NODE_46_length_3769_cov_62.202059 NZ_CP013611.1 3769
NODE_50_length_2389_cov_61.981834 NZ_CP013611.1 2389
NODE_51_length_2108_cov_49.321024 NZ_CP013611.1 2108
NODE_52_length_2101_cov_66.837945 NZ_CP013611.1 2101
NODE_55_length_1604_cov_69.470203 NZ_CP013611.1 1604
NODE_56_length_1436_cov_58.327447 NZ_CP013611.1 1436
NODE_57_length_1404_cov_55.879427 NZ_CP013611.1 1404
NODE_58_length_1344_cov_62.865354 NZ_CP013611.1 1344
NODE_59_length_1296_cov_59.514356 NZ_CP013611.1 1296
NODE_62_length_1109_cov_67.309109 NZ_CP013611.1 1109
NODE_63_length_1052_cov_58.640000 NZ_CP013611.1 1052
NODE_64_length_1051_cov_107.898357 NZ_CP013611.1 1051
NODE_65_length_1012_cov_63.378610 NZ_CP013611.1 1012
NODE_66_length_1010_cov_64.316184 NZ_CP013611.1 1010
4539609
NODE_1_length_861162_cov_63.446345 NZ_CP013612.1 861162
NODE_9_length_217637_cov_63.358880 NZ_CP013612.1 217637
NODE_11_length_201831_cov_63.706063 NZ_CP013612.1 201831
NODE_40_length_18606_cov_63.958497 NZ_CP013612.1 18606
NODE_43_length_5488_cov_62.840325 NZ_CP013612.1 5488
NODE_44_length_4298_cov_60.231936 NZ_CP013612.1 4298
NODE_45_length_4075_cov_61.164082 NZ_CP013612.1 4075
NODE_48_length_2489_cov_67.313433 NZ_CP013612.1 2489
NODE_49_length_2437_cov_61.465678 NZ_CP013612.1 2437
NODE_53_length_1756_cov_64.752829 NZ_CP013612.1 1756
NODE_61_length_1115_cov_60.167630 NZ_CP013612.1 1115
NODE_67_length_1000_cov_60.972914 NZ_CP013612.1 1000
1321894
Missing
NODE_47_length_2996_cov_62.363823 (Vibrio blast nr/nt hit - low similarity)
NODE_54_length_1729_cov_54.748789 (no blast nr/nt hit)
NODE_60_length_1256_cov_57.432570 (no blast nr/nt hit)
# 6.2 R3 BLAST
makeblastdb -in GCF_004014715.1_ASM401471v1_genomic.fna -out GCF_004014715.1_ASM401471v1_genomic -dbtype 'nucl' -hash_index
blastn -query spades_NG_1783_bbduk_norm_denovo_cf_v1/contigs.1000.fasta -max_target_seqs 1 -max_hsps 1 -evalue 10 -outfmt 6 -out spades_NG_1783_bbduk_norm_denovo_cf_v1_contigs.txt -db R3blast/GCF_004014715.1_ASM401471v1_genomic -num_threads 8
##################################################################
## 7. spades assemlby using reference genome as trusted contigs ##
# not a good idea - spades says "Note, that SPAdes does not perform assembly using genomes of closely-related species. Only contigs of the same genome should be specified."
####################################
## 8. Cocacola binning of spades_NG_1783_bbduk_norm_denovo_cf_v1 - why not? ##
bowtie2-build scaffolds.2500.fasta scaffolds.2500
bowtie2 --threads 8 -x mapping/scaffolds.2500 -1 NG_1783_R1_bbduk.fastq -2 NG_1783_R2_bbduk.fastq -S mapping/scaffolds.2500.sam
samtools view -F 4 -bS mapping/NG_1783.sam > mapping/NG_1783.bam
~/work/CONCOCT-0.4.2/scripts/gen_input_table.py Cr15_CrambeBetaBac-only-trusted_contigs.1000.fasta Cr15_only.init.bam > Cr15_only_concoct_coverage.tsv
~/work/CONCOCT-0.4.2/scripts/fasta_to_features.py Cr15_CrambeBetaBac-only-trusted_contigs.1000.fasta 63359 4 Cr15_only_concoct_kmer_4.csv
python cocacola.py \
--contig_file ~/work/CrMg/TethyBinning/BinsOnly/cocacola/Cr15_only.1500.fasta \
--abundance_profiles ~/work/CrMg/TethyBinning/BinsOnly/cocacola/Cr15_1500_only_concoct_coverage.tsv \
--composition_profiles ~/work/CrMg/TethyBinning/BinsOnly/cocacola/Cr15_1500_only_concoct_kmer_4.csv \
--output ~/work/CrMg/TethyBinning/BinsOnly/cocacola/1500/Cr15only_cocacola_result.csv
replace , with |
awk -F\| '{print>$1}' file1
do some manual work then
for f in $PWD/1500/*;
do ./faSomeRecords Cr15_only.1500.fasta $f $f.fasta;
done