1. Fastq数据质控
介绍
fastp
A tool designed to provide fast all-in-one preprocessing for FastQ files. This tool is developed in C++ with multithreading supported to afford high performance.
fastq文件格式
@NS500713:64:HFKJJBGXY:1:11101:1675:1101 1:N:0:TATAGCCT+GACCCCCA
AAAAAAAAGCTACTTGGAGTACCAATAATAAAGTGAGCCCACCTTCCTGGTACCCAGACATTTCAGGAGGTCGGGAAA
+
6AAAAAEEEEE/E/EA/E/AEA6EE//AEE66/AAE//EEE/E//E/AA/EEE/A/AEE/EEA//EEEEEEEE6EEAA
常用参数
qualified_quality_phred the quality value that a base is qualified. Default 15 means phred quality >=Q15 is qualified. (int [=15])
length_required reads shorter than length_required will be discarded, default is 15. (int [=15])
n_base_limit if one read's number of N base is >n_base_limit, then this read/pair is discarded. Default is 5 (int [=5])
cut_mean_quality the mean quality requirement option shared by cut_front, cut_tail or cut_sliding. Range: 1~36 default: 20 (Q20) (int [=20])
cut_window_size the window size option shared by cut_front, cut_tail or cut_sliding. Range: 1~1000, default: 4 (int [=4])
cut_front move a sliding window from front (5') to tail, drop the bases in the window if its mean quality <span><</span> threshold, stop otherwise.
cut_tail move a sliding window from tail (3') to front, drop the bases in the window if its mean quality <span><</span> threshold, stop otherwise.
结果
1.质控的fastq序列文件
2.质控统计文件*.json 或*.html
在文献中引用
Shifu Chen, Yanqing Zhou, Yaru Chen, Jia Gu; fastp: an ultra-fast all-in-one FASTQ preprocessor, Bioinformatics, Volume 34, Issue 17, 1 September 2018, Pages
i884–i890, https://doi.org/10.1093/bioinformatics/bty560
2. soapdenovo
介绍
SOAPdenovo2是用于short-read组装的软件,可用于大型植物和动物、细菌和真菌等基因组组装
输入
fastq1文件:
左端reads的fastq格式文件
fastq2文件:
右端reads的fastq格式文件
fastq文件格式,例如
@NS500713:64:HFKJJBGXY:1:11101:1675:1101 1:N:0:TATAGCCT+GACCCCCA
AAAAAAAAGCTACTTGGAGTACCAATAATAAAGTGAGCCCACCTTCCTGGTACCCAGACATTTCAGGAGGTCGGGAAA
+
6AAAAAEEEEE/E/EA/E/AEA6EE//AEE66/AAE//EEE/E//E/AA/EEE/A/AEE/EEA//EEEEEEEE6EEAA
Kmer值:kmer(最小13, 最大63/127): 默认值[23],值是奇数
D值: 去除频数不大于该值(edgeCovCutoff)的由k-mer连接的边,默认值[1]
d值: 去除kmers频数不大于该值的k-mer,默认值[0]
reads length: read的最大长度,任何比它大的read会被切到这个长度,该值一般设置的比实际read读长稍微短一些
insert length: 文库的平均插入长度
结果
*.contig文件: contig序列文件
*.scafSeq文件: scaffold序列文件
这2个文件都是fasta格式文件
3. 组装结果对比