1. Fastq数据质控
介绍
fastp
A tool designed to provide fast all-in-one preprocessing for FastQ files. This tool is developed in C++ with multithreading supported to afford high performance.
fastq文件格式
@NS500713:64:HFKJJBGXY:1:11101:1675:1101 1:N:0:TATAGCCT+GACCCCCA
AAAAAAAAGCTACTTGGAGTACCAATAATAAAGTGAGCCCACCTTCCTGGTACCCAGACATTTCAGGAGGTCGGGAAA
+
6AAAAAEEEEE/E/EA/E/AEA6EE//AEE66/AAE//EEE/E//E/AA/EEE/A/AEE/EEA//EEEEEEEE6EEAA
常用参数
qualified_quality_phred the quality value that a base is qualified. Default 15 means phred quality >=Q15 is qualified. (int [=15])
length_required reads shorter than length_required will be discarded, default is 15. (int [=15])
n_base_limit if one read's number of N base is >n_base_limit, then this read/pair is discarded. Default is 5 (int [=5])
cut_mean_quality the mean quality requirement option shared by cut_front, cut_tail or cut_sliding. Range: 1~36 default: 20 (Q20) (int [=20])
cut_window_size the window size option shared by cut_front, cut_tail or cut_sliding. Range: 1~1000, default: 4 (int [=4])
cut_front move a sliding window from front (5') to tail, drop the bases in the window if its mean quality <span><</span> threshold, stop otherwise.
cut_tail move a sliding window from tail (3') to front, drop the bases in the window if its mean quality <span><</span> threshold, stop otherwise.
结果
1.质控的fastq序列文件
2.质控统计文件*.json 或*.html
在文献中引用
Shifu Chen, Yanqing Zhou, Yaru Chen, Jia Gu; fastp: an ultra-fast all-in-one FASTQ preprocessor, Bioinformatics, Volume 34, Issue 17, 1 September 2018, Pages
i884–i890, https://doi.org/10.1093/bioinformatics/bty560
2. Soapdenovo2组装
介绍
SOAPdenovo2是用于short-read组装的软件,可用于大型植物和动物、细菌和真菌等基因组组装
输入
fastq1文件:
左端reads的fastq格式文件
fastq2文件:
右端reads的fastq格式文件
fastq文件格式,例如
@NS500713:64:HFKJJBGXY:1:11101:1675:1101 1:N:0:TATAGCCT+GACCCCCA
AAAAAAAAGCTACTTGGAGTACCAATAATAAAGTGAGCCCACCTTCCTGGTACCCAGACATTTCAGGAGGTCGGGAAA
+
6AAAAAEEEEE/E/EA/E/AEA6EE//AEE66/AAE//EEE/E//E/AA/EEE/A/AEE/EEA//EEEEEEEE6EEAA
Kmer值:kmer(最小13, 最大63/127): 默认值[23],值是奇数
D值: 去除频数不大于该值(edgeCovCutoff)的由k-mer连接的边,默认值[1]
d值: 去除kmers频数不大于该值的k-mer,默认值[0]
reads length: read的最大长度,任何比它大的read会被切到这个长度,该值一般设置的比实际read读长稍微短一些
insert length: 文库的平均插入长度
结果
*.contig文件: contig序列文件
*.scafSeq文件: scaffold序列文件
这2个文件都是fasta格式文件
3. 组装序列补Gap
介绍
组装序列存在Gap,Gapcloser可基于reads 填充Gap,使组装结果更完整。
gapcloser常用于Soapdenovo2软件组装后的Gap填充。
输入
fasta文件: 组装后得到的序列
例如,
>seqname
AAAAAAAAGCTACTTGGAGTACCAATAATAAAGTGAGCCCACCTTCCTGGTACCCAGACATTTC
fastq1文件:
左端reads的fastq格式文件
fastq2文件:
右端reads的fastq格式文件
fastq文件格式,例如
@NS500713:64:HFKJJBGXY:1:11101:1675:1101 1:N:0:TATAGCCT+GACCCCCA
AAAAAAAAGCTACTTGGAGTACCAATAATAAAGTGAGCCCACCTTCCTGGTACCCAGACATTTCAGGAGGTCGGGAAA
+
6AAAAAEEEEE/E/EA/E/AEA6EE//AEE66/AAE//EEE/E//E/AA/EEE/A/AEE/EEA//EEEEEEEE6EEAA
reads length: read的最大长度,任何比它大的read会被切到这个长度,该值一般设置的比实际read读长稍微短一些
insert length: 文库的平均插入长度
结果
新的fasta文件
4. 组装序列信息统计
介绍
统计序列数、碱基数、Large序列数、Large序列碱基数、最大长度、GC含量、N50、N90和N率等
输入
fasta序列文件:
格式例如,
>seqname1
AAAAAAAAGCTACTTGGAGTACCAATAATAAAGTGAGCCCACCTTCCTGGTACCCAGACATTTC
结果
生成序列统计表
All_Num : 序列数
All_Bases : 总碱基数
Large_Num : Large序列数
Large_Bases : Large序列的总碱基数
Largest_Bases : 最长序列碱基数
N50 : N50序列长度
N90 : N90序列长度
G+C : GC 含量
N_Rate : N率(位置碱基的含量)