输入文件info.csv
Tab分割符分割,第1列样品名称,第二列fq1路径,第三列fq2路径。路径是相对于info.csv文件的相对路径。
例如
sample1 sampe1.1.fq sample1.2.fq
sample2 sample2.1.fq sample2.2.fq
1. 数据质控
介绍
fastp
A tool designed to provide fast all-in-one preprocessing for FastQ files. This tool is developed in C++ with multithreading supported to afford high performance.
fastq文件格式
@NS500713:64:HFKJJBGXY:1:11101:1675:1101 1:N:0:TATAGCCT+GACCCCCA
AAAAAAAAGCTACTTGGAGTACCAATAATAAAGTGAGCCCACCTTCCTGGTACCCAGACATTTCAGGAGGTCGGGAAA
+
6AAAAAEEEEE/E/EA/E/AEA6EE//AEE66/AAE//EEE/E//E/AA/EEE/A/AEE/EEA//EEEEEEEE6EEAA
常用参数
qualified_quality_phred the quality value that a base is qualified. Default 15 means phred quality >=Q15 is qualified. (int [=15])
length_required reads shorter than length_required will be discarded, default is 15. (int [=15])
n_base_limit if one read's number of N base is >n_base_limit, then this read/pair is discarded. Default is 5 (int [=5])
cut_mean_quality the mean quality requirement option shared by cut_front, cut_tail or cut_sliding. Range: 1~36 default: 20 (Q20) (int [=20])
cut_window_size the window size option shared by cut_front, cut_tail or cut_sliding. Range: 1~1000, default: 4 (int [=4])
cut_front move a sliding window from front (5') to tail, drop the bases in the window if its mean quality <span><</span> threshold, stop otherwise.
cut_tail move a sliding window from tail (3') to front, drop the bases in the window if its mean quality <span><</span> threshold, stop otherwise.
结果
1.质控的fastq序列文件
2.质控统计文件*.json 或*.html
在文献中引用
Shifu Chen, Yanqing Zhou, Yaru Chen, Jia Gu; fastp: an ultra-fast all-in-one FASTQ preprocessor, Bioinformatics, Volume 34, Issue 17, 1 September 2018, Pages
i884–i890, https://doi.org/10.1093/bioinformatics/bty560
2. 序列组装
介绍
megahit用于宏基因组测序数据的组装。组装速度较快,消耗资源较低。
输入
fq1文件:左端reads的fastq数据
格式例如,
@A00151:255:HNMLKDSXY:4:1101:8314:7467 1:N:0:TGAGGC
GTCACGCCGTCTCCTCATCTCGGCTCTCTCACCATGCAGTGGTCGAGGGCCGCGCTTTCTTACACCCGGGGAGAGGGGATTCCGGGCGGCGGGGTGCCCGGGACGAGGGAGGCCGGTGCCGCCGCGTTGCCGGCCGCGGGACGCGGTTGCC
+
FFFFFFFFFFFFF,:,FFFFFFFFFF:FFFFFFFF,FF:F,,FFFFFF,FFF::FF,:FF::F,FF,,FFFFF,,::FFFFFFFFFFFF::FFFFFFF:FF:FFFFF:FFFFFF::FF:FFFF:FFFFF:F:FFFFF,:,:F,FFFF,,:F
fq2文件:右端reads的fastq数据
格式例如,
@A00151:255:HNMLKDSXY:4:1101:8314:7467 2:N:0:TGAGGC
GGACGTCCCCATGGAGCTCCTGAGCTTACGCAGCGCCGCACGGCAACCGCGTCCGGCGTCGGCAACCGCGTCCGGTGCCCAACCGCGTCCAACGGCCGGCAACCGCGTCCCGCGGCCGGCACCGCGGCGGCACCGGCCTCCCTCGTCCCGG
+
F::F:F:FFFF:FFFFFFFF:FFFFF:FFFFFFFFFF,FFFFF:FFFFFFF,:FF,FFFFFFFF,FFFFF:FF:F::FF,FF:F,FFFFFFF,F::FF,FFFFFFFFFFF,FFF:FF:FFF,FFFFFFFFFFFFF::FF:FF:FF:FFFF,
min contig length : 组装的最小contig长度,长度小的contig将被舍去
k-min :最小kmer长度
k-max :最大kmer长度
k setp :kmer变化梯度值
结果
final.contigs.fa,
例如,
>k97_872 flag=0 multi=67.7803 len=320
GCCTGCGCCTCGATCGGATCACCCAGCCTCGTCCCCGTCCCATGCGCCTCCACCACATCCACCTCGGACGCCGACACCCCCGCGTTCTCCAACGCCCGCCGGATCACCCGCTGCTGCGACGGACCATTCGGCGCCATCAACCCATTCGACGCACCATCCTGATTCACCGCCGAACCACGCACCACCGCCAACACCCGATGCCCAAAACGACGAGCATCCGACAAACGCTCCACCACCAACACACCCACACCCTCACCCCAACCCGTCCCATCAGC
CCCCTCGGCAAACGACCTGCACCGACCATCAACCGACAACCCGCG
3. 基因预测
介绍
使用metagene软件,快速的宏基因组基因预测
输入
基因组序列(fasta格式)
例如
>scaffold3
AAAAAAAAGCTACTTGGAGTACCAATAATAAAGTGAGCCCACCTTCCTGGTACCCAGACATTTC...
>scaffold4
GGAGTACCAATAATAAAGTGAGCCCACCTTCCTGGTACCCAGAC...
结果
文件例如:
# scaffold4 61.7
# gc = 0.698128
# bacteria
1 4513 - 0 539.403 partial (lack 3'-end)
4450 5127 - 0 37.5313 complete
# scaffold3 61.3
# gc = 0.738649
# bacteria
1 85 - 0 10.6447 partial (lack 3'-end)
30 2327 + 0 433.438 complete
2354 9268 + 0 1287.43 complete
9184 9295 + 0 20.3604 partial (lack 3'-end)
4. 基因序列
介绍
megagene基因预测软件生成的的结果是表格形式,包含基因在长序列中的位置信息,
根据该表和预测用的长序列,提取基因的核酸序列
输入
metagene的预测结果表:
例如
# scaffold4 61.7
# gc = 0.698128
# bacteria
1 4513 - 0 539.403 partial (lack 3'-end)
4450 5127 - 0 37.5313 complete
# scaffold3 61.3
# gc = 0.738649
# bacteria
1 85 - 0 10.6447 partial (lack 3'-end)
30 2327 + 0 433.438 complete
2354 9268 + 0 1287.43 complete
9184 9295 + 0 20.3604 partial (lack 3'-end)
预测用的长序列文件(fasta格式)
例如
>scaffold4
AAAAAAAAGCTACTTGGAGTACCAATAATAAAGTGAGCCCACCTTCCTGGTACCCAGACATTTCAGGAGGTCGGGAAA.....
>scaffold3
TGGAGTACCAATAATAAAGTGAGCCCACCTTCCTGGTACCCAGACATTTCGAGTACCAATAATAAAGTGA......
结果
基因的核酸序列文件(fasta格式)
5. 去冗余基因集
介绍
使用cd-hit软件,去fasta文件的冗余序列
输入
Fasta序列文件:
例如
>seqname1
AAAAAAAAGCTACTTGGAGTACCAATAATAAAGTGAGCCCACCTTCCTGGTACCCAGACATTTC
>seqname2
GGAGTACCAATAATAAAGTGAGCCCACCTTCCTGGTACCCAGAC
identity: sequence identity threshold, default 0.9
结果
去冗余后的新的fasta序列文件
6. 基因定量
介绍
Salmon不比对快速宏基因组基因定量
输入
1.参考序列索引: 用salmon index对参考序列构建索引的结果路径
2.fq1:左端的read序列文件
3.fq2:右端的read序列文件
结果
生成结果quant.sf,例如
Name Length EffectiveLength TPM NumReads
g1 256 36.763 9545.604194 5.000
g2 298 61.085 4595.839603 4.000
g3 299 61.650 5195.523516 4.564
7. 基因氨基酸序列
介绍
基因的核酸序列转氨基酸序列
使用EMBOSS-6.5.7的transeq子程序
输入
基因核酸序列文件(fasta格式)
code表 [0] Code to use (Values: 0 (Standard); 1
(Standard (with alternative initiation
codons)); 2 (Vertebrate Mitochondrial); 3
(Yeast Mitochondrial); 4 (Mold, Protozoan,
Coelenterate Mitochondrial and
Mycoplasma/Spiroplasma); 5 (Invertebrate
Mitochondrial); 6 (Ciliate Macronuclear and
Dasycladacean); 9 (Echinoderm
Mitochondrial); 10 (Euplotid Nuclear); 11
(Bacterial); 12 (Alternative Yeast Nuclear);
13 (Ascidian Mitochondrial); 14 (Flatworm
Mitochondrial); 15 (Blepharisma
Macronuclear); 16 (Chlorophycean
Mitochondrial); 21 (Trematode
Mitochondrial); 22 (Scenedesmus obliquus);
23 (Thraustochytrium Mitochondrial))
trim参数 [N] This removes all 'X' and '*' characters
from the right end of the translation. The
trimming process starts at the end and
continues until the next character is not a
'X' or a '*'
clean参数 [N] This changes all STOP codon positions
from the '*' character to 'X' (an unknown
residue). This is useful because some
programs will not accept protein sequences
with '*' characters in them.
结果
氨基酸fasta格式文件
8. 基因注释
介绍
**EggNOG-mapper** is a tool for fast functional annotation of novel sequences. It uses precomputed orthologous groups and phylogenies from the eggNOG database (http://eggnog5.embl.de) to transfer functional information from fine-grained orthologs only.
Common uses of eggNOG-mapper include the annotation of novel genomes, transcriptomes or even metagenomic gene catalogs.
The use of orthology predictions for functional annotation permits a higher precision than traditional homology searches (i.e. BLAST searches), as it avoids transferring annotations from close paralogs (duplicate genes with a higher chance of being involved in functional divergence).
Benchmarks comparing different eggNOG-mapper options against BLAST and InterProScan [can be found here](https://github.com/jhcepas/emapper-benchmark/blob/master/benchmark_analysis.ipynb).
EggNOG-mapper is also available as a public online resource: http://eggnog-mapper.embl.de
# Documentation
https://github.com/jhcepas/eggnog-mapper/wiki
If you use this software, please cite:
[1] eggNOG-mapper v2: functional annotation, orthology assignments, and domain
prediction at the metagenomic scale. Carlos P. Cantalapiedra,
Ana Hernandez-Plaza, Ivica Letunic, Peer Bork, Jaime Huerta-Cepas. 2021.
Molecular Biology and Evolution, msab293, https://doi.org/10.1093/molbev/msab293
[2] eggNOG 5.0: a hierarchical, functionally and phylogenetically annotated
orthology resource based on 5090 organisms and 2502 viruses. Jaime
Huerta-Cepas, Damian Szklarczyk, Davide Heller, Ana Hernández-Plaza, Sofia
K Forslund, Helen Cook, Daniel R Mende, Ivica Letunic, Thomas Rattei, Lars
J Jensen, Christian von Mering, Peer Bork Nucleic Acids Res. 2019 Jan 8;
47(Database issue): D309–D314. doi: 10.1093/nar/gky1085
输入
基因的蛋白序列文件(fasta格式)
例如:
>geneName1
MKLLAHILCLSLALAWAQSQDHALAVLDRCEGLEMDAVAVNEEGIPYFFKGDHLFKGFHG
>geneName2
MWVGEERFEGSRLVVVTRGAVSVGGEGVEDVGGGAVWGLVRSAQSEHPGRFVLVDADVDA
DVDTGVVPDVVGLGESQVAVRGGRVWVPRLVGVNSGGGVRAGGGVVRRGLGSGVALVTGG
TGLLGGLVARHLVSAYGVGELVLVSRRGPGAPGVGALVGELEELGAGVRVVACDVADRGA
VAELVGSIEGLRVVVHAAGAVDDGVIGSLDGGRLRGVMGPKAWGAWHLHELTSGLDLS
结果
注释的结果表格文件
格式例如:
#query seed_ortholog evalue score eggNOG_OGs max_annot_lvl COG_category Description Preferred_name GOs EC KEGG_ko KEGG_Pathway
KEGG_Module KEGG_Reaction KEGG_rclass BRITE KEGG_TC CAZy BiGG_Reaction PFAMs
geneName3 494419.ALPM01000100_gene1074 4.15e-05 48.9 COG0747@1|root,COG0747@2|Bacteria,2GM5G@201174|Actinobacteria 201174|Actinobacteria
E ABC transporter substrate-binding protein - - - ko:K02035 ko02024,map02024 M00239 - - ko00000,ko
00001,ko00002,ko02000 3.A.1.5 - - SBP_bac_5
9. 物种注释
介绍
MetaPhlAn 是二代测序物种分类的工具,可得到宏基因组物种分类的列表,以及相对丰度信息。
可直接使用fastq数据。
输入
fastq格式文件:
例如,
@A00151:255:HNMLKDSXY:4:1101:8314:7467 1:N:0:TGAGGC
GTCACGCCGTCTCCTCATCTCGGCTCTCTCACCATGCAGTGGTCGAGGGCCGCGCTTTCTTACACCCGGGGAGAGGGGATTCCGGGCGGCGGGGTGCCCGGGACGAGGGAGGCCGGTGCCGCCGCGTTGCCGGCCGCGGGACGCGGTTGCC
+
FFFFFFFFFFFFF,:,FFFFFFFFFF:FFFFFFFF,FF:F,,FFFFFF,FFF::FF,:FF::F,FF,,FFFFF,,::FFFFFFFFFFFF::FFFFFFF:FF:FFFFF:FFFFFF::FF:FFFF:FFFFF:F:FFFFF,:,:F,FFFF,,:F
结果
profiled_metagenome.txt
例如,
#SampleID Metaphlan_Analysis
#clade_name NCBI_tax_id relative_abundance
k__Bacteria 2 100.0
k__Bacteria|p__Actinobacteria 2|201174 100.0
k__Bacteria|p__Actinobacteria|c__Actinobacteria 2|201174|1760 100.0
k__Bacteria|p__Actinobacteria|c__Actinobacteria|o__Streptomycetales 2|201174|1760|85011 100.0
k__Bacteria|p__Actinobacteria|c__Actinobacteria|o__Streptomycetales|f__Streptomycetaceae 2|201174|1760|85011|2062 100.0
k__Bacteria|p__Actinobacteria|c__Actinobacteria|o__Streptomycetales|f__Streptomycetaceae|g__Streptomyces 2|201174|1760|85011|2062|1883 100.0
k__Bacteria|p__Actinobacteria|c__Actinobacteria|o__Streptomycetales|f__Streptomycetaceae|g__Streptomyces|s__Streptomyces_violaceusniger 2|201174|1760|85011|2062|1883|68280 60.69895
k__Bacteria|p__Actinobacteria|c__Actinobacteria|o__Streptomycetales|f__Streptomycetaceae|g__Streptomyces|s__Streptomyces_melanosporofaciens 2|201174|1760|85011|2062|1883|67327 34.98288