NewMer-帮助文档

输入文件info.csv

Tab分割符分割，第1列样品名称，第二列fq1路径，第三列fq2路径。路径是相对于info.csv文件的相对路径。例如 sample1 sampe1.1.fq sample1.2.fq sample2 sample2.1.fq sample2.2.fq

1. 数据质控

介绍

fastp A tool designed to provide fast all-in-one preprocessing for FastQ files. This tool is developed in C++ with multithreading supported to afford high performance.

fastq文件格式

@NS500713:64:HFKJJBGXY:1:11101:1675:1101 1:N:0:TATAGCCT+GACCCCCA AAAAAAAAGCTACTTGGAGTACCAATAATAAAGTGAGCCCACCTTCCTGGTACCCAGACATTTCAGGAGGTCGGGAAA + 6AAAAAEEEEE/E/EA/E/AEA6EE//AEE66/AAE//EEE/E//E/AA/EEE/A/AEE/EEA//EEEEEEEE6EEAA

常用参数

qualified_quality_phred the quality value that a base is qualified. Default 15 means phred quality >=Q15 is qualified. (int [=15]) length_required reads shorter than length_required will be discarded, default is 15. (int [=15]) n_base_limit if one read's number of N base is >n_base_limit, then this read/pair is discarded. Default is 5 (int [=5]) cut_mean_quality the mean quality requirement option shared by cut_front, cut_tail or cut_sliding. Range: 1~36 default: 20 (Q20) (int [=20]) cut_window_size the window size option shared by cut_front, cut_tail or cut_sliding. Range: 1~1000, default: 4 (int [=4]) cut_front move a sliding window from front (5') to tail, drop the bases in the window if its mean quality <span><</span> threshold, stop otherwise. cut_tail move a sliding window from tail (3') to front, drop the bases in the window if its mean quality <span><</span> threshold, stop otherwise.

结果

1.质控的fastq序列文件 2.质控统计文件*.json 或*.html

在文献中引用

Shifu Chen, Yanqing Zhou, Yaru Chen, Jia Gu; fastp: an ultra-fast all-in-one FASTQ preprocessor, Bioinformatics, Volume 34, Issue 17, 1 September 2018, Pages i884–i890, https://doi.org/10.1093/bioinformatics/bty560

2. 序列组装

介绍

megahit用于宏基因组测序数据的组装。组装速度较快，消耗资源较低。

输入

fq1文件：左端reads的fastq数据格式例如， @A00151:255:HNMLKDSXY:4:1101:8314:7467 1:N:0:TGAGGC GTCACGCCGTCTCCTCATCTCGGCTCTCTCACCATGCAGTGGTCGAGGGCCGCGCTTTCTTACACCCGGGGAGAGGGGATTCCGGGCGGCGGGGTGCCCGGGACGAGGGAGGCCGGTGCCGCCGCGTTGCCGGCCGCGGGACGCGGTTGCC + FFFFFFFFFFFFF,:,FFFFFFFFFF:FFFFFFFF,FF:F,,FFFFFF,FFF::FF,:FF::F,FF,,FFFFF,,::FFFFFFFFFFFF::FFFFFFF:FF:FFFFF:FFFFFF::FF:FFFF:FFFFF:F:FFFFF,:,:F,FFFF,,:F fq2文件：右端reads的fastq数据格式例如， @A00151:255:HNMLKDSXY:4:1101:8314:7467 2:N:0:TGAGGC GGACGTCCCCATGGAGCTCCTGAGCTTACGCAGCGCCGCACGGCAACCGCGTCCGGCGTCGGCAACCGCGTCCGGTGCCCAACCGCGTCCAACGGCCGGCAACCGCGTCCCGCGGCCGGCACCGCGGCGGCACCGGCCTCCCTCGTCCCGG + F::F:F:FFFF:FFFFFFFF:FFFFF:FFFFFFFFFF,FFFFF:FFFFFFF,:FF,FFFFFFFF,FFFFF:FF:F::FF,FF:F,FFFFFFF,F::FF,FFFFFFFFFFF,FFF:FF:FFF,FFFFFFFFFFFFF::FF:FF:FF:FFFF, min contig length ：组装的最小contig长度，长度小的contig将被舍去 k-min ：最小kmer长度 k-max ：最大kmer长度 k setp ：kmer变化梯度值

结果

final.contigs.fa，例如， >k97_872 flag=0 multi=67.7803 len=320 GCCTGCGCCTCGATCGGATCACCCAGCCTCGTCCCCGTCCCATGCGCCTCCACCACATCCACCTCGGACGCCGACACCCCCGCGTTCTCCAACGCCCGCCGGATCACCCGCTGCTGCGACGGACCATTCGGCGCCATCAACCCATTCGACGCACCATCCTGATTCACCGCCGAACCACGCACCACCGCCAACACCCGATGCCCAAAACGACGAGCATCCGACAAACGCTCCACCACCAACACACCCACACCCTCACCCCAACCCGTCCCATCAGC CCCCTCGGCAAACGACCTGCACCGACCATCAACCGACAACCCGCG

3. 基因预测

介绍

使用metagene软件，快速的宏基因组基因预测

输入

基因组序列（fasta格式）例如 >scaffold3 AAAAAAAAGCTACTTGGAGTACCAATAATAAAGTGAGCCCACCTTCCTGGTACCCAGACATTTC... >scaffold4 GGAGTACCAATAATAAAGTGAGCCCACCTTCCTGGTACCCAGAC...

结果

文件例如： # scaffold4 61.7 # gc = 0.698128 # bacteria 1 4513 - 0 539.403 partial (lack 3'-end) 4450 5127 - 0 37.5313 complete # scaffold3 61.3 # gc = 0.738649 # bacteria 1 85 - 0 10.6447 partial (lack 3'-end) 30 2327 + 0 433.438 complete 2354 9268 + 0 1287.43 complete 9184 9295 + 0 20.3604 partial (lack 3'-end)

4. 基因序列

介绍

megagene基因预测软件生成的的结果是表格形式，包含基因在长序列中的位置信息，根据该表和预测用的长序列，提取基因的核酸序列

输入

metagene的预测结果表：例如 # scaffold4 61.7 # gc = 0.698128 # bacteria 1 4513 - 0 539.403 partial (lack 3'-end) 4450 5127 - 0 37.5313 complete # scaffold3 61.3 # gc = 0.738649 # bacteria 1 85 - 0 10.6447 partial (lack 3'-end) 30 2327 + 0 433.438 complete 2354 9268 + 0 1287.43 complete 9184 9295 + 0 20.3604 partial (lack 3'-end) 预测用的长序列文件（fasta格式）例如 >scaffold4 AAAAAAAAGCTACTTGGAGTACCAATAATAAAGTGAGCCCACCTTCCTGGTACCCAGACATTTCAGGAGGTCGGGAAA..... >scaffold3 TGGAGTACCAATAATAAAGTGAGCCCACCTTCCTGGTACCCAGACATTTCGAGTACCAATAATAAAGTGA......

结果

基因的核酸序列文件（fasta格式）

5. 去冗余基因集

介绍

使用cd-hit软件，去fasta文件的冗余序列

输入

Fasta序列文件: 例如 >seqname1 AAAAAAAAGCTACTTGGAGTACCAATAATAAAGTGAGCCCACCTTCCTGGTACCCAGACATTTC >seqname2 GGAGTACCAATAATAAAGTGAGCCCACCTTCCTGGTACCCAGAC identity: sequence identity threshold, default 0.9

结果

去冗余后的新的fasta序列文件

6. 基因定量

介绍

Salmon不比对快速宏基因组基因定量

输入

1.参考序列索引：用salmon index对参考序列构建索引的结果路径 2.fq1:左端的read序列文件 3.fq2:右端的read序列文件

结果

生成结果quant.sf，例如 Name Length EffectiveLength TPM NumReads g1 256 36.763 9545.604194 5.000 g2 298 61.085 4595.839603 4.000 g3 299 61.650 5195.523516 4.564

7. 基因氨基酸序列

介绍

基因的核酸序列转氨基酸序列使用EMBOSS-6.5.7的transeq子程序

输入

基因核酸序列文件（fasta格式） code表 [0] Code to use (Values: 0 (Standard); 1 (Standard (with alternative initiation codons)); 2 (Vertebrate Mitochondrial); 3 (Yeast Mitochondrial); 4 (Mold, Protozoan, Coelenterate Mitochondrial and Mycoplasma/Spiroplasma); 5 (Invertebrate Mitochondrial); 6 (Ciliate Macronuclear and Dasycladacean); 9 (Echinoderm Mitochondrial); 10 (Euplotid Nuclear); 11 (Bacterial); 12 (Alternative Yeast Nuclear); 13 (Ascidian Mitochondrial); 14 (Flatworm Mitochondrial); 15 (Blepharisma Macronuclear); 16 (Chlorophycean Mitochondrial); 21 (Trematode Mitochondrial); 22 (Scenedesmus obliquus); 23 (Thraustochytrium Mitochondrial)) trim参数 [N] This removes all 'X' and '*' characters from the right end of the translation. The trimming process starts at the end and continues until the next character is not a 'X' or a '*' clean参数 [N] This changes all STOP codon positions from the '*' character to 'X' (an unknown residue). This is useful because some programs will not accept protein sequences with '*' characters in them.

结果

氨基酸fasta格式文件

8. 基因注释

介绍

**EggNOG-mapper** is a tool for fast functional annotation of novel sequences. It uses precomputed orthologous groups and phylogenies from the eggNOG database (http://eggnog5.embl.de) to transfer functional information from fine-grained orthologs only. Common uses of eggNOG-mapper include the annotation of novel genomes, transcriptomes or even metagenomic gene catalogs. The use of orthology predictions for functional annotation permits a higher precision than traditional homology searches (i.e. BLAST searches), as it avoids transferring annotations from close paralogs (duplicate genes with a higher chance of being involved in functional divergence). Benchmarks comparing different eggNOG-mapper options against BLAST and InterProScan [can be found here](https://github.com/jhcepas/emapper-benchmark/blob/master/benchmark_analysis.ipynb). EggNOG-mapper is also available as a public online resource: http://eggnog-mapper.embl.de # Documentation https://github.com/jhcepas/eggnog-mapper/wiki If you use this software, please cite: [1] eggNOG-mapper v2: functional annotation, orthology assignments, and domain prediction at the metagenomic scale. Carlos P. Cantalapiedra, Ana Hernandez-Plaza, Ivica Letunic, Peer Bork, Jaime Huerta-Cepas. 2021. Molecular Biology and Evolution, msab293, https://doi.org/10.1093/molbev/msab293 [2] eggNOG 5.0: a hierarchical, functionally and phylogenetically annotated orthology resource based on 5090 organisms and 2502 viruses. Jaime Huerta-Cepas, Damian Szklarczyk, Davide Heller, Ana Hernández-Plaza, Sofia K Forslund, Helen Cook, Daniel R Mende, Ivica Letunic, Thomas Rattei, Lars J Jensen, Christian von Mering, Peer Bork Nucleic Acids Res. 2019 Jan 8; 47(Database issue): D309–D314. doi: 10.1093/nar/gky1085

输入

基因的蛋白序列文件（fasta格式）例如： >geneName1 MKLLAHILCLSLALAWAQSQDHALAVLDRCEGLEMDAVAVNEEGIPYFFKGDHLFKGFHG >geneName2 MWVGEERFEGSRLVVVTRGAVSVGGEGVEDVGGGAVWGLVRSAQSEHPGRFVLVDADVDA DVDTGVVPDVVGLGESQVAVRGGRVWVPRLVGVNSGGGVRAGGGVVRRGLGSGVALVTGG TGLLGGLVARHLVSAYGVGELVLVSRRGPGAPGVGALVGELEELGAGVRVVACDVADRGA VAELVGSIEGLRVVVHAAGAVDDGVIGSLDGGRLRGVMGPKAWGAWHLHELTSGLDLS

结果

注释的结果表格文件格式例如： #query seed_ortholog evalue score eggNOG_OGs max_annot_lvl COG_category Description Preferred_name GOs EC KEGG_ko KEGG_Pathway KEGG_Module KEGG_Reaction KEGG_rclass BRITE KEGG_TC CAZy BiGG_Reaction PFAMs geneName3 494419.ALPM01000100_gene1074 4.15e-05 48.9 COG0747@1|root,COG0747@2|Bacteria,2GM5G@201174|Actinobacteria 201174|Actinobacteria E ABC transporter substrate-binding protein - - - ko:K02035 ko02024,map02024 M00239 - - ko00000,ko 00001,ko00002,ko02000 3.A.1.5 - - SBP_bac_5

9. 物种注释

介绍

MetaPhlAn 是二代测序物种分类的工具,可得到宏基因组物种分类的列表,以及相对丰度信息。可直接使用fastq数据。

输入

fastq格式文件：例如， @A00151:255:HNMLKDSXY:4:1101:8314:7467 1:N:0:TGAGGC GTCACGCCGTCTCCTCATCTCGGCTCTCTCACCATGCAGTGGTCGAGGGCCGCGCTTTCTTACACCCGGGGAGAGGGGATTCCGGGCGGCGGGGTGCCCGGGACGAGGGAGGCCGGTGCCGCCGCGTTGCCGGCCGCGGGACGCGGTTGCC + FFFFFFFFFFFFF,:,FFFFFFFFFF:FFFFFFFF,FF:F,,FFFFFF,FFF::FF,:FF::F,FF,,FFFFF,,::FFFFFFFFFFFF::FFFFFFF:FF:FFFFF:FFFFFF::FF:FFFF:FFFFF:F:FFFFF,:,:F,FFFF,,:F

结果