反馈咨询
欢迎添加微信!
微信号:z_gqing
微信二维码:

输入文件info.csv

Tab分割符分割,第1列样品名称,第二列fq1路径,第三列fq2路径。路径是相对于info.csv文件的相对路径。 例如 sample1 sampe1.1.fq sample1.2.fq sample2 sample2.1.fq sample2.2.fq

1. 数据质控

介绍

fastp A tool designed to provide fast all-in-one preprocessing for FastQ files. This tool is developed in C++ with multithreading supported to afford high performance.

fastq文件格式

@NS500713:64:HFKJJBGXY:1:11101:1675:1101 1:N:0:TATAGCCT+GACCCCCA AAAAAAAAGCTACTTGGAGTACCAATAATAAAGTGAGCCCACCTTCCTGGTACCCAGACATTTCAGGAGGTCGGGAAA + 6AAAAAEEEEE/E/EA/E/AEA6EE//AEE66/AAE//EEE/E//E/AA/EEE/A/AEE/EEA//EEEEEEEE6EEAA

常用参数

qualified_quality_phred the quality value that a base is qualified. Default 15 means phred quality >=Q15 is qualified. (int [=15]) length_required reads shorter than length_required will be discarded, default is 15. (int [=15]) n_base_limit if one read's number of N base is >n_base_limit, then this read/pair is discarded. Default is 5 (int [=5]) cut_mean_quality the mean quality requirement option shared by cut_front, cut_tail or cut_sliding. Range: 1~36 default: 20 (Q20) (int [=20]) cut_window_size the window size option shared by cut_front, cut_tail or cut_sliding. Range: 1~1000, default: 4 (int [=4]) cut_front move a sliding window from front (5') to tail, drop the bases in the window if its mean quality <span><</span> threshold, stop otherwise. cut_tail move a sliding window from tail (3') to front, drop the bases in the window if its mean quality <span><</span> threshold, stop otherwise.

结果

1.质控的fastq序列文件 2.质控统计文件*.json 或*.html

在文献中引用

Shifu Chen, Yanqing Zhou, Yaru Chen, Jia Gu; fastp: an ultra-fast all-in-one FASTQ preprocessor, Bioinformatics, Volume 34, Issue 17, 1 September 2018, Pages i884–i890, https://doi.org/10.1093/bioinformatics/bty560

2. 序列组装

介绍

megahit用于宏基因组测序数据的组装。组装速度较快,消耗资源较低。

输入

fq1文件:左端reads的fastq数据 格式例如, @A00151:255:HNMLKDSXY:4:1101:8314:7467 1:N:0:TGAGGC GTCACGCCGTCTCCTCATCTCGGCTCTCTCACCATGCAGTGGTCGAGGGCCGCGCTTTCTTACACCCGGGGAGAGGGGATTCCGGGCGGCGGGGTGCCCGGGACGAGGGAGGCCGGTGCCGCCGCGTTGCCGGCCGCGGGACGCGGTTGCC + FFFFFFFFFFFFF,:,FFFFFFFFFF:FFFFFFFF,FF:F,,FFFFFF,FFF::FF,:FF::F,FF,,FFFFF,,::FFFFFFFFFFFF::FFFFFFF:FF:FFFFF:FFFFFF::FF:FFFF:FFFFF:F:FFFFF,:,:F,FFFF,,:F fq2文件:右端reads的fastq数据 格式例如, @A00151:255:HNMLKDSXY:4:1101:8314:7467 2:N:0:TGAGGC GGACGTCCCCATGGAGCTCCTGAGCTTACGCAGCGCCGCACGGCAACCGCGTCCGGCGTCGGCAACCGCGTCCGGTGCCCAACCGCGTCCAACGGCCGGCAACCGCGTCCCGCGGCCGGCACCGCGGCGGCACCGGCCTCCCTCGTCCCGG + F::F:F:FFFF:FFFFFFFF:FFFFF:FFFFFFFFFF,FFFFF:FFFFFFF,:FF,FFFFFFFF,FFFFF:FF:F::FF,FF:F,FFFFFFF,F::FF,FFFFFFFFFFF,FFF:FF:FFF,FFFFFFFFFFFFF::FF:FF:FF:FFFF, min contig length : 组装的最小contig长度,长度小的contig将被舍去 k-min :最小kmer长度 k-max :最大kmer长度 k setp :kmer变化梯度值

结果

final.contigs.fa, 例如, >k97_872 flag=0 multi=67.7803 len=320 GCCTGCGCCTCGATCGGATCACCCAGCCTCGTCCCCGTCCCATGCGCCTCCACCACATCCACCTCGGACGCCGACACCCCCGCGTTCTCCAACGCCCGCCGGATCACCCGCTGCTGCGACGGACCATTCGGCGCCATCAACCCATTCGACGCACCATCCTGATTCACCGCCGAACCACGCACCACCGCCAACACCCGATGCCCAAAACGACGAGCATCCGACAAACGCTCCACCACCAACACACCCACACCCTCACCCCAACCCGTCCCATCAGC CCCCTCGGCAAACGACCTGCACCGACCATCAACCGACAACCCGCG

3. 基因预测

介绍

使用metagene软件,快速的宏基因组基因预测

输入

基因组序列(fasta格式) 例如 >scaffold3 AAAAAAAAGCTACTTGGAGTACCAATAATAAAGTGAGCCCACCTTCCTGGTACCCAGACATTTC... >scaffold4 GGAGTACCAATAATAAAGTGAGCCCACCTTCCTGGTACCCAGAC...

结果

文件例如: # scaffold4 61.7 # gc = 0.698128 # bacteria 1 4513 - 0 539.403 partial (lack 3'-end) 4450 5127 - 0 37.5313 complete # scaffold3 61.3 # gc = 0.738649 # bacteria 1 85 - 0 10.6447 partial (lack 3'-end) 30 2327 + 0 433.438 complete 2354 9268 + 0 1287.43 complete 9184 9295 + 0 20.3604 partial (lack 3'-end)

4. 基因序列

介绍

megagene基因预测软件生成的的结果是表格形式,包含基因在长序列中的位置信息, 根据该表和预测用的长序列,提取基因的核酸序列

输入

metagene的预测结果表: 例如 # scaffold4 61.7 # gc = 0.698128 # bacteria 1 4513 - 0 539.403 partial (lack 3'-end) 4450 5127 - 0 37.5313 complete # scaffold3 61.3 # gc = 0.738649 # bacteria 1 85 - 0 10.6447 partial (lack 3'-end) 30 2327 + 0 433.438 complete 2354 9268 + 0 1287.43 complete 9184 9295 + 0 20.3604 partial (lack 3'-end) 预测用的长序列文件(fasta格式) 例如 >scaffold4 AAAAAAAAGCTACTTGGAGTACCAATAATAAAGTGAGCCCACCTTCCTGGTACCCAGACATTTCAGGAGGTCGGGAAA..... >scaffold3 TGGAGTACCAATAATAAAGTGAGCCCACCTTCCTGGTACCCAGACATTTCGAGTACCAATAATAAAGTGA......

结果

基因的核酸序列文件(fasta格式)

5. 去冗余基因集

介绍

使用cd-hit软件,去fasta文件的冗余序列

输入

Fasta序列文件: 例如 >seqname1 AAAAAAAAGCTACTTGGAGTACCAATAATAAAGTGAGCCCACCTTCCTGGTACCCAGACATTTC >seqname2 GGAGTACCAATAATAAAGTGAGCCCACCTTCCTGGTACCCAGAC identity: sequence identity threshold, default 0.9

结果

去冗余后的新的fasta序列文件

6. 基因定量

介绍

Salmon不比对快速宏基因组基因定量

输入

1.参考序列索引: 用salmon index对参考序列构建索引的结果路径 2.fq1:左端的read序列文件 3.fq2:右端的read序列文件

结果

生成结果quant.sf,例如 Name Length EffectiveLength TPM NumReads g1 256 36.763 9545.604194 5.000 g2 298 61.085 4595.839603 4.000 g3 299 61.650 5195.523516 4.564

7. 基因氨基酸序列

介绍

基因的核酸序列转氨基酸序列 使用EMBOSS-6.5.7的transeq子程序

输入

基因核酸序列文件(fasta格式) code表 [0] Code to use (Values: 0 (Standard); 1 (Standard (with alternative initiation codons)); 2 (Vertebrate Mitochondrial); 3 (Yeast Mitochondrial); 4 (Mold, Protozoan, Coelenterate Mitochondrial and Mycoplasma/Spiroplasma); 5 (Invertebrate Mitochondrial); 6 (Ciliate Macronuclear and Dasycladacean); 9 (Echinoderm Mitochondrial); 10 (Euplotid Nuclear); 11 (Bacterial); 12 (Alternative Yeast Nuclear); 13 (Ascidian Mitochondrial); 14 (Flatworm Mitochondrial); 15 (Blepharisma Macronuclear); 16 (Chlorophycean Mitochondrial); 21 (Trematode Mitochondrial); 22 (Scenedesmus obliquus); 23 (Thraustochytrium Mitochondrial)) trim参数 [N] This removes all 'X' and '*' characters from the right end of the translation. The trimming process starts at the end and continues until the next character is not a 'X' or a '*' clean参数 [N] This changes all STOP codon positions from the '*' character to 'X' (an unknown residue). This is useful because some programs will not accept protein sequences with '*' characters in them.

结果

氨基酸fasta格式文件

8. 基因注释

介绍

**EggNOG-mapper** is a tool for fast functional annotation of novel sequences. It uses precomputed orthologous groups and phylogenies from the eggNOG database (http://eggnog5.embl.de) to transfer functional information from fine-grained orthologs only. Common uses of eggNOG-mapper include the annotation of novel genomes, transcriptomes or even metagenomic gene catalogs. The use of orthology predictions for functional annotation permits a higher precision than traditional homology searches (i.e. BLAST searches), as it avoids transferring annotations from close paralogs (duplicate genes with a higher chance of being involved in functional divergence). Benchmarks comparing different eggNOG-mapper options against BLAST and InterProScan [can be found here](https://github.com/jhcepas/emapper-benchmark/blob/master/benchmark_analysis.ipynb). EggNOG-mapper is also available as a public online resource: http://eggnog-mapper.embl.de # Documentation https://github.com/jhcepas/eggnog-mapper/wiki If you use this software, please cite: [1] eggNOG-mapper v2: functional annotation, orthology assignments, and domain prediction at the metagenomic scale. Carlos P. Cantalapiedra, Ana Hernandez-Plaza, Ivica Letunic, Peer Bork, Jaime Huerta-Cepas. 2021. Molecular Biology and Evolution, msab293, https://doi.org/10.1093/molbev/msab293 [2] eggNOG 5.0: a hierarchical, functionally and phylogenetically annotated orthology resource based on 5090 organisms and 2502 viruses. Jaime Huerta-Cepas, Damian Szklarczyk, Davide Heller, Ana Hernández-Plaza, Sofia K Forslund, Helen Cook, Daniel R Mende, Ivica Letunic, Thomas Rattei, Lars J Jensen, Christian von Mering, Peer Bork Nucleic Acids Res. 2019 Jan 8; 47(Database issue): D309–D314. doi: 10.1093/nar/gky1085

输入

基因的蛋白序列文件(fasta格式) 例如: >geneName1 MKLLAHILCLSLALAWAQSQDHALAVLDRCEGLEMDAVAVNEEGIPYFFKGDHLFKGFHG >geneName2 MWVGEERFEGSRLVVVTRGAVSVGGEGVEDVGGGAVWGLVRSAQSEHPGRFVLVDADVDA DVDTGVVPDVVGLGESQVAVRGGRVWVPRLVGVNSGGGVRAGGGVVRRGLGSGVALVTGG TGLLGGLVARHLVSAYGVGELVLVSRRGPGAPGVGALVGELEELGAGVRVVACDVADRGA VAELVGSIEGLRVVVHAAGAVDDGVIGSLDGGRLRGVMGPKAWGAWHLHELTSGLDLS

结果

注释的结果表格文件 格式例如: #query seed_ortholog evalue score eggNOG_OGs max_annot_lvl COG_category Description Preferred_name GOs EC KEGG_ko KEGG_Pathway KEGG_Module KEGG_Reaction KEGG_rclass BRITE KEGG_TC CAZy BiGG_Reaction PFAMs geneName3 494419.ALPM01000100_gene1074 4.15e-05 48.9 COG0747@1|root,COG0747@2|Bacteria,2GM5G@201174|Actinobacteria 201174|Actinobacteria E ABC transporter substrate-binding protein - - - ko:K02035 ko02024,map02024 M00239 - - ko00000,ko 00001,ko00002,ko02000 3.A.1.5 - - SBP_bac_5

9. 物种注释

介绍

MetaPhlAn 是二代测序物种分类的工具,可得到宏基因组物种分类的列表,以及相对丰度信息。 可直接使用fastq数据。

输入

fastq格式文件: 例如, @A00151:255:HNMLKDSXY:4:1101:8314:7467 1:N:0:TGAGGC GTCACGCCGTCTCCTCATCTCGGCTCTCTCACCATGCAGTGGTCGAGGGCCGCGCTTTCTTACACCCGGGGAGAGGGGATTCCGGGCGGCGGGGTGCCCGGGACGAGGGAGGCCGGTGCCGCCGCGTTGCCGGCCGCGGGACGCGGTTGCC + FFFFFFFFFFFFF,:,FFFFFFFFFF:FFFFFFFF,FF:F,,FFFFFF,FFF::FF,:FF::F,FF,,FFFFF,,::FFFFFFFFFFFF::FFFFFFF:FF:FFFFF:FFFFFF::FF:FFFF:FFFFF:F:FFFFF,:,:F,FFFF,,:F

结果

profiled_metagenome.txt 例如, #SampleID Metaphlan_Analysis #clade_name NCBI_tax_id relative_abundance k__Bacteria 2 100.0 k__Bacteria|p__Actinobacteria 2|201174 100.0 k__Bacteria|p__Actinobacteria|c__Actinobacteria 2|201174|1760 100.0 k__Bacteria|p__Actinobacteria|c__Actinobacteria|o__Streptomycetales 2|201174|1760|85011 100.0 k__Bacteria|p__Actinobacteria|c__Actinobacteria|o__Streptomycetales|f__Streptomycetaceae 2|201174|1760|85011|2062 100.0 k__Bacteria|p__Actinobacteria|c__Actinobacteria|o__Streptomycetales|f__Streptomycetaceae|g__Streptomyces 2|201174|1760|85011|2062|1883 100.0 k__Bacteria|p__Actinobacteria|c__Actinobacteria|o__Streptomycetales|f__Streptomycetaceae|g__Streptomyces|s__Streptomyces_violaceusniger 2|201174|1760|85011|2062|1883|68280 60.69895 k__Bacteria|p__Actinobacteria|c__Actinobacteria|o__Streptomycetales|f__Streptomycetaceae|g__Streptomyces|s__Streptomyces_melanosporofaciens 2|201174|1760|85011|2062|1883|67327 34.98288

通明学练 数据挖掘 NGplot绘图 NewMer生信首页

关注我们获取最新动态和更多干货内容

微信公众号:NewMer生信 小红书号:NewMer B站:Newmer生信 抖音:NewMer生信 知乎:NewMer生信 客服微信号:z_gqing
Copyright © 2021-2025 上海牛马人生物科技有限公司 沪ICP备 2022007390号-2