来自全球人类肠道微生物组的未培养基因组的新见解

  我们从NCBI SRA39下载了11,523次测序运行,用于公开可用的人类肠道基因组。这些数据对应于3,810个样本,15个研究9,21,40,42,42,43,44,45,45,46,47,48,49,49,50,51(and https://olive.broadinstitute.org/progects/projects/infant_gut_gut_gut_flora_antibibiotics and> 181 byririon)(补充表1,2)。从SRADB关系数据库获得测序元数据,并从NCBI BioSample数据库53或链接到出版物链接的补充数据集中获得主机元数据(补充表3)。在线或应斐济队列的要求提供元数据;这些人被视为来自农村人口的健康成年人。   我们使用Megahit v.1.1.154与默认参数共同组合了3,810个生物样品中的每一个中的11,523个测序运行。这导致了333,661,332个重叠群的长度超过200 bp,总计453.5×109 bp,平均n50为12,460 bp(补充表2)。每个样品使用具有默认选项的三种不同工具生成人类肠道mag:Maxbin V.2.2.455,Metabat v.2.12.156和Concotoct V.0.4.010,它们都使用序列组成和覆盖信息的组合。DAS工具v.1.1.057带有选项“ -score_threshold 0”选项来整合由三个工具生产的磁磁。丢弃了比1 kb短的重叠群。该过程导致152,591个MAG长于100 kb,总计73,632,219个重叠群(占总组装的22%)和310.7×109 bp(占总组装的69%)。所有MAG均已筛选对人类基因组(Build 38)和用BlastN v.2.6.058的PHIX基因组进行污染。   为了完善HGM数据集的MAG,我们在MAG和其他密切相关的,接近完整的MAG和参考基因组之间进行了对重叠群的成对对准(补充表6)。我们的逻辑是,同一物种的菌株应在大多数重叠群之间共享同源性,并且失败这种情况的重叠群(即,存在于一个基因组中,但在另一个基因组中都存在)可能代表污染。对于每个输入MAG,我们使用MASH V.2.059在数据库中找到至少五个密切相关的,接近完整的基因组(> 95%的估计完整性, <5% estimated contamination, Mash distance ≤ 0.05, P ≤ 0.001), and then used BLASTN to align contigs between each MAG and all target genomes. Contigs in the MAG that failed to align at ≥70% nucleotide identity over ≥25% length to any of the closely related genomes were flagged for removal.   We identified and removed taxonomically discordant contigs from MAGs using two complementary approaches (Supplementary Table 6). The first approach performs taxonomic annotation on the basis of universal single-copy marker genes. Hidden Markov models for marker-gene families were downloaded from the PhyEco database60, and searched against MAGs with HMMER v.3.1b261. A subset of 100 (for Archaea) or 88 (for Bacteria) gene families was used. Marker genes found in MAGs were then aligned against a reference database of taxonomically annotated marker genes from reference genomes using BLASTP. For each gene, we transferred the taxonomy of the best hit in the reference database at the appropriate rank on the basis of the percentage of amino acid identity cut-offs specific to each gene family at each rank. We then taxonomically annotated each MAG on the basis of the consensus taxonomy of marker genes at the lowest rank, such that >注释了70%的标记基因。如果重叠群(1)包含分类学上不一致的标记基因,并且(2)缺乏一致的标记基因,则将其标记为去除。分类学改进的第二种方法与第一种方法相似,只是从地paphlan227数据库中使用了855,764个进化枝特异性的原核生物标记基因,用于在排除并非进化枝不独有的“伪标记”之后进行分类注释。   使用类似于先前公布的方法11的方法,我们从具有(1)异常值GC含量的MAG中识别并删除了重叠群,(2)离群四核苷酸频率或(3)离群测序读取深度(补充表6)。我们使用主成分分析将四核苷酸频率维度降低到第一个主要成分(四核苷酸频率PC1)。对于每个MAG,然后我们测量了平均GC含量,平均四核苷酸频率PC1和平均测序读取深度。如果重叠群偏离这些平均值以外的截止值以最大程度地减少完整性的截止值,则将其标记为删除(补充表6)。   我们模拟了1,000次人类肠道杂志,以验证我们的整体杂志改进策略(补充表7)。每个模拟的MAG都包含两个基因组:一个“宿主”基因组(代表靶基因组)和一个“供体”基因组(代表污染基因组)。模拟中使用的所有102个基因组均从人肠道中分离出来,估计具有> 95%的完整性, <1% contamination and <25 contigs. MAGs were simulated with completeness (mean = 61.9%), contamination (mean = 10.0%) and N50 (mean = 35.8 kb) on the basis of randomly sampled MAGs from the HGM dataset. MAGs were dropped in cases in which contamination exceeded completeness, and thus the host genome was in the minority. The refinement pipeline was applied to each simulated MAG and—to evaluate the pipeline—we quantified the overall reduction in completeness and contamination (Extended Data Fig. 1a, b).   We applied each of the refinement approaches described above to the MAGs (Extended Data Fig. 2c, Supplementary Table 6). In rare cases, these approaches may erroneously flag a large proportion of a MAG. To avoid this, we applied a particular approach to a MAG only if it resulted in ≤25% reduction in total length. The five approaches combined removed 5,251,859 contigs (7.13% of total) and 20,821.2 Mb (6.70% of total) from the MAGs. After removing potential contaminants, we were left with 152,279 MAGs with a total length ≥100 kb and 10,036 individual contigs longer than 100 kb that were either unbinned or removed during decontamination. These long contigs were included with other MAGs, which brought the total number to 162,315.   CheckM v.1.0.713 was used to estimate completeness and contamination of the 162,315 recovered MAGs (Supplementary Table 6); CheckM is based on the copy-number of lineage-specific single-copy genes. Additional statistics were obtained for each genome, including the contig N50, number of contigs, average contig length, contig read-depth, and number of tRNA and rRNA genes. tRNAs were identified using tRNAscan-s.e. v.1.3.162 and rRNA genes using Barrnap v.0.9-dev63 with options ‘–reject 0.01 –evalue 1e-3’. We identified 60,664 MAGs that met the MIMAG medium-quality criteria of ≥50% completeness with ≤10% contamination14. For analyses that required near-complete genomes, we used a subset of 24,345 high-quality MAGs that were ≥90% complete, ≤5% contaminated, with an N50 ≥ 10 kb, an average contig length ≥5 kb, ≤500 contigs and ≥90% of contigs with ≥5× read-depth.   Read mapping and SNP calling were performed to assess the genetic diversity of each MAG (Supplementary Table 5). Bowtie 2 v.2.3.464 was used to construct a database of MAGs for each sample, and to align metagenomic reads. Reads with low mapping and sequence quality were discarded (quality scores <20 and <30, respectively), and we counted the occurrence of nucleotides with quality ≥30 across each MAG. To compare SNPs between MAGs sequenced to different depths, we down-sampled each MAG to 40 mapped reads per site. MAGs with at least 200,000 sites of ≥40× depth were retained for analysis. A SNP was called if at least two reads matched the alternative allele at a genomic site. SNP density was calculated as the number of SNPs per kilobase.   We downloaded 201,102 publicly available bacterial and archaeal reference genomes from the Integrated Microbial Genomes (IMG; https://img.jgi.doe.gov/)65 (n = 61,713) and Pathosystems Resource Integration Center (PATRIC; https://www.patricbrc.org/)66 (n = 139,389) databases, on 16 January 2018. These included genomes from 2 human gut culturomics studies6,7 and 16,525 previously published MAGs, including a previous MAG study from the human gut20 and nearly 8,000 MAGs assembled from SRA metagenomes11. To remove redundancy within and between databases, we used Mash59 with default parameters to cluster genomes with a Mash distance of 0.0, which are expected to be identical. This resulted in 153,900 non-redundant reference genomes, of which 127,419 were classified as high quality, 18,498 as medium quality and another 7,983 as low quality (Supplementary Table 9).   Using an approach similar to a previously published method67, we clustered the 60,664 MAGs and 145,917 reference genomes meeting or exceeding the MIMAG medium-quality standard into species-level OTUs on the basis of 95% whole-genome ANI (Supplementary Table 10). We first performed single-linkage clustering of genomes on the basis of a Mash ANI of 99%, which resulted in 79,675 clusters that can be confidently assigned to the same species-level OTU. Mash is extremely fast, although it can underestimate ANI for incomplete genomes67. To address this, we used the ANIcalculator v.1.068 to compute whole-genome-based ANI (gANI) between the 99%-identity clusters, and required that at least 20% of genes were aligned. The 20% cut-off was chosen to minimize the negative effects of incomplete genomes, and to avoid the formation of spurious OTUs (Extended Data Fig. 5a). To increase computational efficiency, we calculated gANI only for genome pairs with >90% Mash ANI. Genomes were clustered into OTUs using average-linkage hierarchical clustering with a 95% gANI cut-off using the package MC-UPGMA v.1.0.069, which yielded 23,790 OTUs.   All OTUs were taxonomically annotated using the tool GTDBTk v.0.0.6 (release 80, www.github.com/Ecogenomics/GtdbTk), which produces standardized taxonomic labels that are based on those used in the Genome Taxonomy Database26. Additionally, we constructed pan-genomes on the basis of clustering all genes within each OTU, using VSEARCH v.2.4.370 with 90% DNA identity and 50% alignment cut-offs (maximum 500 genomes per OTU). Human gut OTUs were identified from the set of 23,790 OTUs on the basis of (1) containing a MAG from the HGM dataset, (2) being detected by IGGsearch (see ‘Development of IGGsearch for metagenomic profiling of species-level OTUs’) in at least 1 of 3,810 metagenomes used for MAG recovery or (3) containing a genome isolated from the human gut (Extended Data Fig. 6a, b, Supplementary Table 10). A total of 4,558 species-level OTUs were annotated as being found in the human gut, on the basis of a combination of the three criteria.   We constructed phylogenetic trees of MAGs and reference genomes using concatenated alignments of conserved, single-copy marker-gene families from the PhyEco database60 for Bacteria (n = 88 genes) and Archaea (n = 100 genes). Individual marker genes were identified using HMMER v.3.1b2 with bit-score cut-offs that are specific to gene family. For computational efficiency, genomes were collapsed down to species-level OTUs, which were represented as individual leaves in the phylogenetic tree. To reduce the effect of contamination, taxonomically discordant marker genes were removed, as described in ‘Refinement of MAGs on the basis of taxonomic annotation of contigs’. FAMSA v.1.2.571 was used to construct protein-based multiple sequence alignments for each gene family. Columns with >去除了15%的间隙,将比对串联,并消除了> 70%间隙的序列(n = 39)。FastTree2 v.2.1.1072用于为细菌和古细菌建立最大似然的系统发育,并具有默认选项。使用Itol v.373可视化所有树木。为了量化HGM数据集的系统发育多样性的增益,我们计算了两个子树的总分支长度:所有4,558个肠道OTU(PDGUT)的树和带有参考基因组(PDREFGUT)的2,500肠otus的树。系统发育多样性的百分比增长为:100×(PDGUT -PDREFGUT)/PDREFGUT。为了识别高级组的OTU,我们进行了系统发育距离的平均链接分层聚类,这是在R中实现的(补充表10)。通过最大化与参考基因组的基因组分类数据库相似性来确定等级特异性的截止值(扩展数据图3D,E)。   使用类似于spaphlan227的方法,我们开发了一种准确有效的工具,用于量化未组装的元基因组中物种级别的OTU。首先,我们确定了每个OTU的标记基因(补充图1A)。从每个OTU的泛基因组中选择最大的300个基因,其最大OTU内频率和最小OTU间频率。计算出OTU内频率是在OTU中的基因组中的比例,其中在90%的DNA身份下发现了基因。根据DNA对齐(使用HS-Blastn v.0.574)在每个基因与其他OTU的泛基因组之间确定频率,并说明(1)(1)在每个基因中创建基因的其他泛基因组的数量,((2)每个泛基因组中的基因频率和(3)每个元素的eplignity and epanepement epan epan epans epen epentime epan epen epans epen epen epen eprign。出于计算原因,首先将基因对准每个门的基因,并且随后检查每个OTU的300名顶级候选者,以了解门与门之间的独特性。总共鉴定出23,790个OTU的标记基因6,198,663个。   大量的OTU仅包含一个基因组,这使得难以准确预测保守的基因。为了完善我们的标记 - 基因集,我们使用了丰度的共差信息,这是对同一物种的遗传区域的常见策略,以前已应用于3,10,20,20,21,56。具体而言,我们使用Bowtie 2 v.2.3.4对3,810个元基因组样品进行了3,810个元基因组样品的读取图,并量化了每个样品中每个基因的读取深度。我们将平均链接聚类用于每个OTU的基因基因,以跨样品的读取深度相关性,将每个OTU的基因转变为变异组(补充图1B)。在应用0.90的相关阈值之后,我们为最终标记 - 基因集选择了最大的基因簇。该过程除去了≥10个均具有≥1倍覆盖的样品中的1,402个OTU的55,132个基因。   IgGsearch是一种命令行工具,它使用Bowtie 2将元基因组读取映射到标记基因数据库并量化物种级OTU。读取比对以低百分比的身份(最小= 95%),对齐覆盖率(最低读取的70%)和基本质量(最低= 20)的去除。对于每个宏基因组样品,通过将平均读取深度跨标记基因进行平均读取深度,并将这些值归一化为1.0,可以估算OTU相对丰度。基于至少一个映射读取的标记基因的百分比确定物种的存在。   在两个基准数据集上评估了IgGsearch的灵敏度和特异性。首先,我们对CAMI挑战数据集(https://data.cami-challenge.org/participate)进行了基准测试(补充表11、12,补充图2A)。其次,我们基于模拟的肠道元基因组对IGGSearch进行了基准测试,该型元基因组包含500,000至50,000,000个成对末端读数,读取长度为100 bp,光明式的测序误差和1个随机选择的肠道级别的OTUS中的每个基因组和1个基因组(补充图2B)。根据这些基准,当检测到其标记基因的15%时,我们称其为OTU,这在灵敏度和特异性之间给出了良好的平衡。   我们使用Iggsearch物种概况来鉴定与疾病相关的物种水平的OTU,包括十项先前发表的研究,包括大肠癌43,2型糖尿病21,44,21,44,类风湿关节炎42,帕金森氏病75,帕金森氏病75,动脉粥样硬化心血管疾病76,动脉粥样硬化76,链球菌77,非血管疾病77,非乳腺炎和非科学疾病。肥胖80(扩展数据表1,补充表15,16)。为了识别物种 - 疾病的关联,我们使用Wilcoxon RankUM检验比较了病例和健康对照之间的4,558种人类肠道物种水平的物种的相对丰度。排除了非统一的OTU,以减少多种假设检验的效果。对于每种疾病,使用Benjamini – Hochberg程序校正了P值以进行多种假设检验。我们使用来自其他三个工具的物种剖面进行了相同的统计程序:MIDAS v.1.3.04,grinaphan2 v.2.7.727和Motu v.1.1.1.13。所有工具均使用默认参数和分布式参考数据运行。为了防止因疾病治疗而引起的混杂信号,我们排除了100名服用影响微生物组组成的药物,包括2型糖尿病患者的二甲双胍21,44,Acarbose,Acarbose,atorvastatin,adaparinux,fondaparinux和Metopolol andersost Cartiabastotic Cardiabastory Cardiabastory疾病患者的患者76和抗抗毒药rhe42患者和抗抗毒药。   我们构建了在Scikit-Learn Python软件包(https://scikit-learn.org)中实现的随机森林模型,以预测由IgGsearch,Midas,Motu和swriplan2产生的物种丰度概况的疾病状态(扩展数据表1)。对于IggSearch,我们包括了所有23,790种OTU,并允许随机森林模型选择最可预测的OTU。随机森林模型是在Scikit-Learn软件包V.0.19.181中实现的,并针对十种疾病的每种工具中的每一种都进行了优化。具体而言,我们测试了(1)森林中的树木数量的参数值的1,000个随机组合,(2)在每次拆分时要考虑的特征数量,((3)每棵树中的最大级别数量,(4)划分节点的最小样品数量,(5)在每个叶子上使用样品的最小数量和(6),以及(6),以及(6)。为了避免过度拟合,使用十倍的交叉验证评估了每个模型,并选择了在曲线(AUC)下产生最佳接收器工作曲线(ROC)区域的参数的组合。为了获得模型性能的强大估计,所有模型均重新运行100次,并且在运行中平均ROC AUC值。   我们从细菌中选择了504个人类肠道级OTU的子集进行培养的和未培养的生物之间的比较基因组分析(补充表17)。Otus与 <5% prevalence in human gut metagenomes were excluded, because rare organisms may be amenable to cultivation but not yet sampled. Uncultivated OTUs were defined as those that contain only MAGs (either from the current study or previous studies, n = 271) and cultivated OTUs as those that contain at least one isolate genome (n = 233). We based all comparative analysis between OTUs using 24,345 high-quality MAGs from the HGM dataset, which was done (1) to avoid biases that result from a comparison of MAGs to isolate genomes (which differ in assembly quality) and (2) to avoid issues arising from low completeness among MAGs in the medium-quality tier.   We compared several broad genomic features between groups, including estimated genome size, GC content, coding density and estimated replication rate. Estimated genome size was corrected for completeness and contamination using: Ĝ = G × 100/Ĉ − (G × /100), in which Ĝ is the estimated genome size of a MAG, G is the observed genome size, Ĉ is the estimated percentage completeness and is the estimated percentage contamination. Replication rate was estimated with iRep v.1.1028 for MAGs with >5×读取深度,基于复制原点和末端之间的测序深度差异。在每个OTU的所有高质量MAG中平均基因组特征,然后使用Wilcoxon Rank-sum测试在OTU之间进行比较(补充表18)。   为了鉴定潜在的增生噬菌体,我们比较了KEGG数据库中基因,模块和途径的流行率(第77.1)82在培养的OTU和未培养的OTU之间。使用Last V.82883根据氨基酸对齐对高质量MAG的蛋白质进行注释,并分配给EGG Orthology Group,其值最低为E <1×10-5。接下来,我们计算了包含每个KEGG矫正组的每个OTU的MAGS比例,并使用Phylolm R软件包中实现的IVES -GARLAND测试在OTU之间进行了比较。2.684。Ives -Garland测试在控制组之间的系统发育差异的同时进行逻辑回归,并以前已应用于微生物组Data85。从KEGG数据库重复了该模块和路径的分析。使用Benjamini – Hochberg程序校正P值以进行多种假设检验(补充表19)。从TIGRFAM数据库(版本15.0)86(扩展数据图9A)进行了相同的分析。   有关研究设计的更多信息可在与本文有关的自然研究报告摘要中获得。

本文来自作者[admin]投稿,不代表博钧号立场,如若转载,请注明出处:https://ws-game.cn/jyfx/202601-2294.html

(29)

文章推荐

  • 仇似海攻略(仇深如海)

    武林外传51-90的涅磐剧情任务任务9:断空斩海任务授于NPC:京城武林盟长老任务完成NPC:京城武林盟长老任务奖励:171000经验+铜印一枚任务流程:武林盟长老要求你去十八里铺杀绿妖校尉和绿妖将军各30名,回来交任务得经验。此铜印可以京城大内密探凌凌恭处换取60级小天位武器一个(自选

    2025年06月20日
    32306
  • 新南威尔士州Prospect的Ian Clunies Ross动物研究实验室

      感谢您访问Nature.com。您使用的是浏览器版本对CSS的支持有限。获得  最佳体验,我们建议您使用更多最新的浏览器(或关闭兼容模式  InternetExplorer)。同时,为了确保继续支持,我们正在展示网站,没有样式  和JavaScript。

    2025年06月21日
    33300
  • 上海新增2例本土/上海新增两列本土

    11月25日上海新增社会面2例本土确诊和2例无症状1、上海市卫生健康委发布:11月25日,上海新增2例新冠病毒肺炎本土确诊病例和2例本土无症状感染者。市、区疫情防控应急处置机制立即响应,开展流行病学调查、相关人员排查、采样检测和防控管理,落实相关场所及环境终末消毒等防疫措施。2、近期,个别地区再次

    2025年06月21日
    33313
  • 【西师是哪个大学,西师是哪个大学的简称】

    川师和西师哪个好在比较川师和西师时,我们倾向于认为西北师范大学(简称西师)更为优秀。西师坐落于甘肃省兰州市,作为甘肃省的重点大学,它在西北地区享有较高的声誉。西师不仅专注于师范教育,还涵盖了教育学、文学、历史学、理学、工学等多个学科领域,为西北地区输送了众多优秀的教育人才。相较于川师,西师的学科体

    2025年06月23日
    34300
  • 斯威g01

    10~20万之间的车型众多,也是最受消费者关注的价位区间。这个价位你能买到怎样的车?今天小编介绍的是斯威汽车旗下的斯威G01,厂家指导价7.99万起。它是否符合你的需求呢?和小编一起看看吧。车型:斯威G01?2019款?1.5T?傲UP版指导价:11.99万元前脸看起来具有运动气息。锐利的前大灯和箭

    2025年12月18日
    10308
  • 东风小康的旗下产品

    东风小康V系列车型定位为微车中的越野车,是东风小康积蓄多年技术和经验研发出的创新型产品,个性化的车型设计主要面向年轻消费群体。V系列首款车型V27于2010年起在全国范围内陆续推出,并于2010年下半年批量生产。与K系列相比,新款车型参考越野车设计,线条粗犷有力,更具时尚感,并且爬坡上坎的能力更强

    2025年12月19日
    10301
  • 汽车为什么要打蜡

    汽车打蜡的作用:1、首先,最明显的作用还是上光效果,汽车打蜡可以提高漆面的光亮度,汽车显得更新一点。2、防水、防酸雨,车蜡能够使车身的水滴附着减少60%到90%,特别是刚打蜡以后的首次洗车会明显感觉到洗车的省力,这就是打蜡的作用。3、防高温、防紫外线,这点夏天不用说,即使是现在的冬天,如果停在外面或

    2025年12月21日
    14312
  • 平安车险理赔进度怎么查询

    示例操作步骤如下:1、打开微信,然后搜索“平安车险”公众号。2、然后点击关注公众号。3、接着在公众号主页点击“保险服务”,选择“理赔报案”。4、页面跳转后选择进入“车险理赔”。5、然后在跳转的查询页面个人车辆中,输入车牌号、证件号,点击确认查询即可。机动车辆保险即“车险”,是以机动车辆本身及其第三者

    2025年12月22日
    9310
  • 汽车仪表正负极亮灯是什么原因

    汽车仪表正负极亮灯是是充电系统有故障了。不同汽车仪表板的仪表不尽相同,但是一般汽车的常规仪表有车速里程表、转速表、机油压力表、水温表、燃油表、充电表等。现代汽车上,汽车仪表还需要装置稳压器,专门用来稳定仪表电源的电压,抑制波动幅度,以保证汽车仪表的精确性。大部分仪表显示的依据来自传感器,传感装置根

    2025年12月24日
    9301
  • 科沃兹和昕锐哪个皮实

    几天前,斯柯达公布了旗下首款纯电动SUV——斯柯达Enyaq命名。Enyaq基于大众MEB电动车架构打造,WLTP估计续航里程为310英里(约合500公里)。该车实际上是去年日内瓦车展上公布的Vision?IV概念车,如无意外的话,斯柯达Enyaq量产车型将在今年的日内瓦车展中公布。2020日内

    2026年01月03日
    7318
  • 前保险杠橡胶的作用是什么

    太平洋汽车网前保险杠橡胶的作用:1、用来扰流用的,防止通过车低的空气把车向上托起;2、可以稳定车速,起到省油作用;3、防止底盘擦伤的擦条。汽车保险杠以及防撞梁不用橡胶材料有三点必然的原因,橡胶只适合作为轮胎和密封条材质使用。汽车前后装饰以及防撞结构包括:覆盖件工程塑料材质装饰保险杠,内部第一层为泡沫

    2026年01月03日
    4310
  • 彼此问候英语怎么说

    问题一:每天起床后我们都彼此问候用英语怎么说,weeachotherafterwegetwe?greeteachotherafterwegetupeveryday问题二:让我们祝福彼此用英语怎么说Bestwishestoeachother.

    2026年01月04日
    5321

发表回复

本站作者才能评论

评论列表(3条)

  • admin的头像
    admin 2026年01月12日

    我是博钧号的签约作者“admin”

  • admin
    admin 2026年01月12日

    本文概览:  我们从NCBI SRA39下载了11,523次测序运行,用于公开可用的人类肠道基因组。这些数据对应于3,810个样本,15个研究9,21,40,42,42,43,44,45...

  • admin
    用户011201 2026年01月12日

    文章不错《来自全球人类肠道微生物组的未培养基因组的新见解》内容很有帮助