Discovering Novel Proteoforms Using Proteogenomic Workflows Within the Galaxy Bioinformatics Platform.

Praveen Kumar,James E Johnson,Thomas McGowan,Matthew C Chambers,Mohammad Heydarian,Subina Mehta,Caleb Easterly,Timothy J Griffin,Pratik D Jagtap

Methods in molecular biology (Clifton, NJ)（2025）

Data Sciences & Quantitative Biology | Minnesota Supercomputing Institute | Bioinformatics and Proteome Informatics Consulting | R & D - Contract Testing Services - Millipore Sigma | Carolina Population Center

Cited 0|Views2

Abstract

Proteogenomics is a growing "multi-omics" research area that combines mass spectrometry-based proteomics and high-throughput nucleotide sequencing technologies. Proteogenomics has helped in genomic annotation for organisms whose complete genome sequences became available by using high-throughput DNA sequencing technologies. Apart from genome annotation, this multi-omics approach has also helped researchers confirm expression of variant proteins belonging to unique proteoforms that could have resulted from single-nucleotide polymorphism (SNP), insertion and deletions (Indels), splice isoforms, or other genome or transcriptome variations.A proteogenomic study depends on a multistep informatics workflow, requiring different software at each step. These integrated steps include creating an appropriate protein sequence database, matching spectral data against these sequences, and finally identifying peptide sequences corresponding to novel proteoforms followed by variant classification and functional analysis. The disparate software required for a proteogenomic study is difficult for most researchers to access and use, especially those lacking computational expertise. Furthermore, using them disjointedly can be error-prone as it requires setting up individual parameters for each software. Consequently, reproducibility suffers. Managing output files from each software is an additional challenge. One solution for these challenges in proteogenomics is the open-source Web-based computational platform Galaxy. Its capability to create and manage workflows comprised of disparate software while recording and saving all important parameters promotes both usability and reproducibility. Here, we describe a workflow that can perform proteogenomic analysis on a Galaxy-based platform. This Galaxy workflow facilitates matching of spectral data with a customized protein sequence database, identifying novel protein variants, assessing quality of results, and classifying variants along with visualization against the genome.

Translated text

求助PDF

上传PDF

Bibtex

AI Read Science

AI Summary

AI Summary is the key point extracted automatically understanding the full text of the paper, including the background, methods, results, conclusions, icons and other key content, so that you can get the outline of the paper at a glance.

Example

Background

Key content

Introduction

Methods

Results

Related work

Fund

Key content

Pretraining has recently greatly promoted the development of natural language processing (NLP)
We show that M6 outperforms the baselines in multimodal downstream tasks, and the large M6 with 10 parameters can reach a better performance
We propose a method called M6 that is able to process information of multiple modalities and perform both single-modal and cross-modal understanding and generation
The model is scaled to large model with 10 billion parameters with sophisticated deployment, and the 10 -parameter M6-large is the largest pretrained model in Chinese
Experimental results show that our proposed M6 outperforms the baseline in a number of downstream tasks concerning both single modality and multiple modalities We will continue the pretraining of extremely large models by increasing data to explore the limit of its performance

Upload PDF to Generate Summary

Must-Reading Tree

Example

Generate MRT to find the research sequence of this paper

Data Disclaimer

The page data are from open Internet sources, cooperative publishers and automatic analysis results through AI technology. We do not make any commitments and guarantees for the validity, accuracy, correctness, reliability, completeness and timeliness of the page data. If you have any questions, please contact us by email: report@aminer.cn

Chat Paper

【要点】：本文提出了一种在Galaxy生物信息学平台上使用蛋白质组学工作流程发现新型蛋白质形态（proteoforms）的方法，旨在提高蛋白质组学研究的可访问性和可重复性。

【方法】：研究利用质谱蛋白质组学和高效核苷酸测序技术的结合，通过创建定制化的蛋白质序列数据库，匹配光谱数据，识别新型蛋白质变体，并进行变异分类和功能分析。

【实验】：研究在Galaxy平台上开发了一个工作流程，实验结果包括光谱数据与定制化蛋白质序列数据库的匹配，新型蛋白质变体的识别，以及变异的分类，使用的数据集未在文中明确提及。