DDA - TMT¶
This tutorial demonstrates how to analyze Tandem Mass Tag (TMT) labeled data from data-dependent acquisition (DDA) using msmu package.
Data Preparation¶
Original dataset is from PXD013361 (Magnusson et al., 2019) and the search was performed with Sage v0.14.7.
For demonstration purposes, the example dataset has been trimmed to include only 5,000 PSMs.
base_dir = "https://raw.githubusercontent.com/bertis-informatics/msmu/refs/heads/main/data/sage_tmt"
sage_idents = f"{base_dir}/sage/results.sage.tsv"
sage_quants = f"{base_dir}/sage/tmt.tsv"
meta = f"{base_dir}/meta.csv"
Load Required Packages¶
If you haven't installed the
msmupackage yet, please follow the installation guide.
import msmu as mm
import pandas as pd
import plotly.io as pio
pio.renderers.default = "png"
Determination of memory status is not supported on this platform, measuring for memoryleaks will never fail
Read Data¶
You can read data from various proteomics software outputs. Below are examples for Sage, MaxQuant, and FragPipe formats.
For this tutorial, we will use the Sage output as an example.
read_sage() function reads the Sage output files (tmt.tsv, results.sage.tsv) and creates modalities at MuData object.
# Sage format
mdata = mm.read_sage(identification_file=sage_idents, quantification_file=sage_quants, label="tmt")
# MaxQuant format
# mdata = mm.read_maxquant(identification_file="path_to_maxquant_output", label="tmt", acquisition="dda")
# FragPipe format
# mdata = mm.read_fragpipe(identification_file="path_to_fragpipe_output", label="tmt", acquisition="dda")
INFO - Identification file loaded: (5000, 40) INFO - Quantification file loaded: (5000, 9) INFO - Decoy entries separated: (1195, 13)
mdata
MuData object with n_obs × n_vars = 6 × 3805
uns: '_cmd'
1 modality
psm: 6 x 3805
var: 'proteins', 'peptide', 'stripped_peptide', 'filename', 'scan_num', 'charge', 'peptide_length', 'missed_cleavages', 'semi_enzymatic', 'contaminant', 'PEP', 'q_value'
uns: 'level', 'search_engine', 'quantification', 'label', 'acquisition', 'identification_file', 'quantification_file', 'decoy'
varm: 'search_result'
Adding Metadata¶
Optionally, you can add metadata for samples to the mdata.obs dataframe. Make sure that the index of the metadata dataframe matches the sample names in mdata.obs.
meta_df = pd.read_csv(meta, dtype=str)
meta_df = meta_df.set_index("tag") # set the index to match sample id in mdata.obs
mdata.obs = mdata.obs.join(meta_df)
mdata.push_obs() # update all modalities with the new obs data
mdata.obs
| set | sample_id | sample_name | condition | |
|---|---|---|---|---|
| 126 | S1 | t0h | t0h | t0 |
| 127 | S1 | t1h | t1h | t1 |
| 128 | S1 | t2h | t2h | t2 |
| 129 | S1 | t6h | t6h | t6 |
| 130 | S1 | t24h | t24h | t24 |
| 131 | S1 | t120h | t120h | t120 |
Removing Blank Channels¶
If your TMT data contains blank channels, you can remove them by subsetting MuData object.
mdata = mm.pp.add_filter(mdata, modality="psm", column="condition", keep="not_contains", value="BLANK", on="obs")
mdata = mm.pp.apply_filter(mdata, modality="psm", on="obs")
[Filter] Applying obs filters: ['condition_not_contains_BLANK']
Saving Raw MuData Object¶
After reading the search output, saving the MuData object as an h5mu file is recommended for future use.
mdata.write_h5mu("dda_tmt_PXD013361_raw.h5mu")
Handling PSM level¶
Filtering - PSM¶
You can filter the data based on the column values, such as q-value. You can also filter the data based on string containment, which can be useful for removing contaminants or decoys.
Filtering is split into two steps: first, you mark a filter condition using mm.pp.add_filter(), and then you apply the filter using mm.pp.apply_filter().
Here, we keep protein groups with q-value < 0.01 and remove contaminants (protein IDs containing "contam_").
mdata = mm.pp.add_filter(mdata, modality="psm", column="q_value", keep="lt", value=0.01)
mdata = mm.pp.add_filter(mdata, modality="psm", column="proteins", keep="not_contains", value="contam_")
mdata = mm.pp.apply_filter(mdata, modality="psm", on="var")
mdata
[Filter] Applying var filters: ['q_value_lt_0.01', 'proteins_not_contains_contam_']
MuData object with n_obs × n_vars = 6 × 2264
obs: 'set', 'sample_id', 'sample_name', 'condition'
uns: '_cmd'
1 modality
psm: 6 x 2264
obs: 'set', 'sample_id', 'sample_name', 'condition'
var: 'proteins', 'peptide', 'stripped_peptide', 'filename', 'scan_num', 'charge', 'peptide_length', 'missed_cleavages', 'semi_enzymatic', 'contaminant', 'PEP', 'q_value'
uns: 'level', 'search_engine', 'quantification', 'label', 'acquisition', 'identification_file', 'quantification_file', 'decoy', 'filter', 'decoy_filter'
obsm: 'filter'
varm: 'search_result', 'filter'
Normalization¶
Here, we log2 transform and normalize the data at the PSM level.
Median centering normalization is applied using mm.pp.normalize() function.
Optionally, if you need to correct batch effects using Global Internal Standard (GIS) channels, you can scale the data using mm.pp.scale_feature() function. Make sure to have GIS channels in each TMT batch. The example code below assumes that the GIS channels are named with prefix "POOLED_".
mdata = mm.pp.log2_transform(mdata, modality="psm")
mdata = mm.pp.normalize(mdata, modality="psm", method="median", fraction=True)
# GIS-based scaling (if needed)
# mdata = mm.pp.correct_batch_effect(mdata=mdata, modality="psm", method="gis", gis_prefix="POOLED_")
Computing precursor isolation purity¶
For TMT data, it is recommended to compute precursor isolation purity to exclude low-purity features from quantification aggregation. compute_precursor_isolation_purity() function calculates precursor isolation purity using mzML files.
For demonstration purposes, we will skip this step in the tutorial.
# Computing precursor isolation purity
# mdata = mm.pp.compute_precursor_isolation_purity(mdata, mzml_paths=["path/to/sample1.mzML", "path/to/sample2.mzML", ...])
mdata = mm.pp.to_peptide(mdata, purity_threshold=0.7)
# If precursor purity is computed, filter low-purity features
# mdata = mm.pp.to_peptide(mdata, purity_threshold=0.7)
WARNING - Purity column not found in psm modality for TMT data. Skipping purity filtering. INFO - Peptide-level identifications: 2204 (2151 at 1% FDR) Building new peptide quantification data.
Filtering - peptide¶
mdata = mm.pp.add_filter(mdata, modality="peptide", column="q_value", keep="lt", value=0.01)
mdata = mm.pp.apply_filter(mdata, modality="peptide", on="var")
mdata
[Filter] Applying var filters: ['q_value_lt_0.01']
MuData object with n_obs × n_vars = 6 × 4415
obs: 'set', 'sample_id', 'sample_name', 'condition'
uns: '_cmd'
2 modalities
psm: 6 x 2264
obs: 'set', 'sample_id', 'sample_name', 'condition'
var: 'proteins', 'peptide', 'stripped_peptide', 'filename', 'scan_num', 'charge', 'peptide_length', 'missed_cleavages', 'semi_enzymatic', 'contaminant', 'PEP', 'q_value'
uns: 'level', 'search_engine', 'quantification', 'label', 'acquisition', 'identification_file', 'quantification_file', 'decoy', 'filter', 'decoy_filter'
obsm: 'filter'
varm: 'search_result', 'filter'
peptide: 6 x 2151
obs: 'set', 'sample_id', 'sample_name', 'condition'
var: 'peptide', 'proteins', 'stripped_peptide', 'count_psm', 'PEP', 'q_value'
uns: 'level', 'decoy', 'filter', 'decoy_filter'
varm: 'filter'
Protein inference¶
You can infer protein-level data from peptide-level data using the mm.pp.infer_protein() function.
mdata = mm.pp.infer_protein(mdata)
INFO - Starting protein inference INFO - Initial proteins: 1722 INFO - Removed indistinguishable: 175 INFO - Removed subsettable: 62 INFO - Removed subsumable: 0 INFO - Total protein groups: 1485
mdata = mm.pp.to_protein(mdata, top_n=3, rank_method="median_intensity")
[Filter] Applying var filters: ['q_value_lt_0.01', 'peptide_type_eq_unique'] INFO - Ranking features by 'median_intensity' to select top 3 features. INFO - Protein-level identifications : 1465 (1432 at 1% FDR)
Filtering - protein¶
mdata = mm.pp.add_filter(mdata, modality="protein", column="q_value", keep="lt", value=0.01)
mdata = mm.pp.apply_filter(mdata, modality="protein", on="var")
mdata
[Filter] Applying var filters: ['q_value_lt_0.01']
MuData object with n_obs × n_vars = 6 × 5847
obs: 'set', 'sample_id', 'sample_name', 'condition'
uns: '_cmd'
3 modalities
psm: 6 x 2264
obs: 'set', 'sample_id', 'sample_name', 'condition'
var: 'proteins', 'peptide', 'stripped_peptide', 'filename', 'scan_num', 'charge', 'peptide_length', 'missed_cleavages', 'semi_enzymatic', 'contaminant', 'PEP', 'q_value'
uns: 'level', 'search_engine', 'quantification', 'label', 'acquisition', 'identification_file', 'quantification_file', 'decoy', 'filter', 'decoy_filter'
obsm: 'filter'
varm: 'search_result', 'filter'
peptide: 6 x 2151
obs: 'set', 'sample_id', 'sample_name', 'condition'
var: 'peptide', 'proteins', 'stripped_peptide', 'count_psm', 'PEP', 'q_value', 'protein_group', 'peptide_type'
uns: 'level', 'decoy', 'filter', 'decoy_filter'
varm: 'filter'
protein: 6 x 1432
obs: 'set', 'sample_id', 'sample_name', 'condition'
var: 'count_psm', 'count_stripped_peptide', 'PEP', 'q_value'
uns: 'level', 'decoy', 'filter', 'decoy_filter'
varm: 'filter'
mm.pl.plot_id(mdata, modality="protein", obs_column="sample_name")
Intensity distribution plot¶
mm.pl.plot_intensity(mdata, modality="protein", groupby="condition", obs_column="sample_name")
Saving Processed MuData Object¶
You can save MuData object into an h5mu file.
This allows you to easily reload the processed data in future sessions without repeating the entire analysis pipeline.
mdata.write_h5mu("dda_tmt_PXD013361.h5mu")
Citation¶
Magnusson, R., Rundquist, O., Kim, M. J., Hellberg, S., Na, C. H., Benson, M., ... & Gustafsson, M. (2019). A validated strategy to infer protein biomarkers from RNA-Seq by combining multiple mRNA splice variants and time-delay. BioRxiv, 599373.
Lazear, M. R. (2023). Sage: an open-source tool for fast proteomics searching and quantification at scale. Journal of Proteome Research, 22(11), 3652-3659.