DDA - TMT¶

This tutorial demonstrates how to analyze Tandem Mass Tag (TMT) labeled data from data-dependent acquisition (DDA) using msmu package.

Data Preparation¶

Original dataset is from PXD013361 (Magnusson et al., 2019) and the search was performed with Sage v0.14.7.

For demonstration purposes, the example dataset has been trimmed to include only 5,000 PSMs.

In [1]:

Copied!





base_dir = "https://raw.githubusercontent.com/bertis-informatics/msmu/refs/heads/main/data/sage_tmt"
sage_idents = f"{base_dir}/sage/results.sage.tsv"
sage_quants = f"{base_dir}/sage/tmt.tsv"
meta = f"{base_dir}/meta.csv"
base_dir = "https://raw.githubusercontent.com/bertis-informatics/msmu/refs/heads/main/data/sage_tmt"
sage_idents = f"{base_dir}/sage/results.sage.tsv"
sage_quants = f"{base_dir}/sage/tmt.tsv"
meta = f"{base_dir}/meta.csv"

Load Required Packages¶

If you haven't installed the msmu package yet, please follow the installation guide.

In [2]:

Copied!

import msmu as mm
import pandas as pd
import plotly.io as pio

pio.renderers.default = "png"
import msmu as mm
import pandas as pd
import plotly.io as pio

pio.renderers.default = "png"

Determination of memory status is not supported on this 
 platform, measuring for memoryleaks will never fail

Read Data¶

You can read data from various proteomics software outputs. Below are examples for Sage, MaxQuant, and FragPipe formats.

For this tutorial, we will use the Sage output as an example.

read_sage() function reads the Sage output files (tmt.tsv, results.sage.tsv) and creates modalities at MuData object.

In [3]:

Copied!





# Sage format
mdata = mm.read_sage(identification_file=sage_idents, quantification_file=sage_quants, label="tmt")

# MaxQuant format
# mdata = mm.read_maxquant(identification_file="path_to_maxquant_output", label="tmt", acquisition="dda")

# FragPipe format
# mdata = mm.read_fragpipe(identification_file="path_to_fragpipe_output", label="tmt", acquisition="dda")
# Sage format
mdata = mm.read_sage(identification_file=sage_idents, quantification_file=sage_quants, label="tmt")

# MaxQuant format
# mdata = mm.read_maxquant(identification_file="path_to_maxquant_output", label="tmt", acquisition="dda")

# FragPipe format
# mdata = mm.read_fragpipe(identification_file="path_to_fragpipe_output", label="tmt", acquisition="dda")

INFO - Identification file loaded: (5000, 40)
INFO - Quantification file loaded: (5000, 9)
INFO - Decoy entries separated: (1195, 13)

In [4]:

Copied!

mdata
mdata

Out[4]:

MuData object with n_obs × n_vars = 6 × 3805
  uns:	'_cmd'
  1 modality
    psm:	6 x 3805
      var:	'proteins', 'peptide', 'stripped_peptide', 'filename', 'scan_num', 'charge', 'peptide_length', 'missed_cleavages', 'semi_enzymatic', 'contaminant', 'PEP', 'q_value'
      uns:	'level', 'search_engine', 'quantification', 'label', 'acquisition', 'identification_file', 'quantification_file', 'decoy'
      varm:	'search_result'

Adding Metadata¶

Optionally, you can add metadata for samples to the mdata.obs dataframe. Make sure that the index of the metadata dataframe matches the sample names in mdata.obs.

In [5]:

Copied!





meta_df = pd.read_csv(meta, dtype=str)
meta_df = meta_df.set_index("tag")  # set the index to match sample id in mdata.obs

mdata.obs = mdata.obs.join(meta_df)
mdata.push_obs()  # update all modalities with the new obs data
mdata.obs
meta_df = pd.read_csv(meta, dtype=str)
meta_df = meta_df.set_index("tag")  # set the index to match sample id in mdata.obs

mdata.obs = mdata.obs.join(meta_df)
mdata.push_obs()  # update all modalities with the new obs data
mdata.obs

Out[5]:

	set	sample_id	sample_name	condition
126	S1	t0h	t0h	t0
127	S1	t1h	t1h	t1
128	S1	t2h	t2h	t2
129	S1	t6h	t6h	t6
130	S1	t24h	t24h	t24
131	S1	t120h	t120h	t120

Removing Blank Channels¶

If your TMT data contains blank channels, you can remove them by subsetting MuData object.

In [6]:

Copied!

mdata = mm.pp.add_filter(mdata, modality="psm", column="condition", keep="not_contains", value="BLANK", on="obs")
mdata = mm.pp.apply_filter(mdata, modality="psm", on="obs")
mdata = mm.pp.add_filter(mdata, modality="psm", column="condition", keep="not_contains", value="BLANK", on="obs")
mdata = mm.pp.apply_filter(mdata, modality="psm", on="obs")

[Filter] Applying obs filters: ['condition_not_contains_BLANK']

Saving Raw MuData Object¶

After reading the search output, saving the MuData object as an h5mu file is recommended for future use.

In [7]:

Copied!

mdata.write_h5mu("dda_tmt_PXD013361_raw.h5mu")
mdata.write_h5mu("dda_tmt_PXD013361_raw.h5mu")

Handling PSM level¶

Filtering - PSM¶

You can filter the data based on the column values, such as q-value. You can also filter the data based on string containment, which can be useful for removing contaminants or decoys.

Filtering is split into two steps: first, you mark a filter condition using mm.pp.add_filter(), and then you apply the filter using mm.pp.apply_filter().

Here, we keep protein groups with q-value < 0.01 and remove contaminants (protein IDs containing "contam_").

In [8]:

Copied!

mdata = mm.pp.add_filter(mdata, modality="psm", column="q_value", keep="lt", value=0.01)
mdata = mm.pp.add_filter(mdata, modality="psm", column="proteins", keep="not_contains", value="contam_")
mdata = mm.pp.apply_filter(mdata, modality="psm", on="var")

mdata
mdata = mm.pp.add_filter(mdata, modality="psm", column="q_value", keep="lt", value=0.01)
mdata = mm.pp.add_filter(mdata, modality="psm", column="proteins", keep="not_contains", value="contam_")
mdata = mm.pp.apply_filter(mdata, modality="psm", on="var")

mdata

[Filter] Applying var filters: ['q_value_lt_0.01', 'proteins_not_contains_contam_']

Out[8]:

MuData object with n_obs × n_vars = 6 × 2264
  obs:	'set', 'sample_id', 'sample_name', 'condition'
  uns:	'_cmd'
  1 modality
    psm:	6 x 2264
      obs:	'set', 'sample_id', 'sample_name', 'condition'
      var:	'proteins', 'peptide', 'stripped_peptide', 'filename', 'scan_num', 'charge', 'peptide_length', 'missed_cleavages', 'semi_enzymatic', 'contaminant', 'PEP', 'q_value'
      uns:	'level', 'search_engine', 'quantification', 'label', 'acquisition', 'identification_file', 'quantification_file', 'decoy', 'filter', 'decoy_filter'
      obsm:	'filter'
      varm:	'search_result', 'filter'

Normalization¶

Here, we log2 transform and normalize the data at the PSM level.

Median centering normalization is applied using mm.pp.normalize() function.

Optionally, if you need to correct batch effects using Global Internal Standard (GIS) channels, you can scale the data using mm.pp.scale_feature() function. Make sure to have GIS channels in each TMT batch. The example code below assumes that the GIS channels are named with prefix "POOLED_".

In [9]:

Copied!

mdata = mm.pp.log2_transform(mdata, modality="psm")
mdata = mm.pp.normalize(mdata, modality="psm", method="median", fraction=True)

# GIS-based scaling (if needed)
# mdata = mm.pp.correct_batch_effect(mdata=mdata, modality="psm", method="gis", gis_prefix="POOLED_")
mdata = mm.pp.log2_transform(mdata, modality="psm")
mdata = mm.pp.normalize(mdata, modality="psm", method="median", fraction=True)

# GIS-based scaling (if needed)
# mdata = mm.pp.correct_batch_effect(mdata=mdata, modality="psm", method="gis", gis_prefix="POOLED_")

Computing precursor isolation purity¶

For TMT data, it is recommended to compute precursor isolation purity to exclude low-purity features from quantification aggregation. compute_precursor_isolation_purity() function calculates precursor isolation purity using mzML files.

For demonstration purposes, we will skip this step in the tutorial.

In [10]:

Copied!

# Computing precursor isolation purity
# mdata = mm.pp.compute_precursor_isolation_purity(mdata, mzml_paths=["path/to/sample1.mzML", "path/to/sample2.mzML", ...])
# Computing precursor isolation purity
# mdata = mm.pp.compute_precursor_isolation_purity(mdata, mzml_paths=["path/to/sample1.mzML", "path/to/sample2.mzML", ...])

Handling peptide level¶

Summarization - peptide¶

You can summarize psm-level data to peptide-level data using the mm.pp.to_peptide() function.

If isolation purity was calculated and purity_threshold parameter is set, features below this cutoff are excluded from quantification aggregation.

In [11]:

Copied!

mdata = mm.pp.to_peptide(mdata, purity_threshold=0.7)

# If precursor purity is computed, filter low-purity features
# mdata = mm.pp.to_peptide(mdata, purity_threshold=0.7)
mdata = mm.pp.to_peptide(mdata, purity_threshold=0.7)

# If precursor purity is computed, filter low-purity features
# mdata = mm.pp.to_peptide(mdata, purity_threshold=0.7)

WARNING - Purity column not found in psm modality for TMT data. Skipping purity filtering.
INFO - Peptide-level identifications: 2204 (2151 at 1% FDR)
Building new peptide quantification data.

Filtering - peptide¶

In [12]:

Copied!

mdata = mm.pp.add_filter(mdata, modality="peptide", column="q_value", keep="lt", value=0.01)
mdata = mm.pp.apply_filter(mdata, modality="peptide", on="var")

mdata
mdata = mm.pp.add_filter(mdata, modality="peptide", column="q_value", keep="lt", value=0.01)
mdata = mm.pp.apply_filter(mdata, modality="peptide", on="var")

mdata

[Filter] Applying var filters: ['q_value_lt_0.01']

Out[12]:

MuData object with n_obs × n_vars = 6 × 4415
  obs:	'set', 'sample_id', 'sample_name', 'condition'
  uns:	'_cmd'
  2 modalities
    psm:	6 x 2264
      obs:	'set', 'sample_id', 'sample_name', 'condition'
      var:	'proteins', 'peptide', 'stripped_peptide', 'filename', 'scan_num', 'charge', 'peptide_length', 'missed_cleavages', 'semi_enzymatic', 'contaminant', 'PEP', 'q_value'
      uns:	'level', 'search_engine', 'quantification', 'label', 'acquisition', 'identification_file', 'quantification_file', 'decoy', 'filter', 'decoy_filter'
      obsm:	'filter'
      varm:	'search_result', 'filter'
    peptide:	6 x 2151
      obs:	'set', 'sample_id', 'sample_name', 'condition'
      var:	'peptide', 'proteins', 'stripped_peptide', 'count_psm', 'PEP', 'q_value'
      uns:	'level', 'decoy', 'filter', 'decoy_filter'
      varm:	'filter'

Protein inference¶

You can infer protein-level data from peptide-level data using the mm.pp.infer_protein() function.

In [13]:

Copied!

mdata = mm.pp.infer_protein(mdata)
mdata = mm.pp.infer_protein(mdata)

INFO - Starting protein inference
INFO - Initial proteins: 1722
INFO - Removed indistinguishable: 175
INFO - Removed subsettable: 62
INFO - Removed subsumable: 0
INFO - Total protein groups: 1485

Handling protein level¶

Summarization - protein¶

You can summarize peptide-level data to protein-level data using the mm.pp.to_protein() function.

As default, top 3 peptides within protein group can be used for protein group quantification aggregation. If top_n is None, all peptides will be used.

In [14]:

Copied!

mdata = mm.pp.to_protein(mdata, top_n=3, rank_method="median_intensity")
mdata = mm.pp.to_protein(mdata, top_n=3, rank_method="median_intensity")

[Filter] Applying var filters: ['q_value_lt_0.01', 'peptide_type_eq_unique']
INFO - Ranking features by 'median_intensity' to select top 3 features.
INFO - Protein-level identifications :  1465 (1432 at 1% FDR)

Filtering - protein¶

In [15]:

Copied!

mdata = mm.pp.add_filter(mdata, modality="protein", column="q_value", keep="lt", value=0.01)
mdata = mm.pp.apply_filter(mdata, modality="protein", on="var")

mdata
mdata = mm.pp.add_filter(mdata, modality="protein", column="q_value", keep="lt", value=0.01)
mdata = mm.pp.apply_filter(mdata, modality="protein", on="var")

mdata

[Filter] Applying var filters: ['q_value_lt_0.01']

Out[15]:

MuData object with n_obs × n_vars = 6 × 5847
  obs:	'set', 'sample_id', 'sample_name', 'condition'
  uns:	'_cmd'
  3 modalities
    psm:	6 x 2264
      obs:	'set', 'sample_id', 'sample_name', 'condition'
      var:	'proteins', 'peptide', 'stripped_peptide', 'filename', 'scan_num', 'charge', 'peptide_length', 'missed_cleavages', 'semi_enzymatic', 'contaminant', 'PEP', 'q_value'
      uns:	'level', 'search_engine', 'quantification', 'label', 'acquisition', 'identification_file', 'quantification_file', 'decoy', 'filter', 'decoy_filter'
      obsm:	'filter'
      varm:	'search_result', 'filter'
    peptide:	6 x 2151
      obs:	'set', 'sample_id', 'sample_name', 'condition'
      var:	'peptide', 'proteins', 'stripped_peptide', 'count_psm', 'PEP', 'q_value', 'protein_group', 'peptide_type'
      uns:	'level', 'decoy', 'filter', 'decoy_filter'
      varm:	'filter'
    protein:	6 x 1432
      obs:	'set', 'sample_id', 'sample_name', 'condition'
      var:	'count_psm', 'count_stripped_peptide', 'PEP', 'q_value'
      uns:	'level', 'decoy', 'filter', 'decoy_filter'
      varm:	'filter'

Visualization¶

ID plot¶

In [16]:

Copied!

mm.pl.plot_id(mdata, modality="protein", obs_column="sample_name")
mm.pl.plot_id(mdata, modality="protein", obs_column="sample_name")

No description has been provided for this image

Intensity distribution plot¶

In [17]:

Copied!

mm.pl.plot_intensity(mdata, modality="protein", groupby="condition", obs_column="sample_name")
mm.pl.plot_intensity(mdata, modality="protein", groupby="condition", obs_column="sample_name")

Saving Processed MuData Object¶

You can save MuData object into an h5mu file. This allows you to easily reload the processed data in future sessions without repeating the entire analysis pipeline.

In [18]:

Copied!

mdata.write_h5mu("dda_tmt_PXD013361.h5mu")
mdata.write_h5mu("dda_tmt_PXD013361.h5mu")

Citation¶

Magnusson, R., Rundquist, O., Kim, M. J., Hellberg, S., Na, C. H., Benson, M., ... & Gustafsson, M. (2019). A validated strategy to infer protein biomarkers from RNA-Seq by combining multiple mRNA splice variants and time-delay. BioRxiv, 599373.

Lazear, M. R. (2023). Sage: an open-source tool for fast proteomics searching and quantification at scale. Journal of Proteome Research, 22(11), 3652-3659.