Summarization

Overview

The term Summarization refers to aggregating identification features and quantitative values as data move from one hierarchical level to the next (i.e., PSM/precursor -> peptide -> protein).

Summarization functions are provided as to_* methods, such as to_peptide, to_protein, and to_ptm.

The Summarization process generally involves:

Feature selection
Selecting features to include in the aggregation based on criteria such as peptide type (unique/shared), precursor isolation purity, or abundance.
Intensity aggregation
Aggregation of quantification values using methods like median, mean, or sum.
Computing identification confidence scores (PEP, q-value) at the new level when possible.
Calculating PEP and q-values for the aggregated features using appropriate methods.

`to_peptide()`

to_peptide() function takes:

MuData containing psm level modality

and returns

MuData with peptide level modality

This step aggregates PSMs and their quantification values by peptide (non-redundant modified peptide). Peptide-level PEP is calculated with best_pep method by default and peptide-level q-values are computed using a conservative approach when decoy information is available.

For quantification aggregation, the default method is median, and an optional top_n argument can be used to restrict aggregation using top N (e.g., top 3) features within each peptide. Feature ranking is based on median_intensity unless specified otherwise.

In TMT studies, PSMs with low precursor isolation purity may be excluded prior to quantification aggregation to remove spectra with low quantitative accuracy. Precursor isolation purity should be computed with mm.pp.compute_precursor_isolation_purity() before calling to_peptide(). A purity_threshold (commonly 0.7) can be applied during aggregation.

Note that filtering by top_n or purity_threshold affects quantification aggregation only and does not modify identification feature aggregation.

mdata = mm.pp.to_peptide(
    mdata,
    agg_method="median",            # default
    purity_threshold=0.7,           # for tmt data
    top_n=None,                     # default
    rank_method="median_intensity",  # default
    )

`to_protein()`

to_protein() function takes:

MuData containing peptide modality with inferred protein_group and peptide_type

and returns:

MuData with protein level modality

Protein-level summarization requires the protein_group and peptide_type columns, which are generated by mm.pp.infer_protein() from peptide-level data. Details are provided in the Protein Inference section. Briefly:

protein_group contains the inferred proteins for each peptide.
peptide_type indicates whether a peptide is "unique" or "shared".

Only "unique" peptides are used for protein group intensity aggregation; "shared" peptides are excluded.

As in peptide-level aggregation, protein group level PEP and q-value are computed when possible.

The default settings use top_n=3 with ranking by median_intensity, so only the top three peptides per protein group contribute to quantification.

# Infer protein group from mdata (containing peptide modality)
mdata = mm.pp.infer_protein(mdata)

# Summarize peptides to protein group
mdata = mm.pp.to_protein(
    mdata,
    agg_method="median",            # default
    top_n=3,                        # default
    rank_method="median_intensity",  # default
    )

`to_ptm()`

To summarize modified peptide into post-translational modification (PTM) sites, to_ptm() uses the subset of peptides that contain the specified modification and then performs several steps to assign PTM positions at the protein level.

Internally, the function performs:

Filtering data with only modified peptides with modi_identifier
Extracting modified sites from peptide
Assigning peptide-level site labels
Exploding peptides to single proteins for per-protein site labeling
Mapping the site to the corresponding position in each protein
Merging single-protein results back into protein groups
Grouping by modified peptide and peptide-site combination
Merging site metadata with peptide-level quantification

to_ptm() function takes:

MuData containing peptide modality and attached FASTA file

and returns:

MuData with ptm_site level modality

A FASTA file is required because PTM sites must be mapped to protein-sequence coordinates. FASTA can be attached using mm.utils.attach_fasta().

The argument modi_name determines the modality name (e.g., "phospho" -> "phospho_site"), and the modification string is used to identify modified peptides.

agg_method can be selected among methods as described in other summarization functions.

mdata = mm.utils.attach_fasta("fasta/file/path.fasta")

mdata = mm.pp.to_ptm(
    mdata,
    modi_name="phospho",
    modification="[+79.9663]",
    agg_method="median",        # default
    top_n=None                  # default
    )