Summarization
Overview
The term Summarization refers to aggregating identification features and quantitative values as data move from one hierarchical level to the next (i.e., PSM/precursor -> peptide -> protein).
Summarization functions are provided as to_* methods, such as to_peptide, to_protein, and to_ptm.
The Summarization process generally involves:
- Feature selection
Selecting features to include in the aggregation based on criteria such as peptide type (unique/shared), precursor isolation purity, or abundance. - Intensity aggregation
Aggregation of quantification values using methods likemedian,mean, orsum. - Computing identification confidence scores (PEP, q-value) at the new level when possible.
Calculating PEP and q-values for the aggregated features using appropriate methods.
to_peptide()
to_peptide() function takes:
MuDatacontainingpsmlevel modality
and returns
MuDatawithpeptidelevel modality
This step aggregates PSMs and their quantification values by peptide (non-redundant modified peptide).
Peptide-level PEP is calculated with best_pep method by default and peptide-level q-values are computed using a conservative approach when decoy information is available.
For quantification aggregation, the default method is median, and an optional top_n argument can be used to restrict aggregation using top N (e.g., top 3) features within each peptide. Feature ranking is based on median_intensity unless specified otherwise.
In TMT studies, PSMs with low precursor isolation purity may be excluded prior to quantification aggregation to remove spectra with low quantitative accuracy. Precursor isolation purity should be computed with mm.pp.compute_precursor_isolation_purity() before calling to_peptide(). A purity_threshold (commonly 0.7) can be applied during aggregation.
Note that filtering by top_n or purity_threshold affects quantification aggregation only and does not modify identification feature aggregation.
mdata = mm.pp.to_peptide(
mdata,
agg_method="median", # default
purity_threshold=0.7, # for tmt data
top_n=None, # default
rank_method="median_intensity", # default
)
to_protein()
to_protein() function takes:
MuDatacontainingpeptidemodality with inferredprotein_groupandpeptide_type
and returns:
MuDatawithproteinlevel modality
Protein-level summarization requires the protein_group and peptide_type columns, which are generated by mm.pp.infer_protein() from peptide-level data.
Details are provided in the Protein Inference section. Briefly:
protein_groupcontains the inferred proteins for each peptide.peptide_typeindicates whether a peptide is "unique" or "shared".
Only "unique" peptides are used for protein group intensity aggregation; "shared" peptides are excluded.
As in peptide-level aggregation, protein group level PEP and q-value are computed when possible.
The default settings use top_n=3 with ranking by median_intensity, so only the top three peptides per protein group contribute to quantification.
# Infer protein group from mdata (containing peptide modality)
mdata = mm.pp.infer_protein(mdata)
# Summarize peptides to protein group
mdata = mm.pp.to_protein(
mdata,
agg_method="median", # default
top_n=3, # default
rank_method="median_intensity", # default
)
to_ptm()
To summarize modified peptide into post-translational modification (PTM) sites, to_ptm() uses the subset of peptides that contain the specified modification and then performs several steps to assign PTM positions at the protein level.
Internally, the function performs:
- Filtering data with only modified peptides with modi_identifier
- Extracting modified sites from peptide
- Assigning peptide-level site labels
- Exploding peptides to single proteins for per-protein site labeling
- Mapping the site to the corresponding position in each protein
- Merging single-protein results back into protein groups
- Grouping by modified peptide and peptide-site combination
- Merging site metadata with peptide-level quantification
to_ptm() function takes:
MuDatacontainingpeptidemodality and attached FASTA file
and returns:
MuDatawithptm_sitelevel modality
A FASTA file is required because PTM sites must be mapped to protein-sequence coordinates. FASTA can be attached using mm.utils.attach_fasta().
The argument modi_name determines the modality name (e.g., "phospho" -> "phospho_site"), and the modification string is used to identify modified peptides.
agg_method can be selected among methods as described in other summarization functions.
mdata = mm.utils.attach_fasta("fasta/file/path.fasta")
mdata = mm.pp.to_ptm(
mdata,
modi_name="phospho",
modification="[+79.9663]",
agg_method="median", # default
top_n=None # default
)