Redesigning error control in cross-linking mass spectrometry enables more robust and sensitive protein-protein interaction studies
Mol Syst Biol
(2024)
21: 90 - 106
Abstract
Cross-linking mass spectrometry (XL-MS) allows characterizing protein-protein interactions (PPIs) in native biological systems by capturing cross-links between different proteins (inter-links). However, inter-link identification remains challenging, requiring dedicated data filtering schemes and thorough error control. Here, we benchmark existing data filtering schemes combined with error rate estimation strategies utilizing concatenated target-decoy protein sequence databases. These workflows show shortcomings either in sensitivity (many false negatives) or specificity (many false positives). To ameliorate the limited sensitivity without compromising specificity, we develop an alternative target-decoy search strategy using fused target-decoy databases. Furthermore, we devise a different data filtering scheme that takes the inter-link context of the XL-MS dataset into account. Combining both approaches maintains low error rates and minimizes false negatives, as we show by mathematical simulations, analysis of experimental ground-truth data, and application to various biological datasets. In human cells, inter-link identifications increase by 75% and we confirm their structural accuracy through proteome-wide comparisons to AlphaFold2-derived models. Taken together, target-decoy fusion and context-sensitive data filtering deepen and fine-tune XL-MS-based interactomics.
Synopsis
The authors introduce a fused target-decoy strategy and a context-sensitive data subgrouping scheme for cross-linking MS. This strategy allows for the identification of inter-protein links with higher sensitivity while maintaining low error rates.
•
Fused target-decoy databases are introduced as an alternative target-decoy search strategy for cross-linking MS.
•
A data subgrouping strategy that is sensitive to the context of the cross-linking MS dataset was developed.
•
By combining context sensitive subgrouping and target-decoy fusion, this approach achieves up to 75% increase of identified inter-protein connections from human, mitochondrial and viral datasets.
Introduction
Cross-linking mass spectrometry (XL-MS) can reveal protein-protein interactions (PPIs) and the structural information of their binding interfaces in native biological systems (Liu et al, 2017; O’Reilly and Rappsilber, 2018). In XL-MS, protein contacts are captured using a cross-linker, a small organic molecule composed of a spacer arm and two reactive groups that typically target specific amino acid side chains, and are thereafter identified by mass spectrometry. Such experiments can yield intra-links (cross-links between residues within the same protein sequence), inter-links (cross-links between residues in different protein sequences), and mono-links (peptides that are modified by a partially hydrolyzed cross-linker). Only inter-links give insight into the structural configuration of PPIs.
The quality of XL-MS datasets critically depends on a balance between maximizing sensitivity (minimizing false negatives) and specificity (minimizing false positives). The same challenge is prevalent in standard bottom-up proteomics and is primarily addressed by estimating false-discovery rates (FDR) using a concatenated target-decoy search strategy (Elias and Gygi, 2007). In this approach, mass spectra are searched against a database consisting of the target protein sequences and a concatenated list of reversed, shuffled, or randomized versions of them, called decoys (Elias and Gygi, 2007). Any spectral matches to the decoys are assumed to mimic false positives, which allows statistical control of error rates via the FDR. FDR filtering can be applied separately for different subsets of spectral matches when the error rates between the subgroups are reasonably different. Such or similar strategies are used, for example, for modified (Marx et al, 2013; Fu and Qian, 2014), miscleaved, or differently sized peptides in shotgun proteomics (Cox and Mann, 2008).
The XL-MS field has widely adopted the concatenated target-decoy strategy for FDR filtering (Yang et al, 2012; Liu et al, 2015; Maiolica et al, 2007; Walzthoeni et al, 2012; Lenz et al, 2021; Fischer and Rappsilber, 2017; Crowder et al, 2023; Zhou et al, 2023). Initial FDR filtering approaches merged inter-links and intra-links into one category (inter-intra combined FDR). Nowadays, separate FDR filtering for intra- and inter-links (inter-intra separate) has become the prevailing strategy because inter-links have a higher error probability (Walzthoeni et al, 2012; Lenz et al, 2021). This subgroup-specific FDR filtering enables more accurate FDR estimation for inter-links and PPIs. However, to what extent this approach impacts sensitivity in identifying inter-links has not been addressed.
Additional subgrouping strategies for FDR filtering proposed in recent studies increase the sensitivity toward inter-links by incorporating different levels of information present in the XL-MS dataset (e.g., about the specific cross-link, the individual protein, and the PPI). For instance, subgrouping of inter-links based on whether the connected proteins are additionally supported by intra-links or mono-links was shown to reduce the frequency of decoys when searching a concatenated target-decoy database (Chen et al, 2022; Sailer et al, 2022). Related concepts are employed by recently developed XL-MS search engines that enable more sensitive inter-link identification owing to aggregate scoring functions that integrate protein, PPI and cross-link level information (Crowder et al, 2023; Zhou et al, 2023). While such contextual information appears to be highly useful, it has not been systematically assessed whether the commonly used FDR control strategies in XL-MS (i.e., target-decoy concatenation) are sufficiently robust when performing context-sensitive strategies.
Here, we find that concatenated target-decoy searches can drastically underestimate the FDR when applying context-sensitive filtering. Instead, we demonstrate that the theoretical error is better approximated by a target-decoy fusion strategy. We leverage this observation to devise alternative inter-link subgrouping strategies that boost inter-link identifications by 29–76% in deep XL-MS datasets while simultaneously maintaining low error rates and increasing coverage of AlphaFold2 (AF2)-Multimer models across the proteome.
Results
FDR filtering with traditional data grouping schemes limits either the specificity or sensitivity of inter-link identification
In order to evaluate the sensitivity and specificity of different data grouping schemes used for FDR filtering, we made use of our recently developed XL-MS benchmarking standard. This dataset consists of 256 purified proteins that have been mixed and cross-linked according to a defined scheme (Clasen et al, 2024) prescribing which inter-protein connections will occur and which ones cannot occur (Fig. 1A). Following database search, we can accordingly assign cross-link (that is, unique residue pair) identifications as true or false, and assess the proportion of false-positive and false-negative identifications at any given target-decoy based FDR cutoff (Fig. 1B,C).
We initially tested the two most established strategies in XL-MS: a concatenated target-decoy FDR filtering strategy considering intra-links and inter-links either as one combined (intra-inter combined) or two separate (intra-inter separate) groups (Fig. 1D). In accordance with previous reports (Walzthoeni et al, 2012; Lenz et al, 2021), we find that the combined strategy fails to appropriately control error among inter-links, leading to many false-positive identifications (Fig. 1E). While this problem can be adequately controlled by the intra-inter separate approach, this strategy results in a considerable loss of sensitivity signified by an increase in false-negative identifications (Fig. 1F). For example, approximately half of the available true matches remain unidentified at 1% FDR. These shortcomings of the traditional data grouping schemes in XL-MS highlight the need for a workflow that improves sensitivity toward inter-links while retaining accurate error estimates.
Context-sensitive inter-link subgrouping schemes are not fully compatible with concatenated target-decoy searches
Previous work in shotgun proteomics found that peptide identifications can be increased by subgrouping strategies (Marx et al, 2013; Cox and Mann, 2008; Fu and Qian, 2014). This concept is based on the assumption that different subsets of matches have different error probabilities. Adapted to XL-MS, subgroups may be defined based on various properties of the detected inter-linked peptides, including peptide length, charge state, missed cleavages or the presence of other cross-links supporting the inter-link. We tested the latter possibility by separating inter-links depending on the extent to which other cross-links support the existence of the protein or PPI (context-rich vs context-poor inter-links). We evaluated two definitions of context-richness for their suitability as subgrouping criteria: the first depends on the presence of contextual intra-links (“intra-dependent”) and has been proposed previously in a similar manner (Chen et al, 2022; Sailer et al, 2022), the second depends on the presence of contextual inter-links (“inter-dependent”) and has been devised in this study (Fig. 2A). In both subgrouping procedures, inter-links are separated in the two subgroups prior to FDR analysis, which is then performed separately on both subsets before aggregating them into a combined list (see “Methods”).
We tested intra-dependent and inter-dependent subgrouping combined with concatenated target-decoy searches on our XL-MS benchmarking data (Fig. 2B,C). Compared to the established intra-inter separate strategy, our inter-dependent subgrouping yields approximately the same fraction of false positives while substantially lowering the fraction of false negatives. Thus, the inter-dependent strategy appears to be an effective way to increase identification sensitivity without compromising specificity. An increase in sensitivity (i.e., reduction of false negatives) is also observed upon intra-dependent subgrouping (Fig. 2C). However, intra-dependent subgrouping leads to a massive increase in false positives, suggesting that error rate cannot be effectively controlled when using this subgrouping approach together with concatenated target-decoy databases.
To assess whether these observations hold true in a real-world scenario, we analyzed a deep XL-MS dataset from HEK293T cells that was recently generated in our lab (Clasen et al, 2024). This dataset was created by cross-linking with the enrichable cross-linker DSBSO (Wheat et al, 2021; Matzinger et al, 2020) (“HEK293T”), which yielded 12,216 target and 2003 decoy inter-links when no FDR cutoff was applied. When intra-dependent or inter-dependent subgrouping was applied, we observed divergent score distributions between context-rich and context-poor matches (Fig. EV1A,B), which was more pronounced when inter-dependent grouping was applied. This reinforces our initial assumption of different error probabilities between context-poor and context-rich subgroups and suggests that inter-link identification in the HEK293T dataset can benefit from context-sensitive subgrouping.
We first separated inter-links according to the grouping criteria and assessed the overall fraction of targets and decoys prior to imposing any FDR cutoff (Fig. 3A). We observed that both inter-link and intra-link subgrouping strategies result in very few decoys among context-rich subgroups. For inter-dependent subgrouping, this observation agrees with the results from our ground-truth dataset. However, for intra-dependent subgrouping, the complete lack of decoys among the context-rich inter-links is in stark contrast to the explosion of false positives we observed in the benchmarking dataset. Following the herein-applied FDR filtering strategy using target-decoy concatenated database, more than 10,000 inter-links could be added to the identification without accepting any false match. The inconsistency to our benchmarking dataset indicates that combining intra-dependent subgrouping and concatenated target-decoy searches might fail to recognize the false positives among the context-rich inter-links.
Since the true number of false positives in the HEK293T data cannot be known, we simulated the theoretically expected distribution of true and false positives in the context-rich and context-poor subgroups based on a set of simple assumptions and controlled parameters (see “Methods”). We assume that wrong matches are randomly assigned to any protein in the database, while true matches have the tendency to frequently match to a restricted set of proteins. Following the placement of correct and incorrect matches on the proteins, we grouped context-poor and context-rich inter-links and evaluated the fraction of correct and incorrect matches in these subgroups. This is advantageous, as a simulation allows direct counting of the number of correct and incorrect matches (Fig. EV2). For intra-dependent subgrouping, the simulation gives ~1300 false positives representing ~11.9% of the identifications in the context-rich subgroup (Fig. 3B), which is in stark contrast to the experimentally observed numbers of decoys in this subgroup (see Fig. 3A). Interestingly, the simulated distribution of false positives for the inter-dependent strategy closely resembles the distribution of decoys. Altogether, this suggests that the error rate can be properly controlled for inter-dependent subgrouping, but not for intra-dependent subgrouping.
The problem with intra-dependent subgrouping is that a cross-link group with a low error rate (intra-links) is used to categorize a cross-link group with a substantially higher error rate (inter-links). This becomes intuitively apparent when considering how intra-dependent filtering is impacted by having high numbers of intra-links in settings of restricted database size: When more and more intra-links are identified, many (if not most) proteins in a database will eventually contain an intra-link. Randomly assigned false-positive inter-links will then more frequently contain proteins supported by these intra-links (Fig. 3C). This issue is avoided in inter-dependent subgrouping because this strategy categorizes inter-links based on information from the same cross-link group. Inter-dependent subgrouping builds on the fact that inter-links belonging to PPIs, which are additionally targeted by different inter-links are very unlikely to occur by chance.
Target-decoy fusion enables accurate error control upon context-sensitive inter-link subgrouping
Based on the explosion of false-positive identifications in the benchmarking and simulated datasets, we reason that the robustness of target-decoy concatenation strategies can be compromised whenever higher-level information (e.g., proteins supported by intra-links) is used to classify lower-level matches (e.g., specific inter-links). Integration of different information levels can also occur in conventional bottom-up proteomics and has been addressed by the introduction of fused target-decoy databases (Savitski et al, 2015; Zhang et al, 2012). In this setup, each decoy entry is fused to its respective target entry, creating one fused protein sequence (Fig. 3D). In an ungrouped setting, the distribution of target and decoy hits is not different when using a concatenated or fused decoy database. However, upon context-sensitive subgrouping, fused databases ensure that once a target protein is assigned to a context-rich subgroup its corresponding decoy part is automatically assigned to the same subgroup.
We hypothesized that a fused decoy strategy will also be beneficial for context-sensitive subgrouping in XL-MS. Repeating our HEK293T dataset analysis with this strategy resulted in a realistic decoy count for all subgrouping strategies including the intra-dependent approach (Fig. 3E). Similarly, reanalysis of our benchmarking data shows that target-decoy fusion can reduce the false-positive identifications for intra-dependent subgrouping to similar levels as those observed for inter-dependent subgrouping and the inter-intra-separate strategy (Fig. 4A). Comparing all strategies, we find that inter-dependent subgrouping is best suited for minimizing false positives (Fig. 4A) and false negatives (Fig. 4B).
Combining inter-dependent subgrouping and target-decoy fusion boosts interactome coverage in various biological systems
Next, we sought to explore the benefits of context-sensitive subgrouping and fused target-decoy searches when studying biologically relevant interactomes. To this end, we analyzed interactome coverage in intact human cells using the HEK293T dataset described above, intact human mitochondria (“mito”) (Zhu et al, 2024), and intact herpesviral particles (“virion”) (Bogdanow et al, 2023). We first used the simulation approach described above to confirm the utility of target-decoy fusion databases and context-sensitive subgrouping also for the mito and virion datasets (Fig. EV3A–G). In all cases, only target-decoy fusion accurately reflects simulated error rates. Context-rich subgroups, particularly the inter-dependent subgroup, display low error rates in all systems, which also holds true at the level of cross-link spectrum matches (Fig. EV3H–J) The inter-dependent context-rich subgroup further contained overall relatively few unique proteins, suggesting that the majority of the cross-links within this subgroup match to a restricted set of confident proteins (Fig. EV4A–C). As discussed above, this can be explained by the low probability of two different random (i.e., false) inter-links coincidentally matching to the same two protein sequences, which becomes exceedingly rare as the search space expands.
To assess intra-dependent and inter-dependent subgrouping for their potential to increase sensitivity (and thereby interactome coverage), we counted the number of identified inter-links that could be assigned at any FDR. We compared these results to those obtained when grouping all inter-links together (intra-inter-separate). In all tested datasets, we observed a pronounced increase in statistical power when inter-dependent grouping is applied (Fig. 4C and EV5A,B). The improvement was smaller (for “mito” and “virion” datasets) or non-existent (for “HEK293T”) upon intra-dependent grouping. In all cases, inter-dependent grouping performed best, particularly for stringent FDR cutoffs (29–76% increase in sensitivity at FDR = 1%) (Fig. 4D). These data show that context-sensitive subgrouping on a fused decoy database is a universally applicable strategy to increase sensitivity while maintaining accurate FDR estimates, whereby inter-dependent subgrouping increases sensitivity more robustly and effectively than intra-dependent subgrouping.
Increased coverage of PPI contact sites in proteome-wide AF2 models
Finally, we assessed the utility of our optimized inter-link identification strategy for structural biology. In the HEK293T dataset, compared to the standard inter-intra separate strategy, our inter-dependent filtering strategy increased inter-link counts for 38% of PPIs (Fig. 5A). We then mapped inter-links from both strategies onto proteome-wide predictions of complexes by AF2 multimer (Jumper et al, 2021). While both strategies resulted in a similar fraction of inter-links satisfying the computational model, we find that inter-dependent grouping increased the number of inter-links in models of good (pDockQ >0.5), acceptable (0.23 <pDockQ <0.5) and poor (pDockQ <0.23) quality (Fig. 5B). Focusing only on models of acceptable and good docking quality, we observed particularly robust increases of coverage in regions within the models that are likely disordered (pLDDT <50) (Fig. 5C,D), indicating that increased coverage is particularly prominent for regions of higher structural complexity. Overall, compared to the standard approach, inter-dependent subgrouping resulted in a similar distribution of Cɑ-Cɑ distances in good, moderately and poorly satisfied models (Fig. 5E). The newly identified inter-links provide additional supporting structural information for AF2-multimer models, such as the SMC2-SMC4 dimer (Fig. 5F, 10 additional cross-links) and large cryoEM structures, such as the proteasome (Fig. 5G,H, 17 additional cross-links). Thus, inter-dependent subgrouping increases the coverage of PPI contact sites, providing additional cross-linking data to support computational and experimental structural models across the proteome.
Discussion
In this study, we benchmarked existing FDR filtering strategies on a ground-truth dataset. Our analysis confirms previous findings that inter-links should be separately analyzed from intra-links to limit the extent of false positives among the biologically more valuable inter-links (Fig. 1E). While effectively controlling error among inter-links, we find that this strategy leads to a critical reduction of sensitivity in inter-link identifications (Fig. 1F) resulting in low coverage of PPI sites. Importantly, we find that this loss can be overcome by subgrouping inter-links that have different likelihoods of being wrong (Fig. 4B).
We grouped inter-links into either a context-rich or a context-poor subgroup that are based on the existence of other matches supporting the existence of the protein or the PPI. To appropriately apply context-sensitive subgrouping, it is essential to assure that target-decoy symmetry is not violated. Initially, with the use of target-decoy concatenated databases, such violation happened upon intra-dependent subgrouping (Chen et al, 2022; Sailer et al, 2022) leading to the deceiving case of decoys being absent in the context-rich subgroup (Fig. 3A,B). In intra-dependent subgrouping, the context-rich subgroup consists of all inter-links where the corresponding proteins are additionally matched by intra-links. As intra-links have comparably low error probability (Lenz et al, 2021), they rarely match to decoys. In contrast, inter-links have much higher error probability and are more frequently present on decoys. Thus, while incorrect inter-links on target proteins coincide with correct intra-links to the same target proteins, incorrect inter-links on decoys rarely coincide with intra-links on the same decoy proteins. This leads to an underrepresentation of decoys in context-rich subgroups, violating target-decoy symmetry. To solve this issue, we adapted the decoy fusion database search design which was first introduced in shotgun proteomics workflows (Savitski et al, 2015; Zhang et al, 2012) and demonstrated its capabilities in restoring target-decoy symmetry and allowing better control of error rates (Fig. 3D,E). This is consistent with a similar strategy by the Rappsilber lab to restore error control of the mi-filter (Fischer and Rappsilber, 2024). The use of decoy fusion databases upon context-sensitive subgrouping is critical as it will model the error rate more faithfully. Decoy fusion ensures that both the target and decoy complements are placed in the same subgroup, whereas in concatenated strategies, they might end up in different subgroups.
While intra-dependent subgrouping on a fused database led only to modest improvements in inter-link detection sensitivity, we observed that our newly proposed subgrouping strategy, inter-dependent subgrouping, almost completely eliminated false negatives in the ground-truth dataset (Fig. 4B) and retained stringent error control (Fig. 4A). Furthermore, we demonstrated substantial improvements in PPI site coverage at stringent FDRs (Fig. 4C,D), allowing the generation of more comprehensive and highly reliable XL-MS-based interactomes. The reasons behind this excellent performance are the divergent error rates (Fig. 3E) and score distributions (Fig. EV1B) of inter-links between context-rich and context-poor subgroups. While the context-poor subgroup contains many “one hit wonders” where inter-links are the sole representative of the respective PPIs, context-rich inter-links consist of protein pairs that are found with different lysine connections. The latter type is unlikely to occur by chance as random cross-links are unlikely to coincidentally connect different lysines of the same protein pair.
XL-MS data contains more levels (e.g., cross-link spectrum match, residue pair, and PPI) and context-based (e.g., inter-link and intra-link) information compared to the data of shotgun proteomics. Because of its inherent and complex structure, we expect that our fused target-decoy database design will augment the emergence of different context-sensitive strategies in XL-MS identification (Crowder et al, 2023; Chen et al, 2022; Zhou et al, 2023). Fused target-decoy strategies may also be used to consider other non-XL-based protein information, such as protein abundance to increase sensitivity. Future work may be directed towards exploring other subgrouping criteria or aggregating scoring functions that integrate multiple levels of information to continue pushing the boundaries of system-wide structural PPI profiling.
Methods
Reagent/resource | Reference or source | Identifier or catalog number |
---|---|---|
Software | ||
R software, v.4.1.2. | ||
AlphaFold multimer v.2.2 | ||
Rstudio, v. 2024.9.0.375 | ||
Pymol ™ Molecular Graphics System, Version 2.4.0a0. | ||
XlinkX v2.0 | Liu et al, 2017, https://doi.org/10.1038/ncomms15473 | |
Proteome Discoverer v.2.5 | Thermo Fisher Scientific Inc. |
Reanalysis of datasets
We reanalyzed raw files (.raw) from Clasen et al (Clasen et al, 2024), containing HCD-MS2 spectra of the HEK293T and ground-truth datasets (ProteomeXchange identifier PXD042173). First, we converted into peak lists (.mgf files) in Proteome Discoverer v2.5 (Thermo Fisher). The.mgf files were used as input to identify cross-linked peptides with a stand-alone search engine based on XlinkX v2.0 (Liu et al, 2017). The following settings of XlinkX were used: MS ion mass tolerance, 10 parts per million (ppm); MS2 ion mass tolerance, 20 ppm; fixed modification, Cys carbamidomethylation; variable modification, Met oxidation; enzymatic digestion, trypsin; and allowed the number of missed cleavages, 3; DSSO cross-linker, 158.0038 Da (short arm, 54.0106 Da; long arm, 85.9824 Da), reaction site: lysine, protein N-termini.
The search was performed using a database containing all 256 mixed, 264 additional entrapment, and 520 reversed decoy proteins. All cross-links matching to a database of 123 potential contaminant entries were removed prior to data analysis. True and false were assigned at the peptide level to account for the partial homology of proteins in different groups. True matches were assigned, when in accordance to the mixing scheme and false matches when in violation. It is important to note that, in principle, true matches may occur randomly when a cross-link matches two proteins within the same group. In this case, a true match is falsely assigned. However, the frequency of this is negligible as within group protein count (n = 8) is far lower than that of outgroup proteins (n = 512).
Raw files from HEK293T dataset (dowloaded from ProteomeXchange, identifier: PXD042173) containing HCD-MS2 data were converted into peak lists (.mgf) and searched as described above with DSBSO specificity (DSBSO cross-linker, 308.0388276 Da, short arm, 54.01056 Da, long arm, 236.01770 Da, reaction site: lysine, protein N-termini) with MS2 spectra searched against a concatenated target-decoy databases (randomized decoys) generated based on the corresponding proteome determined by bottom-up proteomics, containing 4860 target sequence entries. The table giving targets and decoy residue-level cross-links at no FDR threshold was used for subgrouping, FDR calculations and input for simulations. Cross-links refer to unique residue pairs, and the top scoring cross-link spectrum (CSM) match was considered as a representative for the residue pair.
Target-decoy fusion strategy
In order to evaluate the performance of a fused decoy database search strategy, we analyzed our datasets as if they were searched using a fused database. To this end, we performed the database search in a concatenated design but then replaced the entries containing the gene names for the cross-linked proteins with their target annotations in the cross-link results table. For example, a cross-link matching the proteins “RPL12-RPS6(decoy)” was re-annotated as “RPL12-RPS6”. Thus, all targets and decoys originating from the same .fasta entry are fused into one entry. Importantly, we kept an additional identifier indicating whether the cross-link originated from the decoy or target part of the fused entry.
Subgrouping and FDR calculations
We devised only two subgroups for intra- and inter-dependent filtering to assure subgroups are sufficiently large to enable target-decoy competition (Fu and Qian, 2014) and to avoid overfitting to a specific dataset. In the case of intra-dependent subgrouping, we devised two subgroups: One context-rich group containing all inter-links where both the cross-linked proteins are additionally supported by intra-links and a second, context-poor, subgroup containing inter-links where at most one of the proteins contained intra-links. In the case of inter-dependent grouping, we also devised two subgroups. A context-rich subgroup containing the inter-links where each of the proteins in a PPI is supported by at least two inter-linked lysines involving different reactive sites on both ends and a second, context-poor, subgroup containing all other inter-link matches. Groupings were performed at the level of residue–residue pairs, and all residue–residue identifications without FDR control were used for subgrouping. Following subgrouping, we used previously established approaches for calculating posterior error probabilities (PEP) to estimate the probability distributions for correct and incorrect matches within individual subgroups (Cox and Mann, 2008). Briefly, we make two lists per group. One list containing the matches to the decoy part, and one list matching the target part. Then, we generated a histogram for each by gaussian kernel smoothing of the score distributions. We here used the lower (worse) negative decadic logarithmic XlinkX search scores reported for both sides of the residue pair as the overall score for the residue pair. The resulting distributions correspond to approximations to the probability densities for group-dependent correct and incorrect hits:where X signifies a match to a correct or incorrect hit, s is the negative decadic logarithmic XlinkX score and G the placement of the identification into a context-rich or context-poor subgroup.
According to Bayes theorem, we calculated the PEP for each residue pair as given in Eq. (1).
(1)
Here, p(s | X = false, G) describes the probability of observing a given score s, considering an incorrect match in subgroup G. p(X = false | G) is the probability of an incorrect match in the subgroup G, calculated as the fraction of decoys in the subgroup, and p(s,G) is the probability of observing the score s in subgroup G, considering both correct and incorrect matches.
The FDR was calculated by combining the PEP values from the different groups and the combined PEPs are sorted starting with the best (lowest). Then residue pairs are successively accepted until a desired FDR cutoff is reached according to Eq. (2)where T denotes a target match, and D is a decoy match on either side of a cross-link, respectively.
(2)
The standard inter-intra separate strategy does not require PEP calculation since it does not depend on subgrouping. Instead, all inter-links are assigned to the same group, which was sorted by decreasing search scores. Again, we used the lower (worse) negative decadic logarithmic XlinkX search scores reported for both sides of the residue pair as the overall score for the residue pair. The false-discovery rate for each entry was then calculated based on Eq. (2).
Simulation
Two simulations were performed to estimate the proportion of false positives when employing two different subgrouping strategies (intra- and inter-dependent). In both cases, we simulated a distribution of false and correct cross-links based on parameters as obtained from the datasets (HEK293T, mito and virion). For intra-dependent subgrouping, we first simulated a distribution of correct matches. This was achieved by calculating the number of cross-links per target protein within the respective dataset, followed by fitting a Zipfian (Furusawa and Kaneko, 2003) distribution through the ranked target cross-link counts per protein. A Zipfian power-law distribution was used to model the correct cross-links, reflecting the nature of real-world protein interaction networks. To simulate false matches, we calculated the number of cross-links per decoy protein and fitted a Poisson distribution through the ranked decoy cross-link counts per protein. Using the Poisson distribution assumes that wrong matches occur randomly, without considering the possibility of strong systematic biases towards specific proteins, such as through misidentified modified peptides (Bogdanow et al, 2016). The fit parameters of both models were utilized to create probability distributions for true and false matches. Then we placed correct intra-links (corresponding in number to the target intra-link count), correct inter-links (corresponding in number to the target inter-link count) and incorrect inter-links (corresponding in number to the decoy inter-link count) according to the protein-dependent match probabilities for correct and false matches onto all proteins considered in the database search. Subsequently, the fraction of incorrect inter-links was evaluated in the subgroups of inter-links where (i) both linked proteins contained at least one intra-link and (ii) all others.
For simulating the frequency of false matches upon inter-dependent subgrouping, the simulation was adapted as follows: (i) intra-links were not considered, and (ii) inter-link matches were split into two parts, reflecting the possibility that each of the two cross-linked peptides could be wrongly matched (target-target, decoy-target, decoy-decoy). Again, a Poisson distribution was fitted to the empirically observed counts of decoy matches for all cross-links per PPI involving the most frequently linked decoy protein. To simulate correct matches, a Zipfian distribution was fitted for all cross-links per PPI involving the most frequently linked target protein. Incorrect and correct matches were then deposited on both sides of the inter-link according to the resulting probabilities for all possible entries in the database (as described in the previous paragraph). The frequency of incorrect matches was evaluated in the subgroups of inter-links where (i) both of the linked proteins contained at least one additional inter-link between the same proteins and (ii) all others.
AF2 prediction and structure mapping
Hetero-dimers were predicted by AF2 multimer v2.2 (Evans et al, 2022) using the default protocol and sequence search methods. For each model, the pDockQ score was calculated (Bryant et al, 2022). Different model qualities were considered based on the pDockQ score. Mapping of cross-links on AF2-Multimer predictions was performed using the bio3d R package. Therefore, we extracted the C-alpha atom coordinates of inter-linked Lysines in the predicted zero-rank dimeric structure and calculated their distance in three-dimensional space.
Data availability
Code related to FDR calculations and simulations and a R script for the context-sensitive FDR analysis of MSAnnika data is provided under https://github.com/Bogdanob/XLMS_decoyFusion/.
The source data of this paper are collected in the following database record: biostudies:S-SCDT-10_1038-S44320-024-00079-w.
Author contributions
Boris Bogdanow: Conceptualization; Data curation; Formal analysis; Validation; Investigation; Visualization; Methodology; Writing—original draft; Writing—review and editing. Max Ruwolt: Formal analysis. Julia Ruta: Formal analysis. Lars Mühlberg: Formal analysis. Cong Wang: Formal analysis. Wen-feng Zeng: Formal analysis; Investigation. Arne Elofsson: Formal analysis; Funding acquisition; Investigation; Methodology. Fan Liu: Conceptualization; Supervision; Funding acquisition; Investigation; Methodology; Writing—original draft; Project administration; Writing—review and editing.
Source data underlying figure panels in this paper may have individual authorship assigned. Where available, figure panel/source data authorship is listed in the following database record: biostudies:S-SCDT-10_1038-S44320-024-00079-w.
Funding
Open Access funding enabled and organized by Projekt DEAL.
Disclosure and competing interests statement
FL is a shareholder and advisory board member of Absea Biotechnology Ltd. and VantAI. The remaining authors declare no competing interests.
Acknowledgements
The authors acknowledge Philip Lössl (Absea Biotechnology, Berlin) for editing and critically reviewing the manuscript. BB acknowledges funding from DFG grant BO 5917/1-1. CW, MR, and JR are supported by the European Research Council (ERC) Starting Grant (ERC-STG No. 949184). LM is funded by the Leibniz-Wettbewerb (K284/2019). The AF2 computations were enabled by resources provided by the Swedish National Infrastructure for Computing (SNIC) at NSC (Berzelius), partially funded by the Swedish Research Council through grant agreement no. Berzelius-2021-29 and SNIC 2022/5-282 (AE). The work was funded by Deutsche Forschungsgemeinschaft (DFG) project LI 3260/6-1.
Supporting Information
Expanded View Figures (PDF document, 380.14 KB)
References
Bogdanow B, Gruska I, Mühlberg L, Protze J, Hohensee S, Vetter B, Bosse JB, Lehmann M, Sadeghi M, Wiebusch L, et al. Spatially resolved protein map of intact human cytomegalovirus virions Nat Microbiol 2023 8 1732-1747
Bogdanow B, Zauber H, and Selbach M Systematic errors in peptide and protein identification and quantification by modified peptides Mol Cell Proteom 2016 15 2791-2801
Bryant P, Pozzati G, and Elofsson A Improved prediction of protein-protein interactions using AlphaFold2 Nat Commun 2022 13 1265
Chen X, Sailer C, Kammer KM, Fürsch J, Eisele MR, Sakata E, Pellarin R, and Stengel F Mono- and intralink filter (Mi-Filter) to reduce false identifications in cross-linking mass spectrometry data Anal Chem 2022 94 17751-17756
Clasen MA, Ruwolt M, Kurt LU, Gozzo FC, Wang S, Chen T, Carvalho PC, Lima DB, Liu F (2024) Proteome-scale recombinant standards and a robust high-speed search engine to advance cross-linking MS-based interactomics. Nat Methods
Cox J and Mann M MaxQuant enables high peptide identification rates, individualized p.p.b.-range mass accuracies and proteome-wide protein quantification Nat Biotechnol 2008 26 1367-1372
Crowder DA, Sarpe V, Amaral BC, Brodie NI, Michael ARM, and Schriemer DC High-sensitivity proteome-scale searches for crosslinked peptides using CRIMP 2.0 Anal Chem 2023 95 6425-6432
Elias JE and Gygi SP Target-decoy search strategy for increased confidence in large-scale protein identifications by mass spectrometry Nat Methods 2007 4 207-214
Evans R, O’Neill M, Pritzel A, Antropova N, Senior A, Green T, Žídek A, Bates R, Blackwell S, Yim J et al (2022) Protein complex prediction with AlphaFold-Multimer. Preprint at https://www.biorxiv.org/content/10.1101/2021.10.04.463034v1
Fischer L and Rappsilber J Quirks of error estimation in cross-linking/mass spectrometry Anal Chem 2017 89 3829-3833
Fischer L and Rappsilber J Rescuing error control in crosslinking mass spectrometry Mol Syst Biol 2024 20 1076-1084
Fu Y and Qian X Transferred subgroup false discovery rate for rare post-translational modifications detected by mass spectrometry Mol Cell Proteom 2014 13 1359-1368
Furusawa C and Kaneko K Zipf’s law in gene expression Phys Rev Lett 2003 90 088102
Jumper J, Evans R, Pritzel A, Green T, Figurnov M, Ronneberger O, Tunyasuvunakool K, Bates R, Žídek A, Potapenko A, et al. Highly accurate protein structure prediction with AlphaFold Nature 2021 596 583-589
Lenz S, Sinn LR, O’Reilly FJ, Fischer L, Wegner F, and Rappsilber J Reliable identification of protein-protein interactions by crosslinking mass spectrometry Nat Commun 2021 12 3564
Liu F, Lössl P, Scheltema R, Viner R, and Heck AJR Optimized fragmentation schemes and data analysis strategies for proteome-wide cross-link identification Nat Commun 2017 8 15473
Liu F, Rijkers DTS, and Heck AJR Proteome-wide profiling of protein assemblies by cross-linking mass spectrometry Nat Methods 2015 12 1179-1184
Maiolica A, Cittaro D, Borsotti D, Sennels L, Ciferri C, Tarricone C, Musacchio A, and Rappsilber J Structural analysis of multiprotein complexes by cross-linking, mass spectrometry, and database searching Mol Cell Proteom 2007 6 2200-2211
Marx H, Lemeer S, Schliep JE, Matheron L, Mohammed S, Cox J, Mann M, Heck AJR, and Kuster B A large synthetic peptide and phosphopeptide reference library for mass spectrometry-based proteomics Nat Biotechnol 2013 31 557-564
Matzinger M, Kandioller W, Doppler P, Heiss EH, and Mechtler K Fast and highly efficient affinity enrichment of Azide-A-DSBSO cross-linked peptides J Proteome Res 2020 19 2071-2079
O’Reilly FJ and Rappsilber J Cross-linking mass spectrometry: methods and applications in structural, molecular and systems biology Nat Struct Mol Biol 2018 25 1000-1008
Sailer C, Jansen J, Sekulski K, Cruz VE, Erzberger JP, and Stengel F A comprehensive landscape of 60S ribosome biogenesis factors Cell Rep 2022 38 110353
Savitski MM, Wilhelm M, Hahne H, Kuster B, and Bantscheff M A scalable approach for protein false discovery rate estimation in large proteomic data sets Mol Cell Proteom 2015 14 2394-2404
Walzthoeni T, Claassen M, Leitner A, Herzog F, Bohn S, Förster F, Beck M, and Aebersold R False discovery rate estimation for cross-linked peptides identified by mass spectrometry Nat Methods 2012 9 901-903
Wheat A, Yu C, Wang X, Burke AM, Chemmama IE, Kaake RM, Baker P, Rychnovsky SD, Yang J, and Huang L Protein interaction landscapes revealed by advanced in vivo cross-linking-mass spectrometry Proc Natl Acad Sci USA 2021 118 e2023360118
Yang B, Wu Y-J, Zhu M, Fan S-B, Lin J, Zhang K, Li S, Chi H, Li Y-X, Chen H-F, et al. Identification of cross-linked peptides from complex samples Nat Methods 2012 9 904-906
Zhang J, Xin L, Shan B, Chen W, Xie M, Yuen D, Zhang W, Zhang Z, Lajoie GA, and Ma B PEAKS DB: de novo sequencing assisted database search for sensitive and accurate peptide identification Mol Cell Proteom 2012 11 M111.010587
Zhou C, Dai S, Lin Y, Lian S, Fan X, Li N, and Yu W Exhaustive cross-linking search with protein feedback J Proteome Res 2023 22 101-113
Zhu Y, Akkaya KC, Ruta J, Yokoyama N, Wang C, Ruwolt M, Lima DB, Lehmann M, and Liu F Cross-link assisted spatial proteomics to map sub-organelle proteomes and membrane protein topologies Nat Commun 2024 15 1-18
Information & Authors
Information
Published In
Molecular Systems Biology
Volume 21,Issue 1,Jan 2025This month's cover highlights the article Enhancers and genome conformation provide complex transcriptional control of a herpesviral gene by Britt A Glaunsinger, David W Morgens and colleagues. Various functional interrogations of Kaposi’s sarcoma-associated herpesvirus reveal gene regulatory logic.
Cover concept by the authors; scientific illustration by SciStories
Submission history
Received: 27 May 2024
Revision received: 20 November 2024
Accepted: 21 November 2024
Published online: 9 December 2024
Published in issue: 2 January 2025
Keywords
Copyright
© 2024 The Author(s). Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. Creative Commons Public Domain Dedication waiver http://creativecommons.org/publicdomain/zero/1.0/ applies to the data associated with this article, unless otherwise stated in a credit line to the data, but does not extend to the graphical or creative elements of illustrations, charts, or figures. This waiver removes legal barriers to the re-use and mining of research data. According to standard scholarly practice, it is recommended to provide appropriate citation and attribution whenever technically possible.
Authors
Research Funding
Deutsche Forschungsgemeinschaft (DFG): BO 5917/1-1, LI 3260/6-1
EC | ERC | HORIZON EUROPE European Research Council (ERC): ERC-STG No. 949184
Leibniz Publik: K284/2019
Vetenskapsrådet (VR): Berzelius-2021-29, SNIC 2022/5-282
Metrics & Citations
Metrics
Citations
Download Citations
If you have the appropriate software installed, you can download article citation data to the citation manager of your choice. Select your manager software from the list below and click Download.