Source code for maspy.inference

"""
The module "inference" is still in development. The interface of high and low
level functions is not yet stable!

This module allows a simple protein inference; reporting a minimal set of
proteins that can explain a set of peptides; defining shared und unique
peptides; defining proteins that share equal evidence, protein subgroups and
subsumable proteins. In addition it should provide an interface that allows
convenient selection of observed peptides that can be used for quantification.

:About protein groups:
    In shotgun proteomics proteins are not directly analyzed, but via peptides
    generated by proteolytic digestion. Some of these peptides can't be
    unambiguously mapped to one single protein because of sequence homology
    between proteins.

    As a consequence, it is often not possible to determine the exact set of
    proteins that have generated the observed peptides. The general aim of
    protein grouping is to group proteins with shared evidence together to
    reduce redundancy and simplify data analysis and interpretation. The result
    of protein grouping not only contains protein groups, but also information
    about the relation of proteins and the ambiguity of peptides. This also
    simplifies inspection of the observed evidence that indicate the presence of
    a protein.

    There is however no consensus in the proteomics community how a protein
    group is defined. One approach is to group proteins together which observed
    peptide sets have a subset or sameset relation. Another possibility is to
    generate protein groups in a way, that they reflect the minimal number of
    proteins that can completely account for all observed peptides.

:A mapping based protein grouping algorithm:
    Protein groups, as generated by the function mappingBasedGrouping(), can be
    defined as a set of proteins, with sameset and subset relations. Each of
    these protein sets is essential to explain the presence of at least one
    peptide. In addition, proteins that share information with multiple groups,
    but are not a subset of any protein present in one of the groups, are
    assosicated with these groups as subsumable proteins.

    Principle of the grouping algorithm:
        - All samset proteins are merged and subset proteins are removed
        - Each unique protein generates a new protein group. This group contains
          all sameset and subset proteins of the group leading proteins. All
          peptides mapped to these proteins are considered to be group peptides.
          (A unique protein is mapped to at least least one peptide that is
          mapped to only one protein.)
        - Remaining peptides that are not explained by any group yet are seeds
          for the generation of new groups. All these peptides are still mapped
          to at least two proteins. The function _findRedundantProteins()
          generates a set of proteins that is considered as redundant. After
          removing these redundant proteins, all remaining peptides become
          unique. Proteins mapped to these peptides generate groups as described
          before.
        - Proteins that share peptides with multiple groups, but are not a
          subset of the group leading proteins are associated as susumable
          proteins.

:Protein grouping example data:
    See unit test module ../tests/test_inference.py
    A grafical representation of the example data and the expected results,
    generated by the function mappingBasedGrouping(), can be found here:
    https://docs.google.com/drawings/d/1irbXDODwYsbrEw-Pa4rok5xguYrCbaqWl7F03qKGdC0/pub?w=3262&h=6566

:Example:
    import maspy.inference

    proteinToPeptides = {
        '01_P1': {'01_p1', '01_p2'}, '02_P1': {'02_p1', '02_p2'},
        '02_P2': {'02_p2'}, '03_P1': {'03_p1', '03_p2'},
        '03_P2': {'03_p1', '03_p2'}, '04_P1': {'04_p1', '04_p2'},
        '04_P2': {'04_p1', '04_p2'}, '04_P3': {'04_p2'},
        '05_P1': {'05_p1', '05_p2', '05_p3'}, '05_P2': {'05_p1', '05_p4'},
        '05_P3': {'05_p2', '05_p3', '05_p4'},
        '06_P1': {'06_p1', '06_p2', '06_p3'}, '06_P2': {'06_p2', '06_p3'},
        '06_P3': {'06_p2', '06_p4'}, '06_P4': {'06_p2', '06_p3', '06_p4'},
        '06_P5': {'06_p2', '06_p4'}, '07_P1': {'07_p1', '07_p2'},
        '07_P2': {'07_p1', '07_p3', '07_p4'}, '07_P3': {'07_p2', '07_p3'},
        '07_P4': {'07_p3', '07_p5'}, '08_P1': {'08_p1', '08_p2'},
        '08_P2': {'08_p3', '08_p4'}, '08_P3': {'08_p2', '08_p3'},
        '09_P1': {'09_p1', '09_p3', '09_p4', '09_p5', '09_p7'},
        '09_P2': {'09_p1', '09_p2'}, '09_P3': {'09_p2', '09_p3'},
        '09_P4': {'09_p5', '09_p6'}, '09_P5': {'09_p6', '09_p7'},
        '10_P1': {'10_p1', '10_p2'}, '10_P2': {'10_p3', '10_p4', '10_p5'},
        '10_P3': {'10_p2', '10_p3'}, '10_P4': {'10_p2', '10_p3', '10_p4'},
        }

    inference = maspy.inference.mappingBasedGrouping(proteinToPeptides)

:Todo:
    Hypothesis testing
        Add a private function that can be used to performs hypothesis tests.
        These tests comprise the evaluation of some general assumptions about
        the grouping results.
        - Group representatives should contain all group peptides (not included
          are peptides from subsumable proteins).
        - Peptides of a subsumable protein should not be fully covered by non-
          subsumable proteins of one single group.
        - Group peptides of all the groups of a subsumable protein should
          contain all peptides of the respective subsumable protein.

    function mappingBasedGrouping()
        Refactoring
            - Processing of individual clusters can be put into a function
            - Processing of groupInitiatingProteins, redundantProteins and
              subsetProteins can be put into function
            - The First part that defines various sets of proteins can be put
              into a function
            - Finally remove asserts
            - Maybe it is not necessary to work with merged proteins, after
              defining them...
        Unittest
            - Test if all clusters are properly stored:
              see inference.clusters and inference._proteinToClusterId

:Note:
    Whenever possible, we use the PSI-MS ontology terms to describe protein
    groups and protein relationships. However, in order to be less ambiguous
    some of the terms are more strictly defined in maspy or used in a slightly
    different context. The details of used terms are described below.

    Protein cluster (proteinCluster)
        A protein cluster comprises all proteins that are somehow linked by
        shared peptide evidence. (related to PSI-MS MS:1002407)

    Protein group (proteinGroup)
        A group of proteins that are essential to explain at least one peptide.
        A protein group consists of its leading, subset and subsumable proteins.
        However, subsumable proteins are more loosely associated with the group.
        The common peptide evidence of a group is the sum of all its leading and
        subset proteins. As a consequence, all leading and subset proteins must
        be a sameset or a subset of the protein groups common peptide evidence.

    Representative protein (representative)
        Each protein group must have exactle one representative protein. This
        protein must be amongst the leading proteins of the group. (identical to
        PSI-MS MS:1002403)

    Leading protein (leading)
        Each protein group must have at least one leading protein, to indicate
        proteins with the strongest evidence. (identical to PSI-MS MS:1002401)
        All proteins present in a protein group that are not leading proteins
        can be considered to be non-leading proteins (PSI-MS MS:1002402).

    Same-set protein (sameset)
        A protein that shares exactly the same set of peptide evidence with one
        or multiple other proteins. (identical to PSI-MS MS:1001594)

    Sub-set protein (subset)
        A protein with a sub-set of the peptide evidence of another protein. It
        can be a sub-set of multiple proteins, and those don't have to be in the
        same protein group. (identical to PSI-MS MS:1001596)

    Subsumable protein (subsumable)
        A protein which peptide evidence is distributed across multiple proteins
        in at least two different protein groups. This term is mutually exclusiv
        with the term sub-set protein, and hierarchically below sub-set. That
        means that a protein can only be a subsumable if it is not already a
        subset of another protein. Also the protein is not allowed to be a
        "leading" protein in any protein group. (is related to the PSI-MS term
        MS:1002570 "multiply subsumable protein", but defined more strictly)


Relevant terms from the PSI-MS ontology
---------------------------------------
Group representative (MS:1002403)
    An arbitrary and optional flag applied to exactly one protein per group to
    indicate it can serve as the representative of the group, amongst leading
    proteins, in effect serving as a tiebreaker for approaches that require
    exactly one group representative.

Leading protein (MS:1002401)
    At least one protein within each group should be annotated as a leading
    protein to indicate it has the strongest evidence, or approximately equal
    evidence as other group members.

Non-leading protein (MS:1002402)
    Zero to many proteins within each group should be annotated as non-leading
    to indicate that other proteins have stronger evidence.

Cluster identifier (MS:1002407)
    An identifier applied to protein groups to indicate that they are linked by
    shared peptides.

Same-set protein (MS:1001594)
    A protein which is indistinguishable or equivalent to another protein,
    having matches to an identical set of peptide sequences.

Sequence sub-set protein (MS:1001596)
    A protein with a sub-set of the peptide sequence matches for another
    protein, and no distinguishing peptide matches.

Sequence subsumable protein (MS:1001598)
    A sequence same-set or sequence sub-set protein where the matches are
    distributed across two or more proteins.

Sequence multiply subsumable protein (MS:1002570)
    A protein for which the matched peptide sequences are the same, or a
    subset of, the matched peptide sequences for two or more other proteins
    combined. These other proteins need not all be in the same group.

Peptide unique to one protein (MS:1001363)
    A peptide matching only one.

Peptide shared in multiple proteins (MS:1001175)
    A peptide matching multiple proteins.


Additional terms used in maspy
------------------------------

Unique protein (uniqueProtein)
    A protein mapped to at least one unique peptide.

Super-set protein:
    A protein with a set of peptide matches, that is a strict superset of other
    proteins (strict means that the peptide matches are not equal).
"""

#  Copyright 2015-2017 David M. Hollenstein, Jakob J. Hollenstein
#
#  Licensed under the Apache License, Version 2.0 (the "License");
#  you may not use this file except in compliance with the License.
#  You may obtain a copy of the License at
#
#      http://www.apache.org/licenses/LICENSE-2.0
#
#  Unless required by applicable law or agreed to in writing, software
#  distributed under the License is distributed on an "AS IS" BASIS,
#  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
#  See the License for the specific language governing permissions and
#  limitations under the License.

######################### Python 2 and 3 compatibility #########################
from __future__ import absolute_import, division, print_function
from __future__ import unicode_literals
from future.utils import viewitems, viewkeys, viewvalues, listitems, listvalues

try:
    #python 2.7
    from itertools import izip as zip
except ImportError:
    #python 3 series
    pass
################################################################################

import warnings
warnings.warn('Module maspy.inference is currently in development.',
              ImportWarning)

from collections import Counter
from collections import defaultdict as ddict
import operator

import maspy.auxiliary as AUX


"""
Mapping objects
---------------

:param pepToProts: dict, for each peptide (=key) contains a set of parent
    proteins (=value). For Example {peptide: {protein, ...}, ...}
:param protToPeps: dict, for each protein (=key) contains a set of
    associated peptides (=value). For Example {protein: {peptide, ...}, ...}
"""

[docs]class ProteinInference(object): """Contains the result of a protein inference analysis. :ivar protToPeps: dict, for each protein (=key) contains a set of associated peptides (=value). For Example {protein: {peptide, ...}, ...} :ivar pepToProts: dict, for each peptide (=key) contains a set of parent proteins (=value). For Example {peptide: {protein, ...}, ...} :ivar proteins: dict, protein ids pointing to Protein objects :ivar groups: dict, protein group ids pointing to ProteinGroup objects :ivar clusters: dict, cluster ids pointing to a list of protein group ids that are part of the same cluster. See "Cluster identifier (MS:1002407)" :ivar _proteinToGroupIds: dict, protein ids containing a set of protein group ids that they are associated with. NOTE: Protein objects do not contain a reference to their protein groups :ivar _proteinToClusterId: dict, protein ids containing a set of cluster ids that they are associated with. NOTE: Protein objects do not contain a reference to their clusters :ivar _uniquePeptides: set, all unique peptides of all protein groups. I.e. peptides that are only mapped to one protein (after merging same-set proteins). :ivar _groupUniquePeptides: set, all group unique peptides of all protein groups. I.e. peptides that are not unique but only explained by proteins of one single protein group and not by any subsumable proteins. :ivar _groupSubsumablePeptides: set, all group subsumable peptides of all protein groups. I.e. peptides that are only explained by proteins of one single protein group but also of subsumable proteins. :ivar _sharedPeptides: set, all shared peptides of all protein groups. I.e. peptides that are explained by proteins of multiple protein groups, not considering subsumable proteins. """ def __init__(self, protToPeps): self.protToPeps = protToPeps self.pepToProts = _invertMapping(protToPeps) self.proteins = dict() self.groups = dict() self.clusters = dict() self._proteinToGroupIds = ddict(set) self._proteinToClusterId = dict() self._uniquePeptides = set() self._groupUniquePeptides = set() self._groupSubsumablePeptides = set() self._sharedPeptides = set() self._nextGroupId = 1
[docs] def getGroups(self, proteinId): """Return a list of protein groups a protein is associated with.""" return [self.groups[gId] for gId in self._proteinToGroupIds[proteinId]]
[docs] def addProteinGroup(self, groupRepresentative): """Adds a new protein group and returns the groupId. The groupId is defined using an internal counter, which is incremented every time a protein group is added. The groupRepresentative is added as a leading protein. :param groupRepresentative: the protein group representing protein :returns: the protein groups groupId """ groupId = self._getNextGroupId() self.groups[groupId] = ProteinGroup(groupId, groupRepresentative) self.addLeadingToGroups(groupRepresentative, groupId) return groupId
[docs] def addLeadingToGroups(self, proteinIds, groupIds): """Add one or multiple leading proteins to one or multiple protein groups. :param proteinIds: a proteinId or a list of proteinIds, a proteinId must be a string. :param groupIds: a groupId or a list of groupIds, a groupId must be a string. """ for groupId in AUX.toList(groupIds): self.groups[groupId].addLeadingProteins(proteinIds) self._addProteinIdsToGroupMapping(proteinIds, groupId)
[docs] def addSubsetToGroups(self, proteinIds, groupIds): """Add one or multiple subset proteins to one or multiple protein groups. :param proteinIds: a proteinId or a list of proteinIds, a proteinId must be a string. :param groupIds: a groupId or a list of groupIds, a groupId must be a string. """ for groupId in AUX.toList(groupIds): self.groups[groupId].addSubsetProteins(proteinIds) self._addProteinIdsToGroupMapping(proteinIds, groupId)
[docs] def addSubsumableToGroups(self, proteinIds, groupIds): """Add one or multiple subsumable proteins to one or multiple protein groups. :param proteinIds: a proteinId or a list of proteinIds, a proteinId must be a string. :param groupIds: a groupId or a list of groupIds, a groupId must be a string. """ for groupId in AUX.toList(groupIds): self.groups[groupId].addSubsumableProteins(proteinIds) self._addProteinIdsToGroupMapping(proteinIds, groupId)
def _getNextGroupId(self): """Returns the next internal groupId and increments the counter.""" groupId = self._nextGroupId self._nextGroupId += 1 return str(groupId) def _addProteinIdsToGroupMapping(self, proteinIds, groupId): """Add a groupId to one or multiple entries of the internal proteinToGroupId mapping. :param proteinIds: a proteinId or a list of proteinIds, a proteinId must be a string. :param groupId: str, a groupId """ for proteinId in AUX.toList(proteinIds): self._proteinToGroupIds[proteinId].add(groupId)
[docs]class ProteinGroup(object): """A protein ambiguity group. :ivar id: unique identifier of the protein, should allow mapping to a protein database :ivar representative: a single protein that represents the group :ivar proteins: set, all leading proteins, subset proteins and the representative protein. Subsumable proteins are not included here! :ivar leading: set, leading proteins (MS:1002401) :ivar subset: set, sub-set proteins :ivar subsumableProteins: set, group sub-sumable proteins """ def __init__(self, groupId, representative): self.id = groupId self.representative = representative self.proteins = set() self.leading = set() self.subset = set() self.subsumableProteins = set() def _addProteins(self, proteinIds, containerNames): """Add one or multiple proteinIds to the respective container. :param proteinIds: a proteinId or a list of proteinIds, a proteinId must be a string. :param containerNames: list, entries must be one or multiple of 'leading', 'subset', 'subsumableProteins' or 'proteins' :param addToProteins: bool, if True the proteinIds are added to the """ proteinIds = AUX.toList(proteinIds) for containerName in containerNames: proteinContainer = getattr(self, containerName) proteinContainer.update(proteinIds)
[docs] def addLeadingProteins(self, proteinIds): """Add one or multiple proteinIds as leading proteins. :param proteinIds: a proteinId or a list of proteinIds, a proteinId must be a string. """ self._addProteins(proteinIds, ['leading', 'proteins'])
[docs] def addSubsetProteins(self, proteinIds): """Add one or multiple proteinIds as subset proteins. :param proteinIds: a proteinId or a list of proteinIds, a proteinId must be a string. """ self._addProteins(proteinIds, ['subset', 'proteins'])
[docs] def addSubsumableProteins(self, proteinIds): """Add one or multiple proteinIds as subsumable proteins. :param proteinIds: a proteinId or a list of proteinIds, a proteinId must be a string. """ self._addProteins(proteinIds, ['subsumableProteins'])
[docs]class Protein(object): """Protein represents one entry from a protein database. It describes the protein relations and how a set of observed peptides are mapped to it. :ivar id: unique identifier of the protein, should allow mapping to a protein database :ivar peptides: a list of all peptides that are mapped onto a protein :ivar uniquePeptides: unique to one single protein (or one protein after merging sameset proteins) :ivar groupUniquePeptides: shared only with proteins of the same group :ivar groupSubsumablePeptides: shared only with subsumable proteins, but not with multiple other protein groups :ivar sharedPeptides: shared between proteins of multiple protein groups :ivar isSubset: set, protein is a subset of these proteins :ivar isSameset: set, protein set that shares identical evidence :ivar isLeading: bool, True if protein is a leading protein of its group. :ivar isSubsumable: bool, True if protein is associated with multiple protein groups as a subsumable protein. """ def __init__(self, proteinId, peptides): self.id = proteinId self.peptides = peptides self.uniquePeptides = set() self.groupUniquePeptides = set() self.groupSubsumablePeptides = set() self.sharedPeptides = set() self.isSubset = set() self.isSameset = set() self.isLeading = False self.isSubsumable = False
[docs]def mappingBasedGrouping(protToPeps): """Performs protein grouping based only on protein to peptide mappings. :param protToPeps: dict, for each protein (=key) contains a set of associated peptides (=value). For Example {protein: {peptide, ...}, ...} #TODO: REFACTORING!!! returns a ProteinInference object """ inference = ProteinInference(protToPeps) pepToProts = inference.pepToProts proteinClusters = _findProteinClusters(protToPeps, pepToProts) proteins = {} for clusterId, proteinCluster in enumerate(proteinClusters, 1): clusterProtToPeps = {p: protToPeps[p] for p in proteinCluster} #Find sameset proteins, define unique and non unique sameset proteins #NOTE: already unique proteins could be excluded to find sameset proteins samesetProteins = _findSamesetProteins(clusterProtToPeps) mergedProtToPeps = _mergeProteinEntries(samesetProteins, clusterProtToPeps) mergedPepToProts = _invertMapping(mergedProtToPeps) uniqueProteins = _findUniqueMappingValues(mergedPepToProts) remainingProteins = set(mergedProtToPeps).difference(uniqueProteins) # Remove subset proteins and check if remaining proteins become unique subsetProteinInfo = _findSubsetProteins(remainingProteins, mergedProtToPeps, mergedPepToProts) subsetProteins = [p for p, _ in subsetProteinInfo] subsetRemovedProtToPeps = _reducedProtToPeps(mergedProtToPeps, subsetProteins) subsetRemovedPepToProts = _invertMapping(subsetRemovedProtToPeps) uniqueSubsetRemoved = _findUniqueMappingValues(subsetRemovedPepToProts) remainingProteins = remainingProteins.difference(subsetProteins) remainingProteins = remainingProteins.difference(uniqueSubsetRemoved) # Find redundant proteins # subsumableProteins = _findRedundantProteins(subsetRemovedProtToPeps, subsetRemovedPepToProts) remainingNonRedundant = remainingProteins.difference(subsumableProteins) groupInitiatingProteins = uniqueSubsetRemoved.union(remainingNonRedundant) # - Generate protein groups and assign proteins to groups - # #Generate protein groups clusterGroupIds = set() for protein in groupInitiatingProteins: proteinIds = AUX.toList(protein) groupId = inference.addProteinGroup(proteinIds[0]) inference.addLeadingToGroups(proteinIds, groupId) clusterGroupIds.add(groupId) #Add redundant proteins here (must be subsumable I guess) for protein in subsumableProteins: proteinIds = AUX.toList(protein) connectedProteins = _mappingGetValueSet( mergedPepToProts, mergedProtToPeps[protein] ) flatConnectedProteins = _flattenMergedProteins(connectedProteins) groupIds = _mappingGetValueSet( inference._proteinToGroupIds, flatConnectedProteins ) inference.addSubsumableToGroups(proteinIds, groupIds) assert len(groupIds) > 1 #Add subgroup proteins to the respective groups #NOTE: proteins that are only a subset of subsumable proteins are not #to be added as subset proteins to a group but as subsumable proteins. for protein, supersetProteins in subsetProteinInfo: proteinIds = AUX.toList(protein) #If the protein is a subset of at least one protein, that is not a #subsumable protein, then it should be added to the group as subset. leadingSuperProteins = supersetProteins.intersection( groupInitiatingProteins) if leadingSuperProteins: flatSupersetProteins = _flattenMergedProteins( leadingSuperProteins) superGroupIds = _mappingGetValueSet( inference._proteinToGroupIds, flatSupersetProteins ) inference.addSubsetToGroups(proteinIds, superGroupIds) #However, if all its super proteins are subsumable, the protein #itself is a subsumable protein. else: flatSupersetProteins = _flattenMergedProteins(supersetProteins) superGroupIds = _mappingGetValueSet( inference._proteinToGroupIds, flatSupersetProteins ) inference.addSubsumableToGroups(proteinIds, superGroupIds) subsumableProteins.update(proteinIds) assert superGroupIds # - Define peptide properties - # groupToPeps = dict() allSubsumablePeps = set() for groupId in clusterGroupIds: group = inference.groups[groupId] if group.subsumableProteins: subsumablePeptides = _mappingGetValueSet( protToPeps, group.subsumableProteins ) allSubsumablePeps.update(subsumablePeptides) groupPeptides = _mappingGetValueSet(protToPeps, group.proteins) groupToPeps[groupId] = groupPeptides pepToGroups = _invertMapping(groupToPeps) #Get unique peptides from peptide to protein mapping uniquePeptides = _findUniqueMappingKeys(mergedPepToProts) #Shared peptides have a groupPeptideCount > 1 nonSharedPeptides = _findUniqueMappingKeys(pepToGroups) sharedPeptides = set(pepToGroups).difference(nonSharedPeptides) #Subsumable peptides are peptides from subsumable proteins that #are not shared peptides of multiple groups subsumablePeptides = allSubsumablePeps.difference(sharedPeptides) #groupUniquePeptides are the remaining ones (not shared with subsumable #proteins, groupPeptideCount == 1, not unique peptides) groupUniquePeptides = nonSharedPeptides.difference(subsumablePeptides) groupUniquePeptides = groupUniquePeptides.difference(uniquePeptides) inference._uniquePeptides.update(uniquePeptides) inference._groupUniquePeptides.update(groupUniquePeptides) inference._groupSubsumablePeptides.update(subsumablePeptides) inference._sharedPeptides.update(sharedPeptides) # - Generate protein entries and add them to the inference object - # subsetProteinInfoDict = dict(subsetProteinInfo) for protein, peptides in viewitems(mergedProtToPeps): _uniquePeptides = peptides.intersection(uniquePeptides) _groupUniquePeptides = peptides.intersection(groupUniquePeptides) _subsumablePeptides = peptides.intersection(subsumablePeptides) _sharedPeptides = peptides.intersection(sharedPeptides) proteinIds = AUX.toList(protein) for proteinId in proteinIds: proteinEntry = Protein(proteinId, peptides) if protein in groupInitiatingProteins: proteinEntry.isLeading = True elif protein in subsumableProteins: proteinEntry.isSubsumable = True if protein in subsetProteins: superset = subsetProteinInfoDict[protein] proteinEntry.isSubset = _flattenMergedProteins(superset) if len(proteinIds) > 1: proteinEntry.isSameset = set(proteinIds) inference.proteins[proteinId] = proteinEntry #Add peptides to protein entry proteinEntry.uniquePeptides = _uniquePeptides proteinEntry.groupUniquePeptides = _groupUniquePeptides proteinEntry.groupSubsumablePeptides = _subsumablePeptides proteinEntry.sharedPeptides = _sharedPeptides # - Save cluster information - # for proteinId in proteinCluster: inference._proteinToClusterId[proteinId] = clusterId inference.clusters[clusterId] = clusterGroupIds allProteins = set() for proteinGroup in viewvalues(inference.groups): allProteins.update(proteinGroup.proteins) allProteins.update(proteinGroup.subsumableProteins) assert len(allProteins) == len(protToPeps) return inference
def _findProteinClusters(protToPeps, pepToProts): """Find protein clusters in the specified protein to peptide mappings. A protein cluster is a group of proteins that are somehow directly or indirectly connected by shared peptides. :param protToPeps: dict, for each protein (=key) contains a set of associated peptides (=value). For Example {protein: {peptide, ...}, ...} :param pepToProts: dict, for each peptide (=key) contains a set of parent proteins (=value). For Example {peptide: {protein, ...}, ...} :returns: a list of protein clusters, each cluster is a set of proteins """ clusters = list() resolvingProteins = set(protToPeps) while resolvingProteins: protein = resolvingProteins.pop() proteinCluster = set([protein]) peptides = set(protToPeps[protein]) parsedPeptides = set() while len(peptides) != len(parsedPeptides): for peptide in peptides: proteinCluster.update(pepToProts[peptide]) parsedPeptides.update(peptides) for protein in proteinCluster: peptides.update(protToPeps[protein]) clusters.append(proteinCluster) resolvingProteins = resolvingProteins.difference(proteinCluster) return clusters def _findSamesetProteins(protToPeps, proteins=None): """Find proteins that are mapped to an identical set of peptides. :param protToPeps: dict, for each protein (=key) contains a set of associated peptides (=value). For Example {protein: {peptide, ...}, ...} :param proteins: iterable, proteins that are tested for having equal evidence. If not specified all proteins are tested :returns: a list of sorted protein tuples that share equal peptide evidence """ proteins = viewkeys(protToPeps) if proteins is None else proteins equalEvidence = ddict(set) for protein in proteins: peptides = protToPeps[protein] equalEvidence[tuple(sorted(peptides))].add(protein) equalProteins = list() for proteins in viewvalues(equalEvidence): if len(proteins) > 1: equalProteins.append(tuple(sorted(proteins))) return equalProteins def _findSubsetProteins(proteins, protToPeps, pepToProts): """Find proteins which peptides are a sub-set, but not a same-set to other proteins. :param proteins: iterable, proteins that are tested for being a subset :param pepToProts: dict, for each peptide (=key) contains a set of parent proteins (=value). For Example {peptide: {protein, ...}, ...} :param protToPeps: dict, for each protein (=key) contains a set of associated peptides (=value). For Example {protein: {peptide, ...}, ...} :returns: a list of pairs of protein and their superset proteins. [(protein, {superset protein, ...}), ...] """ proteinsEqual = lambda prot1, prot2: protToPeps[prot1] == protToPeps[prot2] subGroups = list() for protein in proteins: peptideCounts = Counter() for peptide in protToPeps[protein]: proteins = pepToProts[peptide] peptideCounts.update(proteins) peptideCount = peptideCounts.pop(protein) superGroups = set() for sharingProtein, sharedPeptides in peptideCounts.most_common(): if peptideCount == sharedPeptides: if not proteinsEqual(protein, sharingProtein): superGroups.add(sharingProtein) else: break if superGroups: subGroups.append((protein, superGroups)) return subGroups def _findRedundantProteins(protToPeps, pepToProts, proteins=None): """Returns a set of proteins with redundant peptide evidence. After removing the redundant proteins from the "protToPeps" and "pepToProts" mapping, all remaining proteins have at least one unique peptide. The remaining proteins are a "minimal" set of proteins that are able to explain all peptides. However, this is not guaranteed to be the optimal solution with the least number of proteins. In addition it is possible that multiple solutions with the same number of "minimal" proteins exist. Procedure for finding the redundant proteins: 1. Generate a list of proteins that do not contain any unique peptides, a unique peptide has exactly one protein entry in "pepToProts". 2. Proteins are first sorted in ascending order of the number of peptides. Proteins with an equal number of peptides are sorted in descending order of their sorted peptide frequencies (= proteins per peptide). If two proteins are still equal, they are sorted alpha numerical in descending order according to their protein names. For example in the case of a tie between proteins "A" and "B", protein "B" would be removed. 3. Parse this list of sorted non unique proteins; If all its peptides have a frequency value of greater 1; mark the protein as redundant; remove its peptides from the peptide frequency count, continue with the next entry. 4. Return the set of proteins marked as redundant. :param pepToProts: dict, for each peptide (=key) contains a set of parent proteins (=value). For Example {peptide: {protein, ...}, ...} :param protToPeps: dict, for each protein (=key) contains a set of associated peptides (=value). For Example {protein: {peptide, ...}, ...} :param proteins: iterable, proteins that are tested for being redundant. If None all proteins in "protToPeps" are parsed. :returns: a set of redundant proteins, i.e. proteins that are not necessary to explain all peptides """ if proteins is None: proteins = viewkeys(protToPeps) pepFrequency = _getValueCounts(pepToProts) protPepCounts = _getValueCounts(protToPeps) getCount = operator.itemgetter(1) getProt = operator.itemgetter(0) #TODO: quick and dirty solution #NOTE: add a test for merged proteins proteinTuples = list() for protein in proteins: if isinstance(protein, tuple): proteinTuples.append(protein) else: proteinTuples.append(tuple([protein])) sort = list() for protein in sorted(proteinTuples, reverse=True): if len(protein) == 1: protein = protein[0] protPepFreq = [pepFrequency[pep] for pep in protToPeps[protein]] if min(protPepFreq) > 1: sortValue = (len(protPepFreq)*-1, sorted(protPepFreq, reverse=True)) sort.append((protein, sortValue)) sortedProteins = map(getProt, sorted(sort, key=getCount, reverse=True)) redundantProteins = set() for protein in sortedProteins: for pep in protToPeps[protein]: if pepFrequency[pep] <= 1: break else: protPepFrequency = Counter(protToPeps[protein]) pepFrequency.subtract(protPepFrequency) redundantProteins.add(protein) return redundantProteins def _mergeProteinEntries(proteinLists, protToPeps): """Returns a new "protToPeps" dictionary with entries merged that are present in proteinLists. NOTE: The key of the merged entry is a tuple of the sorted protein keys. This behaviour might change in the future; the tuple might be replaced by simply one of the protein entries which is then representative for all. :param proteinLists: a list of protein groups that will be merged [{protein, ...}, ...] :param protToPeps: dict, for each protein (=key) contains a set of associated peptides (=value). For Example {protein: {peptide, ...}, ...} :returns: dict, {protein: set([peptid, ...])} """ mergedProtToPeps = dict(protToPeps) for proteins in proteinLists: for protein in proteins: peptides = mergedProtToPeps.pop(protein) mergedProtein = tuple(sorted(proteins)) mergedProtToPeps[mergedProtein] = peptides return mergedProtToPeps def _reducedProtToPeps(protToPeps, proteins): """Returns a new, reduced "protToPeps" dictionary that does not contain entries present in "proteins". :param protToPeps: dict, for each protein (=key) contains a set of associated peptides (=value). For Example {protein: {peptide, ...}, ...} :param proteins: a list of proteinSet :returns: dict, protToPeps not containing entries from "proteins" """ return {k: v for k, v in viewitems(protToPeps) if k not in proteins} def _findUniqueMappingValues(mapping): """Find mapping entries that are unique for one key (value length of 1). .. Note: This function can be used to find unique proteins by providing a peptide to protein mapping. :param mapping: dict, for each key contains a set of entries :returns: a set of unique mapping values """ uniqueMappingValues = set() for entries in viewvalues(mapping): if len(entries) == 1: uniqueMappingValues.update(entries) return uniqueMappingValues def _findUniqueMappingKeys(mapping): """Find mapping keys that only have one entry (value length of 1. :param mapping: dict, for each key contains a set of entries :returns: a set of unique mapping keys """ uniqueMappingKeys = set() for key, entries in viewitems(mapping): if len(entries) == 1: uniqueMappingKeys.add(key) return uniqueMappingKeys def _invertMapping(mapping): """Converts a protein to peptide or peptide to protein mapping. :param mapping: dict, for each key contains a set of entries :returns: an inverted mapping that each entry of the values points to a set of initial keys. """ invertedMapping = ddict(set) for key, values in viewitems(mapping): for value in values: invertedMapping[value].add(key) return invertedMapping def _getValueCounts(mapping): """Returns a counter object; contains for each key of the mapping the counts of the respective value element (= set length). :param mapping: dict, for each key contains a set of entries. :returns: a counter """ return Counter({k: len(v) for k, v in viewitems(mapping)}) def _mappingGetValueSet(mapping, keys): """Return a combined set of values from the mapping. :param mapping: dict, for each key contains a set of entries returns a set of combined entries """ setUnion = set() for k in keys: setUnion = setUnion.union(mapping[k]) return setUnion def _flattenMergedProteins(proteins): """Return a set where merged protein entries in proteins are flattened. :param proteins: an iterable of proteins, can contain merged protein entries in the form of tuple([protein1, protein2]). returns a set of protein entries, where all entries are strings """ proteinSet = set() for protein in proteins: if isinstance(protein, tuple): proteinSet.update(protein) else: proteinSet.add(protein) return proteinSet