Fasta files in MasPy

The FASTA file format

The fasta format is the standard file format for storing protein sequence information in the field of protoeomics. A fasta file can store from only a few up to tens of thousands protein entries, for example when it contains the collection of all known protein coding genes of an organism.

Each protein entry begins with a headerline, which is followed by lines describing the amino acid sequence in the single letter code. A headerline always starts with the greater than symbol >, which dinstinguishes it from the sequence part. Headerlines can contain various amounts of information typically starting with an unique identifier of the entry followed for example by the protein name and a protein description. However, there are no common rules for the content as well as the syntax of the headerline. As a consequence each protein database might use its own fasta header format, which has to be known to allow a proper parsing of fasta files.

MasPy provides functions to import proteins from a fasta file and perform an in silico digestion thereof. For parsing of the fasta protein headers we rely on the Pyteomics library, which provides automatic recognition and parsing of the formats UniProtKB, UniRef, UniParc and UniMES (UniProt Metagenomic and Environmental Sequences), described at uniprot.org. In addition we have added a custom header parser for fasta files in the SGD format from yeastgenome.org, which can also serve as a template for additional parser functions, see maspy.proteindb.fastaParseSgd().

Protein database in MasPy

Upon import parsed protein entries are represented as maspy.proteindb.ProteinSequence elements. They are stored in the maspy.proteindb.ProteinDatabase class and can be accessed with their unique ids in ProteinDatabase.proteins. Thereafter each protein sequence is digested in silico by using the specified cleavage rules, yielding smaller maspy.proteindb.PeptideSequence elements, which are stored in ProteinDatabase.peptides and can be accessed with their amino acid sequence. Their proteins attribute contains a set() of protein ids and references all proteins that have generated the specific peptide sequence during digestion. In addition the peptide positions within the protein sequence are stored in the attribute proteinPositions. If the peptide was generated by only one protein, its isUnique attribute is set True. Each ProteinSequence contains a list of such unique and not unique peptides that were generated during the digestion, the attributes uniquePeptides and sharedPeptides respectively. In addition it also has an isUnique attribute, which is set True if the protein digestion has generated at least one unique peptide for a certain protein.

Basic code examples

Importing a fasta file

The function maspy.proteindb.importProteinDatabase() is used to import fasta files, perform an insilico digestion and return a ProteinDatabase instance. The parameters minLength and maxLength specify which peptides are stored in the ProteinDatabase after digestion. The default digestion rule is for a tryptic digestion, cutting c-terminally to Lysine and Arginine.

import maspy.proteindb
fastaPath = 'filelocation/human.fasta'
proteindb = maspy.proteindb.importProteinDatabase(fastaPath, minLength=7,
                                                  maxLength=50
                                                  )

By passing the headerParser argument it is possible to specify an alternative function which is called to interpret the fasta header line.

parser = maspy.proteindb.fastaParseSgd
proteindb = maspy.proteindb.importProteinDatabase(fastaPath,
                                                  headerParser=parser
                                                  )

If a fasta file contains decoy proteins or contamination proteins specifically marked by their unique ids starting with a certain tag it should be specified with the attributes decoyTag and contaminationTag respectively to allow a proper parsing of the fasta headers.

proteindb = maspy.proteindb.importProteinDatabase(fastaPath,
                                                  decoyTag='[decoy]',
                                                  contaminationTag='[cont]'
                                                  )

Specifying parameters for in silico digestion

The protein cleavage rule can be changed by passing a regular expression with the argument cleavageRule. The dictionary maspy.constants.expasy_rules contains cleavage rules of the most popular proteolytic enzymes. The concept for finding cleavage positions with regular expressions was adapted from the python library Pyteomics and the expasy_rules collection of cleavage rules was copied from pyteomics.parser.expasy_rules. The asp-n cleavage rule provides an example of an enzyme that cuts n-terminally of a specific amino acid.

>>> import maspy.constants
>>> aspN = maspy.constants.expasy_rules['asp-n']
>>> proteindb = maspy.proteindb.importProteinDatabase(fastaPath,
>>>                                                   cleavageRule=aspN
>>>                                                   )
>>> aspN
u'\\w(?=D)'

Besides defining the cleavage rule it is possible to specify the number of allowed missed cleavage positions, wheter protein n-terminal peptides should also be generated with the initial Methionine removed and if Leucine and Isoleucine should be treated as indistinguishable when assigning peptide sequences to proteins.

proteindb = maspy.proteindb.importProteinDatabase(fastaPath,
                                                  missedCleavage=2,
                                                  removeNtermM=True
                                                  ignoreIsoleucine=True
                                                  )

...

...

...

...

Depricated

The fasta format has become the standard format for storing protein sequences, where each amino acid is represented in the single letter code. “A sequence in FASTA format begins with a single-line description, followed by lines of sequence data. The description line is distinguished from the sequence data by a greater-than (“>”) symbol in the first column. The word following the “>” symbol is the identifier of the sequence, and the rest of the line is the description (both are optional). The sequence ends if another line starting with a “>” appears; this indicates the start of another sequence.” [https://en.wikipedia.org/wiki/FASTA_format]