Fasta files in MasPy¶
The FASTA file format¶
The fasta format is the standard file format for storing protein sequence
information in the field of protoeomics. A fasta file can store from only a
few up to tens of thousands protein entries, for example when it contains the
collection of all known protein coding genes of an organism.
Each protein entry begins with a headerline, which is followed by lines
describing the amino acid sequence in the single letter code. A headerline
always starts with the greater than symbol >, which dinstinguishes it from
the sequence part. Headerlines can contain various amounts of information
typically starting with an unique identifier of the entry followed for example
by the protein name and a protein description. However, there are no common
rules for the content as well as the syntax of the headerline. As a
consequence each protein database might use its own fasta header format, which
has to be known to allow a proper parsing of fasta files.
MasPy provides functions to import proteins from a fasta file and perform an in
silico digestion thereof. For parsing of the fasta protein headers we rely on
the Pyteomics library, which provides automatic recognition and
parsing of the formats UniProtKB, UniRef, UniParc and UniMES (UniProt
Metagenomic and Environmental Sequences), described at uniprot.org. In addition we have added a
custom header parser for fasta files in the SGD format from yeastgenome.org, which can also serve as a template for
additional parser functions, see maspy.proteindb.fastaParseSgd().
Protein database in MasPy¶
Upon import parsed protein entries are represented as
maspy.proteindb.ProteinSequence elements. They are stored in the
maspy.proteindb.ProteinDatabase class and can be accessed with their
unique ids in ProteinDatabase.proteins. Thereafter each protein sequence is
digested in silico by using the specified cleavage rules, yielding smaller
maspy.proteindb.PeptideSequence elements, which are stored in
ProteinDatabase.peptides and can be accessed with their amino acid sequence.
Their proteins attribute contains a set() of protein ids and references
all proteins that have generated the specific peptide sequence during digestion.
In addition the peptide positions within the protein sequence are stored in the
attribute proteinPositions. If the peptide was generated by only one
protein, its isUnique attribute is set True. Each
ProteinSequence contains a list of
such unique and not unique peptides that were generated during the digestion,
the attributes uniquePeptides and sharedPeptides respectively. In
addition it also has an isUnique attribute, which is set True if the
protein digestion has generated at least one unique peptide for a certain
protein.
Basic code examples¶
Importing a fasta file
The function maspy.proteindb.importProteinDatabase() is used to import
fasta files, perform an insilico digestion and return a ProteinDatabase
instance. The parameters minLength and maxLength specify which peptides
are stored in the ProteinDatabase after digestion. The default digestion
rule is for a tryptic digestion, cutting c-terminally to Lysine and Arginine.
import maspy.proteindb
fastaPath = 'filelocation/human.fasta'
proteindb = maspy.proteindb.importProteinDatabase(fastaPath, minLength=7,
maxLength=50
)
By passing the headerParser argument it is possible to specify an
alternative function which is called to interpret the fasta header line.
parser = maspy.proteindb.fastaParseSgd
proteindb = maspy.proteindb.importProteinDatabase(fastaPath,
headerParser=parser
)
If a fasta file contains decoy proteins or contamination proteins specifically
marked by their unique ids starting with a certain tag it should be specified
with the attributes decoyTag and contaminationTag respectively to allow
a proper parsing of the fasta headers.
proteindb = maspy.proteindb.importProteinDatabase(fastaPath,
decoyTag='[decoy]',
contaminationTag='[cont]'
)
Specifying parameters for in silico digestion
The protein cleavage rule can be changed by passing a regular expression with
the argument cleavageRule. The dictionary
maspy.constants.expasy_rules contains cleavage rules of the most popular
proteolytic enzymes. The concept for finding cleavage positions with regular
expressions was adapted from the python library Pyteomics and the
expasy_rules collection of cleavage rules was copied from
pyteomics.parser.expasy_rules. The asp-n cleavage rule provides an
example of an enzyme that cuts n-terminally of a specific amino acid.
>>> import maspy.constants
>>> aspN = maspy.constants.expasy_rules['asp-n']
>>> proteindb = maspy.proteindb.importProteinDatabase(fastaPath,
>>> cleavageRule=aspN
>>> )
>>> aspN
u'\\w(?=D)'
Besides defining the cleavage rule it is possible to specify the number of allowed missed cleavage positions, wheter protein n-terminal peptides should also be generated with the initial Methionine removed and if Leucine and Isoleucine should be treated as indistinguishable when assigning peptide sequences to proteins.
proteindb = maspy.proteindb.importProteinDatabase(fastaPath,
missedCleavage=2,
removeNtermM=True
ignoreIsoleucine=True
)
...
...
...
...
Depricated¶
The fasta format has become the standard format for storing protein sequences, where each amino acid is represented in the single letter code. “A sequence in FASTA format begins with a single-line description, followed by lines of sequence data. The description line is distinguished from the sequence data by a greater-than (“>”) symbol in the first column. The word following the “>” symbol is the identifier of the sequence, and the rest of the line is the description (both are optional). The sequence ends if another line starting with a “>” appears; this indicates the start of another sequence.” [https://en.wikipedia.org/wiki/FASTA_format]