Fasta files in MasPy¶
The FASTA file format¶
The fasta
format is the standard file format for storing protein sequence
information in the field of protoeomics. A fasta file can store from only a
few up to tens of thousands protein entries, for example when it contains the
collection of all known protein coding genes of an organism.
Each protein entry begins with a headerline, which is followed by lines
describing the amino acid sequence in the single letter code. A headerline
always starts with the greater than symbol >
, which dinstinguishes it from
the sequence part. Headerlines can contain various amounts of information
typically starting with an unique identifier of the entry followed for example
by the protein name and a protein description. However, there are no common
rules for the content as well as the syntax of the headerline. As a
consequence each protein database might use its own fasta header format, which
has to be known to allow a proper parsing of fasta files.
MasPy provides functions to import proteins from a fasta file and perform an in
silico digestion thereof. For parsing of the fasta protein headers we rely on
the Pyteomics library, which provides automatic recognition and
parsing of the formats UniProtKB, UniRef, UniParc and UniMES (UniProt
Metagenomic and Environmental Sequences), described at uniprot.org. In addition we have added a
custom header parser for fasta files in the SGD format from yeastgenome.org, which can also serve as a template for
additional parser functions, see maspy.proteindb.fastaParseSgd()
.
Protein database in MasPy¶
Upon import parsed protein entries are represented as
maspy.proteindb.ProteinSequence
elements. They are stored in the
maspy.proteindb.ProteinDatabase
class and can be accessed with their
unique ids in ProteinDatabase.proteins
. Thereafter each protein sequence is
digested in silico by using the specified cleavage rules, yielding smaller
maspy.proteindb.PeptideSequence
elements, which are stored in
ProteinDatabase.peptides
and can be accessed with their amino acid sequence.
Their proteins
attribute contains a set()
of protein ids and references
all proteins that have generated the specific peptide sequence during digestion.
In addition the peptide positions within the protein sequence are stored in the
attribute proteinPositions
. If the peptide was generated by only one
protein, its isUnique
attribute is set True
. Each
ProteinSequence
contains a list of
such unique and not unique peptides that were generated during the digestion,
the attributes uniquePeptides
and sharedPeptides
respectively. In
addition it also has an isUnique
attribute, which is set True
if the
protein digestion has generated at least one unique peptide for a certain
protein.
Basic code examples¶
Importing a fasta file
The function maspy.proteindb.importProteinDatabase()
is used to import
fasta files, perform an insilico digestion and return a ProteinDatabase
instance. The parameters minLength
and maxLength
specify which peptides
are stored in the ProteinDatabase
after digestion. The default digestion
rule is for a tryptic digestion, cutting c-terminally to Lysine and Arginine.
import maspy.proteindb
fastaPath = 'filelocation/human.fasta'
proteindb = maspy.proteindb.importProteinDatabase(fastaPath, minLength=7,
maxLength=50
)
By passing the headerParser
argument it is possible to specify an
alternative function which is called to interpret the fasta header line.
parser = maspy.proteindb.fastaParseSgd
proteindb = maspy.proteindb.importProteinDatabase(fastaPath,
headerParser=parser
)
If a fasta file contains decoy proteins or contamination proteins specifically
marked by their unique ids starting with a certain tag it should be specified
with the attributes decoyTag
and contaminationTag
respectively to allow
a proper parsing of the fasta headers.
proteindb = maspy.proteindb.importProteinDatabase(fastaPath,
decoyTag='[decoy]',
contaminationTag='[cont]'
)
Specifying parameters for in silico digestion
The protein cleavage rule can be changed by passing a regular expression with
the argument cleavageRule
. The dictionary
maspy.constants.expasy_rules
contains cleavage rules of the most popular
proteolytic enzymes. The concept for finding cleavage positions with regular
expressions was adapted from the python library Pyteomics
and the
expasy_rules
collection of cleavage rules was copied from
pyteomics.parser.expasy_rules
. The asp-n
cleavage rule provides an
example of an enzyme that cuts n-terminally of a specific amino acid.
>>> import maspy.constants
>>> aspN = maspy.constants.expasy_rules['asp-n']
>>> proteindb = maspy.proteindb.importProteinDatabase(fastaPath,
>>> cleavageRule=aspN
>>> )
>>> aspN
u'\\w(?=D)'
Besides defining the cleavage rule it is possible to specify the number of allowed missed cleavage positions, wheter protein n-terminal peptides should also be generated with the initial Methionine removed and if Leucine and Isoleucine should be treated as indistinguishable when assigning peptide sequences to proteins.
proteindb = maspy.proteindb.importProteinDatabase(fastaPath,
missedCleavage=2,
removeNtermM=True
ignoreIsoleucine=True
)
...
...
...
...
Depricated¶
The fasta format has become the standard format for storing protein sequences, where each amino acid is represented in the single letter code. “A sequence in FASTA format begins with a single-line description, followed by lines of sequence data. The description line is distinguished from the sequence data by a greater-than (“>”) symbol in the first column. The word following the “>” symbol is the identifier of the sequence, and the rest of the line is the description (both are optional). The sequence ends if another line starting with a “>” appears; this indicates the start of another sequence.” [https://en.wikipedia.org/wiki/FASTA_format]