MS spectra in MasPy

The mzML file format

Every vendor software produces mass spectrometer output files in a different proprietary format. It is a difficult and time consuming task for software developers to support all of these different formats and format versions. Therefore the file format mzML has been developed by the Proteomics Standards Initiative (PSI) as the community standard for representation of mass spectrometry results. mzML is an open, XML- based format that not only allows to store recorded mass spectrum information but also metadata of the instrument configuration, acquisition settings, software used for data processing and sample descriptions. Ultimately, it is desirable to universally use mzML for archiving, sharing, and processing of mass spectrometry data and thus for all software to support and use the mzML format.

Note

Refer to www.psidev.info for details on the XML schema definition and mzML file specifications, see also the publication Mass Spectrometer Output File Format mzML)

Note

We recommend using ProteoWizard for conversion of vendor format files to mzML. The software can be downloaded from their website, a detailed protocol how to use ProteoWizard can be found here.

The raw spectral data recorded by an instrument can be either stored as profile or centroid data. Meassured mass spectra are initially recorded in profile mode, where each mass peak is represented by a number of m/z and intensity values describing a peak shape. In centroid mode this information is reduced to the centroid of the peak shape, storing only one single pair of a dinstinct m/z value and an intensity. The process of converting profile data to centroid data is called peak picking and can be applied as a filter while converting vendor format files to mzML files using ProteoWizard, see the ProteoWizard protocol. The representation as centroid data is easier to work with, saves memory and is sufficient for most applications. Therefore we recommend the utilization of centroid data for MasPy.

MsrunContainer

Modern mass spectrometers can generate tens of thousands of spectra per hour resulting in huge mzML files. Opening and parsing such large XML files takes a lot of time. MzML files can contain a byte-offset index which allows directly reading certain spectra without parsing the whole file. This can increase performance when only one or a few specific spectra have to be accessed at a time.

The actual spectral information takes up to largest part of a typical mzML file. However, sometimes only a certain type of information needs to be accessed, for example the spectrum metadata. Therefore we split the information that is contained in mzML files into four data groups; run metadata (Rm), spectrum metadata items (Smi), spectrum array items (Sai) and chromatogram items (Ci). Each of these data groups is stored separately in MasPy and has its own file type, thus it can be accessed, saved and loaded independently of the others. All four data types are stored in the MasPy class MsrunContainer. Altough the data is split into multiple parts, all information originally contained in an mzML file is still present. This allows the conversion from MsrunContainer to mzML at any given time. #TODO: Why do we want to be able to export mzML files? (Preffered data format for archiving and sharing data and to use as input for other software packages)

See tutorial/docstrings xxx for details on the MsrunContainer file format. #TODO:

Fig.: MsrunContainer #TODO: make figure

  • run metadata
  • spectrum metadata items
  • spectrum array items
  • chromatogram items
  • spectrum items

Run metadata (Rm)

The run metadata element contains all information of an mzML file, which is not directly part of the acquired spectra and chromatograms. This covers, amongst others, a description of the instrument configuration, a list of software used for data processing and a list of applied data processing steps. In addition it is possible to add contact information and a description of the analyzed samples to the mzML file. In MasPy all of these mzML elements are converted to an lxml.etree.Element and stored in MsrunContainer.rmc (Rm container).

Note

Software which is used to process data of an mzML file should be listed in the mzML element “softwareList”, and all applied data processing steps should be documented in the “dataProcessingList” element.

Spectrum array item (Sai), spectrum metadata item (Smi)

An mzML spectrum element contains all information of an acquired MS spectrum, including numerical arrays containing at least recorded m/z and intensity values of the observed ions but also plenty of metadata describing for example details of the acquisition like base peak m/z and intensity, scan start time, ms level, MS2 isolation window or precursor information of MS2 scans. In MasPy this information is split into a metadata containing part and the spectrum array data and put into two separate data structures; spectrum metadata item (Smi) and spectrum array item (Sai), respectively. Smi elements are stored in MsrunContainer.smic (Smi container) and Sai elements in MsrunContainer.saic (Sai container). In order to recreate an mzML spectrum element the information of both MasPy data types (Smi and Sai) is necessary.

Chromatogram item (Ci)

An mzML chromatogram element is similar to a spectrum element, containing metadata and numerical arrays. Common chromatogram types are total ion current chromatogram, selected ion current chromatogram and basepeak chromatogram. All of them contain time and intensity data points, however, other chromatogram types can also contain absorption or emission values instead of intensities. In the current MasPy implementation chromatogram elements are not split into two data types but the metadata and array information is put into one single data structure called chromatogram item (Ci), which is stored in MsrunContainer.cic (Ci container).

Spectrum item (Si)

The mzML file serves as a data container for active data processing but also for data sharing and archiving. Thus the spectrum elements contain a lot of metadata information not needed for most data analysis applications. In addition all information stored in spectrum elements have to be in accordance with the mzML xml scheme definition and the Controlled Vocabularies (CV’s) of the PSI, see. Altough in principle this standardization is beneficial and perfectly reasonable, when actively working with the data it is not always required and can make things unnecessarily complicated.

To circumvent this problem MasPy provides a simpler data type for working with spectrum metadata, called spectrum item (Si). The Si class has a flat structure, meaning that attributes are not nested inside other elements but are stored directly as attributes of the class. Si attributes can be manipulated without restrictions and new attributes can simply be added. Specific functions can be used to selectively extract information from Smi. This allows import only the currently needed spectrum metadata attributes, like retention time, ms level or MS2 precursor information, thereby making the Si more memory efficient. In order to make lasting changes to the mzML file Si attributes have to be translated to the respective Smi elements. These changes however have to strictly follow the mzML specifications and syntax. Thus it is recommend to use existing functions or implement new ones that make changes to Smi elements in a controlled manner.

Each spectrum present in an mzML file is therefore represented threefold in MasPy. First the Smi contains a complete representation of all metadata information present in an mzML spectrum element. However, this data type is not intended to be used for standard data analysis and will normally only be accessed to make lasting, documented changes to spectrum metadata and for generating new mzML files. Second the Sai contains the actual ion information recorded by the mass spectrometer. This data type will be used whenever the ion spectra have to be analyzed or manipulated. In addition it is also required for generating new mzML files. And third the Si, which can be considered as the spectrum metadata workspace in MasPy, allowing convenient access to metadata and simple processing of this data without directly altering the original mzML information. This data type will be used for most data processing and analysis steps in MasPy.

MsrunContainer.info

MsrunContainer.info -> which specfiles are present, what is the current path (used for loading or saving) , which data types are currently imported

MasPy file formats

This section will contain information about how the data contained in an MsrunContainer is written to the hard drive. (one file type per data type: mrc_rm, mrc_si, mrc_sai, mrc_smi, mrc_ci)

Basic code examples

Importing an mzML file

mzML files can be imported by using the function maspy.reader.importMzml(), the imported specfile is then added to the MsrunContainer instance passed to the function.

import maspy.core
import maspy.reader

mzmlfilepath = 'filedirectory/specfile_name_1.mzML'
msrunContainer = maspy.core.MsrunContainer()
maspy.reader.importMzml(mzmlfilepath, msrunContainer)

Saving an MsrunContainer to the hard disk

An MsrunContainer can be saved to the hard disk by calling its .save() method.

msrunContainer.save()

By default all files are saved into the folder specified in .info. This can be altered by changing the path variable in .info or temporarely by passing the “path” parameter to .save().

msrunContainer.save(path='../an_alternative_location')

In addition, multiple parameters can be set to specify which part of the data should be written to the hard disk. The keywords “rm”, “ci”, “smi”, “sai” and “si” can be set to True or False and specify which container types are selected for saving. By default all of them are set to False which is however interpreted as selecting all of them. Setting at least one to True changes this behaviour and only the specified ones are selected. If multiple specfiles are present in an MsrunContainer it is possible to only select a subset for saving by passing the “specfiles” argument to .save(). The value of “specfiles” can either be the name of one single specfile or a list of specfile names. In the following example only the spectrum array item container (saic) and the spectrum metadata item container (smic) of the specfiles “specfile_name_1” and “specfile_name_3” are saved.

msrunContainer.save(specfiles=["specfile_name_1", "specfile_name_3"],
                    sai=True, smi=True
                    )

Loading an MsrunContainer from the hard disk

Before loading an MsrunContainer from the hard disk, a specfile entry has to be added to its .info attribute. This can be done by calling .addSpecfile() with the name of the specfile and the path to the filedirectory. Afterwards the files can be loaded by calling .load(), which will import all specfiles present in .info and update the status variable of .info.

>>> msrunContainer = maspy.core.MsrunContainer()
>>> msrunContainer.addSpecfile('specfile_name_1', 'filedirectory')
>>> msrunContainer.info
{u'specfile_name_1': {u'path': u'filedirectory',
                      u'status': {u'ci': False,
                                  u'rm': False,
                                  u'sai': False,
                                  u'si': False,
                                  u'smi': False}}}
>>> msrunContainer.load()
>>> msrunContainer.info
{u'specfile_name_1': {u'path': u'filedirectory',
                      u'status': {u'ci': True,
                                  u'rm': True,
                                  u'sai': True,
                                  u'si': True,
                                  u'smi': True}}}

Similar to saving only parts of an MsrunContainer it is also possible to only select a subset of specfiles present in .info and specify which data types are imported.

>>> msrunContainer = maspy.core.MsrunContainer()
>>> msrunContainer.addSpecfile('specfile_name_1', 'filedirectory')
>>> msrunContainer.info
{u'specfile_name_1': {u'path': u'filedirectory',
                      u'status': {u'ci': False,
                                  u'rm': False,
                                  u'sai': False,
                                  u'si': False,
                                  u'smi': False}}}
>>> msrunContainer.load(specfiles='specfile_name_1', sai=True, smi=True)
>>> msrunContainer.info
{u'specfile_name_1': {u'path': u'filedirectory',
                      u'status': {u'ci': False,
                                  u'rm': False,
                                  u'sai': True,
                                  u'si': False,
                                  u'smi': True}}}

Deleting data from an MsrunContainer

If specific data types are not needed anymore, they can be removed to free memory. This can be done by using .removeData() and parsing arguments to specify specfiles and which data types to remove. It is recommended to always use this method to remove data instead of manually deleting container entries, because using .removeData automatically updates the .info attribute of the MsrunContainer. The following command removes the Sai and Smi items of the specfile “specfile_name_1”.

>>> msrunContainer.info
{u'specfile_name_1': {u'path': u'filedirectory',
                      u'status': {u'ci': True,
                                  u'rm': True,
                                  u'sai': True,
                                  u'si': True,
                                  u'smi': True}}}
>>> msrunContainer.removeData('specfile_name_1', sai=True, smi=True)
>>> msrunContainer.info
{u'specfile_name_1': {u'path': u'filedirectory',
                      u'status': {u'ci': True,
                                  u'rm': True,
                                  u'sai': False,
                                  u'si': True,
                                  u'smi': False}}}

A specfile can be completely removed from an MsrunContainer by calling .removeSpecfile(), which deletes all data from the containers and in addition the entry from the .info attribute.

msrunContainer.removeSpecfile('specfile_name_1')

Exporting specfiles from MsrunContainer to mzML files

After working in MasPy it might be desirable to export the MsrunContainer back into an mzML file which can be used as input for another software or simply for archiving and sharing mass spectrometry data. An mzML file is generated by using the function maspy.writer.writeMzml() and passing at least the specfile name that should be exported, an MsrunContainer and the output directory. In order to write a valid and complete mzML file all data types except for Si have to be present in the MsrunContainer.

import maspy.writer
maspy.writer.writeMzml('specfile_name_1', msrunContainer, '/filedirectory')

Note

Optionally it is possible to supply a list of spectrumIds and chromatogramIds to only select a subset of spectra and chromatograms that should be written to the mzML file. The supplied lists of element ids have to be sorted in the order they should be written to the mzML file.

Accessing data from an MsrunContainer

#TODO: *examples of .getItem, .getArrays, ... *