# What is PI?

PI is a powerful suite on analysis of tandem mass spectrum. PI seeks to fill the need for the deep analysis of tandem mass spectrum, including the fragmentation rules, preference of cleavage, neutral losses, etc. We believe that statistics plays an important role in mass spectrum analysis.

There are four modules in PI, which are listed as follows:

## a novel scoring function

An effective scoring function for evaluating matches between experimental spectrum and candidate peptide is a key issue in the interpretation of a mass spectrum. Most scoring functions, such as Sequest and Sonar (http://65.219.84.5/ProteinId.html) are based on shared peaks. The is an alternative way to evaluate the probability of recognizing a set of fragments in a protein database, as implemented in MOWSE (http://www.hgmp.mrc.ac.uk/Bioinformatics/Webapp/mowse/mowsedoc.html), Mascot and ProbID . Characterizing ion types and their probabilities, Dancik et al proposed a likelihood-based approach, which was generalized in SCOPE to involve more prior knowledge. An extension of Dancik's scoring approach into an intensity-based statistical scorer incorporated a variety of experimental observations and prior knowledge on peptide fragmentation
. ProbID , a method based on a probabilistic model, adopted a Bayesian approach to interpret mass spectra data.

Random matching between experimental and theoretical masses may bring about false-positive results, therefore arises another key problem of peptide identification---the criteria to evaluate the reliability of the matching. The difference between the highest and second highest scores and estimates are used to filter false positives. Jan Eriksson
and Keller built a model to work out the distribution of scores from random matches, which allowed significance testing under general database searching constraints. Therefore, filtering criteria to distinguish a valid match from all matches should be developed toward being dependent on quantitative estimates rather than on experience.

We introduce a new, effective probabilistic scoring function. Adopting a statistical model similar to Dancik et al , we have employed relative entropy (i.e., K-L distance) to measure the similarity between hypothetic and experimentally observed spectra. We give a brief proof to show that relative entropy is indeed the simplified form of the conditional probability that the spectrum is generated from the peptide.

Moreover, we present an EVD-based criterion to distinguish valid match from random ones. Each spectrum will acquire the best score from correlations with all candidate peptides. Such best scores conform to the extreme value distribution, which underlies a quantitative threshold of significance test.

## quantifying the factors influencing global fragmentation

To predict theoretical spectrum, except for the promising chemical kinetic model
to simulate fragmentation process, several studies have been conducted to develop a statistical predicting model. Dancik
et al. introduced an automatic tool to learn the ion types tendency and
intensity threshold from the experimental spectrum. J.R. Yates III et al.
attempted to identify statistical trend in spectrum peak intensities and put
them into the chemical context. F. P. Roth applied probability decision tree
approach to distinguish the important factors from a total 63 peptide and
fragmentation attributes. Another interesting method to determine the factors influencing fragmentation is a linear model proposed by F.Schutz. In this method, F.Schutz fitted a linear model to spectrum, in which the influence of some specific amino acid and their position in the peptide are reflected. Moreover, the linear model also shows ability to accurately predict theoretical spectrum.

The linear model has some difficulties. In this model, the preference for cleavage at C-terminus or N-terminus is represented as the sum of the influence of C-terminal residue and of N-terminal residue. This assumption is strict since it implies that Xaa-Pro has an enhanced cleavage than any Xaa-Yaa bond regardless of what Xaa is, which is inconsistent with the observation that Xaa-Pro's cleavage is hindered when Xaa was Gly
or Pro. Hence, it is more reasonable to consider the cleavage preference in bond's manner rather than sum of influence of residues. In this paper, we present a novel model to overcome these difficulties.

## deriving probabilities of neutral losses

In these studies, however, little attention was paid to deriving the probabilities of neutral losses for each amino acid and predicting the intensities for ions with neutral losses. To date, widely used algorithms, such as Sequest
and Mascot, adopted a simple fragmentation model to predict theoretical spectrum, which assumes that cleavage will occur at peptide bonds in a uniform manner, regardless of other influencing factors such as position of amino acids and types of bond, especially neutral losses. Since the ions always suffer a neutral losses, which depends heavily on peptide composition, it is necessary to derive the probability of neutral losses for each amino acid. This paper here addresses the neutral losses probability learning problem and how to incorporate them into a statistical model to predict theoretical spectrum.

## predicting spectrum from sequence accurately

We used this model to predict theoretical spectrum for a testing set and made comparison with practical ones. Intensity was estimated for the ions with neutral losses, thus, a more complete theoretical spectrum could be predicted for a peptide sequence. Experimental results show that this model could predict a more 'realistic' spectrum.

## A self-adaptive statistical model for peptide identification by mass spectrometry

Most peptide identification tech-
niques are essentially classifiers with pre-defined parameters acquired through a
training spectra dataset. However, spectra usually display different fragmentation
rules under different types of spectrometers (e.g., Ion-trap, MALDI-qTOF) or ex-
periment setting (e.g., energy level of collision); thus, the parameters optimized
on a specific training spectra dataset might not be applicable to other spectra
datasets. In addition, it is always time-consuming, and sometimes infeasible, to
prepare a high-quality training spectra dataset through manual verification.

To overcome these difficulties, we propose a self-adaptive model for peptide
identification. Unlike the popular methods treating each peptide-spectrum match
individually, our model is ensemble-based, i.e., we are focusing on the common
characteristics shared by peptide-spectrum matches. Spectra from the same ex-
periment are supposed to be the productions of the same spectrum-generating
rules. This understanding implies that the common rules can be derived if correct
peptide-spectrum pairs are given; on the contrary, suppose the common rules are
already known, a peptide assignment can be easily labeled as confident if it con-
sists with the common rules. The mutual-dependence enables an iterative process
to derive the common rules and high-confidence peptide-spectrum matches simul-
taneously. Beside distinguishing correct peptide assignments, this approach also
helps to obviate the laborious work to prepare training set.

We implemented our model and related score scheme into a Java package(Download).