# What is Pi?

**Pi**is a powerful suite on analysis of tandem mass spectrum. Pi seeks to fill the need for the deep analysis of tandem mass spectrum, including the fragmentation rules, preference of cleavage, neutral losses, etc. We believe that statistics plays an important role in mass spectrum analysis.

There are four modules in Pi, which are listed as follows:

## a novel scoring function

An effective scoring function for evaluating matches between experimental spectrum and candidate peptide is a key issue in the interpretation of a mass spectrum. Most scoring functions, such as SEQUEST [5] and Sonar (http://65.219.84.5/ProteinId.html) are based on shared peaks. The is an alternative way to evaluate the probability of recognizing a set of fragments in a protein database, as implemented in MOWSE (http://www.hgmp.mrc.ac.uk/Bioinformatics/Webapp/mowse/mowsedoc.html), Mascot [7] and ProbID [8]. Characterizing ion types and their probabilities, Dancik et al [9] proposed a likelihood-based approach, which was generalized in SCOPE [10] to involve more prior knowledge. An extension of Dancik?s scoring approach into an intensity-based statistical scorer incorporated a variety of experimental observations and prior knowledge on peptide fragmentation [11]. ProbID [8], a method based on a probabilistic model, adopted a Bayesian approach to interpret mass spectra data.

Random matching between experimental and theoretical masses may bring about false-positive results, therefore arises another key problem of peptide identification---the criteria to evaluate the reliability of the matching. The difference between the highest and second highest scores [5] and estimates [7] are used to filter false positives. Jan Eriksson [12] and Keller [13] built a model to work out the distribution of scores from random matches, which allowed significance testing under general database searching constraints. Therefore, filtering criteria to distinguish a valid match from all matches should be developed toward being dependent on quantitative estimates rather than on experience.

We introduce a new, effective probabilistic scoring function. Adopting a statistical model similar to Dancik et al [8], we have employed relative entropy (i.e., K-L distance) to measure the similarity between hypothetic and experimentally observed spectra. We give a brief proof to show that relative entropy is indeed the simplified form of the conditional probability that the spectrum is generated from the peptide.

Moreover, we present an EVD-based criterion to distinguish valid match from random ones. Each spectrum will acquire the best score from correlations with all candidate peptides. Such best scores conform to the extreme value distribution, which underlies a quantitative threshold of significance test.

## quantifying the factors influencing global fragmentation

To predict theoretical spectrum, except for the promising chemical kinetic model to simulate fragmentation process\cite{ZhangZ}, several studies have been conducted to develop a statistical predicting model. Dancik et al. introduced an automatic tool-\emph{offset frequency function}-to learn the ion types tendency and intensity threshold from the experimental spectrum\cite{Dancik, Bafna}. J.R. Yates III et al. attempted to identify statistical trend in spectrum peak intensities and put them into the chemical context. F. P. Roth applied probability decision tree approach to distinguish the important factors from a total 63 peptide and fragmentation attributes\cite{Roth}. Another intersting method to determine the factors influencing fragmentation is a linear model proposed by F.Schutz\cite{Schutz}. In this method, F.Schutz fitted a linear model to spectrum, in which the influence of some specific amino acid and their position in the peptide are reflected. Moreover, the linear model also shows ability to accurately predict theoretical spectrum.

The linear model has some difficulties. In this model, the preference for cleavage at C-terminus or N-terminus is represented as the sum of the influence of C-terminal residue and of N-terminal residue. This assumption is strict since it implies that Xaa-Pro has an enhanced cleavage than any Xaa-Yaa bond regardless of what Xaa is, which is inconsistent with the observation that Xaa-Pro's cleavage is hindered when Xaa was Gly or Pro\cite{Schutz}. Hence, it is more reasonable to consider the cleavage preference in bond's manner rather than sum of influence of residues. In this paper, we present a novel model to overcome these difficulties.

## deriving probabilities of neutral losses

In these studies, however, little attention was paid to deriving the probabilities of neutral losses for each amino acid and predicting the intensities for ions with neutral losses. To date, widely used algorithms, such as Sequest\cite{sequest} and Mascot\cite{mascot}, adopted a simple fragmentation model to predict theoretical spectrum, which assumes that cleavage will occur at peptide bonds in a uniform manner, regardless of other influencing factors such as position of amino acids and types of bond, especially neutral losses. Since the ions always suffer a neutral losses, which depends heavily on peptide composition, it is necessary to derive the probability of neutral losses for each amino acid. This paper here addresses the neutral losses probability learning problem and how to incorporate them into a statistical model to predict theoretical spectrum.

## predicting spectrum from sequence accurately

We used this model to predict theoretical spectrm for a testing set and made comparison with practical ones. Intensity was estimated for the ions with neutral losses, thus, a more complete theoretical spectrum could be predicted for a peptide sequence. Experimental results show that this model could predict a more 'realistic' spectrum.

## To be done

de novo method using the relative entropy scoring function

identifying post-transcription modification

.....