MSPreprocess - program performs preprocessing steps for the mass-spectrum data

Proteomics-MSPreprocess - Softberry Mass Spectra (SMS) processing tools. Preprocessing of the mass-spectrum data.

On a single calibrated spectrum processing, this program performs the following operations:
(1) Data resampling;
(2) Data smoothing;
(3) Detection of the baseline and its subtraction from intensity;
(4) Normalization;
The resulted data can be used further by Proteomics-MSCreateTable or Proteomics-MSPredictLDA programs.

Step 1. Data resampling.
The first step in mass spectra processing is data resampling. It allows to discriminate the excessive data and to bring the mi values to common scale. As a result, different spectra will have the same m value counts, and, thus, will be comparable. Reduction in number of spectrum points allows to lower the noise and to eliminate excessive data, but, at the same time, to keep the spectrum shape. The common data scale after conversion is located between the minimal and maximal m values of spectrum. The number of data that will be resampled from original set is determined by the 'Binning percent' parameter, that represents the percentage of spectrum points remained after conversion (default value is 25). Example of data resampling is shown in figure 1.


Figure 1. Result of data resampling for small spectrum interval. Original data are shown as blue squares, resampled ones - as red circles. The 'Binning percent' for this case was set to 25.

Step 2. Smoothing.
Data smoothing procedure is intended for data noise elimination. During the smoothing, the values of intensity for each mi point are being averaged by several neighboring points. The number of such points is determined by the 'SmoothWindowSize' parameter (default value is 3). The smoothing procedure can be repeated for several times; the number of iterations is determined by the 'SmoothReps' parameter (default value is 3). Example of data smoothing is shown in the figure 2.


Figure 2. Result of data smoothing. Original data are shown as blue squares, smoothed ones - as red circles. The SmoothWindowSize was set to 3 and SmoothReps was set to 3.

Step 3. Baseline detection and subtraction.
This step of data processing is applied for elimination of the systematic artifacts that occur due to matrix and chemicals used in the experiments or as a result of detector overload. It results in background noise that may occur to be significant for some m values. The initial step in background noise removal is identification of peaks (local signal maxima that are located far enough from each other). The distance between peaks is determined by the 'Baseline parameter' value (default= 0.005). This parameter defines the minimal m distance, over which the two neighboring peaks 1 and 2 are to be located in the way, when:
|m1-m2|/m1 > 'Baseline parameter'.

After peaks identification, algorithm detects the points with signal minima located in intervals between peaks. These are the base points for calculation of background noise line. Over base points the baseline for all spectrum points is built by interpolation. In case when in some spectrum parts the value of base signal exceeds the original one, the new base points selection from neighboring ones occurs.

The values of base signal intensity are subtracted from the original one. At that, if value of original signal has occurred below zero, it is equated to zero. The result of background subtraction is shown in figure 3.


Figure 3. Result of background signal subtraction. Original data are shown as blue squares, modified ones - as red circles. Baseline is shown in green line.

Step 4. Normalization.
Normalization allows to bring peaks intensity values to a common scale, and thus it becomes possible to compare data from different spectra. The only parameter for current procedure is 'NormalizationConstant' (default value is 10000). Example is shown in fig. 4.


Figure 4. Result of normalization procedure. Original data are shown as blue squares, modified ones - as red circles.

Input: m/z - Intensity data
Output: Preprocessed data of mass-spectrum.

Parameter(s):
Binning percent - This parameter specify the fraction of data in percent that will remain after resampling. The default value is 25.
SmoothWindowSize - This parameter determine window size for smoothing operation. The default value is 3.
SmoothReps - This parameter specify the number of smoothing operation repeats. The default value is 3.
Baseline parameter - This parameter specify the minimal mass difference, over which the two neighboring peaks 1 and 2 are to be distinguished for baseline determination. The default value is 0.005.
NormalizationConstant - This parameter specify the normalization constant. The default value is 10000.
File format type - This parameter specify file format. SSV-space separated values, CSV - comma separated values, TSV - tab separated values.

Data format.
Mass spectra data represent the sets of following pairs of values: mass to charge relation (m/z, further, for more convenience, it will be referred to as m, mass) and corresponding signal intensity (I). On a spectrum plot, the mass corresponds to X coordinate, and signal intensity- to Y one. A typical spectrum consists of several thousand of such value pairs (points). Data are represented as text files, where for each pair (mi,Ii) of mass-intensity values the string is assigned, and data in this string are separated by special separator symbol. The SMS package allows several separators types: space (SSV, space separated values, file format), comma (CSV, comma separated values, file format) and tabulation (TSV, tab-separated values, file format). In files with data, the string with comments are allowed; during the file reading these strings are to be skipped. The commentary strings should begin with "#" symbol at the first position. In the figure 2 the example of file with data in CSV format is shown.

#M/Z,Intensity -7.8602611e-005,4.1126194 2.1773576e-007,4.0764203 9.6021472e-005,4.0040221 0.00036601382,4.1186526 0.00081019477,4.0040221 0.0014285643,3.9617898 …. 19742.941,4.077895 19745.564,4.0772248 19748.187,4.0772248

Figure 5. Example file with mass spectra data in CSV format.