ISSN: 0974-276X
Research Article - (2015) Volume 8, Issue 5
Introduction: Preanalytical variations have major impact on most biological assays. Specifically MS-based multiparametric proteomics analyses of blood specimens are seriously affected by limited protein stability due to high intrinsic proteolytic activity of serum and plasma. However, the direct analysis of sample quality (DASQ) for serum specimens is not readily available. Here we propose the mass spectrometry based monitoring of peptide patterns that are ex vivo changing in a time dependent manner to alleviate these constrains.
Materials and methods: Serum specimens from healthy controls (n=3) and patients with colorectal cancer were analyzed for a set of endogenous peptides (n=62). The respective proteolytic fragments were monitored with LC/MS at different preanalytical points in time ranging from 1 h to 48 h after blood withdrawal. An algorithm was constructed with a training set of serum specimens from colorectal cancer patients (n=30). An independent test set of patients (n=20) was used for further validation.
Results: The coefficient of determination (R2) for the linear regression of true and estimated points in time was 0.89. However, the classification accuracy for specimens with a preanalytical time span below 8 h was higher when compared to older specimens (>8 h).
Conclusion: Endogenous peptides are processed in blood specimens in a time dependent manner. This ‘proteomic degradation clock’ can be used to estimate the preanalytical quality of serum specimens. This is specifically relevant prior to in depth proteomic profiling approaches or other laborious analyses in research or diagnostic applications. Accordingly, specimens with low quality can be identified and subsequently be excluded from further analyses to avoid any unwanted preanalytical bias.
Keywords: Serum, Protease, Decay markers, Bioinformatics, Mass spectrometry, Biobanking, Quality control
LC/MS: Liquid Chromatography Mass Spectrometry; FTMS: Fourier Transform Mass Spectrometry; ITMS: Quadrupole Ion Trap Mass Spectrometry; ppm: Parts Per Million; Da: Dalton; CV: Coefficient of Variation; SD: Standard Deviation; HPLC: High Pressure Liquid Chromatography; m/z: Mass to Charge Ratio; XIC: Extracted Ion Chromatogram; aU: Arbitrary Units
The identification and validation of new biomarker candidates relies upon the availability of clinical specimens like serum and plasma that are most often provided by biobank repositories [1]. However, the effects of variable preanalytical sample handling and storage have major impact on sample quality that affects biomarker discovery. Accordingly, the preanalytical bias has been identified as one of the most serious threats to profiling experiments that even can abolish meaningful data interpretation completely [2]. An analysis of 125 biomarker discovery papers published in open-access journals between 2004 and 2009 found that more than half included no information about how specimens had been obtained, stored or processed [3]. Furthermore, a 2011 survey of more than 700 cancer researchers found that 47% had trouble finding samples of sufficient quality. Accordingly, either the scope of the study was limited (81%) or findings were questionable (61%) [4]. Due to the neglected issue of sample quality many recently described biomarkers could not be validated independently and therefore had to be classified as false positives [5]. To fully assess biospecimen quality, multiple quality control markers are needed that are not readily available up to now [6]. Accordingly, the analytical monitoring of sample quality is obligatory for improved biomarker discovery and validation studies. Recently we described an external decay marker for quality control monitoring of serum and plasma [7]. However, this approach is restricted to prospective quality control analyses as the synthetic peptide substrate has to be added to serum- and plasma- tubes prior to blood withdrawal and existing collections with native samples are not assessable to this approach.
Here we propose the MS-based analysis of patterns from endogenous peptides for quality assessment of serum specimens. Blood has inherent proteolytic activity that is related to various endoproteases e.g. from coagulation, fibrinolysis and the complement activation [8]. In addition, a multitude of exoproteases is also contributing to the proteolytic activity of serum and plasma [9]. Consequently, proteins and peptides are continuously processed in a time dependent manner. We hypothesized that the MS-based monitoring of the proteolytic processing of endogenous proteins/peptides might function as a ‘proteomic degradation clock’ to estimate the preanalytical quality of blood specimens. The proteolytic decay of many high abundant proteins is initiated during blood withdrawal. The cleavage of proteins by endoproteases is generating peptidic fragments that are further trimmed down by exoproteases causing ‘ladder like’ degradation patterns [10]. This results in the reduction of longer and accumulation of shorter fragments during prolonged incubation [7]. For the selection of appropriate endogenous decay markers we systematically investigated the time dependent changes in peptide profiles of serum specimens that were aged under controlled conditions (1 h to 48 h). Furthermore, an algorithm was constructed for matching peptide profiles to the respective points in time. For this sort of mathematical analysis two prerequisites are relevant: First, the data have to be reproducible i.e. the time dependent changes of peptide profiles have to be consistent and must not be related to interindividual differences or any state of disease. Second, a given number of replicate data from time-resolved measurements should enable the construction of an algorithm to calculate the preanalytical quality of a given sample with sufficient precision.
In summary, we identified 62 endogenous decay markers with time related changes of respective concentrations. A training set was used for the construction of a regression algorithm that subsequently was validated with an independent test set. Quality control analyses should be introduced for future biomarker discovery studies to avoid any unwanted bias that is related to insufficient sample quality.
Reagents and chemicals
HPLC-grade acetonitrile was purchased from Fisher Chemicals. Formic acid and trichloroacetic acid were purchased from Sigma. LiChrosolve water was purchased from Merck. All reagents and chemicals were at least of analytical grade.
Serum samples
Whole blood specimens for initial identification and selection of decay markers were taken from 3 healthy volunteers. Further blood withdrawal was performed from colorectal cancer patients of the oncology department at the University Hospital Mannheim. Patients’ specimens were either depicted as training set (n=30) or as test set (n=20). Blood collection was performed after we obtained institutional review board approval and written informed consent. The specimens were kept at room temperature for 30 min prior to centrifugation at 22°C for 10 min at 3000 x g. The specimens were further kept at room temperature for the scheduled time and 50 μl aliquots were taken after 1 h, 2 h, 5 h, 8 h, 24 h, 30 h and 48 h of blood withdrawal. Specimens of the training set were kept at room temperature for 1 h-8 h, 24 h, 30 h and 48 h and processed in equal manner. Any handling and processing of serum specimens from the training set and test set was performed strictly randomized. An overview of the workflow is given in supplemental Figure 1.
Sample preparation
For deproteinization of the samples 50 μl of ice-cold 10% (v/v) trichloroacetic acid (TCA) was added and mixed thoroughly. The resulting mixture was kept at 4°C for 30 min prior to centrifugation for 15 min at 4°C and 13,000 rpm in a microcentrifuge (Eppendorf, Germany). The supernatant was centrifuged for 10 min at 4°C and 13,000 rpm and 1.5 μl of the supernatant was used for LC/MS-analysis.
Liquid chromatography-mass spectrometry (LC-MS) analysis
LC-MS was performed using a nano HPLC system (UltiMate 3000 RSLC; Thermo Scientific Dionex) coupled to a linear ion trap - Orbitrap hybrid mass spectrometer (LTQ-Orbitrap XL, Thermo Scientific) with a TriVersa NanoMate chip interface (Advion). Liquid chromatography was performed on an Acclaim PepMap® RSLC 75 μm ID, 150 mm length, C-18 column with 2 μm particles (Thermo Scientific) with a flow rate of 300 nl/min and a gradient from 3-60% of buffer B in 37 min. The composition of buffer A was water with 0.1% formic acid and buffer B was 80% acetonitrile with 0.1% formic acid. Each LC run was preceded by a blank run ensuring lack of carryover. MS analysis was performed in positive ion mode, with a mass range from 340 to 1700 m/z. Each scan cycle consisted of one FTMS full scan and up to seven ITMS dependent MS/MS scans of the most intense ions. Dynamic exclusion (30 s), mass width (10 ppm) and monoisotopic precursor selection were enabled. The cleavage specificity was set to “trypsin”, allowing for a maximum of two missed cleavages. Cysteine alkylation due to iodoacetamide (+57.022) treatment was set as fixed modification. For peptide identification, MS/MS spectra were searched against the Uniprot/Swissprot human data base using the PEAKS 6 search engine (Bioinformatics Solutions Inc.) accepting common variable modifications. The precursor mass tolerance was set to 10 ppm, fragment ion mass tolerance was set to 0.5 Da. The false discovery rate was below <1% and this resulted in a -log p-value of greater than 18.9. Extracted ion chromatograms (XIC) were analysed with Xcalibur software (Thermo Scientific). Exemplary screenshots of XIC and MS/MS scans are shown in supplemental Figure 2. Peak areas of selected peptides were determined for each point in time via Xcalibur quantification method based on the XIC. The peak areas were normalized to the total ion count (TIC) of respective MS-spectra. The dynamic range of label-free LC-MS quantification was at least three orders of magnitude and is in line with other reports [7,11].
Reproducibility of reporter peptide spiking
Repeated instrument analysis of one randomly chose sample were performed 6 times for selected decay markers (n=11) to monitor the reproducibility of label-free peptide quantification. The signal intensities of selected decay markers varied in the range of approximately three orders of magnitude. As coefficients of variations are inversely correlated to signal intensities [12] it is essential to add decay markers with low signal intensity to the reproducibility testing.
Prediction algorithm
Our aim was to develop an approach that can determine the status of a single unknown serum sample from high throughput experiments regarding its peptide content. The detailed description of the mathematical calculations has already been described [13]. Briefly, serum specimens were aged under controlled conditions and peptide profiles of respective points in time were generated with LC-MS to build a set of specimens depicted as training data. These were checked for consistency and specimens with inconsistent behaviour were eliminated from further analyses. We used the robust rank correlation measure that is freely available as an R package named Rococo [14] that has been shown to enable robust classification even when noisy numerical data and small sample sets are analyzed [15]. In theory, a high correlation between MS-peptide profile measurements of unrelated specimens is indicative for comparable preanalytical conditions and rather similar points in time. However, some of the identified peptides within the MS-profile might behave completely unrelated to the specific condition or process of interest (sample aging). Therefore, they could deteriorate the correlation coefficient between MS-peptide profiles at a given point in time. In order to reduce this effect we removed a fixed small number of peptides that lead to the highest increase of the rank correlation coefficient. The fixed number of the removed peptides was set to three as we expect that 5% of the used peptides in the correlation computation may be outliers. The structure of the data enclosed N MSpeptide profiles measured in R sample sets (training set), at TR different points in time. For a given single sample x, N peptides are measured at an unknown point in time tk. To assess the point in time tk, the following steps were carried out: 1) Measurement of training set of samples for all N peptides and consistency check. 2) Removal of replicates that raise suspicious behaviour by applying the data consistency check to the training set. 3) For measured MS-peptide profiles in the test set at the estimation point in time tk of a sample, we compute the robust gamma rank correlation coefficient at all available points in time of the training data set. 4) For each replicate of the training data set, we chose the point in time t(r) that delivers the highest obtained rank correlation coefficient with measured cell products at tk. 5) We build the mean value from all obtained points in time to assess tk.
Evaluation and testing
The training set comprised serum specimens (n=210 time points) from 30 colorectal cancer patients that were aged under controlled conditions (1-48 h). After testing for data consistency one out of 30 patients was excluded from further analysis. The remaining 29 patients were taken as training set to estimate time courses of the selected 62 decay markers. A test set of independently collected serum specimens from further 20 patients was blinded and used for validation. For each point in time estimation we computed the correlation coefficient with all available replicate points in time in the training set after removing the three peptides that allow the highest increase of the correlation coefficient during each calculation. Therefore, we obtained from each replicate one point in time that represents the highest correlation coefficient. By computing the mean of the 29 resulting points in time, we determined the point in time at which the measurement of the peptides was performed. In fact, there were only minor differences between usages of mean value und median for this step.
To visualize our approach, we plotted all computed correlation coefficients with all available samples of the training set for each point in time where each patient is illustrated with a different color. The plotted correlation points display the coherence of the computed correlation coefficient depending on the point in time we intend to estimate. The average of the selected points in time with the highest correlation from each patient of the training set is displayed as a blue line representing the estimated point in time. The illustration of both estimated point in time and real point in time, allow an intuitive visualization and identification of the sufficiency of our presented method.
Peptides for analysis
Initial analysis started with 192 peptides identified in a pilot study. The number of peptides was reduced subsequently and peptides with low signal intensities, instable ionisation, poor identification reliability and rapid loss of signal intensity over time were excluded. Finally, 62 peptides were selected as decay markers for further analyses. These peptides originate from common serum proteins like fibrinogen, coagulation factors, components of the complement system or interalpha- trypsin inhibitor heavy chain H4 and are in line with previously reported results [16]. A summary of the results is shown in Table 1. Details of endogenous reporter peptides are given in supplementary Table 1.
Protein | Symbol | Accession number | MW [kD] | number of selected peptides |
---|---|---|---|---|
Albumin | ALB | P02768 | 66 | 1 |
Complement C3 | C3 | P01024 | 187 | 2 |
Factor II | F2 | P00734 | 70 | 4 |
Factor XIII A chain precursor | F13 | P00488 | 83 | 2 |
Fibrinogen alpha chain | FGA | P02671 | 95 | 29 |
Fibrinogen beta chain | FGB | P02675 | 56 | 13 |
L-Fucose Kinase | FUK | Q8N0W3 | 118 | 3 |
Inter-alpha-trypsin inhibitor heavy chain H4 | ITIH4 | Q14624 | 103 | 3 |
Kininogen-1 | KNG1 | P01042 | 72 | 3 |
Thymosin beta-4 | TMSB4X | Q0P5T0 | 7 | 2 |
Table 1: List of protein precursors of endogenous peptides used decay markers.
Reproducibility
To monitor the technical reproducibility of quantitative peptide analysis, one sample was measured 6 times consecutively. Eleven endogenous decay markers with different signal intensities that ranged over approximately three orders of magnitude were selected for reproducibility testing (Figure 1). The highest median signal intensity of 693,242,168 [a.U.] was observed for m/z 733.33 whereas lowest median signal intensity of 1,538,601 [a.U.] was observed for m/z 629.63. In any case the coefficient of variation (CV) was smaller than 13 % underlining the good reproducibility of peptide quantification with LC/MS [7].
Figure 1: Exemplary results from measurements of 11 decay markers. Each measurement was performed in six-fold repetition. The open squares and lines inside each box represent the mean and median values; the limits of each box represent the 25th and 75th percentiles. The whiskers represent the minimum and maximum values.
Kinetics
Most often a fast decrease of signal intensity was detected. However, few other peptides increased over time. The time dependent changes of 62 endogenous decay markers is shown for three randomly chosen patients in Figure 2A. Values were log transformed prior to heat map visualization using the Excel 2013 software. The “overall pattern” of the three patients is rather similar, although absolute signal intensities are showing great interindividual variability (Figure 2B).
Figure 2:A) Time dependent changes of 62 endogenous decay markers exemplarily shown for three patients. The heat map was generated with Excel 2013 software. The red color represents high signal intensity whereas blue color represents low signal intensity of the respective peptides. Green arrows indicate the subgroup of eleven decay markers that were selected for testing of analytical reproducibility (Figure 1) and interindividual variability (figure 2B). B) Signal intensities of eleven decay markers were extracted from 6 randomly chosen patients at 3h respectively. The line inside each box represents the median. The square inside each box represents the mean value. The limits of each box represent the 25th and 75th percentiles, and the whiskers represent the minimum and maximum values. Outliers are marked as asterisk.
Mathematical modelling
The training set initially comprised 210 time points from 30 colorectal cancer patients. After testing of consistency one cancer patient had to be excluded due to irregular kinetics of peptide decay that markedly varied from specimens of the other 29 patients (Figure 3A). For a good correlation a pair wise depiction of two samples must show a continuous increasing curve as exemplarily shown in Figure 3B.
As illustrated in Figure 4, the presented method shows more reliable result at early points in time. The position of the violet vertical line (true time) and the blue vertical line (estimated time) are nearly identical and this indicates a high prediction accuracy (Figure 4A). In contrast, the inaccuracy of estimated points in time is increasing over the time as shown in Figure 4B. From each specimen we select the highest correlation value for determination of respective points in time. An example of the distribution of the highest correlation values used to estimate the time point presented in Figure 4A is illustrated in the boxplot in Figure 5. The resulting approximate range of the highest correlation values used for the estimation is 0.83–0.97. More than 100 data points were created from 29 patients as training set and data from further 20 patients as test sets. The data from the training set were blinded prior to further analysis. Estimated points in time were plotted against the real points in time respectively and coefficient of determination (R2) had a high value of 0.89 (Figure 6A). However, the absolute error increased markedly for older specimens (Figure 6B).
Figure 4: Examples for estimation of points in time. Plotted points present computed correlation coefficients of the point in time we intend to estimate with all available samples of the training set. Each color illustrates one different patient. The average of the selected points in time with the highest correlation from each patient of the training set is displayed as a blue line representing the resulting estimated time. A) Estimated point in time (blue line) is almost identical to the real point in time (1.8h; violet line) B) Estimated point in time (blue line) is deviating from the real point in time (30.2 h; violet line) [13].
Proteomics approaches are powerful tools for biomarker discovery and thereby can improve early detection, staging, therapeutic monitoring and prognosis of various diseases including cancer [17]. Despite much progress in the field, the introduction of new biomarkers for routine diagnostic applications is rare [18]. In contrast, the number of possible biomarker candidates from a multitude of proteomic profiling studies is rather high. However, most of these biomarkers could not be validated independently and consequently had to be classified as false positive [5]. Most important, the preanalytical bias in sample handling has been identified as one of the most serious threads for profiling experiments [2,19]. Specifically, mass spectrometry based proteomic profiling approaches are prone to preanalytical interference factors that profoundly affect the peptide profiles of blood specimens [20,21].
Most attempts for quality assurance of proteomic specimens are focusing on the standardization of preanalytical steps that comprise blood collection, transport, centrifugation and storage. Multiple working groups have defined standard operating procedures for documentation of preanalytical sample handling. These include ‘Biospecimen Reporting for Improved Study Quality’ (BRISQ) [22], ‘Standard PREanalytical Code’ (SPREC) [23] and ‘Standardisation and Improvement of Generic Pre-analytical Tools and Procedures for In Vitro Diagnostics’ (SPIDIA) [http://www.spidia.eu/]. The documentation of sample collection, processing and storage is important. However, this might be insufficient as many factors that influence sample quality can hardly be controlled under routine laboratory conditions [24]. Furthermore, the in depth documentation of various preanalytical conditions as proposed by Lehmann et al. [23] is rather laborious and might be missing for specimens collected within a clinical biobanking setup.
Accordingly, direct analysis of sample quality (DASQ) seems to be an attractive alternative [25] and is mandatory for transcriptomic analyses [26]. However, DASQ prior to proteomics experiments is not feasible up to now [6].
We hypothesized, that the monitoring of the preanalytical time span can be realized by quantification of proteolytically derived fragments of endogenous proteins/peptides that are generated in serum in a time dependent manner. However, there are two major hurdles that have to be considered. First, the proteolytic activity that is related to the processing of decay markers should have ‘housekeeping quality’ meaning that it should not be related to any state of disease including cancer and inflammation. Second, the interindividual variability of protein concentrations [27] and respective polymorphisms [28] should be taken into account.
Regarding the ‘housekeeping quality’ of exoproteolytic activity in serum- and plasma specimens we previously could demonstrate that the processing of an exogenous decay marker is independent of the disease state when healthy controls and colorectal cancer patients were compared [7]. In this study 62 peptides were selected as endogenous decay markers in healthy individuals that further on were validated in a set of colorectal cancer patients and the compliance of results is indicating a disease-independent pattern of decay markers. Peptides were quantified using a label-free approach that has clear limitations when compared to exact quantification with stable isotope labeled (SIL)- peptides as internal standard. However, the analytical reproducibility of our method is fit for purpose as the CV of analytical reproducibility (Figure 1) is clearly below the biological variability (Figure 2B). Generally, the great biological variability of proteins respective peptides in serum is a mathematical challenge. However, robust rank correlation measures showed sufficient accuracy for classification of specimens from the training set of serum specimens that were aged under controlled conditions but blinded prior to analysis. The training set data were taken at discrete times. This might be a limitation as predicted times of the test set were continuous. It should be clarified in further studies if classification accuracy can further be improved by continuous data of the training set.
The kinetics of peptide degradation is complex and most often the absolute changes of decay marker concentrations are decreasing during prolonged praeanalytical time span (Figure 2A). Accordingly, changes in signal intensities from severely aged samples (>24 h) are rather small when compared to moderately aged specimens (<8 h). The algorithm was optimized for an incubation time <8h (Figure 6B) as this is the time period of most interest. A given sample that has been aged for thirty or fifty hours at room temperature is old in any way and in most cases much too old for further analysis. However, discrimination between e. g. three and six hours might be of high relevance for some investigations. Consequently, the classification error for more aged samples seems to be acceptable.
Taken together, the preanalytical variability is a long known interference factor in laboratory testing [29]. There is raising awareness that also research studies for biomarker identification are critically affected from low sample quality [30,31]. Consequently, methods for direct analysis of sample quality are urgently needed [32]. The presented data from our proof of concept study are preliminary and have to be validated prospectively. Furthermore, a broader pattern of praeanalytical variability including long term storage [33] and freeze thaw cycles [34] will have to be investigated in future studies.