DEVELOPMENT OF COMPUTING APPROACHES FOR ANALYSIS AND IDENTIFICATION OF POLYMORPHIC PEPTIDE
Polymorphism. Variant peptides. Custom database. Proteomics.
The proteomic approach allows large-scale studies of protein expression in different tissues and body fluids, aiming to identify and quantify the total protein content. In the proteomic analysis process, protein identification still presents limitations despite major advances in the area. Frequently, a mass spectrometer is used to generate mass/charge values of the samples. After this process, a reference protein database (eg, UNIPROTI) is usually used to identify proteins. However, using a reference database limits the analysis of the identification of the proteins, since it does not contain the variations in the DNA, that can impact the sequence of amino acids, causing incorrect identification or making the process impossible. In this context, there are several custom databases that incorporate such genetic variations. Although they present good results, they are also limited by a considerably increase in the search space, becoming another problem in the identification process. Thus, this research proposes the implementation of a database containing polymorphic peptides, combining information contained in dbSNP and NCBI. Then, an hypothetical sequence is generated containing the mutated peptides in the protein, considering their allelic frequency. This process is complemented with analysis of the peptides identified, after the samples are submitted to the software identifier. In parallel, a search is performed on the database of reference and on the database of mutated peptides, allowing a reduction of the search space, generating two outputs. Then, the uniqueness of the database peptides is checked and, if there is redundancy, the one with the best score is selected. The peptides identified using the mutated base are also classified according to the type of mutation, allelic frequency and pathogenicity. For the classification of the peptides, a machine learning approach was also developed, distinguishing them according to the non-mutated, SNP, INDEL and nonsense classes. For the tests, three data were used as input, HapMap and samples of ovarian and colon cancer. As a result, 3,013 new peptides were identified using the polymorphic base, of which 82% were SAPs, 13% were INDELs, 5% frameshifts and less than 1% corresponding to lost stop and UTR variation. Among the mutations, some were related to nonsyndromic deafness, hypomyelination with encephalic and spinal cord involvement and spasticity of the leg, Gaucher’s disease and breast cancer. For the data from ovarian cancer samples, 7,514 new peptides were identified, being 72.9% SAPs, 21.8% Frameshifts, 2.6% INDEL and less than 1% for Lost Stop and UTR variation. These mutations are also related to inflammatory bowel disease, segmental and focal glomerulosclerosis. For the colon samples, 3,965 new peptides were identified, being 75.4% SAPs, 20.4% Frameshift, 3.3% INDEL and less than 1% for lost stop and UTR variation. These mutations are also associated with amyotrophic lateral sclerosis and acute fatty liver of pregnancy. Using the random forest algorithm for classification we obtained a accuracy rate >89.7%. Therefore, our approach appears to be very promising regarding the established objective and applicable to analyzes with new samples.