Journal of Proteomics & Bioinformatics

Journal of Proteomics & Bioinformatics
Open Access

ISSN: 0974-276X

Research Article - (2024)Volume 17, Issue 3

A Mathematical Model for Protein Structure Prediction and Analysis

Zakaria Lamine1*, Mohammed Wadia Mansouri1 and My Ismail Mamouni2
 
*Correspondence: Zakaria Lamine, Department of Mathematics, Ibnou Tofail University, Kenitra, Morocco, Email: ,

Author info »

Abstract

In the aim of presenting a learning approach derived from algebraic topology for protein structure prediction, we will be showing how our quotient spaces could qualitatively give insight into how building good homomorphism’s can help identifying accurate neural networks. We will also be giving as an example of application the use of a model generated after extracting an algebraic invariant which is in our case a persistent diagram on some biological data, by encoding the two first homologies H1 to H0 using a boundary operator, the algorithms are originated from algebraic geometry. Basically two main algorithms are used the Buchberger’s algorithm and Shreyer’s algorithm.

Keywords

Neural networks; Persistent diagrams; Buchberger’s algorithm; Shreyer’s algorithm

Introduction

The main idea of this paper is reconstructing molecular shapes by using alternative to the interpreted graph neural networks that are using geometric parametres for building artificial intelligence models, so we can access a theoretical justification of the topological signature from our previous work and explore new topological models for application purposes [1]. We will be considering the metrical representation of a boundary operator defined on the set of edges to the set of vertices in the context of an affine varieties so we can reconstruct the variety from an already defined algebraic topological space, let’s illustrate by a first example, the following is a filtered simplicial table.

A quantification of the boundary operator obtained from Grobner and Buchberger algorithms using ideals as basis generators to solve a hidden polynomial equations system would be;

Equation

The two following theorems will play a central role in road mapping the inverse of the boundary and would also give us a justification to work in a commutative algebraic setting.

Theorem 1. (Strong Nullstellensatz) If K is an algebraically closed field and I is an ideal in K[x1, ..., xn] then

I (V (I )) = √ I

Theorem 2. (Ideal-Variety Correspondence) Let K be an arbitrary field; the maps

Affinevarieties→ideals

and

ideals→ Affinevarieties

are inclusion reversing and

V (I (V )) = V

for all affine varieties V, if K is an algebraically closed then

Affinevarieties→radicalideals

and

radicalideals→ Affinevarieties

are inclusion reversing bijections and inverses for each other. Our free resolution is guaranteed from the following theorem. Theorem 3. The boundary of a boundary vanishes, that is;

Equation

Proof. We have;

Equation

Then

Equation

Equation

Equation

= 0

Let’s now detail the computing part of the previous.

Materials and Methods

Persistent diagram with different methods of construction

Let’s consider the following revision from which we can derive a clear description of the class of linear statistical representations;

X × Y

as universal components in a set theoretical context.

Equation

It is now sufficient to consider the pushout of the precedent diagram so the existence of our persistent diagram is guaranteed. Let us now involve more components to full-fill the definition, for that reason and to exploit efficiently theorems and proofs of the investigated theory, let us consider the functoriality of the main definition,

Equation

With ψ ,ϕ are well defined vertex mappings between different set vertices contained in a filtered simplicial complexes, we should also mention that no theoretical frame or applied one is given in the literature for a comparison between kernel density estimation construction vs alpha complex one of the persistent diagram in topological data analysis.

To be able to visualize the filtration process, one needs to consider the pullback given by,

Equation

Then given a sequence of inclusions of topological spaces

Equation

and its homology groups cautioned by their tames, a persistent diagram up to isomorphism is given by the following;

Equation

The inclusions of topological spaces induces immediately an inclusion between the cautioned spaces, we now can be sure from the greatest lower bound which is

Equation

This gives a theoretical frame to construct our confidence sets intervals. We should mention before getting in the proposed probabilistic models or the way they are writing that computer simulations nowadays made the theoretical frame quite flexible but less deliberate, especially when a new theory is proposed. This is the case with persistent diagrams. We should also mention that persistent diagrams are either derived from a learning process or functional summaries within a larger Hilbert space, for that reason one should investigate how the replicated persistent diagrams can be generated and what makes it different from other traditional constructions, principal component analysis as an example. To prove existence and definition of a replicated persistent diagram, we will be solving the problem of replication by investigating the behaviour of a persistent diagram near its greatest lower bound given by following [2].

Equation

From the already defined inclusion of topological spaces, we derive the following commutative diagram;

Equation

then we induce by using relative homology the following exact sequence;

Equation

Which means the caution could be defined for the whole inclusion, then a replicated persistent diagram is theoretically guaranteed. To fulfill the definition we consider the following diagram.

Equation

The uniqueness of our persistent diagram to conclude the definition depends on a factorization of the previous in the functor h back to the previous relative sequence we have.

p, q ∈ H* are well defined projections which implies H ∈ W it is now sufficient to prove Imp ∈ W or Imp ∈ Hl or (Xa)/Fl b,b H* the second inclusion is given by construction or in Hl(Xa)/Fl b,b every map calculate a homology within Fl b,b we confirm that W has the same topological degree as P.D(Xa+b) which gives the commutativity of the diagram. We conclude the uniqueness of P.D(Xa+b) then P.D(X) for any topological space (X) with some degree p.

Being said, the immediate way to start is building a confidence set interval for Equation

With Equation is an estimate of the persistent diagram constructed from a sample.

W is the bottleneck distance, we consider for that reason the theorem.

Theorem 4. Let f, g : K → R be monotone functions. Then;

Equation

for a homology dimension k we have;

Equation

We then bound

H ( S, M )

such that H is the Hausdorff distance;

Equation

to obtain a bound on

Equation

with ε = Z(VectM*)

We can now easily define a 1 − α confidence set interval for the bottleneck distance;

Equation

that is;

Equation

With pn an adequate statistical descriptor of Equation the last step is to find α such that

Equation

Then the set of persistent diagrams are given.

Equation such that; Cn is the confidence set related to Equation.

Being said, we get a confirmed theoretical frame to start the statistical study that involve point clouds representing atoms lying in a high dimensional space with a hidden locally Euclidean manifold. The next step consists of presenting algorithms derived from the previous result mentioned in the introduction, which is persistent homology of filtered complex is nothing but the regular homology of a graded module over a polynomial ring [3-6].

Polynomial solutions of boundary operators

Boundary and cycles modules: The concept of boundary and cycles is theoretically formalized in the previous definition of a persistence homology, homology gives a description of the set of cycles, by using the caution over the set of boundaries, which also means by persistence, preserving the cycles that are not boundaries.

Equation

In our context, cycles are the significant topological signatures of all types including loops and loops of loops, holes and cavities and so on. Let’s now compute our homologies, as already mentioned in the introduction persistent homology of filtered complex is nothing but the regular homology of a graded module over a polynomial ring, our module is defined over the n graded polynomial ring;

Equation

with standard grading

Equation

then

R = An

Then our vector of polynomials is writing as [a1, ..., am]T , ai is a polynomial where the matrix Mi+1 for i+1 has mi rows and mi+1 columns where mj stands for the number of j − simplices in the complex, ai is the ith column in Mi+1 thus we can separate polynomials from the derived coefficients, let

Equation

Where aiis the ith column in Mi+1 one now can write a polynomial vector a in a submodule in term of some basis A as in

Equation

to get a final result computing i+1. Things seems easier for the cycle submodule, which is a submodule of the polynomial module. As previously, this time ∂i has mi−1 rows and mi columns,

Equation

Where ai is the ith column in the matrix, the set of all [q1, ..., qmi]T such that

Equation is a R submodule of Equation which is the first Syzygy module of (a1, ..., ami). A set of generators of the previous would finish the task, then finally to compute our homologies it suffices to verify whether the generators of the Syzygy submodule are in the boundary submodule.

Solving the problem of the boundary within a variety would consists of solving all edges and vertices within a set of polynomials equations without losing topological significance. The inverse inclusion would give an exact sequence for the boundary operators. The problem then takes the form of a free resolution, so we have the following computation.

Computation of homologies and rank invariant: Let’s consider the polynomial module Rm with the standard basis e1, ..., em where ei is the standard basis vector with constant polynomial 0 in all positions except 1 in position i, min Rm is of the form xuei for some i and we say m contains ei. For, u, v ∈ Nn u > v, if u − v ∈ Zn the left most nonzero entry is positive, this gives a total order on Nnas an example (1, 4, 0)>(1, 3, 1) since (1, 4, 0)−(1, 3, 1)=(0, 1, 0) the left most nonzero is 1, for two monomials xu, xv in R, xu>xv if u>v which gives a monomial order on R we then extend the order on Rm by using xuei>xvej if i<j or if i=j and xu>xv, r ∈ Rm can be written in a unique way, as a k linear combination of monomials mi;

Equation

Where ci ∈ K, ci ≠ 0 and mi ordered according to monomial order. As an example, if we consider

Equation

Then we can write f in terms of the standard basis

Equation

We then extend operations such as least common multiple to monomials in R and Rm we summarize them by saying

m / n = xu / xv = xu−v

After a division, we get

Equation

So, if r=0 then a ∈ <A> so the division is not a sufficient condition, for that reason we use Grobner basis then by forcing the leading terms to be equal we get a sufficient condition. For unicity and minimality, we reduce each polynomial in G by replacing g ∈ G by the remainder of g/(G − g) then im∂i+1 is well computed.

Still to compute generators for the Syzygy submodule, we compute a Grobner basis.

A = {a1,...,as}

For <A> where the ordering is the monomial one, we then follow the same process as for im∂i+1 we get

Equation

with gk elements of the Grobner we need now a Grobner basis for

SYZ (a1,...,as)

which can be obtained by using Schreyer’s theorem, guaranteeing the existence of

Equation

otherwise, we use this basis to find generators

SYZ (g1,...,gs)

for a metrical representation we consider elements ai and gi from S as columns of a given MA and MG respectively, the two basis generate the same module. ∃A, B such that MG=MAA, MA=MGB with each column of MA is divided by MG since MG a Grobner basis for MA. We conclude, there is a column in B for each column ai MA that can be obtained by division of ai by MG. Let;

S1,...,St

be the columns of the t × t matrix It – AB. Then;

SYZ (a1,...,at) =< ASi j, S1,...,St >

Then the Ker∂i is computed. Finally we need to compute the caution Hi given im∂i+1=<G> and Ker∂i=SYZ(a1, ..., at). We divide every column in Ker∂i by im∂i+1 using the same process as in computing im∂i+1 if the remainder is non zero we add it both to im∂i+1 and Hi. Therefore, we count only unique cycles. We obtain for the previous bifiltration the following homogenous matrix for 1 So M11 is obtained by cautioning

Equation

and so on, the full matrix then has the form

Equation

To compute the rank invariant, we can use the multigraded approach, then if we take the previous bifiltration, matrices for SYZ(G1) and Grobner of Z1 for 1 are obtained as previously,

Multi-filtered dataset: In topological data analysis, a multifiltered data set can be defined as;

Definition 1. (S, {fj} j), where S is a finite set of d-dimensional points with n – 1 real-valued functions.

fj : S → R

Defined on it, for n>1. We assume our data is a multifiltered dataset (S, {fj} j).

In the following definitions, the calculations are made in commutative algebraic setting, this induces an order on the multifiltration, which can be viewed as an action of a ring over a module plus an inclusion maps relating copies of vertices within complexes, we will be using the ring of polynomials to relate the chain groups in the different grades of the module as the following;

Equation

with

Equation

For that purpose, let us detail the definition.

Definition 2. A p-dimensional simplex or p − simplex σp = [e0, e1, ..., ep] is the smallest convex set in a Euclidean space Rm containing the p+1 points e0, ..., ep;

Equation

Another interesting and explicit description of persistent homology via visualization of barcodes can be found in [7]. We suggest here a concise precise definition via classification theorem.

Remark 1 (Persistence modules). We apply the “homology functor” to the filtered chain complexes, so we get our “homology groups” category [8,9]. This can be viewed as;

Equation

Where → denotes the inclusion map.

For a finite persistence module C with filed F coefficients

Equation

that are the quantification of the filtration parameter over a field, with clear description [10].

Definition 3. The p-persistence k-th homology group

Equation

well defined since Bkl+P and Zlk are subgroups of Ckl+P.

Let us consider the previous bi filtration from the introduction; we assume the computation are in

Z ⊕Z

and u1=(0, 2), u2=(0, 1), u3=(0, 0), u4=(1, 2), u5=(1, 1), u6=(1, 0), u7=(2, 2), u8=(2, 1), u9=(2, 0), u10=(3, 2), u11=(3, 1), u12=(3, 0) to be read from top to the bottom.

In this example, we have F4 in grade (0, 0),

F5=x1 × F4 in grade (0, 0).

F6=x2 × F5= x1 × x2 × F4 in grade (1, 1) and so on, then ∂1 as from

Equation

can be computed

Equation

Computation of homologies and rank invariant for atoms point cloud: Being said gives a clear road map to start hypothesizing over a real data set. For that reason, let us consider a folding protein that constitutes N particles and has the spatiotemporal complexity of . We assume that our system can be described as a set of N nonlinear oscillators of dimension RnN* R+, where n is the dimensionality of a single nonlinear oscillator. We will be using data from the freely data bank of Protein Data Bank (PDBs), the molecule in consideration has 1cos as an ID. Our point cloud lying in a R3.700, coordinates of atoms are considered as the input of our multidimensional filtration (Figure 1).

representation

Figure 1: The all-atom representation of an alpha helix.

We obtain in a first sight the following topological signatures which means our final result is a three dimensional simplex (Figure 2).

molecule

Figure 2: Topological fingerprints of the molecule. Note: A) cois5 (dimension 0); B) cois5 (dimension 1); C) cois5 (dimension 2).

To simplify the task let us consider the alpha carbon atoms of our molecule (Figure 3).

helix

Figure 3: Coarse-Grained (CG) representation of an alpha helix generated from a protein of Protein Data Bank (PDB) ID 1cos.

The topological signature was given in Figure 4.

Fingerprint

Figure 4: Fingerprint of Coarse-Grained (CG) representation of an alpha helix generated from a protein of Protein Data Bank (PDB) ID 1cos. Note: A) 1cos (dimension 0); B) 1cos (dimension 1).

This means our final result when the end of the multifiltration is a one dimensional simplex, with eighteen vertex at the beginning of the multifiltration; u1=(3, 20, 21), u2=(3, 19, 21), u3=(4, 21, 22), u4=(3, 21, 23), u5=(3, 19, 23), u6=(3, 20, 24), u7=(4, 21, 23), u8=(3, 22, 25), u9=(4, 20, 22), u10=(3, 21, 24), u11=(3, 22, 26), u12=(3, 23, 26), u13=(4, 25, 26), u14 =(4, 24, 25), u15=(4, 19, 25), u16=(3, 23, 19), u17=(4, 23, 26), u18=(4, 22, 27).

We get after calculations the following matrix;

Equation

which means the final shape conserve only one type of homology, with four loops as generators. Let us now involve more parameters, we consider decreasing radial basis functions. The general form is;

Equation

Where, ωij is associated with atomic types, then a generalized exponential kernel has the form;

Equation

k>0 one then can construct the following matrix.

Equation

This matrix can easily be obtained following the division algorithm mentioned in the previous section. By considering, xyz coordinates of atoms as the input of the multifiltration, and then the result can be used as the input for the persistent homology calculations following the same process. This clearly shows the path for an easiest extraction of a shape of a protein, since the traditional methods use many complicated parameters to build matrices supposed to rebuild the geometric conformation as the case of molecular nonlinear dynamics and flexibility rigidity index involving exponential kernels with parameters. For more enlightenment through an interesting detailed investigation of topology function relationship paradigm of proteins [3,11].

Results

As we have already mentioned in the previous section a full description of persistent homology can be obtained following; persistent homology of filtered complex is nothing but the regular homology of a graded module over a polynomial ring. The computation is also easy following; a division algorithm then a Buchberger algorithm to seek generators then basis (ideals) for modules. The final step for a statistical analysis is a quantification of the result of the second section to figure out the so-called replicated persistent diagrams. We can observe the significant topological difference between tertiary structures and secondary structures from Figure 5; the interesting task would be a separation between the alpha helices and beta sheets. Those can be literally expressed as the length of the coefficients lying in the metrical representation of our computed quotient Hi [2,12].

Persistent

Figure 5: Persistent diagrams generated from xyz distributions of the alpha carbon atoms. Note: A) Protein Data Bank (PDB) id 1cos; B) PDB id 2jox; C) PDB id 6idd and D) PDB id 1dgv.

The total loss function incorporating homology into the learning process is given by;

Equation

Where

• L (θ) is the total loss function of the model.

• Loss (f (xi; θ), yi) is the standard loss function for the ith data point.

• f (xi; θ) is the model’s prediction for input xi with parameters θ .

• yi is the true label for the ith data point.

• hj is the homology coefficient for the jth feature or level.

• λ is the regularization parameter that controls the weight of the homology term.

After running the model through our dataset, we get a folding process describing the behavior of different types of homologies through variation of our Gaussian probability distribution.

For a neural network with a single hidden layer, the learning function can be summarized as follows;

Equation

Where,

σ (z) is the activation function (e.g., sigmoid for binary classification, soft-max for multi-class classification).

L(y, ypred) is the loss function (e.g., binary cross-entropy or categorical cross-entropy).

The parameter updates using gradient descent are given by;

Equation

Equation

Where,

• η is the learning rate.

Equation are the gradients of the loss with respect to weights and biases.

Statistical summary of models is given (Table 1, Figures 5-7).

Metric DeepCNF (2018) DNN-Pred (2019)
Accuracy ~ 80% 75%-78%
Precision High for residue-level prediction Good for domain classification
Recall High for secondary structure tasks Good for protein interactions
Complexity High (CNN+CRF) Moderate (MLP)
Training data Large PDB datasets CASP and other benchmarks
Computational cost High due to CRFs and CNNs Moderate due to MLP

Table 1: Comparison of DeepCNF and DNN-Pred models.

Convolutional

Figure 6: Comparison of Deep Convolutional Neural Fields (DeepCNF) and Deep Neural Network-based Prediction (DNN-Pred) models.

Behaviour

Figure 7: Behaviour of homologies through variation of the distribution.

Experimental procedures

In the all-atom model, atoms are considered the same; each atom is associated with the same radius in the distance-based filtration. The stream will be constructed for the point cloud data, which is the xyz, coordinates of the all atom representation (Figure 8). The size is not too large to choose a landmark selector, so we will simply build a Vietoris-Rips stream. We can choose a better filtration but for the limited computation power, we stick with the value of 8. In this case a Vietoris-Rips complex is largely sufficient to decipher the topological fingerprints (a small data set) so there is no need to use a landmark selector, which can be seen in the code shown below [13-16].

structure

Figure 8: Example of structure used in our training data set with Protein Data Bank (PDB) Id 6IDD.

>> s i z e (ecos)

ans = 696 3

>> max_dimension = 3;

>> max_filtration_value = 8;

>> num_divisions = 1000;

>> stream = api.Plex4.create Vietoris-Rips Stream (ecos, max_dimension , . . . max_filtration_v a l u e, n u m _d i v i s i o n s);

>> n u m _s i m p l i c e s = s t r e a m . g e t S i z e () n u m s i m p l i c e s =3259289

>> p e r s i s t e n c e = a p i . P l e x 4 . g e t M o d u l a r S i m p l i c i a l A l g o r i t h m (max_dimension, 2);

>> o p t i o n s . f i l e n a m e = ’ c o i s ’ ;

>> o p t i o n s . m a x f i l t r a t i o n v a l u e

= m a x f i l t r a t i o n v a l u e ;

>> o p t i o n max_dimension=max_dimension–1;

>> o p t i o n s . s i d e_ b y_ s i d e = t r u e ;

>> p l o t _b a r c o d e s (i n t e r v a l s , o p t i o n s) ;

We utilize the Coarse-Grained (CG) with each amino acid represented by its Alpha Carbon (Cα) atom. The simplices are constructed which is helpful for the detection of the helix structure, so the corresponding barcode is simplified. As the last construction a Vietoris-Rips stream is largely sufficient to decipher the topological features of our data which is an 18 points in a 3-dimensional space. A part of the Matlab© code is shown below.

>> load ecos 1

>> s i z e (e c o s) ans = 18 3

>> m a x_d i m e n s i o n = 2 ;

>> m a x _f i l t r a t i o n _v a l u e =2 3 ;

>> n u m _d i v i s i o n s =1000 ;

>> s t r e a m = a p i . P l e x 4 . c r e a t e V i e t o r i s R i p s S t r e a m (ecos, m a x_d i m e n s i o n, . . . m a x _f i l t r a t i o n _v a l u e , n u m d i v i s i o n s) ;

>> o p t i o n s . f i l e n a m e = ’ c o i i s 2 ’ ;

>> o p t i o n s . m a x f i l t r a t i o n v a l u e = m a x f i l t r a t i o n v a l u e ;

>> o p t i o n s.max dimension=max dimension−1;

>> p e r s i s t e n c e = a p i . P l e x 4 . g e t M o d u l a r S i m p l i c i a l A l g o r i t h m (max_dimension, 2) ;

>> o p t i o n s . s i d e b y s i d e = t r u e ;

>> i n t e r v a l s = p e r s i s t e n c e . c o m p u t e I n t e r v a l s (s t r e a m) ;

>> p l o t b a r c o d e s (i n t e r v a l s , o p t i o n s) ;

for the beta sheet construction, we use the following:

>> l o a d f i n b e t a

>> max_dimension=2;

>> m a x _f i l t r a t i o n _v a l u e = 5 ;

>> n u m _d i v i s i o n s =1000 n u m d i v i s i o n s = 1000

>> s t r e a m = a p i . Plex 4 . c r e a t e V i e t o r i s R i p s S t r e a m (b e t i 0 0 1, max_dimension, . . . m a x _f i l t r a t i o n _v a l u e, n u m _d i v i s i o n s) ;

>> n u m s i m p l i c e s = s t r e a m . g e t S i z e () n u m s i m p l i c e s = 149776

>> p e r s i s t e n c e = a p i . P l e x 4 . g e t M o d u l a r S i m p l i c i a l A l g o r i t h m (max_dimension, 2);

>> i n t e r v a l s = p e r s i s t e n c e . c o m p u t e I n t e r v a l s (s t r e a m) ;

>> o p t i o n s. f i l e n a m e = ’1 bet ’ ;

>> o p t i o n s. m a x f i l t r a t i o n v a l u e = m a x _f i l t r a t i o n _v a l u e ;

>> o p t i o n s.max_dimension=max_dimension–1;

>> o p t i o n s . s i d e b y s i d e = t r u e ;

>> p l o t b a r c o d e s (i n t e r v a l s , o p t i o n s) ;

Then we simplify without loosing topological significance by using the following:

>> l o a d b e t y

>> s i z e (b e t i 0 1) a n s =

24 3

>>max_dimension=2;

>> m a x _f i l t r a t i o n _v a l u e = 2 0 ;

>> n u m d i v i s i o n s =1 0 0 0 n u m d i v i s i o n s = 1000

>> s t r e a m = a p i . P l e x 4 . c r e a t e V i e t o r i s R i p s S t r e a m (b e t i 0 1 , max_dimension, . . . m a x _f i l t r a t i o n _v a l u e , n u m _d i v i s i o n s) ;

>> n u m s i m p l i c e s = s t r e a m . g e t S i z e () n u m s i m p l i c e s = 410

>> p e r s i s t e n c e = a p i . P l e x 4 . g e t M o d u l a r S i m p l i c i a l A l g o r i t h m (max_dimension, 2);

>> i n t e r v a l s = p e r s i s t e n c e . c o m p u t e I n t e r v a l s (s t r e a m) ;

>> o p t i o n s . f i l e n a m e = ’ b e t ’ ;

>> o p t i o n s . m a x f i l t r a t i o n v a l u e = m a x _f i l t r a t i o n _v a l u e ;

>> options.max_dimension=max_dimension−1;

>> o p t i o n s . s i d e _b y _s i d e = t r u e ;

>> plotbarcodes (intervals, o p t i o n s) ;

# −*− coding: utf−8−*−

””” U n t i t l e d 1 . i p y n b ”””

import numpy as np

import m a t p l o t l i b . p y p l o t as plt

import numpy as np

impo

rt seaborn as sns

from matplotlib import colors

from mpltoolkits.mplot3d import Axes3D

import matplotlib.pyplot as plt

from pylab import*

from array import array

pip install pdbreader

import p d b r e a d e r

p d b = p d b r e a d e r . r e a d p d b (” / c o n t e n t / 2 j o x . p d b ”)

from google.colab import drive

d r i v e . m o u n t (’ / c o n t e n t / d r i v e ’)

”””# Nouvelle section ”””

for key in pdb:

print (key)

ATOM = pdb [’ATOM’]

type (ATOM)

for key in ATOM:

print (key)

matrix = [ATOM [’x’]],

ATOM [’y’],

ATOM [’z’]]

p r i n t (m a t r i x)

table=np.empty ((696, 3))

p r i n t (t a b l e)

for i in range (0,696):

table [i] = [np.array (row [i]) for row in matrix]

print (table [i])

a = n p . a s a r r a y (t a b l e)

np.savetxt (”/content/matrixNOTordered.csv”, a, delimiter=”,”)

tabEX = table

print (tabEX)

a = n p . a s a r r a y (t a b E X)

n p . s a v e t x t (” / c o n t e n t / c c . c s v ” , a , d e l i m i t e r =” , ”)

d e f a r r a n g e (t a b E X , n , m) :

for i in range (0, m):

line = tabEX [i] − tabEX [i+1]

for j in range (0, n):

if line [j]<0:

temp = tabEX [i]

tabEX [i] = tabEX [i+1]

tabEX [i+1] = temp

break

elif line [j]>0:

break

elif i<18: print (tabEX [i])

else: continue

return tabEX

print (tabEX)

a = n p . a s a r r a y (t a b E X)

n p . s a v e t x t (” / c o n t e n t / k k . c s v ” , a , d e l i m i t e r =” , ”)

from typing import KeysView

table au = np.empty ((696,3))

for i in range (696, 3):

for j in range (696, 3):

for k in range (0, 2):

t a b l e a u [ i ] = tabEX [j] [k] * tabEX [j] [k+1]

p r i n t (t a b l e a u)

from typing import KeysView

Grobner = np.empty ((696,3))

def Grobnera (tableau, X, Y, Z):

for j in range (696, 3):

for k in range (0, 2):

Grobner [j] [k] = tableau [j] [k]/table a u [j] [k+1]

return Grobner

”””# Nouvelle section”””

a=np.asarray (Grobner)

np.savetxt (”/content/kk.csv”, a, delimiter=”,”)

tableau.shape

def Gaussian_kernel_matrix (Grobner, sigma):

distances = np.sum ((Grobner [:,np.newaxis] − X) ** 2, axis = −1)

kernel_matrix = np.exp (−distances/(2*sigma**2))

return kernel_matrix

X = Grobner

sigma =1

kernel_matrix = gaussian_kernel_matrix (X, sigma)

print (kernel_matrix)

a=np.asarray (kernel matrix)

np.savetxt (”/content /matrixNOTordered.csv”, a, delimiter=”,”)

type (Grobner)

import numpy as np

import matplotlib.pyplot as plt

import numpy as np

def heatmap2d (arr: Grobner):

p l t . i m s h o w (a r r , c m a p = ’ v i r i d i s ’)

p l t . c o l o r b a r ()

p l t . s h o w ()

t e s t a r r a y = n p . a r a n g e (2 0 0 * 2 0 0) . r e s h a p e (2 0 0 , 2 0 0)

heatmap2d (test_array)

Linking different types of homologies to figure out the full simulation of persistent homology will be done using a training parameter under a reinforcement learning approach. We will be building neural networks using numpy library from python; with only two hidden layers our model seems to hide greater strategies in capturing alpha shapes from a twenty data set of matrices representing Hi as our training data set (Figure 9); to figure out probabilities we use the following softmax activation function [10].

sample

Figure 9: Each row is a sample in our data set.

Equation

pip install tensorflow biopython numpy pandas

import numpy as np

import pandas as pd

from Bio import PDB

def load_pdb_data (file_path):

parser = PDB.PDBParser()

structure = parser.get_structure (’protein’, file path)

atom_data = []

for model in structure:

for chain in model:

for residue in chain:

for atom in residue:

atom_data.append ([atom.get name (), atom.coord [0],

atomdf = pd.DataFrame (atom data, columns =[’AtomName’, ’X’, ’Y’,

return atom_df

#Example usage

pdb_data = load_pdb_data (’example.pdb’)

print (pdb data.head ())

# Data preparation

from sklearn.model_selection import train_test_split

from sklearn.preprocessing import StandardScaler

def prepare_data (df, labels):

features = df [[’X’, ’Y’,’Z ’]]. v a l u e s

s c a l e r = S t a n d a r d S c a l e r ()

f e a t u r e s = s c a l e r . f i t t r a n s f o r m (f e a t u r e s)

X _ t r a i n , X _ t e s t , y _ t r a i n , y _ t e s t = t r a i n _t e s t _ s p l i t (f e a t u r e s , l a b

return X _ t r a i n, X _ t e s t , y _ t r a i n , y _ t e s t

#Example labels (you would need actual labels for your dataset)

labels=np.random.randint (0, 2, len (pdb data)) # Example: Binary cl

X_train, X_test, y_train, y_test=prepare_data (pdb_data, labels)

#Define and train the model

import tensorflow as tf

from tensorflow.keras.models import Sequential

from tensorflow.keras.layers import Dense, Dropout

def build_model (input_shape):

model=Sequential ([

Dense (64, activation=’relu’, input_shape=(input_shape,)),

Dropout (0.5),

Dense (32, activation=’relu’),

Dense (1, activation=’sigmoid’)

])

model.compile (optimizer=’adam’,

loss=’binary_crossentropy’,

m e t r i c s =[ ’ a c c u r a c y ’ ])

return model

#Example usage

model=build model (X_train.shape [1])

model.summary ()

#Train the model

history=model.fit (X_train, y_train, epochs=10, batch_size=32, valid)

#Evaluate the model

loss, accuracy=model.evaluate (X_test, y_test)

print (f ” Test Loss: {loss}”)

print (f” Test Accuracy: {accuracy}”)

Discussion

It was out of the scope of this proposition to deal theoretically with the use of statistical tests on the set of barcodes, but the application shows clearly that the method can surpass a simple statistical approach and instead of conducting a molecular dynamic simulation it is easier to use existing information from models to construct a quantified sequence of barcodes then to look for its convergence limit. We can find interesting productions in the literature but none exploited fully persistent homology far from being a statistical tool. An interesting attempt by using dynamical distances was made by Peter Bubenik and collaborators, but couldn’t theoretically justify barcodes as a statistical observation, instead it gives birth to a new functional tool which is persistent landscapes [17].

Conclusion

This work is providing a complete roadmap for persistent homology and application to protein structure design, prediction and analysis. Persistent homology is a powerful tool in the field of computational biology, allowing researchers to analyze complex biological structures with greater accuracy and efficiency. In order to fully understand the fundamental concept, the mathematical model is thoroughly explained. To get familiarized with the axiomatic idea, the full mathematical model is detailed, as already mentioned in the computational part and give a stochastically approval to our work but without getting in the details of the calculations and the theoretical justifications.

References

Author Info

Zakaria Lamine1*, Mohammed Wadia Mansouri1 and My Ismail Mamouni2
 
1Department of Mathematics, Ibnou Tofail University, Kenitra, Morocco
2Department of Mathematics, CRMEF, Rabat, Morocco
 

Citation: Lamine Z, Mansouri MW, Mamouni MI (2024). A Mathematical Model for Protein Structure Prediction and Analysis. J Proteomics Bioinform. 17:673.

Received: 03-Sep-2024, Manuscript No. JPB-24-33809; Editor assigned: 05-Sep-2024, Pre QC No. JPB-24-33809 (PQ); Reviewed: 19-Sep-2024, QC No. JPB-24-33809; Revised: 26-Sep-2024, Manuscript No. JPB-24-33809 (R); Published: 03-Oct-2024 , DOI: 10.35248/0974-276X.24.17.673

Copyright: © 2024 Lamine Z, et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution and reproduction in any medium, provided the original author and source are credited.

Top