A Mathematical Model for Protein Structure Prediction and Analysis

Zakaria Lamine; Mohammed Wadia Mansouri; My Ismail Mamouni

doi:10.35248/0974-276X.24.17.673

Research Article - (2024)Volume 17, Issue 3

View PDF Download PDF

A Mathematical Model for Protein Structure Prediction and Analysis

Zakaria Lamine¹^*, Mohammed Wadia Mansouri¹ and My Ismail Mamouni²

^*Correspondence: Zakaria Lamine, Department of Mathematics, Ibnou Tofail University, Kenitra, Morocco, Email: ,

Author info »

Abstract

In the aim of presenting a learning approach derived from algebraic topology for protein structure prediction, we will be showing how our quotient spaces could qualitatively give insight into how building good homomorphism’s can help identifying accurate neural networks. We will also be giving as an example of application the use of a model generated after extracting an algebraic invariant which is in our case a persistent diagram on some biological data, by encoding the two first homologies H₁ to H₀ using a boundary operator, the algorithms are originated from algebraic geometry. Basically two main algorithms are used the Buchberger’s algorithm and Shreyer’s algorithm.

Keywords

Neural networks; Persistent diagrams; Buchberger’s algorithm; Shreyer’s algorithm

Introduction

The main idea of this paper is reconstructing molecular shapes by using alternative to the interpreted graph neural networks that are using geometric parametres for building artificial intelligence models, so we can access a theoretical justification of the topological signature from our previous work and explore new topological models for application purposes [1]. We will be considering the metrical representation of a boundary operator defined on the set of edges to the set of vertices in the context of an affine varieties so we can reconstruct the variety from an already defined algebraic topological space, let’s illustrate by a first example, the following is a filtered simplicial table.

A quantification of the boundary operator obtained from Grobner and Buchberger algorithms using ideals as basis generators to solve a hidden polynomial equations system would be;

Equation

The two following theorems will play a central role in road mapping the inverse of the boundary and would also give us a justification to work in a commutative algebraic setting.

Theorem 1. (Strong Nullstellensatz) If K is an algebraically closed field and I is an ideal in K[x₁, ..., x_n] then

I (V (I )) = √ I

Theorem 2. (Ideal-Variety Correspondence) Let K be an arbitrary field; the maps

Affinevarieties→ideals

and

ideals→ Affinevarieties

are inclusion reversing and

V (I (V )) = V

for all affine varieties V, if K is an algebraically closed then

Affinevarieties→radicalideals

and

radicalideals→ Affinevarieties

are inclusion reversing bijections and inverses for each other. Our free resolution is guaranteed from the following theorem. Theorem 3. The boundary of a boundary vanishes, that is;

Equation

Proof. We have;

Equation

Then

Equation

= 0

Let’s now detail the computing part of the previous.

Materials and Methods

Persistent diagram with different methods of construction

Let’s consider the following revision from which we can derive a clear description of the class of linear statistical representations;

X × Y

as universal components in a set theoretical context.

Equation

It is now sufficient to consider the pushout of the precedent diagram so the existence of our persistent diagram is guaranteed. Let us now involve more components to full-fill the definition, for that reason and to exploit efficiently theorems and proofs of the investigated theory, let us consider the functoriality of the main definition,

Equation

With ψ ,ϕ are well defined vertex mappings between different set vertices contained in a filtered simplicial complexes, we should also mention that no theoretical frame or applied one is given in the literature for a comparison between kernel density estimation construction vs alpha complex one of the persistent diagram in topological data analysis.

To be able to visualize the filtration process, one needs to consider the pullback given by,

Equation

Then given a sequence of inclusions of topological spaces

Equation

and its homology groups cautioned by their tames, a persistent diagram up to isomorphism is given by the following;

Equation

The inclusions of topological spaces induces immediately an inclusion between the cautioned spaces, we now can be sure from the greatest lower bound which is

Equation

This gives a theoretical frame to construct our confidence sets intervals. We should mention before getting in the proposed probabilistic models or the way they are writing that computer simulations nowadays made the theoretical frame quite flexible but less deliberate, especially when a new theory is proposed. This is the case with persistent diagrams. We should also mention that persistent diagrams are either derived from a learning process or functional summaries within a larger Hilbert space, for that reason one should investigate how the replicated persistent diagrams can be generated and what makes it different from other traditional constructions, principal component analysis as an example. To prove existence and definition of a replicated persistent diagram, we will be solving the problem of replication by investigating the behaviour of a persistent diagram near its greatest lower bound given by following [2].

Equation

From the already defined inclusion of topological spaces, we derive the following commutative diagram;

Equation

then we induce by using relative homology the following exact sequence;

Equation

Which means the caution could be defined for the whole inclusion, then a replicated persistent diagram is theoretically guaranteed. To fulfill the definition we consider the following diagram.

Equation

The uniqueness of our persistent diagram to conclude the definition depends on a factorization of the previous in the functor h back to the previous relative sequence we have.

p, q ∈ H* are well defined projections which implies H ∈ W it is now sufficient to prove Imp ∈ W or Imp ∈ H_l or (X_a)/F_l ^b,b ∈ H* the second inclusion is given by construction or in H_l(X_a)/F_l ^b,b every map calculate a homology within F_l ^b,b we confirm that W has the same topological degree as P.D(X_a+b) which gives the commutativity of the diagram. We conclude the uniqueness of P.D(X_a+b) then P.D(X) for any topological space (X) with some degree p.

Being said, the immediate way to start is building a confidence set interval for Equation

With Equation is an estimate of the persistent diagram constructed from a sample.

W_∞ is the bottleneck distance, we consider for that reason the theorem.

Theorem 4. Let f, g : K → R be monotone functions. Then;

Equation

for a homology dimension k we have;

Equation

We then bound

H ( S, M )

such that H is the Hausdorff distance;

Equation

to obtain a bound on

Equation

with ε = Z(VectM*)

We can now easily define a 1 − α confidence set interval for the bottleneck distance;

Equation

that is;

Equation

With p_n an adequate statistical descriptor of Equation the last step is to find α such that

Equation

Then the set of persistent diagrams are given.

Equation such that; C_n is the confidence set related to .

Being said, we get a confirmed theoretical frame to start the statistical study that involve point clouds representing atoms lying in a high dimensional space with a hidden locally Euclidean manifold. The next step consists of presenting algorithms derived from the previous result mentioned in the introduction, which is persistent homology of filtered complex is nothing but the regular homology of a graded module over a polynomial ring [3-6].

Polynomial solutions of boundary operators

Boundary and cycles modules: The concept of boundary and cycles is theoretically formalized in the previous definition of a persistence homology, homology gives a description of the set of cycles, by using the caution over the set of boundaries, which also means by persistence, preserving the cycles that are not boundaries.

Equation

In our context, cycles are the significant topological signatures of all types including loops and loops of loops, holes and cavities and so on. Let’s now compute our homologies, as already mentioned in the introduction persistent homology of filtered complex is nothing but the regular homology of a graded module over a polynomial ring, our module is defined over the n graded polynomial ring;

Equation

with standard grading

Equation

then

R = Aⁿ

Then our vector of polynomials is writing as [a₁, ..., a_m]^T , a_i is a polynomial where the matrix M_i+1 for ∂_i+1 has mi rows and m_i+1 columns where m_j stands for the number of j − simplices in the complex, a_i is the i^th column in M_i+1 thus we can separate polynomials from the derived coefficients, let

Equation

Where a_iis the i^th column in M_i+1 one now can write a polynomial vector a in a submodule in term of some basis A as in

Equation

to get a final result computing ∂_i+1. Things seems easier for the cycle submodule, which is a submodule of the polynomial module. As previously, this time ∂_i has m_i−1 rows and mi columns,

Equation

Where a_i is the i^th column in the matrix, the set of all [q₁, ..., q_mi]^T such that

Equation is a R submodule of which is the first Syzygy module of (a₁, ..., a_mi). A set of generators of the previous would finish the task, then finally to compute our homologies it suffices to verify whether the generators of the Syzygy submodule are in the boundary submodule.

Solving the problem of the boundary within a variety would consists of solving all edges and vertices within a set of polynomials equations without losing topological significance. The inverse inclusion would give an exact sequence for the boundary operators. The problem then takes the form of a free resolution, so we have the following computation.

Computation of homologies and rank invariant: Let’s consider the polynomial module R^m with the standard basis e₁, ..., e_m where e_i is the standard basis vector with constant polynomial 0 in all positions except 1 in position i, min R^m is of the form x^ue_i for some i and we say m contains e_i. For, u, v ∈ Nⁿ u > v, if u − v ∈ Zⁿ the left most nonzero entry is positive, this gives a total order on Nⁿas an example (1, 4, 0)>(1, 3, 1) since (1, 4, 0)−(1, 3, 1)=(0, 1, 0) the left most nonzero is 1, for two monomials x^u, x^v in R, x^u>x^v if u>v which gives a monomial order on R we then extend the order on R^m by using x^ue_i>x^ve_j if i<j or if i=j and x_u>x_v, r ∈ R^m can be written in a unique way, as a k linear combination of monomials m_i;

Equation

Where c_i ∈ K, c_i ≠ 0 and m_i ordered according to monomial order. As an example, if we consider

Equation

Then we can write f in terms of the standard basis

Equation

We then extend operations such as least common multiple to monomials in R and R^m we summarize them by saying

m / n = x^u / x^v = x^u−v

After a division, we get

Equation

So, if r=0 then a ∈ <A> so the division is not a sufficient condition, for that reason we use Grobner basis then by forcing the leading terms to be equal we get a sufficient condition. For unicity and minimality, we reduce each polynomial in G by replacing g ∈ G by the remainder of g/(G − g) then im∂_i+1 is well computed.

Still to compute generators for the Syzygy submodule, we compute a Grobner basis.

A = {a₁,...,a_s}

For <A> where the ordering is the monomial one, we then follow the same process as for im∂_i+1 we get

Equation

with g_k elements of the Grobner we need now a Grobner basis for

SYZ (a₁,...,a_s)

which can be obtained by using Schreyer’s theorem, guaranteeing the existence of

Equation

otherwise, we use this basis to find generators

SYZ (g₁,...,g_s)

for a metrical representation we consider elements a_i and g_i from S as columns of a given M_A and M_G respectively, the two basis generate the same module. ∃A, B such that M_G=M_AA, M_A=M_GB with each column of M_A is divided by M_G since M_G a Grobner basis for M_A. We conclude, there is a column in B for each column a_i ∈ M_A that can be obtained by division of a_i by M_G. Let;

S₁,...,S_t

be the columns of the t × t matrix I_t – AB. Then;

SYZ (a₁,...,a_t) =< AS_{i j}, S₁,...,S_t >

Then the Ker∂_i is computed. Finally we need to compute the caution H_i given im∂_i+1=<G> and Ker∂_i=SYZ(a₁, ..., a_t). We divide every column in Ker∂_i by im∂_i+1using the same process as in computing im∂_i+1 if the remainder is non zero we add it both to im∂_i+1 and H_i. Therefore, we count only unique cycles. We obtain for the previous bifiltration the following homogenous matrix for ∂₁ So M₁₁ is obtained by cautioning

Equation

and so on, the full matrix then has the form

Equation

To compute the rank invariant, we can use the multigraded approach, then if we take the previous bifiltration, matrices for SYZ(G₁) and Grobner of Z₁ for ∂₁ are obtained as previously,

Multi-filtered dataset: In topological data analysis, a multifiltered data set can be defined as;

Definition 1. (S, {f_j} j), where S is a finite set of d-dimensional points with n – 1 real-valued functions.

f_j : S → R

Defined on it, for n>1. We assume our data is a multifiltered dataset (S, {f_j} j).

In the following definitions, the calculations are made in commutative algebraic setting, this induces an order on the multifiltration, which can be viewed as an action of a ring over a module plus an inclusion maps relating copies of vertices within complexes, we will be using the ring of polynomials to relate the chain groups in the different grades of the module as the following;

Equation

with

Equation

For that purpose, let us detail the definition.

Definition 2. A p-dimensional simplex or p − simplex σ^p = [e₀, e₁, ..., e_p] is the smallest convex set in a Euclidean space R^m containing the p+1 points e₀, ..., e_p;

Equation

Another interesting and explicit description of persistent homology via visualization of barcodes can be found in [7]. We suggest here a concise precise definition via classification theorem.

Remark 1 (Persistence modules). We apply the “homology functor” to the filtered chain complexes, so we get our “homology groups” category [8,9]. This can be viewed as;

Equation

Where → denotes the inclusion map.

For a finite persistence module C with filed F coefficients

Equation

that are the quantification of the filtration parameter over a field, with clear description [10].

Definition 3. The p-persistence k-th homology group

Equation

well defined since B_k^l+P and Z^l_k are subgroups of C_k^l+P.

Let us consider the previous bi filtration from the introduction; we assume the computation are in

Z ⊕Z

and u₁=(0, 2), u₂=(0, 1), u₃=(0, 0), u₄=(1, 2), u₅=(1, 1), u₆=(1, 0), u₇=(2, 2), u₈=(2, 1), u₉=(2, 0), u₁₀=(3, 2), u₁₁=(3, 1), u₁₂=(3, 0) to be read from top to the bottom.

In this example, we have F₄ in grade (0, 0),

F₅=x₁ × F₄ in grade (0, 0).

F₆=x₂ × F₅= x₁ × x₂ × F₄ in grade (1, 1) and so on, then ∂₁ as from

Equation

can be computed

Equation

Computation of homologies and rank invariant for atoms point cloud: Being said gives a clear road map to start hypothesizing over a real data set. For that reason, let us consider a folding protein that constitutes N particles and has the spatiotemporal complexity of . We assume that our system can be described as a set of N nonlinear oscillators of dimension R^nN* R⁺, where n is the dimensionality of a single nonlinear oscillator. We will be using data from the freely data bank of Protein Data Bank (PDBs), the molecule in consideration has 1cos as an ID. Our point cloud lying in a R^3.700, coordinates of atoms are considered as the input of our multidimensional filtration (Figure 1).

Figure 1: The all-atom representation of an alpha helix.

We obtain in a first sight the following topological signatures which means our final result is a three dimensional simplex (Figure 2).

Figure 2: Topological fingerprints of the molecule. Note: A) cois5 (dimension 0); B) cois5 (dimension 1); C) cois5 (dimension 2).

To simplify the task let us consider the alpha carbon atoms of our molecule (Figure 3).

Figure 3: Coarse-Grained (CG) representation of an alpha helix generated from a protein of Protein Data Bank (PDB) ID 1cos.

The topological signature was given in Figure 4.

Figure 4: Fingerprint of Coarse-Grained (CG) representation of an alpha helix generated from a protein of Protein Data Bank (PDB) ID 1cos. Note: A) 1cos (dimension 0); B) 1cos (dimension 1).

This means our final result when the end of the multifiltration is a one dimensional simplex, with eighteen vertex at the beginning of the multifiltration; u₁=(3, 20, 21), u₂=(3, 19, 21), u₃=(4, 21, 22), u₄=(3, 21, 23), u₅=(3, 19, 23), u₆=(3, 20, 24), u₇=(4, 21, 23), u₈=(3, 22, 25), u₉=(4, 20, 22), u₁₀=(3, 21, 24), u₁₁=(3, 22, 26), u₁₂=(3, 23, 26), u₁₃=(4, 25, 26), u₁₄ =(4, 24, 25), u₁₅=(4, 19, 25), u₁₆=(3, 23, 19), u₁₇=(4, 23, 26), u₁₈=(4, 22, 27).

We get after calculations the following matrix;

Equation

which means the final shape conserve only one type of homology, with four loops as generators. Let us now involve more parameters, we consider decreasing radial basis functions. The general form is;

Equation

Where, ω_ij is associated with atomic types, then a generalized exponential kernel has the form;

Equation

k>0 one then can construct the following matrix.

Equation

This matrix can easily be obtained following the division algorithm mentioned in the previous section. By considering, xyz coordinates of atoms as the input of the multifiltration, and then the result can be used as the input for the persistent homology calculations following the same process. This clearly shows the path for an easiest extraction of a shape of a protein, since the traditional methods use many complicated parameters to build matrices supposed to rebuild the geometric conformation as the case of molecular nonlinear dynamics and flexibility rigidity index involving exponential kernels with parameters. For more enlightenment through an interesting detailed investigation of topology function relationship paradigm of proteins [3,11].

Results

As we have already mentioned in the previous section a full description of persistent homology can be obtained following; persistent homology of filtered complex is nothing but the regular homology of a graded module over a polynomial ring. The computation is also easy following; a division algorithm then a Buchberger algorithm to seek generators then basis (ideals) for modules. The final step for a statistical analysis is a quantification of the result of the second section to figure out the so-called replicated persistent diagrams. We can observe the significant topological difference between tertiary structures and secondary structures from Figure 5; the interesting task would be a separation between the alpha helices and beta sheets. Those can be literally expressed as the length of the coefficients lying in the metrical representation of our computed quotient Hi [2,12].

Figure 5: Persistent diagrams generated from xyz distributions of the alpha carbon atoms. Note: A) Protein Data Bank (PDB) id 1cos; B) PDB id 2jox; C) PDB id 6idd and D) PDB id 1dgv.

The total loss function incorporating homology into the learning process is given by;

Equation

Where

• L (θ) is the total loss function of the model.

• Loss (f (x_i; θ), y_i) is the standard loss function for the i^th data point.

• f (x_i; θ) is the model’s prediction for input x_i with parameters θ .

• y_i is the true label for the i^th data point.

• h_j is the homology coefficient for the j^th feature or level.

• λ is the regularization parameter that controls the weight of the homology term.

After running the model through our dataset, we get a folding process describing the behavior of different types of homologies through variation of our Gaussian probability distribution.

For a neural network with a single hidden layer, the learning function can be summarized as follows;

Equation

Where,

• σ (z) is the activation function (e.g., sigmoid for binary classification, soft-max for multi-class classification).

• L(y, y_pred) is the loss function (e.g., binary cross-entropy or categorical cross-entropy).

The parameter updates using gradient descent are given by;

Equation

Where,

• η is the learning rate.

• Equation are the gradients of the loss with respect to weights and biases.

Statistical summary of models is given (Table 1, Figures 5-7).

Metric	DeepCNF (2018)	DNN-Pred (2019)
Accuracy	~ 80%	75%-78%
Precision	High for residue-level prediction	Good for domain classification
Recall	High for secondary structure tasks	Good for protein interactions
Complexity	High (CNN+CRF)	Moderate (MLP)
Training data	Large PDB datasets	CASP and other benchmarks
Computational cost	High due to CRFs and CNNs	Moderate due to MLP

Note: DeepCNF: Deep Convolutional Neural Fields; DNN-Pred: Deep Neural Network-based Prediction; CNN: Convolutional Neural Network; CRF: Conditional Random Field; MLP: Multilayer Perceptron; PDB: Protein Data Bank; CASP: Critical Assessment of Structure Prediction.

Table 1: Comparison of DeepCNF and DNN-Pred models.

Figure 6: Comparison of Deep Convolutional Neural Fields (DeepCNF) and Deep Neural Network-based Prediction (DNN-Pred) models.

Figure 7: Behaviour of homologies through variation of the distribution.

Experimental procedures

In the all-atom model, atoms are considered the same; each atom is associated with the same radius in the distance-based filtration. The stream will be constructed for the point cloud data, which is the xyz, coordinates of the all atom representation (Figure 8). The size is not too large to choose a landmark selector, so we will simply build a Vietoris-Rips stream. We can choose a better filtration but for the limited computation power, we stick with the value of 8. In this case a Vietoris-Rips complex is largely sufficient to decipher the topological fingerprints (a small data set) so there is no need to use a landmark selector, which can be seen in the code shown below [13-16].

Figure 8: Example of structure used in our training data set with Protein Data Bank (PDB) Id 6IDD.

>> s i z e (ecos)

ans = 696 3

>> max_dimension = 3;

>> max_filtration_value = 8;

>> num_divisions = 1000;

>> stream = api.Plex4.create Vietoris-Rips Stream (ecos, max_dimension , . . . max_filtration_v a l u e, n u m _d i v i s i o n s);

>> n u m _s i m p l i c e s = s t r e a m . g e t S i z e () n u m s i m p l i c e s =3259289

>> p e r s i s t e n c e = a p i . P l e x 4 . g e t M o d u l a r S i m p l i c i a l A l g o r i t h m (max_dimension, 2);

>> o p t i o n s . f i l e n a m e = ’ c o i s ’ ;

>> o p t i o n s . m a x f i l t r a t i o n v a l u e

= m a x f i l t r a t i o n v a l u e ;

>> o p t i o n max_dimension=max_dimension–1;

>> o p t i o n s . s i d e_ b y_ s i d e = t r u e ;

>> p l o t _b a r c o d e s (i n t e r v a l s , o p t i o n s) ;

We utilize the Coarse-Grained (CG) with each amino acid represented by its Alpha Carbon (Cα) atom. The simplices are constructed which is helpful for the detection of the helix structure, so the corresponding barcode is simplified. As the last construction a Vietoris-Rips stream is largely sufficient to decipher the topological features of our data which is an 18 points in a 3-dimensional space. A part of the Matlab^© code is shown below.

>> load ecos 1

>> s i z e (e c o s) ans = 18 3

>> m a x_d i m e n s i o n = 2 ;

>> m a x _f i l t r a t i o n _v a l u e =2 3 ;

>> n u m _d i v i s i o n s =1000 ;

>> s t r e a m = a p i . P l e x 4 . c r e a t e V i e t o r i s R i p s S t r e a m (ecos, m a x_d i m e n s i o n, . . . m a x _f i l t r a t i o n _v a l u e , n u m d i v i s i o n s) ;

>> o p t i o n s . f i l e n a m e = ’ c o i i s 2 ’ ;

>> o p t i o n s . m a x f i l t r a t i o n v a l u e = m a x f i l t r a t i o n v a l u e ;

>> o p t i o n s.max dimension=max dimension−1;

>> p e r s i s t e n c e = a p i . P l e x 4 . g e t M o d u l a r S i m p l i c i a l A l g o r i t h m (max_dimension, 2) ;

>> o p t i o n s . s i d e b y s i d e = t r u e ;

>> i n t e r v a l s = p e r s i s t e n c e . c o m p u t e I n t e r v a l s (s t r e a m) ;

>> p l o t b a r c o d e s (i n t e r v a l s , o p t i o n s) ;

for the beta sheet construction, we use the following:

>> l o a d f i n b e t a

>> max_dimension=2;

>> m a x _f i l t r a t i o n _v a l u e = 5 ;

>> n u m _d i v i s i o n s =1000 n u m d i v i s i o n s = 1000

>> s t r e a m = a p i . Plex 4 . c r e a t e V i e t o r i s R i p s S t r e a m (b e t i 0 0 1, max_dimension, . . . m a x _f i l t r a t i o n _v a l u e, n u m _d i v i s i o n s) ;

>> n u m s i m p l i c e s = s t r e a m . g e t S i z e () n u m s i m p l i c e s = 149776

>> p e r s i s t e n c e = a p i . P l e x 4 . g e t M o d u l a r S i m p l i c i a l A l g o r i t h m (max_dimension, 2);

>> i n t e r v a l s = p e r s i s t e n c e . c o m p u t e I n t e r v a l s (s t r e a m) ;

>> o p t i o n s. f i l e n a m e = ’1 bet ’ ;

>> o p t i o n s. m a x f i l t r a t i o n v a l u e = m a x _f i l t r a t i o n _v a l u e ;

>> o p t i o n s.max_dimension=max_dimension–1;

>> o p t i o n s . s i d e b y s i d e = t r u e ;

>> p l o t b a r c o d e s (i n t e r v a l s , o p t i o n s) ;

Then we simplify without loosing topological significance by using the following:

>> l o a d b e t y

>> s i z e (b e t i 0 1) a n s =

24 3

>>max_dimension=2;

>> m a x _f i l t r a t i o n _v a l u e = 2 0 ;

>> n u m d i v i s i o n s =1 0 0 0 n u m d i v i s i o n s = 1000

>> s t r e a m = a p i . P l e x 4 . c r e a t e V i e t o r i s R i p s S t r e a m (b e t i 0 1 , max_dimension, . . . m a x _f i l t r a t i o n _v a l u e , n u m _d i v i s i o n s) ;

>> n u m s i m p l i c e s = s t r e a m . g e t S i z e () n u m s i m p l i c e s = 410

>> p e r s i s t e n c e = a p i . P l e x 4 . g e t M o d u l a r S i m p l i c i a l A l g o r i t h m (max_dimension, 2);

>> i n t e r v a l s = p e r s i s t e n c e . c o m p u t e I n t e r v a l s (s t r e a m) ;

>> o p t i o n s . f i l e n a m e = ’ b e t ’ ;

>> o p t i o n s . m a x f i l t r a t i o n v a l u e = m a x _f i l t r a t i o n _v a l u e ;

>> options.max_dimension=max_dimension−1;

>> o p t i o n s . s i d e _b y _s i d e = t r u e ;

>> plotbarcodes (intervals, o p t i o n s) ;

# −*− coding: utf−8−*−

””” U n t i t l e d 1 . i p y n b ”””

import numpy as np

import m a t p l o t l i b . p y p l o t as plt

import numpy as np

impo

rt seaborn as sns

from matplotlib import colors

from mpltoolkits.mplot3d import Axes3D

import matplotlib.pyplot as plt

from pylab import*

from array import array

pip install pdbreader

import p d b r e a d e r

p d b = p d b r e a d e r . r e a d p d b (” / c o n t e n t / 2 j o x . p d b ”)

from google.colab import drive

d r i v e . m o u n t (’ / c o n t e n t / d r i v e ’)

”””# Nouvelle section ”””

for key in pdb:

print (key)

ATOM = pdb [’ATOM’]

type (ATOM)

for key in ATOM:

print (key)

matrix = [ATOM [’x’]],

ATOM [’y’],

ATOM [’z’]]

p r i n t (m a t r i x)

table=np.empty ((696, 3))

p r i n t (t a b l e)

for i in range (0,696):

table [i] = [np.array (row [i]) for row in matrix]

print (table [i])

a = n p . a s a r r a y (t a b l e)

np.savetxt (”/content/matrixNOTordered.csv”, a, delimiter=”,”)

tabEX = table

print (tabEX)

a = n p . a s a r r a y (t a b E X)

n p . s a v e t x t (” / c o n t e n t / c c . c s v ” , a , d e l i m i t e r =” , ”)

d e f a r r a n g e (t a b E X , n , m) :

for i in range (0, m):

line = tabEX [i] − tabEX [i+1]

for j in range (0, n):

if line [j]<0:

temp = tabEX [i]

tabEX [i] = tabEX [i+1]

tabEX [i+1] = temp

break

elif line [j]>0:

break

elif i<18: print (tabEX [i])

else: continue

return tabEX

print (tabEX)

a = n p . a s a r r a y (t a b E X)

n p . s a v e t x t (” / c o n t e n t / k k . c s v ” , a , d e l i m i t e r =” , ”)

from typing import KeysView

table au = np.empty ((696,3))

for i in range (696, 3):

for j in range (696, 3):

for k in range (0, 2):

t a b l e a u [ i ] = tabEX [j] [k] * tabEX [j] [k+1]

p r i n t (t a b l e a u)

from typing import KeysView

Grobner = np.empty ((696,3))

def Grobnera (tableau, X, Y, Z):

for j in range (696, 3):

for k in range (0, 2):

Grobner [j] [k] = tableau [j] [k]/table a u [j] [k+1]

return Grobner

”””# Nouvelle section”””

a=np.asarray (Grobner)

np.savetxt (”/content/kk.csv”, a, delimiter=”,”)

tableau.shape

def Gaussian_kernel_matrix (Grobner, sigma):

distances = np.sum ((Grobner [:,np.newaxis] − X) ** 2, axis = −1)

kernel_matrix = np.exp (−distances/(2*sigma**2))

return kernel_matrix

X = Grobner

sigma =1

kernel_matrix = gaussian_kernel_matrix (X, sigma)

print (kernel_matrix)

a=np.asarray (kernel matrix)

np.savetxt (”/content /matrixNOTordered.csv”, a, delimiter=”,”)

type (Grobner)

import numpy as np

import matplotlib.pyplot as plt

import numpy as np

def heatmap2d (arr: Grobner):

p l t . i m s h o w (a r r , c m a p = ’ v i r i d i s ’)

p l t . c o l o r b a r ()

p l t . s h o w ()

t e s t a r r a y = n p . a r a n g e (2 0 0 * 2 0 0) . r e s h a p e (2 0 0 , 2 0 0)

heatmap2d (test_array)

Linking different types of homologies to figure out the full simulation of persistent homology will be done using a training parameter under a reinforcement learning approach. We will be building neural networks using numpy library from python; with only two hidden layers our model seems to hide greater strategies in capturing alpha shapes from a twenty data set of matrices representing H_i as our training data set (Figure 9); to figure out probabilities we use the following softmax activation function [10].

Figure 9: Each row is a sample in our data set.

Equation

pip install tensorflow biopython numpy pandas

import numpy as np

import pandas as pd

from Bio import PDB

def load_pdb_data (file_path):

parser = PDB.PDBParser()

structure = parser.get_structure (’protein’, file path)

atom_data = []

for model in structure:

for chain in model:

for residue in chain:

for atom in residue:

atom_data.append ([atom.get name (), atom.coord [0],

atomdf = pd.DataFrame (atom data, columns =[’AtomName’, ’X’, ’Y’,

return atom_df

#Example usage

pdb_data = load_pdb_data (’example.pdb’)

print (pdb data.head ())

# Data preparation

from sklearn.model_selection import train_test_split

from sklearn.preprocessing import StandardScaler

def prepare_data (df, labels):

features = df [[’X’, ’Y’,’Z ’]]. v a l u e s

s c a l e r = S t a n d a r d S c a l e r ()

f e a t u r e s = s c a l e r . f i t t r a n s f o r m (f e a t u r e s)

X _ t r a i n , X _ t e s t , y _ t r a i n , y _ t e s t = t r a i n _t e s t _ s p l i t (f e a t u r e s , l a b

return X _ t r a i n, X _ t e s t , y _ t r a i n , y _ t e s t

#Example labels (you would need actual labels for your dataset)

labels=np.random.randint (0, 2, len (pdb data)) # Example: Binary cl

X_train, X_test, y_train, y_test=prepare_data (pdb_data, labels)

#Define and train the model

import tensorflow as tf

from tensorflow.keras.models import Sequential

from tensorflow.keras.layers import Dense, Dropout

def build_model (input_shape):

model=Sequential ([

Dense (64, activation=’relu’, input_shape=(input_shape,)),

Dropout (0.5),

Dense (32, activation=’relu’),

Dense (1, activation=’sigmoid’)

])

model.compile (optimizer=’adam’,

loss=’binary_crossentropy’,

m e t r i c s =[ ’ a c c u r a c y ’ ])

return model

#Example usage

model=build model (X_train.shape [1])

model.summary ()

#Train the model

history=model.fit (X_train, y_train, epochs=10, batch_size=32, valid)

#Evaluate the model

loss, accuracy=model.evaluate (X_test, y_test)

print (f ” Test Loss: {loss}”)

print (f” Test Accuracy: {accuracy}”)

Discussion

It was out of the scope of this proposition to deal theoretically with the use of statistical tests on the set of barcodes, but the application shows clearly that the method can surpass a simple statistical approach and instead of conducting a molecular dynamic simulation it is easier to use existing information from models to construct a quantified sequence of barcodes then to look for its convergence limit. We can find interesting productions in the literature but none exploited fully persistent homology far from being a statistical tool. An interesting attempt by using dynamical distances was made by Peter Bubenik and collaborators, but couldn’t theoretically justify barcodes as a statistical observation, instead it gives birth to a new functional tool which is persistent landscapes [17].

Conclusion

This work is providing a complete roadmap for persistent homology and application to protein structure design, prediction and analysis. Persistent homology is a powerful tool in the field of computational biology, allowing researchers to analyze complex biological structures with greater accuracy and efficiency. In order to fully understand the fundamental concept, the mathematical model is thoroughly explained. To get familiarized with the axiomatic idea, the full mathematical model is detailed, as already mentioned in the computational part and give a stochastically approval to our work but without getting in the details of the calculations and the theoretical justifications.

References

Lamine Z, Mamouni MI, Mansouri MW. A topological data analysis of the protein structure. Int J Anal Appl. 2023;21:136.
[Crossref] [Google Scholar]
Xia K, Opron K, Wei GW. Multiscale multiphysics and multidomain models—Flexibility and rigidity. J Chem Phys. 2013;139(19):194109.
[Crossref] [Google Scholar] [PubMed]
Buchet M, Chazal F, Oudot SY, Sheehy DR. Efficient and robust persistent homology for measures. Comput Geom. 2016;58:70-96.
[Crossref] [Google Scholar]
Edelsbrunner H, Morozovy D. Persistent homology: Theory and practice. European Congress of Mathematics. 2014:31-50.
[Crossref]
Ichinomiya T, Obayashi I, Hiraoka Y. Protein-folding analysis using features obtained by persistent homology. Biophys J. 2020;118(12):2926-2937.
[Crossref] [Google Scholar] [PubMed]
Kovacev-Nikolic V, Bubenik P, Nikolić D, Heo G. Using persistent homology and dynamical distances to analyze protein binding. Stat Appl Genet Mol Biol. 2016;15(1):19-38.
[Crossref] [Google Scholar] [PubMed]
Carlsson G. Topology and data. Bull Amer Math Soc. 2009;46(2):255-308.
[Crossref] [Google Scholar]
Liu J, Xia KL, Wu J, Yau SS, Wei GW. Biomolecular topology: Modelling and analysis. Acta Math Sin Engl Ser. 2022;38(10):1901-1938.
[Crossref] [Google Scholar] [PubMed]
Hatcher A. Algebraic topology. 2005.
[Google Scholar]
Zomorodian A, Carlsson G. Computing Persistent Homology. Proceedings of the twentieth annual symposium on Computational geometry. 2004. 347-356.
[Crossref] [Google Scholar]
Xia K, Wei GW. Stochastic model for protein flexibility analysis. Phys Rev E Stat Nonlin Soft Matter Phys. 2013;88(6):062709.
[Crossref] [Google Scholar] [PubMed]
Bramer D, Wei GW. Atom-specific persistent homology and its application to protein flexibility analysis. Comput Math Biophys. 2020;8(1):1-35.
[Crossref] [Google Scholar] [PubMed]
Gräßler J, Koschützki D, Schreiber F. CentiLib: Comprehensive analysis and exploration of network centralities. Bioinformatics. 2012;28(8):1178-1179.
[Crossref] [Google Scholar] [PubMed]
Hu Z, Hung JH, Wang Y, Chang YC, Huang CL, Huyck M, et al. VisANT 3.5: Multi-scale network visualization, analysis and inference based on the gene ontology. Nucleic Acids Res. 2009;37(suppl_2):W115-W121.
[Crossref] [Google Scholar] [PubMed]
Lee MS, Ji QC. Protein analysis using mass spectrometry: Accelerating protein biotherapeutics from lab to patient. John Wiley & Sons. 2017.
[Google Scholar]
Opron K, Xia K, Burton Z, Wei GW. Flexibility–rigidity index for protein–nucleic acid flexibility and fluctuation analysis. J Comput Chem. 2016;37(14):1283-1295.
[Crossref] [Google Scholar] [PubMed]
Kovacev-Nikolic V, Bubenik P, Nikolic D, Heo G. Using persistent homology and dynamical distances to analyze protein binding. Stat Appl Genet Mol Biol. 2016;15(1):19-38.
[Crossref] [Google Scholar] [PubMed]

Author Info

Zakaria Lamine¹^*, Mohammed Wadia Mansouri¹ and My Ismail Mamouni²

¹Department of Mathematics, Ibnou Tofail University, Kenitra, Morocco
²Department of Mathematics, CRMEF, Rabat, Morocco

Citation: Lamine Z, Mansouri MW, Mamouni MI (2024). A Mathematical Model for Protein Structure Prediction and Analysis. J Proteomics Bioinform. 17:673.

Received: 03-Sep-2024, Manuscript No. JPB-24-33809; Editor assigned: 05-Sep-2024, Pre QC No. JPB-24-33809 (PQ); Reviewed: 19-Sep-2024, QC No. JPB-24-33809; Revised: 26-Sep-2024, Manuscript No. JPB-24-33809 (R); Published: 03-Oct-2024 , DOI: 10.35248/0974-276X.24.17.673

Copyright: © 2024 Lamine Z, et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution and reproduction in any medium, provided the original author and source are credited.

Journal of Proteomics & BioinformaticsOpen Access

A Mathematical Model for Protein Structure Prediction and Analysis

Abstract

Keywords

Introduction

Materials and Methods

Results

Discussion

Conclusion

References

Author Info

Journal of Proteomics & Bioinformatics
Open Access