Inferring Chemical Reaction Patterns Using Rule Composition in Graph Grammars

Modeling molecules as undirected graphs and chemical reactions as graph rewriting operations is a natural and convenient approach tom odeling chemistry. Graph grammar rules are most naturally employed to model elementary reactions like merging, splitting, and isomerisation of molecules. It is often convenient, in particular in the analysis of larger systems, to summarize several subsequent reactions into a single composite chemical reaction. We use a generic approach for composing graph grammar rules to define a chemically useful rule compositions. We iteratively apply these rule compositions to elementary transformations in order to automatically infer complex transformation patterns. This is useful for instance to understand the net effect of complex catalytic cycles such as the Formose reaction. The automatically inferred graph grammar rule is a generic representative that also covers the overall reaction pattern of the Formose cycle, namely two carbonyl groups that can react with a bound glycolaldehyde to a second glycolaldehyde. Rule composition also can be used to study polymerization reactions as well as more complicated iterative reaction schemes. Terpenes and the polyketides, for instance, form two naturally occurring classes of compounds of utmost pharmaceutical interest that can be understood as"generalized polymers"consisting of five-carbon (isoprene) and two-carbon units, respectively.


Introduction
Directed hypergraphs [8] are a suitable topological representation of (bio)chemical reaction networks where (catalytic) reactions are hyperedges connecting substrate nodes to product nodes. Such networks require an underlying Artificial Chemistry [3] that describes how molecules and reactions are modeled. If molecules are treated as edge and vertex labeled graphs, where the vertex labels correspond to atom types and the edge labels denote bond types, then structural change of molecules during chemical reactions can be modeled as graph rewrite [2]. In contrast to many other Artificial Chemistries this approach allows for respecting fundamental rules of chemical transformations like mass conservation, atomic types, and cyclic shifts of electron pairs in reactions. In general, a graph rewrite (rule) transforms a set of substrate graphs into a set of product graphs. Hence the graph rewrite formalism allows not only to delimit an entire chemical universe in an abstract but compact form but also provides a methodology for its explicit construction.
Most methods for the analysis of this network structure are directed towards this graph (or hypergraph) structure [1,8], which is described by the stoichiometric matrix S of the chemical system. Since S is essentially the incidence matrix of the directed hypergraph, algebraic approaches such as Metabolic Flux Analysis and Flux Balance Analysis [10] have a natural interpretation in terms of the hypergraph. Indeed typical results are sets of possibly weighted reactions (i.e., hyperedges) such as elementary flux modes [12], extreme pathways [11], minimal metabolic behaviors [9] or a collection of reactions that maximize the production of a desired product in metabolic engineering. The net reaction of a given pathway is simply the linear combination of the participating hyperedges.
In the setting of generative models of chemistry, each concrete reaction is not only associated with its stoichiometry but also with the transformation rule operating on the molecules that are involved in a particular reaction. Importantly, these rules are formulated in terms of reaction mechanisms that readily generalize to large sets of structurally related molecules. It is thus of interest to derive not only the stoichiometric net reaction of a pathway but also the corresponding "effective transformation rule". Instead of attempting to address this issue a posteriori, we focus here on the possibility of composing the elementary rules of chemical transformations to new effective rules that encapsulate entire pathways.
The motivation comes from the observation that string grammars are meaningfully characterized and understood by investigating the transformation rules. Consider, as a trivial example, the context-free grammar G with the starting symbol S and the rules S → aS a, S → aS a | B and B → | bB. Inspecting this grammar we see that we can summarize the effect of the productions as B → b k , k ≥ 0, and S → a n Ba n , n ≥ 1. The language generated from G is thus {a n b * a n |n ≥ 1}. Here we explore whether a similar reasoning, namely the systematic combination of transformation rules, can help to characterize the language of molecules that is generated by a particular graph rewriting chemistry. Similar to the example from term rewriting above, we should at the very least be able to recognize the regularities in polymerization reactions. We shall see below, however, that the rule based approach holds much higher promises.
In this contribution we address two issues: First we establish the formal conditions under which chemical transformation rules can be meaningfully composed. To this end, we introduce in section 2 rule composition within the framework of concurrency theory. We then discuss the specific restrictions that apply to chemical systems, leading to the constructive approach to inferring composed rules in section 3.
The basic computational task we envision starts from an unordered set R of reactions such as those forming a particular metabolic reaction pathway. To derive the effective transformation rule describing the pathway we need to find the correct ordering π in which the transformation rules p i , underlying the individual chemical reactions ρ i , have to be composed. We illustrate this approach in some detail using the Formose reaction as an example in Section 4.

Graph Grammars and Rule Composition
Graph grammars, or graph rewriting systems, are proper generalizations of term rewrite systems. A wide variety of formal frameworks have been explored, including several different algebraic ones rooted in category theory. As a model of chemical transformations the so-called double pushout (DPO) formulation appears to be best suited. We refer to [5] for the comprehensive treatise. In the following sections we first outline the basic setup and then introduce full and partial rule composition.

Double Pushout and Concurrency
The DPO formulation of graph transformations considers transformation rules of form p = (L l ← − K r − → R) where L, R, and K are called the left graph, right graph, and context graph, respectively. The maps l and r are graph morphisms. The rule p transforms G to H, in symbols G p,m = = ⇒ H if there is a pushout graph D and a "matching morphism" m : L → G such that following diagram is valid: The existence of D is equivalent to the so-called gluing condition, which determines whether the rule p is applicable to a match in G. In the following we will also write G p = ⇒ H and G ⇒ H for derivations, if the specific match or transformation rule is unimportant or clear from the context.
Concurrency theory provides a canonical framework for the composition of two graph transformations. Given two rules can be defined whenever a dependency graph E exists so that in the following diagram: the cycles (1) and (2) are pushouts, and (3) is a pullback, see e.g., [7]. We then have q l = s 1 • w 1 and q r = t 2 • w 2 . The concurrency theorem [4] ensures that for any sequence of consecutive direct transformations G = == ⇒ G a graph E, a corresponding E-concurrent rule p 1 * E p 2 , and a morphism m can be found such that G In order to use graph transformation as a model for chemical reactions additional conditions must be enforced. Most importantly, atoms are neither created, nor destroyed, nor transformed to other types. Thus only graph morphisms whose restriction to the vertex sets are bijective are valid in our context. In particular, the matching morphism m always corresponds to a subgraph isomorphism in our context. The context graph K thus is (isomorphic to) a subgraph of both L and R, describing the part of L that remains unchanged in R. Conservation of atoms means that the vertex sets of L, K, and R are linked by bijections known as the atom-mapping. When the atom mapping is clear, thus, we do not need to represent the context explicitly.
It is important to note that the existence of the matching morphism m : L → G alone is not sufficient to guarantee the applicability of the transformation. In our context, we require in addition that the transformation rule does not attempt to introduce an edge in R that has been present already before the transformation is applied. Formally, the gluing condition requires that (l(x), l(y)) / ∈ L and (r(x), r(y)) ∈ R implies (m(l(x)), m(r(y))) / ∈ G. Fig. 1: Full composition of two rules requires that L2 is (isomorphic to) a subgraph of R1.

Full Rule Composition
In the following we will be concerned only with special, chemically motivated, types of rule compositions. In the simplest case the dependency graph E is isomorphic to R 1 , later we will also consider a more general setting in which E is isomorphic to the disjoint union of R 1 and some connected components of L 2 . For the ease of notation from now on we only refer to a rule composition, and not to a composition of morphisms as in Section 2, i.e., p 1 * E p 2 will be denoted as p 2 • p 1 (note the order of the arguments changes). If E ∼ = R 1 , then L 2 ∼ = e 2 (L 2 ) is a subgraph of R 1 . Omitting the explicit references to the subgraph matching morphism e 2 we can simply view L 2 as subgraph of R 1 as illustrated in Figure 1.
The rule composition thus amounts to a rewriting R 1 p 2 ,e 2 = == ⇒ R, while the left side L 1 is preserved. We will use the notation p 2 • p 1 and G p 2 •p 1 = == ⇒ G for this restricted type of rule composition, and call it full composition as the complete left side of p 2 is a subgraph of R 1 . Note that L 2 may fit into R 1 in more than one way so that there may be more than one composite rule. Formally, the alternative compositions are distinguished by different matching morphisms e 2 in the diagram (2); we will return to this point below.

Partial Rule Composition
An important issue for the application to chemical reactions is that the graphs involved in the rules are in general not connected. Typical chemical reactions combine molecules, split molecules or transfer groups of atoms from one molecule to another. The transformation rules for all these reactions therefore require multiple connected components. For the purpose of dealing with these rules, we introduce the following notation for graphs and derivations.  Let Q be a graph with #Q connected components Q i , i = 1, . . . , #Q. It will be convenient to treat Q as the multiset of its components. A typical chemical graph derivation, corresponding to a bi-molecular reaction can be written in the form where we take the notation to imply that all graphs G i and H j are connected. We will furthermore insist that representations of chemical reactions are minimal in the following sense: If the left graph of the rule p = (L ← K → R) matches entirely within G 1 , i.e., m(L) ∩ G 2 = ∅, then G 2 can be omitted. (In a chemical rewriting grammar, then, one of the H i must be isomorphic to G 2 , becoming redundant as well.) More formally, we say that a derivation That is, a proper derivation cannot be simplified. If the derivation G p = ⇒ H is proper then #G ≤ #L. The inequality comes from that fact that multiple components of L may easily be matched to a single component of G while each component of L must match within a component of G.
The conditions for the • composition of rules are a bit too strict for our applications. We thus relax them respect the component structure of left and right graphs. More precisely, we require that E is isomorphic to a disjoint union of a copy of R 1 and some connected components of L 2 so that for every connected component For a rule composition of this type to be well defined we need that ∃i such that e 2 (L i 2 ) ⊆ e 1 (R 1 ) holds. We remark that the latter condition could be relaxed further to lead to additional compositions for which left and right sides are disjoint unions.
The composition of p 1 = (L 1 , K 1 , R 1 ) and Figure 2). Note that right graph R 3 cannot no longer be regarded simply as a rewritten version of R 1 because rule p 2 now adds additional vertices to both the left and the right graph. The composite context K 3 contains only subsets of K 1 and K 2 , but it is expanded by the vertices of L 2 2 and the edges of L 2 2 that remain unchanged under rule p 2 . Fig. 4: The (partial) composition of two rules is mediated by the dependency graph E and the two matching morphisms e1 and e2. Since these are subgraph isomorphisms in our case, E is simply the union e1(R1) ∪ e2(L2). The (partial) match e1(R1) ∩ e2(L2) can be understood as a matching µ between R1 and L2, i.e., as a 1-1 relation of the matching nodes and edges. Whenever an edge is matched, then so are its incident vertices.
An example of a full rule composition is shown in Fig. 3. The two rules in the example, which in this case are also chemical reactions, are part of the Formose grammar. The Formose grammar consists of two pairs of rules. The first pair of rules, (from now on denoted as p 0 and p 1 ), implements both directions of the keto-enol tautomerism. One direction, p 1 , is visualized in Fig. 3. The second pair, (from now on denoted as p 2 , p 3 ) is the aldol-addition and its reverse respectively. The reverse (p 3 ) is also visualized in Fig. 3. We see that the left side of p 1 is isomorphic to a subgraph of one of the components of the right side of p 3 . Composing the two rules by subgraph matching yields a third rule, p 1 • p 3 .
In general, we require here that the connected components of R 1 and L 2 satisfy either e 2 (L i 2 ) ⊆ e 1 (R j 1 ) or e 1 (R j 1 )∩e 2 (L i 2 ) = ∅. We furthermore exclude the trivial case of parallel rules in which only the second alternative is realized. In other extreme, if all components L i 2 satisfy e 1 (L i 2 ) ⊆ e 2 (R 1 ), the partial composition becomes a full composition. Formally, these alternatives are described by different dependency graphs E and/or different morphisms e 1 and e 2 . Pragmatically we can understand this as a matching µ of L 2 and R 1 as in Fig. 4. Specifying µ of course removes the ambiguity from the definition of the rule composition; hence we write p 2 • µ p 1 to emphasize the matching µ.

Constructing Rule Compositions
Given two rules, p 1 and p 2 , it is not only interesting to know if a partial composition is defined, but also to create the set of all possible compositions explicitly. This set in particular contains also all full compositions. The following describes an algorithm for enumerating all partial compositions.

Enumerating the Matchings µ
The key to finding all compositions is the enumeration of all matchings µ that respect out restrictions on overlaps between connected components. We thus start from the sets R 1 1 , R 2 1 , . . . , R #R 1 1 and L 1 2 , L 2 2 , . . . , L #L 2 2 of connected components of R 1 and L 2 , resp. In the first set we find all subgraph matches L i 2 ⊆ R j 1 (represented as the corresponding matchings µ ij ) and arrange the result in a matrix of lists of subgraph matches, Fig. 5a.
The matching matrix is extended by a virtual column to account for the possibility that L i 2 is not matched with any component of R 1 . Every partial (and full) composition is now defined by a selection of one submatch from each row of the matrix, see Supplemental Material for an example. The converse is not true, however: Not every selection of matches correspond to a partial composition. In particular, we exclude the case that only entries from the virtual column are selected. In addition, the sub-matches must be disjoint to ensure that the combined match is injective. The latter conditions needs to be checked only when more than one submatch is selected from the same column.

Composing the Rules
The construction of the composition p 2 • µ p 1 of two rules p 1 and p 2 does not explicitly depend on the component structure of R 2 and L 1 because it is uniquely defined by the matching µ and the bijections of the nodes of L i , K i , and R i for each of the two rules. We obtain L by extending L 1 with unmatched components of L 2 and R by extending R 2 by the unmatched components of R 1 . The corresponding extension of µ to a bijectionμ of the vertex sets of L and R is uniquely defined. The context K of the composite rule simply consists the common vertex set of L and R and all edges (x, y) of L for which (μ(x),μ(y)) is an edge in R. We note in passing thatμ defines the atom mapping of the composite transformation. The explicit construction of (R, K, L) is summarized as Algorithm 3.1. The implementation of the algorithm naturally depends heavily on the representation of transformation rules, which in our implementation is the representation from the Graph Grammar Library (GGL) [6]. The representation is a single graph, with attached vertex and edge properties defining membership of L, K and R, as well as the needed labels.
Not all matchings define valid rule composition. For instance, consider an edge (u, v) that is present in R 1 and R 2 but not in L 2 and both u and v are in L 2 . This would amount to creating the edge by means of rule p 2 which was already introduced by p 1 . Since we do not allow parallel edges and thus regard such inconsistencies as undefined cases and reject the matching. Note that a parallel edge does not correspond to a "double bond" (which essentially is only an edge with a specific type). Algorithm 3.1: Composing p 1 and p 2 to p, by a given partial mapping Input: p 1 = (L 1 , K 1 , R 1 ) Input: p 2 = (L 2 , K 2 , R 2 ) Input: µ, a partial matching between L 2 and R 1 Output: p = (L, K, R) 1 p ← empty rule 2 Copy vertices of p 1 to p if v is not mapped by µ then 5 Copy v to p

Graph Binding
The composition of transformation rules, and thereby chemical reactions, makes it possible to create abstract meta-rules in a way that is similar to the combination of multiple functions into more abstract functions in functional programming. A related concept from (functional) programming that seems useful in the context of graph grammars is partial function application. Consider, for example, the binding of the number 2 to the exponentiation operator, yielding either the function f (x) = 2 x or f (x) = x 2 . In the framework of rule composition, we define graph binding as a special case.
Let G be a graph and p 2 = (L 2 , K 2 , R 2 ) be a transformation rule. The binding of G to p 2 results in the transformation rule p = (L, K, R) which implements the partial application of p 2 on G. This is accomplished simply by regarding G as a rule p 1 = (∅, ∅, G), and using partial composition; p = p 2 • p 1 . Note that if p 2 • p 1 is a full composition, then p can be regarded as a graph H and G p 2 = ⇒ H holds. Graph binding allows a simplified representation of reactions. For instance, we can use this formal construction to omit uninteresting ubiquitously present molecules such as water by binding the graph of the water molecule to the transformation rule of a reaction that requires water. Similarly, graph unbinding can be defined as a transformation rule that destroys graphs. In a chemical application it can be used to avoid the explicit representation of uninteresting ubiquitous molecules such as the solvent. as their directionality can be used efficiently to prune the possible orders of rule compositions. The fact that multiple reactions are instantiations of the same transformation rule, as in the example discussed in detail in the next section, further reduces the search spaces.

Results and Discussion
We illustrate the use of transformation rule composition by deriving of meta-rules from the graph grammar consisting of the four rules necessary to represent the complete Formose reaction, see Fig. 3. The overall reaction pattern of the Formose cycle is 2g 0 + g 1 → 2g 1 with g 0 being formaldehyde and g 1 being glycolaldehyde. It amounts to the linear combination 9 i=1 ρ i of the eight reactions and the influx ρ 1 of g 0 listed in Fig. 3. It is important to notice that several of these reactions are instantiations of the same, well-known chemical transformations. We have forward keto-enol tautomerism (p 0 : ρ 2 , ρ 4 , ρ 6 ), backward keto-enol tautomerism (p 1 : ρ 7 , ρ 9 ), forward al-dol addition (p 2 : ρ 3 , ρ 5 ), and backward aldol addition (p 3 : ρ 8 ). The composite rule models the complete autocatalytic cycle shown in Fig. 6 as a single meta-rule.
Throughout this section we will not explicitly distinguish between partial composition and full composition, and we interpret the composition operator • as right-associative to simplify the notation. Thus The rules are used in the autocatalytic cycle in the following order (starting with an keto-enol tautomerisation p 0 ): As it is not possible to compose this sequence of rules directly, we start by binding glycolaldehyde g 1 to reaction p 0 , as the before-mentioned keto-enol tautomerisation is applied to molecule g 1 . The resulting rule is denoted as g 1 . The hyperedges in the chemical reaction network depicted in Fig. 6 are numbered according to the sequence that reflects in which order the Formose reaction takes place and consequently the order in which the rule composition subsequently is done. The first composition refers to the binding operation. This binding of glycolaldehyde results in a graph grammar rule, which is depicted in row 1 in the table depicted in Fig. 6, i.e., the rule (∅, ∅, g 1 ) (see "Graph Binding"). The numbers at the hyperedges (2, 3, . . . 9) refer to the second, third, . . ., ninth reaction in the sequence of reactions given above. The graph grammar rule p i , 0 ≤ i ≤ 3, used for the corresponding hyper-edge is given next to the sequence number. The rules inferred by a subsequent rule composition are given in rows 2 to 9 of the table.
The application of the final rule results in the composed meta-rule p 1 • p 3 • . . . • p 0 • g 1 . This rule precisely covers the reaction pattern of the Formose reaction, namely how two formaldehyde molecules and one (bound) glycolaldehyde are transformed to two glycolaldehyde molecules. However note, that the rule is general enough such that any pair of molecules with aldehyde groups can be used, i.e., the inferred reaction pattern refers to a class of overall reactions and the product does not necessarily need to be glycolaldehyde.
The practical computation of these compositions takes less than a second in the current implementation. Even for substantially more general composition sequences the running time remains manageable. For instance, it takes less than 1 minute to compute all composition sequences with a length k ≤ 10 of the form p i 1 • p i 2 • · · · • p i k • g q with i j ∈ {0, 1, 2, 3}, based on the binding of one of the influx molecules g 0 or g 1 . This results in 1875 different inferred composite rules.
Polymerization can also be viewed as a pathway in a chemical reaction network, albeit one of potentially infinite size. The same methods applied to the automatic inference of the overall reaction pattern of the Formose cycle can be directly applied to detecting composition rules for polymerization reactions. Importantly, even if a chemical reaction network is not given, the approaches presented in this paper can be used to automatically find sequences of reactions that will lead to polymerization. This can be realized by a straight-forward post-processing step: all that needs to be done is to check whether an inferred composite rule exhibits a replicated functional unit. Such polymerization meta-rules also enable the analysis of chemical systems with highly complex carbon skeletons such as the natural compound classes of the terpenes or the polyketides.

Conclusions
Graph grammars provide a convenient framework for modeling chemistries on different levels of abstraction. A chemically valid approach is to see any chemical reaction as a bi-molecular reaction. This requires graph grammar rules that cover changes of molecules in an rather explicit and detailed way. Understanding chemical reaction patterns usually requires spanning the chemical reaction networks based on such rules. Obviously, this approach suffers the inherent potential of an immense combinatorial explosion. In this paper we introduced the automatic inference of such higher-level chemical reaction pattern based on a formal approach for graph grammar rule combination. We analyzed the autocatalytic cycle of the Formose reaction and inferred its overall reaction pattern as a rule composition of nine rules. Rule composition is also naturally applicable to inferring patterns of polymerization reactions. Future work will include e.g. the analysis of terpene-based and hydrogen cyanide-based polymerization chemistry.

Supplemental Material: Example of Enumeration of Compositions
In this Appendix we show the complete result of the composition of two (artificial) rules, p 1 and p 2 , including the selection of submatches from the match matrix. The two rules are depicted in Fig. 7 with the extended match matrix of the composition p 2 • p 1 , that corresponds to the example of an extended match matrix as given in the paper. The rules in this section are all depicted with vertices that have an additional index. The numbering of the components is in increasing order wrt. to these indices, e.g., L 1 2 denotes the component connecting nodes A, 0 and B, 1 and L 2 2 denotes the component connecting nodes B, 2 and C, 3. In the following we will enumerate all valid selections of submatches based on the extended match matrix and give the corresponding resulting rule composition. The chosen matches are depicted as • in the extended match matrix. If several matches can be found (in our example this is true for the component L 1 2 , that can be matched twice in R 3 1 ), the • has an index. Result of composition 2  Result of composition 9 Composition 10