DOWNLOAD FULL PAPER
The First Gene, David L. Abel, Editor 2011, pp 147-169 ISBN: 978-0-9657988-9-1
Functional Sequence Complexity in Biopolymers
KIRK DURSTON & DAVID K.Y. CHIU
Department of computer science, Bioinformatics
University of Guelph
50 Stone Road East, Guelph, ON, Canada, N1G 2W1
ABSTRACT. It is generally recognized that biopolymers such as DNA, RNA and proteins demonstrate a form of sequence complexity. Recent work has provided a more detailed insight into biopolymeric complexity by introducing three types of sequence complexity, Random Sequence Complexity (RSC), Ordered Sequence Complexity (OSC) and Functional Sequence Complexity (FSC). The primary feature of FSC that distinguishes it from RSC and OSC, is the imposition of functional controls upon the sequence. In this paper, we propose that it can be measured using an extended form of Shannon uncertainty that includes a variable of functionality. Clearly, FSC can be found in human languages and carefully designed computer code, but the measure we propose in this paper reveals that it is also found in biopolymers. In the case of proteins, the measure of FSC provides an estimate for the target size of a protein family in the amino acid sequence space, revealing that functional sequences occupy an extremely small fraction of sequence space. Due to the miniscule size of functional sequence space for a given protein family, as mutations accumulate there will be an increasing likelihood of moving the mutated sequence outside that space, with a corresponding deleterious effect on FSC.
Correspondence/Reprint request: kirkdurston@gmail.com
Introduction: sequence complexity in biopolymers
It has recently been pointed out that traditional notions of complexity are inadequate when applied to biosequences [1, 2]. For example, characterizing biosequence complexity in terms of algorithmic complexity fails to account for the redundancy found in numerous different sequences even when they have the same function [1]. Functional controls imposed upon a biological sequence are critical for maintaining specific functions of the sequence within the cell and, ultimately, for the existence of life. A more rigorous formulation for complexity in biosequences that incorporates functionality is therefore required. Abel and Trevors have defined three types of sequence complexity, only one of which accounts for functional controls imposed upon biosequences such as DNA, RNA and proteins. We will discuss these three types of complexity within the context of biopolymers, with a special focus on that form of sequence complexity that incorporates functionality.
1. Random sequence complexity
Abel and Trevors have defined Random Sequence Complexity (RSC) as a linear string of stochastically linked units, the sequencing of which is dynamically inert, statistically unweighted, and is unchosen by agents; a random sequence of independent and equiprobable unit occurrence [3]. Implicitly, four components contribute to RSC. First, the sequence is composed of sites, or loci. Second, there is the importance of the symbols that could occupy each site in the sequence. Third, there is a complete absence of constraints and controls on these symbols, statistically making all options equiprobable. Finally, the value of the symbol at each site must be independent of the values at any other site, such that no site is constrained by any other site in the sequence. An example of RSC can be found in atactic polystyrene, where the orientation of the side chains at each site appears to be completely unconstrained. In summary, if no agent or law of nature controls or constrains the outcomes of any site in a sequence, then they are presumed to be equiprobable, and the complexity of the sequence is characterized as RSC.
2. Ordered sequence complexity
Ordered Sequence Complexity (OSC) is defined as a linear string of linked units, the sequencing of which is patterned either by the natural regularities described by physical laws (necessity) or by statistically weighted means (e.g., unequal availability of units), but which is not patterned by deliberate choice contingency (agency) [3]. Examples of OSC are repeating patterns arising out of chaotic interactions or a string of repeating alphabet characters such as TGTGTGTGTGTG … In nature, OSC is presumed to occur when laws of nature impose such tight constraints that there is no possibility of variation. In this case, repeatable, highly constrained sequences are produced that cannot, therefore, incorporate new functional inputs as functional information. An example of OSC is the highly ordered and repeating sequence obtained through the formation of polyadenosine absorbed onto the surface of montmorillonite clay [4].
3. Functional sequence complexity
Given the limitations discussed above, neither RSC, OSC, nor a combination of the two, are capable of producing significant levels of FSC since neither, by definition, are controlled by functionality [5]. Szostak [1] has further pointed out that, traditionally, neither algorithmic complexity [6] nor Shannon’s measure of uncertainty [7] is adequate for biopolymers. Functional Sequence Complexity (FSC) is therefore defined as a linear, digital, cybernetic string of symbols representing syntactic, semantic and pragmatic prescription; each successive symbol in the string is a representation of a decision-node configurable switch-setting—a specific selection for function [3]. Volitional agency (control) is implicitly required to properly set each configurable-switch-position symbol to achieve functionality. Examples of FSC are said to occur in well-designed computer code and, naturally, in human languages. For biopolymers, functionality can be a result of structural requirements of protein families [17], cellular processes, or specific biochemical reactions [8]. Furthermore, biological functions can be nested in a hierarchical manner from the sub-molecular domain structure necessary for the 3D structure of an enzyme, all the way up to the global function of entire species of organisms. Comparing the differences between OSC and RSC on the one hand, and FSC on the other, it is the requirement of functionality that is the distinguishing feature between them.
Recent advances in the synthesis of RNA chains in water are encouraging so far as providing a storage medium for prescriptive information and FSC [9]. However, the much greater challenge of encoding FSC within RNA remains. If a RNA sequence is highly ordered, it will tend toward OSC. If the highly ordered sequence can mutate, it will tend toward RSC over time. To become functional, controls will be required to properly configure each switch-setting (nucleotide) to select for function.