(logo) Natural Genesis (logo text)
A Sourcebook for the Worldwide Discovery of a Creative Organic Universe
Table of Contents
Genesis Vision
Learning Planet
Organic Universe
Earth Life Emerge
Genesis Future
Recent Additions

VI. Earth Life Emergence: A Development of Body, Brain, Selves and Societies

2. The Deep Affinity of Genomes and Languages

Around 1970, the linguist Roman Jakobson and biopsychologist Jean Piaget opined that these prime informational domains ought to have common similarities. The common phrase Book of Life, along with analogous usage of language terms in genetics went on even earlier. Over the subsequent decades, as gathered here, the parallels grew in veracity and value, often serving as insightful comparisons. We now enter this 2016 section because several recent contributions strongly confirm, as naturally should be, an innate similarity. A March 2016 issue of the Philosophical Transactions of the Royal Society A on “DNA as Information” (search Cartwright) braces this result by novel rootings of genome phenomena in nature’s mathematics, physics and chemistry. As our worldwise humankinder personsphere arises to learn on her/his own, a truly textual, poetic code script narrative, the Magnum Opus quest of perennial tradition, is at last being fulfilled, revealed and proven.

Asgari, Ehsaneddin and Mohammad Mofrad. Comparing Fifty Natural Languages and Twelve Genetic Languages Using Word Embedding Language Divergence as a Quantitative Measure of Language Distance. arXiv:1604.08561. Within the work of Mofrad’s Molecular Cell Biomechanics Laboratory which involves the linguistic modeling of protein bioinformatics, Iranian-American, UC Berkeley researchers discern an innate affinity between linguistic volumes by way of network parsings (search Rosetta Cosmos) and similar analyses of genomes. By so doing, the mid 2010s realization of a common natural textuality from uniVerse to human increasingly gains a scientific verification. See also herein Continuous Distributed Representation of Biological Sequences for Deep Proteomics and Genomics by the authors.

We introduce a new measure of distance between languages based on word embedding, called word embedding language divergence, defined as divergence between unified similarity distribution of words between languages. Using such a measure, we perform language comparison for fifty natural languages and twelve genetic languages. Our natural language dataset is a collection of sentence-aligned parallel corpora from bible translations for fifty languages spanning a variety of language families. Although we use parallel corpora, which guarantees having the same content in all languages, interestingly in many cases languages within the same family cluster together. In addition to natural languages, we perform language comparison for the coding regions in the genomes of 12 different organisms (4 plants, 6 animals, and two human subjects). The proposed method is a step toward defining a quantitative measure of similarity between languages, with applications in languages classification, genre identification, dialect identification, and evaluation of translations. (Abstract)

Figure 1: Hierarchical clustering of fifty natural languages according to divergence of joint distance distribution of 4097 aligned words in bible parallel corpora. Subsequently we use colors to show the ground-truth about family of languages. For Indo-European languages we use different symbols to distinguish various sub-families of Indo-European languages. We observe that the obtained clustering reasonably discriminates between various families and subfamilies. Figure 2: Visualization of word embedding language divergence in twelve different genomes belonging to 12 organisms for various n-gram segments. Our results indicate that evolutionarily closer species have higher proximity in the syntax and semantics of their genomes. (8)

Avise, John. The Best and the Worst of Times for Evolutionary Biology. BioScience. 53/3, 2003. More considerations by the University of Georgia geneticist and author on better metaphors for an increasingly sequenced dynamic genome beyond the old string of particulate molecules. This quote is also a good description of a complex adaptive system.

An emerging view is that the genome is in many ways like an extended intracellular society of interacting genetic elements. Within each such microecosystem are multitudinous quasi-independent DNA sequences with elaborate divisions of labor and functional collaborations….Their strategies (pieces of DNA) often bear a striking analogy to those observed among people partially bound in social arrangements. (251)

Cai, Yizhi, et al. Modeling Structure-Function Relationships in Synthetic DNA Sequences using Attribute Grammars. PLoS Computational Biology. 5/10, 2009. As a systems biology approach wholly reconceives the genetic code, Virginia Polytechnic bioinformatics scientists weigh in on how actually to identify and define what a “gene” is. As the quotes advise, a clever idea is to draw on “attribute grammars” from computer software to help represent genetic function. Once again, affinities with linguistic formats, and underlying complexity phenomena, can be noted.

Recognizing that certain biological functions can be associated with specific DNA sequences has led various fields of biology to adopt the notion of the genetic part. This concept provides a finer level of granularity than the traditional notion of the gene. However, a method of formally relating how a set of parts relates to a function has not yet emerged. Synthetic biology both demands such a formalism and provides an ideal setting for testing hypotheses about relationships between DNA sequences and phenotypes beyond the gene-centric methods used in genetics. Attribute grammars are used in computer science to translate the text of a program source code into the computational operations it represents. By associating attributes with parts, modifying the value of these attributes using rules that describe the structure of DNA sequences, and using a multi-pass compilation process, it is possible to translate DNA sequences into molecular interaction network models. (1)

Yet, despite its success, the notion of gene appears insufficient to express the complexity of the relation between an organism genome and its phenotype. The elucidation of the molecular mechanisms controlling gene expression has revealed a web of molecular interactions that have been modeled mathematically to show that important phenotypic traits are the emerging properties of a complex system. (1)

Cartwright, Julyan, et al. DNA as Information: At the Crossroads between Biology, Mathematics, Physics and Chemistry. Philosophical Transactions of the Royal Society A. Vol.374/Iss.2063, 2016. University of Granada, and University of Bologna scientists introduce an issue on growing abilities to connect and explain genetic phenomena with an encompassing physical, chemical, and mathematical domains. As the quotes allude, both life and cosmos phases proceed to cross-inform each other. The natural universe increasingly appears as biologically conducive in essence, living systems become theoretically amenable and describable by these disciplines. The authors go on to recognize an historic revolution, or paradigm shift in the making, which ought to be facilitated and pursued forthright. It is worth noting that language and book is once more a metaphor for both the genetic code, and by extension for a conducive nature. The copious issue contains papers such as The Meaning of Biological Information by Eugene Koonin, DNA as Information by Peter Wills, and Pragmatic Information in Biology and Physics by Juan Roederer.

On the one hand, biology, chemistry and also physics tell us how the process of translating the genetic information into life could possibly work, but we are still very far from a complete understanding of this process. On the other hand, mathematics and statistics give us methods to describe such natural systems—or parts of them—within a theoretical framework. Furthermore, there are peculiar aspects of the management of genetic information that are intimately related to information theory and communication theory. This theme issue is aimed at fostering the discussion on the problem of genetic coding and information through the presentation of different innovative points of view. The aim of the editors is to stimulate discussions and scientific exchange that will lead to new research on why and how life can exist from the point of view of the coding and decoding of genetic information. (Abstract)

Biology at present is embarked on an experimental search that we may define as functionalist; that is to say that it is attempting to understand how the functions of living material link together. This search is based upon data that are harder and harder to classify, and above all to interpret. We may compare the situation to that of the comprehension of inanimate matter before the advent of the modern atomic theory. We may thus ask ourselves: were those theoretical efforts to understand and classify matter using physico-mathematical concepts useful? The answer is of course affirmative, and indeed theoretical methods used by biology today originated in the revolution—the paradigm shift—produced by the knowledge of the atomic structure of matter, without which molecular biology would not exist. We argue that another paradigm shift is needed to understand biology: its mathematization. (5)

A common metaphor refers to DNA as the ‘book of life’. Of course, we know that the main information that represents an organism is contained or carried by nucleic acid molecules. In this respect, DNA can be considered as a book, but curiously, such a metaphor has scientific basis only in the concept of the genetic code. However, the genetic code is not a book nor a part of it; rather it is a translation dictionary between two different worlds (languages), i.e. the world of nucleic acids and the world of proteins. Hence, the genetic code allows the translation of a book written in a language into an abridged version of the same book in a different language. Moreover, little is known about the grammar, the syntax and even the orthography of the book of life. Still, we know that the genetic code is involved in the transmission of the information contained in such book and configures a relevant part of the process that defines the central dogma of molecular biology. (7)

Chaing, David, et al. Grammatical Representations of Macromolecular Structure. Journal of Computational Biology. 13/5, 2006. With co-authors Aravind Joshi and David Searls, a contribution to the long convergence of genomes and language, genetic and linguistic composition and function, which exemplify the “same core principles.” Upon reflection, the achievement is to realize molecular realms as a literal text, while our written and spoken discourse then becomes biologically instructive in kind, if only we could learn to perceive and read this testament.

Since the first application of context-free grammars to RNA secondary structures in 1988, many researchers have used both ad hoc and formal methods from computational linguistics to model RNA and protein structure. We show how nearly all of these methods are based on the same core principles and can be converted into equivalent approaches in the framework of tree-adjoining grammars and related formalisms. (1077)

Computational linguistic methods, broadly construed, have been applied to molecular biology in two general ways, which we classify as textual and structural. Textual approaches bear mainly on the actual string content of biological sequences, whereas structural approaches deal mainly with the interactions between sequence elements in folded structures. Examples of the textual approach include the use of regular expressions to specify recurring motifs, or of grammars that capture gene structures as assemblages of codons, signal sequences, and other lexical elements. The latter application demonstrates how linguistic methods (whether based on grammars or their cognate automata) allow such primary sequence elements to be collected in rule-based fashion into flexible hierarchical descriptions that both enforce global constraints such as reading frame and provide a useful compositional framework for heuristic and statistical discrimination of, for instance, coding versus non-coding segments. (1077)

Chaudhuri, Pramit and Joseph Dexter. Bioinformatics and Classical Literary Study. arXiv:1602.08844. As part of an ongoing program, a Dartmouth University professor of classics, and a Harvard University molecular biologist explain how a genome sequencing technique can be applied to parse classic Latin epics such as Vergil’s Aeneid. The literary humanities can thus become amenable to a common “-omics” analysis. As other entries in this new section report, an historic discovery, mostly unbeknownst, of an innately textual, ultimately genomic nature which reaches from cosmome and quantome to our human epitome is dawning in our midst.

This paper describes a collaborative project between classicists, quantitative biologists, and computer scientists to apply ideas and methods drawn from the sciences to the study of literature. A core goal of the project is the use of computational biology, natural language processing, and machine learning techniques to investigate intertextuality, reception, and related phenomena of literary significance. As a case study in our approach, here we describe the use of sequence alignment, a common technique in genomics, to detect intertextuality in Latin literature. Sequence alignment is distinguished by its ability to find inexact verbal parallels, which makes it ideal for identifying phonetic resemblances in large corpora of Latin texts. Although especially suited to Latin, sequence alignment in principle can be extended to many other languages. (Abstract)

Delwiche, Charles. The Genomic Palimpsest: Genomics in Evolution and Ecology. BioScience. 54/11, 2004. Advances in the sequencing and analyzing of complete genomes can now inform the study of populations, interacting organisms and the course of evolution. A clever metaphor is then enlisted whereby the DNA code is seen to have evolved in a similar way to medieval manuscripts. Because good parchment was scarce, earlier texts were partially erased and written over by later, superimposed passages, known as a palimpsest. Genomes likewise evolve not by adding new genes but through modifying prior ones whose remnants reflect its history.

Deming, Laura, et al. Genetic Architect: Discovering Genomic Structure with Learned Neural Architectures. arXiv:1605.07156. As the genetic and brain sciences merge and cross-inform, UC San Francisco, Institute for Human Genetics, researchers show how deep learning network algorithms can also effectively parse genome phenomena. An inclusive synthesis then seems underway from celestial webs (Coutinho) to literature (Rosetta Cosmos), as a cosmos to culture anatomy and physiology just now becomes realized.

Each human genome is a 3 billion base pair set of encoding instructions. Decoding the genome using deep learning fundamentally differs from most tasks, as we do not know the full structure of the data and therefore cannot design architectures to suit it. As such, architectures that fit the structure of genomics should be learned not prescribed. Here, we develop a novel search algorithm, applicable across domains, that discovers an optimal architecture which simultaneously learns general genomic patterns and identifies the most important sequence motifs in predicting functional genomic outcomes. The architectures we find using this algorithm succeed at using only RNA expression data to predict gene regulatory structure, learn human-interpretable visualizations of key sequence motifs, and surpass state-of-the-art results on benchmark genomics challenges. (Abstract)

Deep learning demonstrates excellent performance on tasks in computer vision, text and many other fields. Most deep learning architectures consist of matrix operations composed with non-linearity activations. Critically, the problem domain governs how matrix weights are shared. In convolutional neural networks – dominant in image processing – translational equivariance is encoded through the use of the convolution operation; in recurrent networks – dominant in sequential data – temporal transitions are captured by shared hidden-to-hidden matrices. These architectures mirror human intuitions and priors o the structure of the underlying data. Genomics is an excellent domain to study how we might learn optimal architectures on poorly-understood data because while we have intuition that local patterns and long-range sequential dependencies affect genetic function, much structure remains to be discovered. (1)

Dunn, Ian. Are Molecular Alphabets Universal Enabling Factors for the Evolution of Complex Life? Origins of Life and Evolution of Biospheres. 43/6, 2013. The CytoCure LLC geneticist and research director comes a description of nucleotides and proteins as most like an alphabetic string of characters. While a correspondence between genetics and linguistic has been in the offing for some time, Ian Dunn gives it an updated, robust affirmation. Thus “a digital self-organizing complementary primary replicative alphabet” is seen as a universal property of genomic phenomena. And as ever the implication is a textual, inscribed, naturome as life’s newly legible script.

Terrestrial biosystems depend on macromolecules, and this feature is often considered as a likely universal aspect of life. While opinions differ regarding the importance of small-molecule systems in abiogenesis, escalating biological functional demands are linked with increasing complexity in key molecules participating in biosystem operations, and many such requirements cannot be efficiently mediated by relatively small compounds. It has long been recognized that known life is associated with the evolution of two distinct molecular alphabets (nucleic acid and protein), specific sequence combinations of which serve as informational and functional polymers. In contrast, much less detailed focus has been directed towards the potential universal need for molecular alphabets in constituting complex chemically-based life, and the implications of such a requirement.

To analyze this, emphasis here is placed on the generalizable replicative and functional characteristics of molecular alphabets and their concatenates. A primary replicative alphabet based on the simplest possible molecular complementarity can potentially enable evolutionary processes to occur, including the encoding of secondarily functional alphabets. Very large uniquely specified (‘non-alphabetic’) molecules cannot feasibly underlie systems capable of the replicative and evolutionary properties which characterize complex biosystems. Transitions in the molecular evolution of alphabets can be related to progressive bridging of barriers which enable higher levels of biosystem organization. It is thus highly probable that molecular alphabets are an obligatory requirement for complex chemically-based life anywhere in the universe. In turn, reference to molecular alphabets should be usefully applied in current definitions of life. (Abstract)

In order to evaluate the role of molecular alphabets in biosystems as potentially universal phenomena, it is useful to initially review how such alphabets are defined in general terms, and to consider how they may be grouped into different functional classes. An alphabet in the molecular sense generally refers to the defined set of monomers from which functional oligomers or polymers are derived, in an analogous fashion to stringing individual letters together in the correct sequence of a string itself. Consequently, it is always necessary to compare a set of alphabetic monomers with their concatenated functional forms. The following statement serves as a practical generalizable definition: ‘A biological molecular alphabet consists of a specific set of relatively small distinct molecules which be means of a templating process can covalently concatenate into a large number of combinatorial alternative oligomenrs or polymers with specified sequences., of essential informational and/or functional significance for biosystem operations.’ (447)

Ferrer-I-Cancho, Ramon and Nuria Forns. The Self-Organization of Genomes. Complexity. Online First, March, 2010. As the quote cites, Barcelona biologists contribute to the recent robust affirmation that genetic and linguistic codes are one and the same in their expression of the universal complex system dynamics, which then, one may add, could take on the likeness of an independent, mathematical cosmic genotype.

Menzerath-Altmann law is a general law of human language stating, for instance, that the longer a word, the shorter its syllables. With the metaphor that genomes are words and chromosomes are syllables, we examine if genomes also obey the law. We find that longer genomes tend to be made of smaller chromosomes in organisms from three different kingdoms: fungi, plants, and animals. Our findings suggest that genomes self-organize under principles similar to those of human language. (Abstract)

Ferrer-i-Cancho, Ramon, et al. The Challenges of Statistical Patterns of Language: The Case of Menzerath’s Law in Genomes. Complexity. Online December, 2012. With coauthors Nuria Forns, Antoni Hernandez-Fernandez, Gemma Bel-enguix and Jaume Baixeries, Barcelona systems scientists advise that along with (George Kingsley) Zipf’s law, the theorem of German linguist Paul Menzerath about word or note frequencies in a text or score can hold equally well for biomolecular nucleotide genomes. By these lights, another entry is gained to appreciate a deep, parallel affinity between the genetic code and literate languages.

The importance of statistical patterns of language has been debated over decades. Although Zipf's law is perhaps the most popular case, recently, Menzerath's law has begun to be involved. Menzerath's law manifests in language, music and genomes as a tendency of the mean size of the parts to decrease as the number of parts increases in many situations. This statistical regularity emerges also in the context of genomes, for instance, as a tendency of species with more chromosomes to have a smaller mean chromosome size. It has been argued that the instantiation of this law in genomes is not indicative of any parallel between language and genomes because (a) the law is inevitable and (b) noncoding DNA dominates genomes. Here mathematical, statistical, and conceptual challenges of these criticisms are discussed. Two major conclusions are drawn: the law is not inevitable and languages also have a correlate of noncoding DNA. However, the wide range of manifestations of the law in and outside genomes suggests that the striking similarities between noncoding DNA and certain linguistics units could be anecdotal for understanding the recurrence of that statistical law. (Abstract)

Languages and genomes show a striking similarity at the semantic level: both possess units that have an arbitrary semantic reference of symbolic nature. Our comparison goes further and suggests that genomes code for some abstract version of grammatical and lexical meaning, the former in non‐coding regions and the latter in coding regions.(6) Quantitative linguistics offers powerful tools for discovering and investigating non‐trivial connections between human language and genomes. However, the evolutionary mechanisms and the constraints that may underlie the recurrence of Menzerath’s law still must be understood. (6)

Gimona, Mario. Protein Linguistics and the Modular Code of the Cytoskeleton. Barbieri, Marcello, ed. The Codes of Life. Berlin: Springer, 2008. The University of Salzburg geneticist contributes to the long project to interpret, join and unify the molecular and literal versions, in support of the growing conclusion that “Nature is Structured in a Language-like Fashion.” See also an earlier paper “Protein Linguistics – A Grammar for Modular Protein Assembly?” in Nature Reviews: Molecular Cell Biology (7/1, 2006).

1 | 2 | 3 | 4  Next