IV. Ecosmomics: An Independent, UniVersal, Source Code-Script of Generative Complex Network Systems
2. The Innate Affinity of Genomes, Protenomes and Language
Around 1970, the linguist Roman Jakobson and the biopsychologist Jean Piaget opined that these prime informational domains ought to have common similarities. The familiar phrase Book of Life, along with analogous usage of language terms in genetics was popular at the time. Over subsequent decades, as gathered herein, parallels between the two code scripts grew in veracity and value, often as an insightful cross-comparison. We post this 2016 section because recent contributions strongly confirm an innate, natural continuity. A March 2016 issue of the Philosophical Transactions of the Royal Society A on “DNA as Information” (Julyan Cartwright) supports this view by novel rootings of genome phenomena in mathematics, physics and chemistry.
Asgari, Ehsaneddin and Mohammad Mofrad. Comparing Fifty Natural Languages and Twelve Genetic Languages Using Word Embedding Language Divergence as a Quantitative Measure of Language Distance. arXiv:1604.08561. Within the work of Mofrad’s Molecular Cell Biomechanics Laboratory which involves the linguistic modeling of protein bioinformatics, Iranian-American, UC Berkeley researchers discern an innate affinity between linguistic volumes by way of network parsings (search Rosetta Cosmos) and similar analyses of genomes. By so doing, the mid 2010s realization of a common natural textuality from uniVerse to human increasingly gains a scientific verification. See also herein Continuous Distributed Representation of Biological Sequences for Deep Proteomics and Genomics by the authors.
We introduce a new measure of distance between languages based on word embedding, called word embedding language divergence, defined as divergence between unified similarity distribution of words between languages. Using such a measure, we perform language comparison for fifty natural languages and twelve genetic languages. Our natural language dataset is a collection of sentence-aligned parallel corpora from bible translations for fifty languages spanning a variety of language families. Although we use parallel corpora, which guarantees having the same content in all languages, interestingly in many cases languages within the same family cluster together. In addition to natural languages, we perform language comparison for the coding regions in the genomes of 12 different organisms (4 plants, 6 animals, and two human subjects). The proposed method is a step toward defining a quantitative measure of similarity between languages, with applications in languages classification, genre identification, dialect identification, and evaluation of translations. (Abstract)
Avise, John. The Best and the Worst of Times for Evolutionary Biology. BioScience. 53/3, 2003. More considerations by the University of Georgia geneticist and author on better metaphors for an increasingly sequenced dynamic genome beyond the old string of particulate molecules. This quote is also a good description of a complex adaptive system.
An emerging view is that the genome is in many ways like an extended intracellular society of interacting genetic elements. Within each such microecosystem are multitudinous quasi-independent DNA sequences with elaborate divisions of labor and functional collaborations….Their strategies (pieces of DNA) often bear a striking analogy to those observed among people partially bound in social arrangements. (251)
Benegas, Gonzalo, et al.. Benegas, Gonzalo, et al. DNA language models are powerful predictors of genome-wide variant effects. PNAS. 120/44, 2023. As 2023 seems to be a year of novel literacies from Chatbots, large langue models, to genomic and protein linguistics as this, UC Berkeley computer scientists introduce a dedicated program-like method as a better way parse, read and curate nature’s informative code-script. See also, for example, Exploring the Protein Sequence Space with Global Generative Models by Sergio Romero-Romero, et al at arXiv:2305.01941.
The expanding catalog of genome-wide association studies (GWAS) provides biological insights across a variety of species, but identifying the causal variants remains a challenge. Recent progress in natural language processing via unsupervised pretraining on protein sequence databases has worked well in extracting complex information. Here we introduce the Genomic Pre-trained Network (GPN), a model designed to learn genome-wide variant effects (Abstract). As the artificial intelligence field progresses, our approach can incorporate future advancements, offering a powerful and scalable tool to decipher the vast biological sequence diversity observed in nature. (Significance)
Bepler, Tristan and Bonnie Berger. Learning the Protein Language: Evolution, Structure, and Function. Cell Systems. 12/6, 2021. We cite this by Simons Machine Learning Center, NYC and MIT computational biologists as another instance of how deep learning AI, ML (Earthificial) leading edge capabilities are beginning a new global plane of of lively analytical studies.
Language models have emerged as a machine-learning approach for distilling information from protein sequence databases. These methods can discover evolutionary, structural, and functional organization across protein spaces. Here we encode amino-acid sequences into distributed vector representations that capture their structural and functional properties, as well as evaluate the evolutionary fitness of sequence variants. Deep protein language studies thus suggest new ways to approach protein and therapeutic design. (Excerpt)
Bolshoy, Alexander, et al. Genome Clustering: From Linguistic Models to Classification of Genetic Texts. Berlin: Springer, 2010. Israeli geneticists contribute to growing indications of a pervasive code system by which engender phrases such as DNA texts and DNA linguistics, along with implications that our language corpus is then in some ways genomic in nature.
Caetano-Anolles, Gustavo. Agency in Evolution of Biomolecular Communication. Annals of the New York Academy of Sciences. May, 2023. The University of Illinois bioinformatics scholar (search) continues his theoretic and empirical perceptions of life’s innate self-serving proactivity because it appears to be deeply suffused and informed by biosemiotic encodings at every scalar phase. It is further proposed that biological systems are graced by recurrent anatomies, dynamic behaviors and linguistic expressions, to an extent that “lexicons, semantics and syntax” have an a macromolecular essence. As this natural narrative unfolds, a surmise is that “learning and intelligence drive biomolecular communication.” So into 2023, another erudite entry describes a relative decipherment of life’s literary instructive scriptome.
The emergence of agency in biomolecular systems involves a biphasic process of communication that constructs a message before it can be transmitted for interpretation. Evolutionary genomic and bioinformatic explorations suggest agency emerges when molecular machinery generates hierarchical layers of vocabularies in an entangled communication network clustered around the universal Turing machine of the ribosome. (Annals Editor)
Cai, Yizhi, et al. Modeling Structure-Function Relationships in Synthetic DNA Sequences using Attribute Grammars. PLoS Computational Biology. 5/10, 2009. As a systems biology approach wholly reconceives the genetic code, Virginia Polytechnic bioinformatics scientists weigh in on how actually to identify and define what a “gene” is. As the quotes advise, a clever idea is to draw on “attribute grammars” from computer software to help represent genetic function. Once again, affinities with linguistic formats, and underlying complexity phenomena, can be noted.
Recognizing that certain biological functions can be associated with specific DNA sequences has led various fields of biology to adopt the notion of the genetic part. This concept provides a finer level of granularity than the traditional notion of the gene. However, a method of formally relating how a set of parts relates to a function has not yet emerged. Synthetic biology both demands such a formalism and provides an ideal setting for testing hypotheses about relationships between DNA sequences and phenotypes beyond the gene-centric methods used in genetics. Attribute grammars are used in computer science to translate the text of a program source code into the computational operations it represents. By associating attributes with parts, modifying the value of these attributes using rules that describe the structure of DNA sequences, and using a multi-pass compilation process, it is possible to translate DNA sequences into molecular interaction network models. (1)
Cartwright, Julyan, et al. DNA as Information: At the Crossroads between Biology, Mathematics, Physics and Chemistry. Philosophical Transactions of the Royal Society A. Vol.374/Iss.2063, 2016. University of Granada, and University of Bologna scientists introduce an issue on growing abilities to connect and explain genetic phenomena with an encompassing physical, chemical, and mathematical domains. As the quotes allude, both life and cosmos phases proceed to cross-inform each other. The natural universe increasingly appears as biologically conducive in essence, living systems become theoretically amenable and describable by these disciplines. The authors go on to recognize an historic revolution, or paradigm shift in the making, which ought to be facilitated and pursued forthright. It is worth noting that language and book is once more a metaphor for both the genetic code, and by extension for a conducive nature. The copious issue contains papers such as The Meaning of Biological Information by Eugene Koonin, DNA as Information by Peter Wills, and Pragmatic Information in Biology and Physics by Juan Roederer.
On the one hand, biology, chemistry and also physics tell us how the process of translating the genetic information into life could possibly work, but we are still very far from a complete understanding of this process. On the other hand, mathematics and statistics give us methods to describe such natural systems—or parts of them—within a theoretical framework. Furthermore, there are peculiar aspects of the management of genetic information that are intimately related to information theory and communication theory. This theme issue is aimed at fostering the discussion on the problem of genetic coding and information through the presentation of different innovative points of view. The aim of the editors is to stimulate discussions and scientific exchange that will lead to new research on why and how life can exist from the point of view of the coding and decoding of genetic information. (Abstract)
Chaing, David, et al. Grammatical Representations of Macromolecular Structure. Journal of Computational Biology. 13/5, 2006. With co-authors Aravind Joshi and David Searls, a contribution to the long convergence of genomes and language, genetic and linguistic composition and function, which exemplify the “same core principles.” Upon reflection, the achievement is to realize molecular realms as a literal text, while our written and spoken discourse then becomes biologically instructive in kind, if only we could learn to perceive and read this testament.
Since the first application of context-free grammars to RNA secondary structures in 1988, many researchers have used both ad hoc and formal methods from computational linguistics to model RNA and protein structure. We show how nearly all of these methods are based on the same core principles and can be converted into equivalent approaches in the framework of tree-adjoining grammars and related formalisms. (1077)
Chaudhuri, Pramit and Joseph Dexter. Bioinformatics and Classical Literary Study. arXiv:1602.08844. As part of an ongoing program, a Dartmouth University professor of classics, and a Harvard University molecular biologist explain how a genome sequencing technique can be applied to parse classic Latin epics such as Vergil’s Aeneid. The literary humanities can thus become amenable to a common “-omics” analysis. As other entries in this new section report, an historic discovery, mostly unbeknownst, of an innately textual, ultimately genomic nature which reaches from cosmome and quantome to our human epitome is dawning in our midst.
This paper describes a collaborative project between classicists, quantitative biologists, and computer scientists to apply ideas and methods drawn from the sciences to the study of literature. A core goal of the project is the use of computational biology, natural language processing, and machine learning techniques to investigate intertextuality, reception, and related phenomena of literary significance. As a case study in our approach, here we describe the use of sequence alignment, a common technique in genomics, to detect intertextuality in Latin literature. Sequence alignment is distinguished by its ability to find inexact verbal parallels, which makes it ideal for identifying phonetic resemblances in large corpora of Latin texts. Although especially suited to Latin, sequence alignment in principle can be extended to many other languages. (Abstract)
Delwiche, Charles. The Genomic Palimpsest: Genomics in Evolution and Ecology. BioScience. 54/11, 2004. Advances in the sequencing and analyzing of complete genomes can now inform the study of populations, interacting organisms and the course of evolution. A clever metaphor is then enlisted whereby the DNA code is seen to have evolved in a similar way to medieval manuscripts. Because good parchment was scarce, earlier texts were partially erased and written over by later, superimposed passages, known as a palimpsest. Genomes likewise evolve not by adding new genes but through modifying prior ones whose remnants reflect its history.
Deming, Laura, et al. Genetic Architect: Discovering Genomic Structure with Learned Neural Architectures. arXiv:1605.07156. As the genetic and brain sciences merge and cross-inform, UC San Francisco, Institute for Human Genetics, researchers show how deep learning network algorithms can also effectively parse genome phenomena. An inclusive synthesis then seems underway from celestial webs (Coutinho) to literature (Rosetta Cosmos), as a cosmos to culture anatomy and physiology just now becomes realized.
Each human genome is a 3 billion base pair set of encoding instructions. Decoding the genome using deep learning fundamentally differs from most tasks, as we do not know the full structure of the data and therefore cannot design architectures to suit it. As such, architectures that fit the structure of genomics should be learned not prescribed. Here, we develop a novel search algorithm, applicable across domains, that discovers an optimal architecture which simultaneously learns general genomic patterns and identifies the most important sequence motifs in predicting functional genomic outcomes. The architectures we find using this algorithm succeed at using only RNA expression data to predict gene regulatory structure, learn human-interpretable visualizations of key sequence motifs, and surpass state-of-the-art results on benchmark genomics challenges. (Abstract)