IV. Ecosmomics: Independent Complex Network Systems, Computational Programs, Genetic Ecode Scripts

2. The Innate Affinity of Genomes, Proteomes and Language

Around 1970, the linguist Roman Jakobson and the biopsychologist Jean Piaget opined that these prime informational domains ought to have common similarities. The familiar phrase Book of Life, along with analogous usage of language terms in genetics was popular at the time. Over subsequent decades, as gathered herein, parallels between the two code scripts grew in veracity and value, often as an insightful cross-comparison. We post this 2016 section because recent contributions strongly confirm an innate, natural continuity. A March 2016 issue of the Philosophical Transactions of the Royal Society A on “DNA as Information” (Julyan Cartwright) supports this view by novel rootings of genome phenomena in mathematics, physics and chemistry.

2020: As our worldwise personsphere proceeds to learn and know on her/his own, an ecosmic correlation and cross-translation between these archetypal generative codes is strongly evident. Both phases take on a commonly manifest textual, informative scriptome guise. A novel indicative sign, as entries convey, is an on-going transfer and avail of common sequencing and parsing techniques in both realms.

Bepler, Tristan and Bonnie Berger. Learning the Protein Language: Evolution, Structure, and Function. Cell Systems 12/6, 2021
Faltynek, Dan, et al. On the Analogy between the Genetic Code and Natural Language by Sequence Analysis. Biosemiotics. April, 2019.
Ferruz, Noelia, et al. ProtGPT2 is a Deep Unsupervised Language Model for Protein Design. Nature Communications. 13/4348, 2022.
Lackova, Ludmilla. Folding of a Peptide Continuum: Semiotic Approach to Protein Folding. Semiotica. 233/77, 2020.
Ros, Enric, et al. Learn from Nature to Expand the Genetic Code. Trends in Biotechnology. 19/5, 2021.
Wang, Li-Min, et al. Mechanism of Evolution Shared by Genes and Language. arXiv:2012.14309.
Zolyan, Suren. From Matter to Form: The Evolution of the Genetic Code as Semio-poiesis. Semiotica. March 2022.

2023:

, . Sala, Alba, et al. An integrated machine-learning model to predict nucleosome architecture. Nucleic Acids Research. 52/17, September 2024.. Nucleic Acids Research. 52/17, September, 2024. We cite this prime journal entry by seven bioresearchers at the Barcelona Institute of Science and Technology and Universitat de Barcelona including Modesto Orozco as another leading edge of an integral merger of frontier genetic studies and the latest AI neural net computational methods. See also Explainable AI Methods for Multi-Omics Analysis by Ahmad Hussein, et al at arXiv:2410.11910 for more current advances.

We demonstrate that nucleosomes placed in the gene body can be located from signal decay theory. These wave signals can be in phase or in antiphase We found that the first (+1) and the last (-last) nucleosomes are contiguous to regions signaled by transcription factor binding sites. Based on these analyses, we developed a method that combines Machine Learning and signal transmission theory which is able to predict the basal locations of the nucleosomes with an accuracy similar to that of experimental MNase-seq based methods. (Excerpt)

When we applied our model to the human genome, we obtained more accurate results than what would be expected from a random model. Considering the additional layers to deconvolute when studying more complex organisms and that the current model and architecture was optimized for yeast, our current methodology and experimental data, shows the potential to be used to study any nucleosome positioning array. Results presented here show the existence of a clear connection between expression level and the organization of nucleosome arrays, (11)

Abramson, Josh, et al. Accurate structure prediction of biomolecular interactions with AlphaFold 3. Nature. May 8, 2024. Some fifty computational biologists at Google Deepmind, London and Isomorphic Labs, London led by John Jumper introduce a latest, much advanced edition of this Protein Structure Database capability which first came out in mid 2021. As the quotes say, major AI deep machine neural learning advances have now fostered capabilities not possible earlier. Among many news reports see Google Unveils A.I. for Predicting Behavior of Human Molecules by Cade Metz in the NY Times (May 8, 2024).

In this paper, we describe our latest AlphaFold 3 model with an updated diffusion-based architecture which can predict complex structures including proteins, nucleic acids, small molecules, ions, and modified residues. The new version has improved analytic accuracy for protein-ligand interactions, protein-nucleic acids, and higher antibody-antigen predictability. Together these results achieve a revolutionary stage of precise modelling across biomolecular space within a single unified deep learning framework. (Abstract)

The development of bottom-up modelling of cellular components is a key step in unravelling the complexity of molecular regulation within the cell, and the performance of AlphaFold 3 shows that the right deep learning frameworks can reduce the amount of data required to obtain relevant performance. We expect that structural modelling will continue to improve not only due to deep learnings but also by cryo electron microscopy and tomography. The parallel advance of experimental and computational methods promise an era of structurally informed biological understanding and therapeutic development. (10)

Asgari, Ehsaneddin and Mohammad Mofrad. Comparing Fifty Natural Languages and Twelve Genetic Languages Using Word Embedding Language Divergence as a Quantitative Measure of Language Distance. arXiv:1604.08561. Within the work of Mofrad’s Molecular Cell Biomechanics Laboratory which involves the linguistic modeling of protein bioinformatics, Iranian-American, UC Berkeley researchers discern an innate affinity between linguistic volumes by way of network parsings (search Rosetta Cosmos) and similar analyses of genomes. By so doing, the mid 2010s realization of a common natural textuality from uniVerse to human increasingly gains a scientific verification. See also herein Continuous Distributed Representation of Biological Sequences for Deep Proteomics and Genomics by the authors.

We introduce a new measure of distance between languages based on word embedding, called word embedding language divergence, defined as divergence between unified similarity distribution of words between languages. Using such a measure, we perform language comparison for fifty natural languages and twelve genetic languages. Our natural language dataset is a collection of sentence-aligned parallel corpora from bible translations for fifty languages spanning a variety of language families. Although we use parallel corpora, which guarantees having the same content in all languages, interestingly in many cases languages within the same family cluster together. In addition to natural languages, we perform language comparison for the coding regions in the genomes of 12 different organisms (4 plants, 6 animals, and two human subjects). The proposed method is a step toward defining a quantitative measure of similarity between languages, with applications in languages classification, genre identification, dialect identification, and evaluation of translations. (Abstract)

Figure 1: Hierarchical clustering of fifty natural languages according to divergence of joint distance distribution of 4097 aligned words in bible parallel corpora. Subsequently we use colors to show the ground-truth about family of languages. For Indo-European languages we use different symbols to distinguish various sub-families of Indo-European languages. We observe that the obtained clustering reasonably discriminates between various families and subfamilies. Figure 2: Visualization of word embedding language divergence in twelve different genomes belonging to 12 organisms for various n-gram segments. Our results indicate that evolutionarily closer species have higher proximity in the syntax and semantics of their genomes. (8)

Avise, John. The Best and the Worst of Times for Evolutionary Biology. BioScience. 53/3, 2003. More considerations by the University of Georgia geneticist and author on better metaphors for an increasingly sequenced dynamic genome beyond the old string of particulate molecules. This quote is also a good description of a complex adaptive system.

An emerging view is that the genome is in many ways like an extended intracellular society of interacting genetic elements. Within each such microecosystem are multitudinous quasi-independent DNA sequences with elaborate divisions of labor and functional collaborations….Their strategies (pieces of DNA) often bear a striking analogy to those observed among people partially bound in social arrangements. (251)

Benegas, Gonzalo, et al. Genomic Language Models: Opportunities and Challenges.. Trends in Genetics. 41/4, 2025. UC Berkeley bioinformatic researchers scope out a latest common confluence as these two main life and mind codifications proceed to gain their necessary essential affinity. See also Linguistics-based formalization of the antibody language as a basis for antibody language models by Mai Ha Vu et al in Nature Computational Science (4/412, 2024) and A Benchmark Dataset for Multimodal Prediction of Enzymatic Function Coupling DNA Sequences and Natural Language by Yuchen Zhang, et al at arXiv:2407.15888 for companion work.

Large language models (LLMs) are having transformative impacts across a wide range of scientific fields, particularly in the biomedical sciences. Just as the goal of Natural Language Processing is to understand sequences of words, a major objective in biology is to understand biological sequences. Genomic Language Models (gLMs), which are LLMs trained on DNA sequences, have the potential to significantly advance our understanding of genomes and how DNA elements at various scales interact to give rise to complex functions. In this review, we showcase this potential by highlighting key applications of gLMs, including fitness prediction, sequence design, and transfer learning. (Abstract)

Benegas, Gonzalo, et al.. DNA language models are powerful predictors of genome-wide variant effects. PNAS. 120/44, 2023. As 2023 seems to be a year of novel literacies from Chatbots, large langue models, to genomic and protein linguistics as this, UC Berkeley computer scientists introduce a dedicated program-like method as a better way parse, read and curate nature’s informative code-script. See also, for example, Exploring the Protein Sequence Space with Global Generative Models by Sergio Romero-Romero, et al at arXiv:2305.01941.

The expanding catalog of genome-wide association studies (GWAS) provides biological insights across a variety of species, but identifying the causal variants remains a challenge. Recent progress in natural language processing via unsupervised pretraining on protein sequence databases has worked well in extracting complex information. Here we introduce the Genomic Pre-trained Network (GPN), a model designed to learn genome-wide variant effects (Abstract). As the artificial intelligence field progresses, our approach can incorporate future advancements, offering a powerful and scalable tool to decipher the vast biological sequence diversity observed in nature. (Significance)

Bepler, Tristan and Bonnie Berger. Learning the Protein Language: Evolution, Structure, and Function. Cell Systems. 12/6, 2021. We cite this by Simons Machine Learning Center, NYC and MIT computational biologists as another instance of how deep learning AI, ML (Earthificial) leading edge capabilities are beginning a new global plane of of lively analytical studies.

Language models have emerged as a machine-learning approach for distilling information from protein sequence databases. These methods can discover evolutionary, structural, and functional organization across protein spaces. Here we encode amino-acid sequences into distributed vector representations that capture their structural and functional properties, as well as evaluate the evolutionary fitness of sequence variants. Deep protein language studies thus suggest new ways to approach protein and therapeutic design. (Excerpt)

Bolshoy, Alexander, et al. Genome Clustering: From Linguistic Models to Classification of Genetic Texts. Berlin: Springer, 2010. Israeli geneticists contribute to growing indications of a pervasive code system by which engender phrases such as DNA texts and DNA linguistics, along with implications that our language corpus is then in some ways genomic in nature.

Boulaimen, Youssef, et al. Integrating Large Language Models for Genetic Variant Classification.. arXiv:2411.05055. We cite this contribution by nine bioinformatics specialists at Oncodesign Precision Medicine, Dijon, France as a leading edge example of how gene-based diagnostic research is melding with and being enhanced by AI linguistic-like sequence data analysis.

The classification of genetic variants poses a challenge in clinical genetics and precision medicine. Large Language Models (LLMs) have emerged as transformative tools which can uncover intricate patterns and predictive insights that other methods might miss. This study seeks to integrate current LLMs which leverage DNA and protein sequence data as a way to analyze variant classifications. Our models were tested on uncertain variants and showed substantial improvements. The results aver how multiple models can refine the accuracy and reliability of variant classification systems and support advanced computational models in clinics. (Excerpts)

Caetano-Anolles, Gustavo. Agency in Evolution of Biomolecular Communication. Annals of the New York Academy of Sciences. May, 2023. The University of Illinois bioinformatics scholar (search) continues his theoretic and empirical perceptions of life’s innate self-serving proactivity because it appears to be deeply suffused and informed by biosemiotic encodings at every scalar phase. It is further proposed that biological systems are graced by recurrent anatomies, dynamic behaviors and linguistic expressions, to an extent that “lexicons, semantics and syntax” have an a macromolecular essence. As this natural narrative unfolds, a surmise is that “learning and intelligence drive biomolecular communication.” So into 2023, another erudite entry describes a relative decipherment of life’s literary instructive scriptome.

The emergence of agency in biomolecular systems involves a biphasic process of communication that constructs a message before it can be transmitted for interpretation. Evolutionary genomic and bioinformatic explorations suggest agency emerges when molecular machinery generates hierarchical layers of vocabularies in an entangled communication network clustered around the universal Turing machine of the ribosome. (Annals Editor)

If communication and language are biological drivers, then existing quantitative linguistic patterns follow the three types of statistical laws that exist in natural languages: probability distribution, functional-type, and developmental-type. Scale-free patterns explain power law behavior proteome domains to an extent that Bacteria, Archaea and Eukarya phases are match those for the English and Chinese languages. (18)

Finally, developmental-type laws, which place languages in the context of time, explain the growth and evolution of biological essences. For example, the Heaps’ law describes how the number of words in a document scales with a protein. database, was found operating in proteins. Heaps’ exponents of growth for viruses approached unity and were similar to not early Chinese, Japanese, and Korean regimens with limited dictionary that reflect ancient kernel-like vocabularies. (18-19)

Cai, Yizhi, et al. Modeling Structure-Function Relationships in Synthetic DNA Sequences using Attribute Grammars. PLoS Computational Biology. 5/10, 2009. As a systems biology approach wholly reconceives the genetic code, Virginia Polytechnic bioinformatics scientists weigh in on how actually to identify and define what a “gene” is. As the quotes advise, a clever idea is to draw on “attribute grammars” from computer software to help represent genetic function. Once again, affinities with linguistic formats, and underlying complexity phenomena, can be noted.

Recognizing that certain biological functions can be associated with specific DNA sequences has led various fields of biology to adopt the notion of the genetic part. This concept provides a finer level of granularity than the traditional notion of the gene. However, a method of formally relating how a set of parts relates to a function has not yet emerged. Synthetic biology both demands such a formalism and provides an ideal setting for testing hypotheses about relationships between DNA sequences and phenotypes beyond the gene-centric methods used in genetics. Attribute grammars are used in computer science to translate the text of a program source code into the computational operations it represents. By associating attributes with parts, modifying the value of these attributes using rules that describe the structure of DNA sequences, and using a multi-pass compilation process, it is possible to translate DNA sequences into molecular interaction network models. (1)

Yet, despite its success, the notion of gene appears insufficient to express the complexity of the relation between an organism genome and its phenotype. The elucidation of the molecular mechanisms controlling gene expression has revealed a web of molecular interactions that have been modeled mathematically to show that important phenotypic traits are the emerging properties of a complex system. (1)

Cartwright, Julyan, et al. DNA as Information: At the Crossroads between Biology, Mathematics, Physics and Chemistry. Philosophical Transactions of the Royal Society A. Vol.374/Iss.2063, 2016. University of Granada, and University of Bologna scientists introduce an issue on growing abilities to connect and explain genetic phenomena with an encompassing physical, chemical, and mathematical domains. As the quotes allude, both life and cosmos phases proceed to cross-inform each other. The natural universe increasingly appears as biologically conducive in essence, living systems become theoretically amenable and describable by these disciplines. The authors go on to recognize an historic revolution, or paradigm shift in the making, which ought to be facilitated and pursued forthright. It is worth noting that language and book is once more a metaphor for both the genetic code, and by extension for a conducive nature. The copious issue contains papers such as The Meaning of Biological Information by Eugene Koonin, DNA as Information by Peter Wills, and Pragmatic Information in Biology and Physics by Juan Roederer.

On the one hand, biology, chemistry and also physics tell us how the process of translating the genetic information into life could possibly work, but we are still very far from a complete understanding of this process. On the other hand, mathematics and statistics give us methods to describe such natural systems—or parts of them—within a theoretical framework. Furthermore, there are peculiar aspects of the management of genetic information that are intimately related to information theory and communication theory. This theme issue is aimed at fostering the discussion on the problem of genetic coding and information through the presentation of different innovative points of view. The aim of the editors is to stimulate discussions and scientific exchange that will lead to new research on why and how life can exist from the point of view of the coding and decoding of genetic information. (Abstract)

Biology at present is embarked on an experimental search that we may define as functionalist; that is to say that it is attempting to understand how the functions of living material link together. This search is based upon data that are harder and harder to classify, and above all to interpret. We may compare the situation to that of the comprehension of inanimate matter before the advent of the modern atomic theory. We may thus ask ourselves: were those theoretical efforts to understand and classify matter using physico-mathematical concepts useful? The answer is of course affirmative, and indeed theoretical methods used by biology today originated in the revolution—the paradigm shift—produced by the knowledge of the atomic structure of matter, without which molecular biology would not exist. We argue that another paradigm shift is needed to understand biology: its mathematization. (5)

A common metaphor refers to DNA as the ‘book of life’. Of course, we know that the main information that represents an organism is contained or carried by nucleic acid molecules. In this respect, DNA can be considered as a book, but curiously, such a metaphor has scientific basis only in the concept of the genetic code. However, the genetic code is not a book nor a part of it; rather it is a translation dictionary between two different worlds (languages), i.e. the world of nucleic acids and the world of proteins. Hence, the genetic code allows the translation of a book written in a language into an abridged version of the same book in a different language. Moreover, little is known about the grammar, the syntax and even the orthography of the book of life. Still, we know that the genetic code is involved in the transmission of the information contained in such book and configures a relevant part of the process that defines the central dogma of molecular biology. (7)

1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 Next