IV. Ecosmomics: Independent Complex Network Systems, Computational Programs, Genetic Ecode Scripts

2. The Innate Affinity of Genomes, Proteomes and Language

Karollus, Alexander, et al. Species-aware DNA language models capture regulatory elements and their evolution. Genome Biology.. Vol. 25/Art 83, 2024. In this BMC journal, Technical University of Munich geneticists introduce an effective synthesis of these premier nucleotide and narrative code-script domains. By so doing, a cross-assimilation is achieved of these biomolecular and linguistic text phases to an extent they can be seen as the same descriptive process in different sequential venues. See also How do Large Language Models understand Genes and Cells Chen Fang, et al in bioRxiv preprints for March 27, 2024 and Gene and RNA Editing at arXiv:2409.09057.

Large-scale multi-species genome sequencing promises to shed new light on gene regulatory instructions. To this end, algorithms are needed that can leverage conservation while accounting for their evolution. Here, we introduce species-aware DNA language models trained on 800 species spanning 500 million years of evolution. We show that DNA language models distinguish transcription factor and RNA-binding protein motifs from background non-coding sequence. These results show that species-aware DNA language models are a powerful, flexible, and scalable tool to integrate information from large compendia of highly diverged genomes. (Abstract)

A typical eukaryotic genome contains large regions of non-coding DNA. Tese are not translated into proteins but contain regulatory elements which control gene expression in response to environmental cues. Finding these regulatory elements and elucidating how their combinations and arrangements determine gene expression is a major goal of genomics research and is of great utility for synthetic biology and personalized medicine. (1)

ConclusionIn this study, we trained language models on the genomes of hundreds of fungal species, spanning more than 500 million years of evolution. We specifically directed our attention to non-coding regions, examining the ability of the models to acquire meaningful species-specific and shared regulatory attributes when trained on the genomes of many species. To our knowledge, we are the first to show that LMs are able to transfer these attributes to unseen species.

Kay, Lily. A Book of Life?: How the Genome Became an Information System and DNA a Language. Perspectives in Biology and Medicine. 41/4, 1998. The late philosopher of science discerns intrinsic congruities between the verbal and genetic codes.

Kay, Lily. Who Wrote the Book of Life? Stanford, CA: Stanford University Press, 2000. A premier history of science study of how a linguistic metaphor came to represent the genetic code. The author goes on to note a correspondence between molecular genetics, language and the Chinese divination system, the I Ching.

As with (linguist Roman) Jakobson, the answer was affirmative (to the question of one basic code) and pointed to a universe fundamentally different from that portrayed in Jacques Monod’s Chance and Necessity. Rather than viewing DNA-based life as a product of chance, it would be chance subject to the structures and patterns of the I Ching. And rather than being a gypsy living on the edge of an alien world, as Monod decried, a human being would enjoy a deep sense of security that emerged from being planted physically and spiritually in an internal natural order. (318)

Kilgore, Henry, et al. Protein codes promote selective subcellular compartmentalization. Science. February 6, 2025. In our novel phase of AI assisted computational biology, twelve researchers at the Whitehead Institute for Biomedical Research and Computer Science and Artificial Intelligence Laboratory, MIT describe a language based code-script model in addition to functional aspects which can now predict which bounded places they locate in.

Cells have evolved mechanisms to distribute billions of protein molecules to subcellular phases where they are involved in shared functions. Here, we show that these proteins convey amino acid sequence codes that guide them to compartment destinations. A protein language model, ProtGPS, was developed that predicts their localization from the training set. Our results indicate that protein sequences contain not only a folding code, but also a previously unrecognized code governing their distribution to diverse subcellular compartments. (Excerpt)

Lackova, Ludmila, et al. Arbitrariness is not Enough: Towards a Functional Approach to the Genetic Code. Theory in Biosciences. Online May, 2017. Palacky University, Olomouc, Czech Republic linguists Lackova, Vladimir Matlach, and Dan Faltynek build a case for a semiotic definition of genomic conveyance. By this view, similar to written and oral communications, nucleotides and proteins are all about signs, symbols, interpretation and transcription. Apropos, from our home page a slide presentation, Cosmic Genesis in the 21st Century, that I gave at Palacky University in 2005 can be accessed.

Arbitrariness in the genetic code is one of the main reasons for a linguistic approach to molecular biology: the genetic code is usually understood as an arbitrary relation between amino acids and nucleobases. However, from a semiotic point of view, arbitrariness should not be the only condition for definition of a code, consequently it is not completely correct to talk about “code” in this case. Semiotically, a code should be always associated with a function and we propose to define the genetic code not only relationally (in basis of relation between nucleobases and amino acids) but also in terms of function (function of a protein as meaning of the code). In fact, if the function of a protein represents the meaning of the genetic code (the sign’s object), then it is crucial to reconsider the notion of its expression (the sign) as well. In our contribution, we will show that the actual model of the genetic code is not the only possible and we will propose a more appropriate model from a semiotic point of view. (Abstract)

Lackova, Ludmilla. Folding of a Peptide Continuum: Semiotic Approach to Protein Folding. Semiotica. 233/77, 2020. The Palacky University, Olomouc, CR linguist continues her studies of innate affinities across genetic, metabolic and onto communicative codes, which each seem to have a common textual nature. What then might be their phenomenal message as we first grade readers try to interpret, translate and understand?

In this paper I attempt to study the notion of “folding of a semiotic continuum” in a direction of a possible application to the biological processes (protein folding). The process of obtaining protein structures is compared to the folding of a semiotic continuum. Consequently, peptide chain is presented as a continuous line potential to be formed (folded) in order to create functional units. The functional units are protein structures having a certain usage in the cell or organism (semiotic agents). Moreover, protein folding is analyzed in terms of tension between syntax and semantics. (Abstract)

Lee, Ji-Hoon, et al. A DNA Assembly Model of Sentence Generation. BioSystems. Online, June, 2011. Seoul National University, Kyungpook National University, and University of Arkansas, bioinformatic scientists add to the evidence that these widely separated generative sources of life and culture share deep affinities with regard to their grammatical structures. Since the inklings of Roman Jakobson and Jean Piaget in the 1970s and earlier that genome and “languagome” (just coined) are deeply similar, this emergent evolutionary correspondence has been steadily proven, which this whole section seeks to document.

Recent results of corpus-based linguistics demonstrate that context-appropriate sentences can be generated by a stochastic constraint satisfaction process. Exploiting the similarity of constraint satisfaction and DNA self-assembly, we explore a DNA assembly model of sentence generation. The words and phrases in a language corpus are encoded as DNA molecules to build a language model of the corpus. Given a seed word, the new sentences are constructed by a parallel DNA assembly process based on the probability distribution of the word and phrase molecules. Here, we present our DNA code word design and report on successful demonstration of their feasibility in wet DNA experiments of a small scale. (Abstract)

Li, Zhi, et al. Extracting DNA Words Based on the Sequence Features. Theoretical Biology and Medical Modelling. 13/2, 2016. Shanxi Medical University, Taiyuan, China researchers carry out a formal interpretation of genetic systems by way of linguistic and textual terms. Nucleotide strings appear as a language with words, sentences, vocabularies, so that genomes are akin to a written book. This deep correspondence is braced by a novel algorithm that traces salient aspects of non-uniform distributions and integrity. Its validity is checked by applying to a select English volume, The Holy Bible (see quotes). How fortuitous, for here is evidence of a direct relation between religious scripture and a naturome code, God’s word and works.

Shanxi Medical University, Taiyuan, China researchers carry out a formal interpretation of genetic systems by way of linguistic and textual terms. Nucleotide strings appear as a language with words, sentences, vocabularies, so that genomes are akin to a written book. This deep correspondence is braced by a novel algorithm that traces salient aspects of non-uniform distributions and integrity. Its validity is checked by applying to a select English volume, The Holy Bible (see quotes). How fortuitous, for here is evidence of a direct relation between religious scripture and a naturome code, God’s word and works.

Liang, Wang. Human Genome Book: Words, Sentences and Paragraphs. arXiv:2501.16982. In these mid 2020s when an AI Large Language publicity fills the cyberair, a Huazhong University of Science and Technology, China physicist lays out a whole scale translation of life’s hereditary endowment in full literary and textual terms. See also Find Central Dogma Again by LW at arXiv:2502.06253 whereby these AI methods are able to “rediscover” basic genetic principles and DNA and Human Language: Epigenetic Memory and Redundancy in Linear Sequence by Li Yang and Dongbo Wang at arXiv:2503.23494 for a similar review.

A consideration of the genome as a book with equivalents of words, sentences, and paragraphs has been often proposed. Recently, large language models have provided a novel approach, whereby we can train a foundational model capable of transferring from English to DNA sequences. We were then able to translate a human genome by segments and tokens into a "book" comprised of genomic "words," "sentences," and "paragraphs."

Lin, Yigun, et al.. Exploiting Hierarchical Interactions for Protein Surface Learning. arXiv:2401.10144. Hong Kong University of Science and Technology, and Nanyang Technological University, Singapore computer scientists post another frontier instance of creative ways to learn to read and write life’s amino acid metabolism.

Predicting interactions between proteins is a main project in structural bioinformatics which is often based on geometric and chemical features. Here, we propose key properties of a more effective learning process: 1) relationship atoms linked by covalent bonds to form biomolecules 2): a residue effect that validates hierarchical feature interactions among atoms and surface points). In this paper, we present a principled framework based on deep learning techniques, namely Hierarchical Chemical and Geometric Feature Interaction Network (HCGNet), for protein surface analysis by bridging chemical and geometric features with hierarchical interactions. (Excerpt)

In this work, we highlight the importance of the multiscale relationship between atoms and the hierarchical interaction between chemical and geometric features. To this end, we propose HCGNet, a novel learning architecture for protein surface analysis. HCGNet takes atoms and surface points of a given protein as the input. Then two hierarchical branches are used to learn chemical features from atoms and geometric features from surface points in parallel. In addition, features are hierarchically propagated from the chemical branch to the geometric branch for multi-modality feature fusion. (9)

List, Johann-Mattis, et al. Networks of Lexical Borrowing and Lateral Gene Transfer in Language and Genome Evolution. BioEssays. Online December, 2013. From our late vantage, Philipps-University Marburg, Heinrich-Heine University Düsseldorf, linguists and biologists achieve a keen observation about the historical study and affinity of these disparate programs. The course of linguistics has mostly been reconstructed in terms of vertical “trees,” which is also how eukaryote cellular life proceeds. But language history is actually seen to take horizontal, net-like pathways through sharings, akin to how microbial prokaryotes trade genetic materials. So a further, novel correspondence can be elucidated between genome and languagome. See also in BioEssays 36/1, 2014 Horizontal Gene Acquisitions by Eukaryotes as Drivers of Adaptive Evolution by Gerald Schonknecht, et al, whence such parallel traffic occurs for these nucleated cells.

Like biological species, languages change over time. As noted by Darwin, there are many parallels between language evolution and biological evolution. Insights into these parallels have also undergone change in the past 150 years. Just like genes, words change over time, and language evolution can be likened to genome evolution accordingly, but what kind of evolution? There are fundamental differences between eukaryotic and prokaryotic evolution. In the former, natural variation entails the gradual accumulation of minor mutations in alleles. In the latter, lateral gene transfer is an integral mechanism of natural variation. The study of language evolution using biological methods has attracted much interest of late, most approaches focusing on language tree construction. These approaches may underestimate the important role that borrowing plays in language evolution. Network approaches that were originally designed to study lateral gene transfer may provide more realistic insights into the complexities of language evolution. (List Abstract)

In contrast to vertical gene transfer from parent to offspring, horizontal (or lateral) gene transfer moves genetic information between different species. Bacteria and archaea often adapt through horizontal gene transfer. Recent analyses indicate that eukaryotic genomes, too, have acquired numerous genes via horizontal transfer from prokaryotes and other lineages. Based on this we raise the hypothesis that horizontally acquired genes may have contributed more to adaptive evolution of eukaryotes than previously assumed. Current candidate sets of horizontally acquired eukaryotic genes may just be the tip of an iceberg. We have recently shown that adaptation of the thermoacidophilic red alga Galdieria sulphuraria to its hot, acid, toxic-metal laden, volcanic environment was facilitated by the acquisition of numerous genes from extremophile bacteria and archaea. Other recently published examples of horizontal acquisitions involved in adaptation include ice-binding proteins in marine algae, enzymes for carotenoid biosynthesis in aphids, and genes involved in fungal metabolism. (Schonknecht Abstract)

List, Johann-Mattis, et al. Unity and Disunity in Evolutionary Sciences: Process-Based Analogies Open Common Research Avenues for Biology and Linguistics. Biology Direct. Online August, 2016. University of Pierre and Marie Curie, Paris theorists including Eric Bapteste survey the long together and apart interplay between genetics and languages. While parallels seem innately evident, their actual discernment has proven elusive until these late days of algorithmic network complexities. As this section reports, a cross-fertilization of analytic techniques such as homolog identification, sequence alignment, and protein literacy is much underway. And may we again report that from cosmic and galactic webs to neural net connectomics, from uniVerse to human epitome, the one, same iconic scriptome recurs and informs in kind.

We compared important evolutionary processes in biology and linguistics and identified processes specific to only one of the two disciplines as well as processes which seem to be analogous, potentially reflecting core evolutionary processes. These new process-based analogies support novel methodological transfer, expanding the application range of biological methods to the field of historical linguistics. We illustrate this by showing (i) how methods dealing with incomplete lineage sorting offer an introgression-free framework to analyze highly mosaic word distributions across languages; (ii) how sequence similarity networks can be used to identify composite and borrowed words across different languages; (iii) how research on partial homology can inspire new methods and models in both fields; and (iv) how constructive neutral evolution provides an original framework for analyzing convergent evolution in languages resulting from common descent (Sapir’s drift). (Results)

Previous 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 Next