IV. Ecosmomics: Independent Complex Network Systems, Computational Programs, Genetic Ecode Scripts

2. The Innate Affinity of Genomes, Proteomes and Language

Ferrer-i-Cancho, Ramon, et al. The Challenges of Statistical Patterns of Language: The Case of Menzerath’s Law in Genomes. Complexity. Online December, 2012. With coauthors Nuria Forns, Antoni Hernandez-Fernandez, Gemma Bel-enguix and Jaume Baixeries, Barcelona systems scientists advise that along with (George Kingsley) Zipf’s law, the theorem of German linguist Paul Menzerath about word or note frequencies in a text or score can hold equally well for biomolecular nucleotide genomes. By these lights, another entry is gained to appreciate a deep, parallel affinity between the genetic code and literate languages.

The importance of statistical patterns of language has been debated over decades. Although Zipf's law is perhaps the most popular case, recently, Menzerath's law has begun to be involved. Menzerath's law manifests in language, music and genomes as a tendency of the mean size of the parts to decrease as the number of parts increases in many situations. This statistical regularity emerges also in the context of genomes, for instance, as a tendency of species with more chromosomes to have a smaller mean chromosome size. It has been argued that the instantiation of this law in genomes is not indicative of any parallel between language and genomes because (a) the law is inevitable and (b) noncoding DNA dominates genomes. Here mathematical, statistical, and conceptual challenges of these criticisms are discussed. Two major conclusions are drawn: the law is not inevitable and languages also have a correlate of noncoding DNA. However, the wide range of manifestations of the law in and outside genomes suggests that the striking similarities between noncoding DNA and certain linguistics units could be anecdotal for understanding the recurrence of that statistical law. (Abstract)

Languages and genomes show a striking similarity at the semantic level: both possess units that have an arbitrary semantic reference of symbolic nature. Our comparison goes further and suggests that genomes code for some abstract version of grammatical and lexical meaning, the former in non‐coding regions and the latter in coding regions.(6) Quantitative linguistics offers powerful tools for discovering and investigating non‐trivial connections between human language and genomes. However, the evolutionary mechanisms and the constraints that may underlie the recurrence of Menzerath’s law still must be understood. (6)

Ferruz, Noelia, et al. ProtGPT2 is a Deep Unsupervised Language Model for Protein Design. Nature Communications. 13/4348, 2022. University of Bayreuth, German system biochemists describe current progress toward a deep unity of life’s two prime genetic and linguistic code domains. By virtue of AI/ML facilities, into the 2020s an infinite affinity, as long sensed, is being revealed. This worldwise phase of palliative and enhanced metabolomics then brings much promise for health and welfare.

Protein design projects aim to build novel biomolecules customized for specific purposes so to potentially solve many environmental and biomedical problems. Recent progress in Transformer-based architectures have been enabled by language models which can generate text with human-like capabilities. Here, we describe ProtGPT2, a linguistic form that can build de novo protein sequences following the principles of natural ones. AlphaFold prediction of ProtGPT2-sequences yield structures with embodiments and topologies not captured in current structure databases.

Natural language processing (NLP) has seen many advances in recent years. Analogies between protein sequences and human languages have long been described as a concatenation of letters from a chemically defined alphabet. Both amino acids, and human text arrange letters to form structural elements (“words”) which assemble into domains (“sentences”) that undertake a function (“meaning”). A vital similarity is that protein sequences, like natural languages, are information-complete: they fully store structure and function. We propose that these methods open a new approach the whole metabolicfield of proteomics. (1)

Flam-Sherperd, Daniel, et al. Atom-by-atom protein generation and beyond with language models. arXiv:2308.09482.
We post an August entry by University of Toronto and Vector Institute reseachers including Alán Aspuru-Guzik to record much current activity in biocomputional studies which now join Large Language Models of AI neural machine learning methods. As the excerpt cites, a broad continuity across chemical, genetic, biochemical and onto linguistic phases bodes for an innately informational, universe to wumanverse, literacy to literacy procreative milieu. See also, for example, PEvoLM: Protein Sequence Evolutionary Information Language Model by Issar Arab at 2308.08578.

Protein language models learn powerful representations directly from sequences of amino acids. In contrast, chemical language models learn atom-level results of smaller molecules that include every atom, bond, and ring. In this work, we show that chemical language models can learn atom-level proteins which can generate the standard genetic code and far beyond it. The results demonstrate the potential for biomolecular design at the atom level using language models. (Exerpt)

Gall, Barnabas, et al. Protein Evolution as a Complex System. arXiv:2412.06115. As an amenable integration of nonlinear science, along with linguistic codes currently spread across every realm, herein ten Australian National University, Canberra researchers achieve such an amenable meld with the prolific amino acids that vivify all life.

Protein evolution underpins all living systems but our current models do not allow quantitative interpretation and prediction of its evolutionary trajectory. Viewing protein evolution as a complex system has the potential to advance our ability to fully model protein evolution. In this perspective, we discuss aspects that are typical of complex systems such as nonlinear dynamics, sensitivity to initial conditions, self-organization, and the emergence of order from chaos and disorder. We discuss how better sequence data and machine learnings can serve to treat protein evolution as a complex adaptive system, so as to reveal deep principles driving biological innovation and adaptation. (Excerpt)

Gimona, Mario. Protein Linguistics and the Modular Code of the Cytoskeleton. Barbieri, Marcello, ed. The Codes of Life. Berlin: Springer, 2008. The University of Salzburg geneticist contributes to the long project to interpret, join and unify the molecular and literal versions, in support of the growing conclusion that “Nature is Structured in a Language-like Fashion.” See also an earlier paper “Protein Linguistics – A Grammar for Modular Protein Assembly?” in Nature Reviews: Molecular Cell Biology (7/1, 2006).

Hackenberg, Michael, et al. Clustering of DNA Words and Biological Function: A Proof of Principle. Journal of Theoretical Biology. 297/127, 2012. University of Granada and University of Malaga, Spain system biologists including Pedro Carpena contribute to historic 2010s verifications that the molecular nucleotide version and human cultural literature are one and the same, that they are formed and suffused by the same informative nonlinear complex network systems. View articles of this kind, for example, in the journal Complexity over recent years.

Relevant words in literary texts (key words) are known to be clustered, while common words are randomly distributed. Given the clustered distribution of many functional genome elements, we hypothesize that the biological text per excellence, the DNA sequence, might behave in the same way: k-length words (k-mers) with a clear function may be spatially clustered along the one-dimensional chromosome sequence, while less-important, non-functional words may be randomly distributed. To explore this linguistic analogy, we calculate a clustering coefficient for each k-mer (k=2–9 bp) in human and mouse chromosome sequences, then checking if clustered words are enriched in the functional part of the genome. The clustering of DNA words thus appears as a novel principle to detect functionality in genome sequences. As evolutionary conservation is not a prerequisite, the proof of principle described here may open new ways to detect species-specific functional DNA sequences and the improvement of gene and promoter predictions, thus contributing to the quest for function in the genome. (Abstract excerpt)

Heckmeier, Philipp, et al.. A billion years of evolution manifest in nanosecond protein dynamics. PNAS. 121/10, 2024. We cite this paper by University of Zurich and Columbia University biochemists as an example of how far the scope and range of these current techniques can reach. And again who are we peoples with an Earthomo sapience to be able to look down and back and reconstruct and re-present how it all came to occur?

Protein dynamics forms a broad bridge between structure and function, yet the impact of evolution on ultrafast protein processes remains enigmatic. This study delves into the nanosecond-scale phenomena of a conserved protein across species separated by almost a billion years as a way to investigate ten complex homologs. In so doing, we found a cascade of rearrangements which manifest in discrete time points over hundreds of millions of years. Our work poses a novel scientific inquiry within molecular paleontology compared by the rapid pace of protein processes which can connect the shortest time scale in living matter (10^-9 s) with the largest ones (10^16 s). (Abstract)

Holzer, Jacqueline. Genomes & Language. http://www.liu.se/isk/research/doc/Birgitta_forum.pdf. An extensive summary from a Birgitta Forum held in August 2002 in Vadstena, Sweden, reviewed more in Emergent Genetic Information.

Holzer, Jacqueline. Genomes & Language. http://www.liu.se/isk/research/doc/Birgitta_forum.pdf. A website for the conference program and lengthy Concluding Reflections from a Birgitta Forum held in August 2002 in Vadstena, Sweden. Geneticists and linguists are finding much commonality between these archetypal formative modes upon which our life and world is founded. A main resource is the work of the German philosopher Wolfgang Raible, who also spoke, Google for his 2001 paper “Linguistic and Genetics. Systematic Parallels”.

Geneticists, when presenting the structure of the human genome, seem to find the metaphor of the genome as a book, or a text, useful. Genomes and texts are both multiply articulated structures, where purely contrastive units – phonemes, letters, bases – combine to form meaningful units at several levels of increasing complexity – words, sentences, texts; codons, genes, chromosomes. (4) In a very profound way he (Raible) shows the structural similarities between linguistics and genetics and sees herein a “deeper relationship between the ‘grammar of biology’ and the grammar of natural languages.” In both systems, the principles allowing the reconstruction of multi-dimensional wholes from linear sequences of basic elements are identical: double articulation, different classes of ‘signs,’ hierarchy, combinatorial rules: wholes are always more that the sum of their parts. (Holzer, 5)

Hwang, Yunha, et al. Genomic language model predicts protein co-regulation and function. Nature Communications.. 15/2880, 2024. We enter this work by Cornell, Harvard, Johns Hopkins, and MIT biologists including Sergey Ovchinnikov as another literate version of the textual affinity of nucleotides and narratives. See also ProteinEngine: Empower LLM with Domain Knowledge for Protein Engineering at arXiv:2405.06658.

•
Deciphering the relationship between a gene and its genomic context is vital to understand and modify biological systems. Machine learning can study the sequence-structure-function paradigm but higher order genomic information remains elusive. Evolutionary processes dictate genomic contexts in which a gene occurs across phylogenetic distances, and these emergent patterns can be leveraged to uncover functional relationships. Here, we train a genomic language model (gLM) on metagenomic scaffolds to uncover regulatory relationships between genes. Our findings illustrate that gLM’s deep learning of metagenomes is an effective approach to encode the semantics and syntax of genes and uncover complex relationships in a genomic region. (Abstract)

The unprecedented amount and diversity of metagenomic data presents opportunities to learn hidden patterns and structures of biological systems. With larger amounts of data, these models can disentangle the complexity of organismal genomes and their encoded functions. The work presented here validates the concept of genomic language modeling. Our implementation of the masked genomic language modeling illustrates the feasibility of training such a model, and provides evidence that biologically meaningful information is being captured in learned contextualized embeddings. (9)

Igamberdiev, Abir and Nikita Shklovskiy-Kordi. Computational Power and Generative Capacity of Genetic Systems. BioSystems. 142-143/1, 2016. A Memorial University of Newfoundland theoretical biologist and a National Research Center for Hematology, Moscow research physician contribute to the intent of this journal (second quote) to achieve a natural philosophy of life’s evolution as an oriented ascent from an innately conducive cosmos. In this encompassing genesis, a “generative” agency is a textual essence which rises in kind from a physical matrix to genomic and linguistic manifestations. Once again, after decades of study, it is strongly put that these two prime codes are one and the same.

In his many writings, AI cites Aristotelian, Greek, and Renaissance roots to provide a historical heritage for this 21st century resolve. In this paper it is said that Heraclitus’ “self-growing Logos” can now be confirmed. See his website at www.mun.ca/biology/igamberdiev/index.php for a publications page, such as Relational Universe of Leibniz (2105), and Semiotic Autopoiesis of the Universe (2001). A “quantum measurement” theory is broached here, explained more elsewhere, which means that biological systems survive, evolve, and prosper by recursively comparing new experience with prior experiential representations.

Semiotic characteristics of genetic sequences are based on the general principles of linguistics formulated by Ferdinand de Saussure, such as the arbitrariness of sign and the linear nature of the signifier. Besides these semiotic features that are attributable to the basic structure of the genetic code, the principle of generativity of genetic language is important for understanding biological transformations. The problem of generativity in genetic systems arises to a possibility of different interpretations of genetic texts, and corresponds to what Alexander von Humboldt called “the infinite use of finite means”. These interpretations appear in the individual development as the spatiotemporal sequences of realizations of different textual meanings, as well as the emergence of hyper-textual statements about the text itself, which underlies the process of biological evolution. These interpretations are accomplished at the level of the readout of genetic texts by the structures, which includes DNA, RNA and the corresponding enzymes operating with molecular addresses. The molecular computer performs physically manifested mathematical operations and possesses both reading and writing capacities. Generativity paradoxically resides in the biological computational system as a possibility to incorporate meta-statements about the system, and thus establishes the internal capacity for its evolution. (Abstract)

Life is a self-organizing and self-generating activity of open non-equilibrium systems determined by their internal semiotic structure. My vision of life in the Universe is based on the principles of the quantum measurement theory, which can be considered as a mirrored image of theoretical biology. Life by its existence in self-reflecting loops establishes basic physical parameters of the Universe. The philosophical background of this approach is in Greek philosophy, in monadology of Leibniz, and in the organism philosophy of A. N. Whitehead. The field of theoretical biology is a description of systems that possess their own embedded description. Life always maintains and solves these paradoxes since living organisms possess their internal description. The structure of the Universe includes a self-reflective loop to be observable, i.e. existing. (AI site Theoretical Biology)

BioSystems encourages experimental, computational, and theoretical articles that link biology, evolutionary thinking, and the information processing sciences. The link areas form a circle that encompasses the fundamental nature of biological information processing, computational modeling of complex biological systems, evolutionary models of computation, the application of biological principles to the design of novel computing systems, and the use of biomolecular materials to synthesize artificial systems that capture essential principles of natural biological information processing.

Jolma, Arttu, et al. DNA-dependent Formation of Transcription Factor Pairs Alters Their Binding Specificity. Nature. 527/384, 2015. A Karolinska Institute, Sweden group and colleagues, led by Jussi Talpale, report a unique parsing of nucleotide genetics by treating them much as a linguistic script. The achievement was noted in a Science Daily item for November 15, 2015 (Google SD and article keywords) entitled Complex Grammar of the Genomic Language. A gene regulatory code is thus composed by “DNA words,” which can be seen to combine and compound just as lexicons and sentences.

Previous 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 Next