IV. Ecosmomics: Independent Complex Network Systems, Computational Programs, Genetic Ecode Scripts

2. The Innate Affinity of Genomes, Proteomes and Language

Liu, Jiajia, et al. Large language models in bioinformatics: applications and perspective.. arXiv:2401.04155. We cite this entry by Center for Computational Systems Medicine, University of Texas Health Science Center, Zhengzhou University, Southwest Jiaotong University and Center of Gerontology and Geriatrics, West China Hospital, Sichuan University computational biologists ast an example of the on-going interplay of bioinformatic studies and these novel linguistic programs, as the quote notes.

Large language models (LLMs) as based on AI deep learning perform well on various tasks such as natural language processing (NLP). LLMs are composed of artificial neural networks with many parameters trained on unlabeled input using self- or semi- supervised learning. However, their potential for bioinformatics studies may even exceed this proficiency. In this review, we review the prominent LLMs such as BERT and GPT, and explore their applications at different omics levels in bioinformatics like transcriptomics, proteomics, drug discovery and single cell analysis. (excerpt)

Livnat, Adi. Simplification, Innateness, and the Absorption of Meaning from Context. arXiv:1605.03440. Reviewed more in Systems Evolution, the University of Haifa theorist continues his project (search) to achieve a better explanation of life’s evolution by way of algorithmic computations, innate network propensities, genome – language affinities, neural net deep learning, and more.

Maggi, Luca. The main role of fractal-like nature of conformational space in subdiffusion in protein. arXiv:2306.07825. A Barcelona Institute of Science and Technology bioinformatics disease mechanism researcher provides a latest report of how vital self-similarities appear to suffuse their metabolic activities. See also The Evolution of Fractal Protein Modules in Multicellular Development by Harry Booth and Peter Bentley in Artificial Life Conference Proceedings (MIT Press 2022).

Protein dynamics studies their biological functions but a theoretical picture of their relevant features is still missing. For example, a prime property exhibited by this dynamic is its subdiffusivity. Here, by comparing all-atom molecular simulations and theory we show that this behavior arises from the fractal network of the network of metastable conformational states over which protein diffusion processes take place. (Excerpt)

Majewski, Maciej, et al. Machine Learning Coarse-Grained Potentials of Protein Thermodynamics. arXiv:2212.07492. We note this work by eleven bioinformatic researchers from Universitat Pompeu Fabra, Barcelona, Rice University, Houston, FU Berlin, Princeton University and Microsoft Research, Cambridge UK as an example of the latest integrations of biological studies (e.g. genes, cells, metabolism), neural net methods, and deep rootings in a conducive physical origin.

We note this work by eleven bioinformatic researchers from Universitat Pompeu Fabra, Barcelona, Rice University, Houston, FU Berlin, Princeton University and Microsoft Research, Cambridge UK as an example of the latest integrations of biological studies (e.g. genes, cells, metabolism), neural net methods, and deep rootings in a conducive physical origin.

Marin, Frederikke, et al.. BEND: Benchmarking DNA Language Models on biologically meaningful tasks. arXiv:2311.12570. At a time when prior sequence techniques have run their course, this paper by Novozymes A/S, Denmark, University of Copenhagen, and Computational Health Center, Munich researchers proposes a turn to natural language processing and large languages models by which enter a new advanced phase of rapid readings of whole genomes in their many functions. In regard, as novel protein linguistics, deep neural network, and AI capabilities come altogether in 2023, they also begin to imply the actual presence of an intrinsic textual, source code-script domain. As a result, biomolecular, genetic, linguistic, and cerebral phases, broadly conceived, gain a text-like similarity. Such a common vernacular has been a metaphor since the 1960s and maybe just now its verity and import can be realized.

The genome sequence contains the blueprint for governing cellular processes. However, experimental annotation of functional, non-coding and regulatory elements encoded in the DNA sequence remains both costly and difficult. This has sparked interest in language modeling of genomic DNA, which has seen much success for protein sequence data. In this study, we introduce BEND, a Benchmark for DNA language models, featuring a collection of realistic downstream tasks defined on the human genome. We find that embeddings from current DNA language models can approach performance of expert methods on some tasks, but only capture limited information about long-range features. (Abstract)

Markos, Anton and Dan Faltynek. Language Metaphors of Life. Biosemiotics. Online August 14, 2010. Charles University (Prague) scientist, and Palacky University (Olomouc, Czech Republic, where I once gave a keynote, see home page) philosopher argue that not only is communication the essence of livingness, it involves constant “readings” by all manner of creatures. Verily a greater, textual nature is revealed that evolved, emergent beings, now we phenomenal humans, are invited to read.

We believe that linguistic processes are present at all levels of life’s organization in the biosphere. Ecosystems, for example, do not build their homes – oikos – for ever; they maintain them by incessant communications games, reading included. We tend to read like our contemporaries, and from this common ground there often emerges something new unique; understanding the text is a unique performance of the reader. The same holds, we believe, for the members of any living species – in a species-specific way.

McBride, John and Tsvi Tlusty. The physical logic of protein machines.. Journal of Statistical Mechanics. Vol. 2024/Num. 2, 2025. This paper by Center for Soft and Living Matter, Institute for Basic Science, Ulsan, South Korea theorists was presented at the STATPHYS 28 conference in 2024 as another way to combine neural net learning, proteome programs and AI language methods. We also note usage of the machine word whence it is meant to infer, so to clarify, a computer rather than a lathe. This is traced herein to a Simple mechanics of protein machines by Holger Flechsig and Alexander Mikhailov in the Journal of the Royal Interface for June 2019.

Proteins are intricate biomolecules whose complexity arises from the heterogeneity of the amino acids and their dynamic network of many-body interactions. Their functionality was shaped by an evolutionary history through intertwined paths of selection and adaptation. However, their basic logic remains open. Here, we explore a physical approach that treats proteins as mechano-chemical machines, which are adapted via a concerted evolution of structure, motion, and chemical interactions. (Excerpt)

Moghaddasi, Hanieh, et al. Distinguishing Functional DNA Words. Nature Scientific Reports. 7/41543, 2017. With Khosrow Khalifeh and Amir Darooneh (search), University of Zanjan, Iran biophysicists discuss similar algorithmic ways to parse genetic and written textualities, whence both are seen as an extension of statistical mechanics and Tsallis entropy. In an evolutionary perspective such archetypal, generative inscriptions strongly imply a common, exemplary source. OK

Functional DNA sub-sequences and genome elements are spatially clustered through the genome just as keywords in literary texts. Therefore, some of the methods for ranking words in texts can also be used to compare different DNA sub-sequences. In analogy with the literary texts, here we claim that the distribution of distances between the successive sub-sequences (words) is q-exponential which is the distribution function in non-extensive statistical mechanics. (Abstract)

DNA sequences as one-dimensional arrays of four nucleotides (A, C, T and G) can be considered as texts so that they can be analyzed from a linguistic point of view to discover their different linguistic features. It is believed that there is a meaningful relation between linguistic interpretation of sub-sequences and their biological significances. Here, the important matter is how to define the alphabets and words, for example nucleotides may be assumed as letters and sequences of n consecutive nucleuotides (n-tuples) as words. Some genome elements like exons, introns and others can also play the role of words. (1)

Muskhelishvili, Georgi. DNA Information: Laws of Perception. Berlin: SpringerBriefs in Biology, 2015. A Jacobs University professor of molecular genetics offers a 100 page essay on his view, in collaboration with the University of Cambridge biologist Andrew Travers, that genomes are composed of distinct complementary phases of diverse “digital” nucleotides and integral “analog” networks. By this notice, another notable insight is gained of how these archetypal, gender modes are present in and distinguish genetic phenomena. An extrapolation, as stated throughout this site, would be that nature’s ubiquitous self-organized, complex systems with dual agency and relation is ultimately genetic in essence. See also by GM with Travers the papers Integration of Syntactic and Semantic Properties of the DNA Code Reveals Chromosomes as Thermodynamic Machines Converting Energy into Information in Cellular and Molecular Life Sciences (70/4555, 2013), and DNA Information: From Digital Code to Analogue Structure in the Philosophical Transactions of the Royal Society A (370/2960, 2012).

This book explores the double coding property of DNA, which is manifested in the digital and analog information types as two interdependent codes. This double coding principle can be applied to all living systems, from the level of the individual cell to entire social systems, seen as systems of communication. Further topics discussed include the ubiquitous problem of logical typing, which reflects our inherent incapacity to simultaneously perceive the distinction between discontinuity and continuity, the problem of time, and the peculiarities of autopoietic living systems. (Publisher)

In principle, the relationship between the analog and digital DNA codes is akin to the relationship between the syntax and semantics of natural language. It occurred to me that a closer examination of this remarkable similarity between the most ancient coding device and the most recently invented means of social communication was worth trying. (v) In a nutshell, this booklet is an attempt to show that the basic device of creating information in the living world, including our perception and social communication, is already provided in structural organisation of the DNA molecule, which alike the human mind, has a property of both, distinctness and wholeness to it. I surmise that it is this highly elaborate double coding mechanism with two structurally coupled codes mutually determining each other, which provides for the wholeness that no artificial device can ever attain. (vi)

Thus, in physical systems the capacity to “draw distinctions” and thus to self-organize, crucially depends on the threshold values of the external parameters, whereas in any living system the capacity to make distinction and directional choice is internal. Put another way, the faculty of perception in the living system appears as an internalized discrimination capacity. (8)

Understanding genetic regulation is a problem of fundamental importance. Recent studies have made it increasingly evident that, whereas the cellular genetic regulation system embodies multiple disparate elements engaged in numerous interactions, the central issue is the genuine function of the DNA molecule as information carrier. Compelling evidence suggests that the DNA, in addition to the digital information of the linear genetic code (the semantics), encodes equally important continuous, or analog, information that specifies the structural dynamics and configuration (the syntax) of the polymer. These two DNA information types are intrinsically coupled in the primary sequence organisation, and this coupling is directly relevant to regulation of the genetic function. In this review, we emphasise the critical need of holistic integration of the DNA information as a prerequisite for understanding the organisational complexity of the genetic regulation system. (2013 Article Abstract)

Nelson-Sathi, Shijulal, et al. Networks Uncover Hidden Lexical Borrowing in Indo-European Language Evolution. Proceedings of the Royal Society B. Vol. 278/Iss. 1713, 2011. Heinrich Heine University, Ulm University, and University of Auckland linguists including William Martin and Russell Gray find that genomes and languages evolve with a parallel correspondence. This paper makes the case by way of similar network phylogenies found to apply in both instances.

Genome evolution and language evolution have a lot in common. Both processes entail evolving elements—genes or words—that are inherited from ancestors to their descendants. The parallels between biological and linguistic evolution were evident both to Charles Darwin, who briefly addressed the topic of language evolution in The Origin of Species, and to the linguist August Schleicher, who in an open letter to Ernst Haeckel discussed the similarities between language classification and species evolution. Computational methods that are currently used to reconstruct genome phylogenies can also be used to reconstruct evolutionary trees of languages. However, approaches to language phylogeny that are based on bifurcating trees recover vertical inheritance only, neglecting the horizontal component of language evolution (borrowing). Horizontal interactions during language evolution can range from the exchange of just a few words to deep interference. (1794)

Outeiral, Carlos and Charlotte Deane. Codon language embeddings provide strong signals for use in protein engineering.. Nature Machine Intelligence. 6/2, 2024. We enter this note by Oxford University biostatisticians because it treats this metabolic regime as if it can be typically parsed by various grammatical methods.

Protein representations from deep language models have achieved good performance in computational protein studies surpassing the datasets they were trained on. But here we propose an alternative direction. We show that LLMs trained on codons, instead of amino acid sequences, provide high-quality results that outperform across a variety of tasks. For species recognition, prediction of protein and transcript abundance or melting point estimation, we show that a codon language surpasses every other published version. This topical shift indicates that the information content of biological data provides an orthogonal direction to expand the utility of machine learning in biology. (Excerpt)

Romero-Romero, Sergio, et al. Exploring the Protein Sequence Space with Global Generative Models. arXiv:2305.01941. We note this entry by University of Bayreuth, Heidelberg, and Barcelona researchers as an early example of efforts to integrate and enhance life’s new intentional phase by way of deep neural machine language resources. See also, for another example, General Mechanism of Evolution Shared by Proteins and Words by Li-Min Wang, et al at arXiv:2012.14309.

Recent advancements in large-scale architectures for training images and languages have taken over the field of computer vision and natural language processing (NLP). The recent ChatGPT and GPT4 models have exceptional capabilities to process, translate, and generate textual scripts.. As a result these advances are also aiding protein research wby way of rapid development of new methods with unprecedented performance. Language models have been utilized to embed proteins, generate novel ones, and predict tertiary structures. In this book chapter, we discuss 1) language models for the design of novel artificial proteins, 2) works that use non-Transformer architectures, and 3) applications in directed evolution approaches. (Abstract)

The field of protein design is being transformed due to advances in the field of artificial intelligence. The use of architectures that excel in other areas, such as computer vision and natural language processing, is highly successful in generating sequences in previously inaccessible regions of the protein space. In this work, we provided an overview of these advances in sequence generation and their potential applications in directed evolution. This progress provides an optimistic outlook for designing à-la-carte protein functions with new-to-nature enzymes becoming realistic in the near future. (Conclusion, 17)

Previous 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 Next