IV. Ecosmomics: Independent, UniVersal, Complex Network Systems and a Genetic Code-Script Source

2. The Innate Affinity of Genomes, Proteomes and Language

Nelson-Sathi, Shijulal, et al. Networks Uncover Hidden Lexical Borrowing in Indo-European Language Evolution. Proceedings of the Royal Society B. Vol. 278/Iss. 1713, 2011. Heinrich Heine University, Ulm University, and University of Auckland linguists including William Martin and Russell Gray find that genomes and languages evolve with a parallel correspondence. This paper makes the case by way of similar network phylogenies found to apply in both instances.

Genome evolution and language evolution have a lot in common. Both processes entail evolving elements—genes or words—that are inherited from ancestors to their descendants. The parallels between biological and linguistic evolution were evident both to Charles Darwin, who briefly addressed the topic of language evolution in The Origin of Species, and to the linguist August Schleicher, who in an open letter to Ernst Haeckel discussed the similarities between language classification and species evolution. Computational methods that are currently used to reconstruct genome phylogenies can also be used to reconstruct evolutionary trees of languages. However, approaches to language phylogeny that are based on bifurcating trees recover vertical inheritance only, neglecting the horizontal component of language evolution (borrowing). Horizontal interactions during language evolution can range from the exchange of just a few words to deep interference. (1794)

Outeiral, Carlos and Charlotte Deane. Codon language embeddings provide strong signals for use in protein engineering.. Nature Machine Intelligence. 6/2, 2024. We enter this note by Oxford University biostatisticians because it treats this metabolic regime as if it can be typically parsed by various grammatical methods.

Protein representations from deep language models have achieved good performance in computational protein studies surpassing the datasets they were trained on. But here we propose an alternative direction. We show that LLMs trained on codons, instead of amino acid sequences, provide high-quality results that outperform across a variety of tasks. For species recognition, prediction of protein and transcript abundance or melting point estimation, we show that a codon language surpasses every other published version. This topical shift indicates that the information content of biological data provides an orthogonal direction to expand the utility of machine learning in biology. (Excerpt)

Romero-Romero, Sergio, et al. Exploring the Protein Sequence Space with Global Generative Models. arXiv:2305.01941. We note this entry by University of Bayreuth, Heidelberg, and Barcelona researchers as an early example of efforts to integrate and enhance life’s new intentional phase by way of deep neural machine language resources. See also, for another example, General Mechanism of Evolution Shared by Proteins and Words by Li-Min Wang, et al at arXiv:2012.14309.

Recent advancements in large-scale architectures for training images and languages have taken over the field of computer vision and natural language processing (NLP). The recent ChatGPT and GPT4 models have exceptional capabilities to process, translate, and generate textual scripts.. As a result these advances are also aiding protein research wby way of rapid development of new methods with unprecedented performance. Language models have been utilized to embed proteins, generate novel ones, and predict tertiary structures. In this book chapter, we discuss 1) language models for the design of novel artificial proteins, 2) works that use non-Transformer architectures, and 3) applications in directed evolution approaches. (Abstract)

The field of protein design is being transformed due to advances in the field of artificial intelligence. The use of architectures that excel in other areas, such as computer vision and natural language processing, is highly successful in generating sequences in previously inaccessible regions of the protein space. In this work, we provided an overview of these advances in sequence generation and their potential applications in directed evolution. This progress provides an optimistic outlook for designing à-la-carte protein functions with new-to-nature enzymes becoming realistic in the near future. (Conclusion, 17)

Ros, Enric, et al. Learn from Nature to Expand the Genetic Code. Trends in Biotechnology. 19/5, 2021. Institute for Research in Biomedicine, Barcelona Institute of Science and Technology system geneticists post a latest note on how easy it has become for researchers to treat genomes as written texts and thus modify their relative nucleotide alphabets and form “designer proteins.” By way of a natural philoSophia view, a smartest, fittest speciesphere reaches a stage of being able to decipher, sequence, read, and write a natural genetic code script. See also Reprogramming the Genetic Code by Daniel de la Torre and Jason Chin in Nature Review Genetics (March 2021).

The genetic code is the manual that cells use to incorporate amino acids into proteins. It is possible to artificially expand these instructions through cellular, molecular, and chemical manipulations to improve protein functionality. Here, we review the approaches used to incorporate noncanonical amino acids into designer proteins through the manipulation of the translation machinery and draw parallels between these methods and natural adaptations. Following this logic, we propose new nature-inspired tactics to improve genetic code expansion (GCE) in synthetic organisms. (Abstract excerpt)

Scaiewicz, Andrea and Michael Levitt. The Language of the Protein Universe. Current Opinion in Genetics & Development. 35/1, 2015. In one of the strongest comparisons to date, Stanford University biologists illume the deep correspondence between these metabolic biochemicals and linguistic dialogue. Levitt won the 2013 Nobel Prize in Chemistry for discoveries in structural biology. A graphic illustration is included which compares human and protein language with regard to alphabet, syntax, grammar, vocabulary, semantics and pragmatics, a quite common affinity of proteins and prose/poetry. The 75 references include several that cite a “protein universe.” By literary license might one imagine a lively conducive cosmos which is innately organic, anatomic, physiological, and written in a genomic script?

Proteins, the main cell machinery which play a major role in nearly every cellular process, have always been a central focus in biology. We live in the post-genomic era, and inferring information from massive data sets is a steadily growing universal challenge. The increasing availability of fully sequenced genomes can be regarded as the ‘Rosetta Stone’ of the protein universe, allowing the understanding of genomes and their evolution, just as the original Rosetta Stone allowed Champollion to decipher the ancient Egyptian hieroglyphics. In this review, we consider aspects of the protein domain architectures repertoire that are closely related to those of human languages and aim to provide some insights about the language of proteins. (Abstract)

Proteins are a class of nitrogenous organic compounds that consist of large molecules composed of one or more long chains of amino acids and are an essential part of all living organisms, especially as structural components of body tissues such as muscle, hair, collagen, etc., and as enzymes and antibodies. (web definition)

Searls, David. A Primer in Macromolecular Linguistics. Biopolymers. 99/3, 2013. The philosophical geneticist (bio below) has been a prescient observer (search) that nature’s dual domains of informational nucleotides and literary discourse are innately similar in kind. This entry describes via graphic, evidential visuals their parallel, self-similar essence. The import is that if the relation could move from metaphor to analogy to factual, both the genetics and linguistics endeavors could much benefit from cross-applications of methods and analytic techniques.

Polymeric macromolecules, when viewed abstractly as strings of symbols, can be treated in terms of formal language theory, providing a mathematical foundation for characterizing such strings both as collections and in terms of their individual structures. In addition this approach offers a framework for analysis of macromolecules by tools and conventions widely used in computational linguistics. This article introduces the ways that linguistics can be and has been applied to molecular biology, covering the relevant formal language theory at a relatively nontechnical level. Analogies between macromolecules and human natural language are used to provide intuitive insights into the relevance of grammars, parsing, and analysis of language complexity to biology. (Abstract)

David A. Searls received degrees in Philosophy and Life Sciences from MIT and a PhD in Biology from Johns Hopkins University. Following a postdoctoral fellowship at the Wistar Institute in Philadelphia he completed a Master's in Computer and Information Science at the University of Pennsylvania. He went on to co-found the Computational Biology and Informatics Laboratory at UP. He then spent 13 years at SmithKline Beecham and GlaxoSmithKline Pharmaceuticals, where he was Senior Vice-President of Bioinformatics. He left GSK in 2008 and is now an independent consultant.

Searls, David. Reading the Book of Life. Bioinformatics. 17/7, 2001. A report on a conference between geneticists and linguists to explore the systematic affinities between the DNA molecular code and human language.

Searls, David. The Language of Genes. Nature. 420/211, 2002. An affirmation that the molecular genetic code, as now studied by computer-based bioinformatics, is in fact a true language with its own grammar and syntax. And these techniques are also being used to explore the structures of literature.

…nucleic acids may be said to be at about the same level of linguistic complexity as natural human languages.…genes do convey information, and furthermore this information is organized in a hierarchical structure whose features are ordered, constrained and related in a manner analogous to the syntactic structure of sentences in a natural language. (213)

Searls, David. Trees of Life and of Language. Nature. 426/391, 2003. The same pattern occurs for the lineage of ancient languages and the reconstruction of evolutionary ancestors, whereby “philology recapitulates phylogeny.”

Shabi, Uri, et al. Processing DNA Molecules as Text. Systems and Synthetic Biology. 4/3, 2011. Weizmann Institute of Science, Rehovot, Israel mathematicians, biochemists, and a cell biologist spell out an extensive, technically detailed consideration of the nucleotide genome in terms of, as if, a linguistic document. A paper in the next issue (4/4, 2010) “Creating Novel Protein Scripts beyond Natural Alphabets” by Anil Kumar, University of Toronto, and Vibin Ramakrishan, Rajiv Gandhi Centre for Biotechnology, similarly reinforces. Are we altogether at last verifying a true literal nature, as prior traditions well know, whose creative genetic program then manifests in kind at each emergent level? And could it now be passing into our conscious knowledge, indeed to commence a new era of “synthetic biology”?

Polymerase Chain Reaction (PCR) is the DNA-equivalent of Gutenberg’s movable type printing, both allowing large-scale replication of a piece of text. De novo DNA synthesis is the DNA-equivalent of mechanical typesetting, both ease the setting of text for replication. What is the DNA-equivalent of the word processor? (227) Here we present a novel operation on DNA molecules,…and show that it provides a foundation for DNA processing as it can implement all basic text processing operations on DNA molecules including insert, delete, replace, cut and paste and copy and paste. (227) In this work we present a uniform framework for DNA processing that encompasses DNA edition, DNA synthesis, and DNA library construction. (228)

Sheinman, Michael, et al. Evolutionary Dynamics of Selfish DNA Explains the Abundance Distribution of Genomic Sequences. Nature Scientific Reports. 6/30851, 2016. As an instance of genome complexity, with Anna Ramisch, Florian Massip, and Peter Arndt, MPI Molecular Genetics researchers draw upon physics and linguistics to finesse features from these realms. See Massip in the next section for more from this team. Circa 2016, genomes are commonly treated as a whole entity, which are then seen to have deep affinities to universal nonlinear systems before and after.

Since the sequencing of large genomes, many statistical features of their sequences have been found. One intriguing feature is that certain subsequences are much more abundant than others. In fact, abundances of subsequences of a given length are distributed with a scale-free power-law tail, resembling properties of human texts, such as Zipf’s law. Despite recent efforts, the understanding of this phenomenon is still lacking. Here we find that selfish DNA elements, such as those belonging to the Alu family of repeats, dominate the power-law tail. Interestingly, for the Alu elements the power-law exponent increases with the length of the considered subsequences. Motivated by these observations, we develop a model of selfish DNA expansion. The predictions of this model qualitatively and quantitatively agree with the empirical observations. This allows us to estimate parameters for the process of selfish DNA spreading in a genome during its evolution. The obtained results shed light on how evolution of selfish DNA elements shapes non-trivial statistical properties of genomes. (Abstract)

Our genome is a sequence of A, C, G and T nucleotides and can be viewed as a long text of about three billion letters. Only a small part of our genome is functional and under selection; the rest (so-called junk DNA) mostly evolves neutrally and, therefore, is naively expected to be a random sequence. However, the junk DNA contains many homologous sequences, sharing significant similarities to each other. Hence, its statistical properties differ from those of random sequences. One of these properties, which we discuss here, is that for a given length, certain subsequences are much more abundant than others. Namely, the abundances of k-mers—sequences of length k—possess a wide, scale-free distribution, as shown in Fig. 1. This phenomenon resembles statistical properties of human texts, where abundances of words also exhibit a scale-free distribution. (1)

Soares, Eduardo, et al. Beyond Chemical Language: A Multimodal Approach to Enhance Molecular Property Prediction. arXiv:2306.14919. Seven IBM researchers posted in Rio de Janeiro, Brazil and San Jose, USA including Dmitry Zubarev first describe current approaches as this broad field of biomolecule parsings actively shifts to deep machine learning methods. See also Artificial Intelligence-aided Protein Engineering from Topological Data Analysis to Deep Protein Language Models at 2307.14587 for another instance. A number of technique proposals are then advanced going forward. Altogether such novel literacies add more evidence for an affine genetic and protein equivalence.

Protein engineering is an emerging field in biotechnology that has the potential to revolutionize various areas, such as antibody design, drug discovery, food security, ecology, and more. However, the mutational space involved is too vast to be handled through experimental means alone. Leveraging accumulative protein databases, machine learning (ML) models, particularly those based on natural language processing (NLP), have considerably expedited protein engineering. Moreover, advances in topological data analysis (TDA) and artificial intelligence-based protein structure prediction, such as AlphaFold2, have made more powerful structure-based ML-assisted protein engineering strategies possible. This review aims to offer a comprehensive, systematic, and indispensable set of methodological components, including TDA and NLP, for protein engineering and to facilitate their future development. (Excerpt)

Previous 1 | 2 | 3 | 4 | 5 | 6 | 7 Next