IV. Ecosmomics: Independent, UniVersal, Complex Network Systems and a Genetic Code-Script Source

2. The Innate Affinity of Genomes, Proteomes and Language

Tavares, Ana, et al. DNA Word Analysis Based on the Distribution of the Distances Between Symmetric Words. Nature Scientific Reports. 7/728, 2017. We note in 2017 this paper by University of Aveiro, Portugal, medical and computational mathematicians as an example of how it has become common usage to consider genetic phenomena by way of similar linguistic features.

We address the problem of discovering pairs of symmetric genomic words (i.e., words and the corresponding reversed complements) occurring at distances that are overrepresented. For this purpose, we developed new procedures to identify symmetric word pairs with uncommon empirical distance distribution and with clusters of overrepresented short distances. We focused on the human genome, and analysed both the complete genome as well as a version with known repetitive sequences masked out. We reported several well-defined features in the distributions of distances, which can be classified into three different profiles, showing enrichment in distinct distance ranges. (Abstract excerpts)

Turenne, Nicolas. On a Possible Similarity between Gene and Semantic Networks. arXiv:1606.00414. The University of Paris, INRA Science and Society bioinformatics researcher contributes to growing realizations, after decades of intimations since Jean Piaget and Roman Jakobson, that as similar self-organizing systems, the disparate realms of literature and genomes are necessarily one and the same natural testaments.

In several domains such as linguistics, molecular biology or social sciences, holistic effects are hardly well-defined by modeling with single units, but more and more studies tend to understand macro structures with the help of meaningful and useful associations in fields such as social networks, systems biology or semantic web. A stochastic multi-agent system offers both accurate theoretical framework and operational computing implementations to model large-scale associations, their dynamics and patterns extraction. We show that clustering around a target object in a set of associations of object prove some similarity in specific data and two case studies about gene-gene and term-term relationships leading to an idea of a common organizing principle of cognition with random and deterministic effects. (Abstract)

Victorri, Bernard. Analogy Between Language and Biology. Cognitive Processing. 8/1, 2009. The Centre National de la Recherche Scientifique (CNRS) linguist finds a deep correspondence between the hierarchical array of protein forms and transcriptions, and how human communication employs a similar scale from phonemes (smallest unit conveying a distinct meaning) to essay or speech. A dual “productive system” accrues in both cases of external events, as if a resultant phenotype, which springs from literal descriptions. A salient discovery might then be revealed in this work and companion approaches, which courses in both directions. Life’s evolution is distinguished by an ascendant “linguistic” essence, while our languages are in some real way akin to the molecular genetic code. Altogether an original, independent cosmic code is quite inferred, human and universe once again mirror each other, this late time as a temporal gestation.

If we now turn to the structural aspect of the analogy, the first observation to be made is that in both cases there is a primary sequential structure forming the basis of a complex hierarchical organization. As regards proteins, the discrete units composing the sequence are the twenty proteinogenic amino acids composing the polypeptide chain. As for language, the discrete units are the phonemes. Their number changes from one language to another, but the order of magnitude remains the same as the number of amino acids. (14)

Wang, Li-Min, et al. Mechanism of Evolution Shared by Genes and Language. arXiv:2012.14309. Nine National Tsing Hua University, Taiwan biologists and linguists describe a strongest parallel between these premier modes of vital, prescriptive content. After consideration from 1970 to 2000 to today, life’s evolutionary emergence can indeed be seen as endowed with deeply similar, Rosetta-like versions of genetic and linguistic informative codesl. We log this in with Siobhan Roberts review of cellular automata models such as John Conway’s Game of Life and Bert Chan’s Lenia Universe. Within a 21st century worldwise revolution, a natural genesis now well appears to have its own uniVerse to humanVerse ecosmomic code. In further regard, our Earthomo sapience may seem meant to achieve its sentient translation, and intentional continuance.

We propose a general mechanism for evolution to explain the diversity of genes and language. To quantify their common features and reveal hidden structures, several statistical properties and patterns are examined by way of a new method called the rank-rank analysis. We find that the older relation, "domain plays the role of word in gene language", is not rigorous, and propose to replace it by protein. Based on the correspondence between (protein, domain) and (word, syllgram), we discover that both genes and language share a common scaling structure and scale-free network. Like the Rosetta stone, this work may help decipher the secret behind non-coding DNA and unknown languages. (Abstract)

Among the topics of evolution, we are particularly interested in genes and natural languages. The fact that 20 kinds of codon, composed by three nucleotides in the set A, T, C, G encode genome sequence is similar to the human written text constituted by letters that form the alphabet. Therefore, it is intuitive to make an analogy between gene and language. When choosing the “space-time” of organism as nature and that of human as society, their inheritance of survival can be recorded in gene and language, respectively. (1)

The correspondence between gene and language may be the Rosetta Stone to decipher the language of genes. Scientists have applied linguistic formalisms to this goal, such as using Zipf’s and Shannon’s approach to quantify the linguistic features of non-coding DNA sequences, and exploring information hidden in genome with the aid of natural language processing (NLP). On the other hand, linguists have investigated the relationship between language and the natural selection, and discussed the language faculty in the broad and narrow sense from the viewpoint of biolinguistics. (1)

Waseem, Muhammad, et al.. Language-independence of DisCoCirc’s Text Circuits: English and Urdu. arXiv:2208.10281. An Oxford University Computational Intelligence team, including Bob Coecke continue to finesse reasons why genomic and linguistic descriptive phases can be found to have a common character which arises from a communicative reality.

DisCoCirc is a newly proposed framework for representing the grammar and semantics of texts using compositional, generative circuits. While it advances the Categorical Distributional Compositional (DisCoCat) framework, it achieves radical new features toward eliminating grammatical differences between languages. In this paper we suggest that this is indeed the case for restricted fragments of English and Urdu. There is a simple translation from English grammar to Urdu grammar, and vice versa. We then show that differences in grammatical structure between English and Urdu - primarily relating to the ordering of words and phrases - vanish when passing to DisCoCirc circuits. (Abstract)

Wilson, Erin, et al. Genotype Specification Language. ACS Synthetic Biology. 5/6, 2016. We cite this entry by a nine member team including Darren Platt of Amyris Biotechnologies, Emeryville, CA as an example of how the nascent field of genoinformatics or genolinguistics is moving to respectfully reinvent a much better life, environment and sustainable planet. See also Double Dutch: A Tool for Designing Combinatorial Libraries of Biological Systems by Nicholas Roehner in this same issue.

We describe here the Genotype Specification Language (GSL), a language that facilitates the rapid design of large and complex DNA constructs used to engineer genomes. The GSL compiler implements a high-level language based on traditional genetic notation, as well as a set of low-level DNA manipulation primitives. The language allows facile incorporation of parts from a library of cloned DNA constructs and from the “natural” library of parts in fully sequenced and annotated genomes. GSL was designed to engage genetic engineers in their native language while providing a framework for higher level abstract tooling. To this end we define four language levels, Level 0 (literal DNA sequence) through Level 3, with increasing abstraction of part selection and construction paths. GSL targets an intermediate language based on DNA slices that translates efficiently into a wide range of final output formats, such as FASTA and GenBank. (Abstract)

Witzany, Gunther. Biocommunication and Natural Genome Editing. Dordrecht: Springer, 2010. A book-length exposition by the Austrian philosopher and editor (see next) of the linguistic turn to perceive living systems, across many nested whole scales, as most characterized by text-like, informational qualities. In an initial chapter an historic sequence of worldviews from “monistic-organismic” to “pluralistic-mechanistic” to this nascent “organic-morphological” phase is laid out. (Reading this synopsis, one is struck by an apparent Right to Left to Whole Brain passage for humankind that we retrace in our own lives.) The next chapters dutifully span flora and fauna from viral, genomic, fungal, bacterial, cellular, and honey bee realms so as to highlight their dialogic, quorum-sensing essence. Altogether with other recent postings (e.g. Beckner, et al) a sense of a natural genesis may accrue that is intrinsically textual as genetics and language meld in a singular evolutionary emergence. Since both modes are being found to express a self-organizing complex dynamics, this phenomenal propensity itself could take on the guise of a universe to human genetic code.

Current molecular biology as well as cell biology investigates its scientific object by using key terms such as genetic code, code without commas, misreading of the genetic, coding, open frame reading, genetic storage medium DNA, genetic information, genetic alphabet, genetic expression, messenger RNA, cell-to-cell communication….All these terms combine a linguistic and communication theoretical vocabulary with a biological one. In this book I try to introduce and appropriate model to exemplify this vocabulary (which is used in biology all the time without people thinking about it), on the basis of explanation and understanding of a linguistic action, the great variety of communicative actions. (v)

In parallel, the usage of a ‘language’-metaphor has increased since the mid-twentieth century with the growing knowledge about this genetic code. Most of the processes which evolve, constitute, conserve, rearrange the genetic storage medium DNA are terms which were originally used in linguistics such as coding, copying, transcription, translation, signaling, signal transduction, etc. Meanwhile the linguistic approach has also lost its metaphorical character and the similarity between linguistic languages/codes and the genetic storage medium are not only accepted but are fully adapted in bioinformatics, biolinguistics, protein linguistics, biohermeneutics and biosemiotics. (198-199)

Wu, Fang, et al. Integration of pre-trained protein language models into geometric deep learning networks. Communications Biology. 6/876, 2023. Westlake University, Hangzhou, China, Yale University, and Tsinghua University, Beijing computational biologists provide another example of this frontier cross-adoption of protein linguistics with AI neural net contents. Our comment for these contributions is that as genetic and metabolic processes are able to be grammatically parsed, so to say, they gain a common textual basis. As a result, a wide and deep natural narrative is being realized in our midst written in an ecosmome to geonome code script. See also ProtLLM: An Interleaved Protein-Language LLM with Protein-as-Word Pre-Training by Le Zhuo, et al at arXiv:2403.07920 for more work in this regard.

Geometric deep learning has achieved much success in defining 3D structures of large biomolecules. Meanwhile, protein language models trained on 1D sequences apply to a broad range of applications. In this work, we integrate the knowledge learned by protein language models into geometric networks and evaluate a variety of protein representation learning benchmarks. The incorporation of protein language knowledge enhances geometric networks’ capacity and can be generalized to complex tasks. (Excerpt)

Wu, Yanying and Quanlong Wang. A Categorical Compositional Distributional Modelling for the Language of Life. arXiv:1902.09303. Oxford University computer neuroscientists are able to treat and parse protein biology in a linguistic manner by use of this title computational insight achieved by the Oxford computer science group, search Bob Coecke. We log in amongst concurrent papers which establish this deep affinity between genetic and literary domains, broadly conceived, as a later evolutionary stage of one, same natural script.

The Categorical Compositional Distributional (DisCoCat) Model is a powerful mathematical method for composing the meaning of sentences in natural languages. Since we can think of biological sequences as the "language of life", here we apply this model to see if we can obtain new insights and a better understanding of life’s language. We choose to focus on proteins as their linguistic features are more prominent as compared with other macromolecules such as DNA or RNA. Thus, we treat each protein as a sentence and its constituent domains as words. The meaning of a word or the sentence is its biological function, and the arrangement of protein domains corresponds to the syntax. Putting all those into the DisCoCat frame, we can "compute" the function of a protein with grammar rules that combine them together. (Abstract excerpts)

Xiao, Yi, et al. Bridging Text and Molecule: A Survey on Multimodal Frameworks for Molecules. arXiv:2403.13830. Chinese Academy of Sciences AI researchers provide an example of how readily language-based content can be assimilated by computational methods as they are then employed to parse protein linguistics. Altogether a common natural narrative from nucleotides to nouns is being read and written anew,

With recent trend in machine learning and natural language processing is aimed at building multimodal frameworks to jointly model molecules with textual domain knowledge. In this paper, we present the first systematic survey of this integrative endeavor. We focus on advances in text-molecule alignment methods, categorizing current models into two groups based on their architectures and listing relevant pre-training tasks. We next delve into the utilization of large language models and prompting techniques for molecular tasks and present significant applications in drug discovery. (Excerpt)

Zaccagnino, Rocco, et al. Testing DNA Code Words Properties of Regular Languages. Theoretical Computer Science. 608/84, 2015. In a special issue From Computer Science to Biology and Back, we cite this entry by University of Salerno informatics researchers as an example of how DNA nucleotides are finding a common utility across disparate genetic, linguistic and computational domains.

One aspect of DNA Computing is the possibility of using DNA molecules for solving some “complicated” computational problems. In this context, the DNA code word design problem assumes a fundamental role: given a problem encoded in DNA strands and biochemical processes, the final computation is a concatenation of the input DNA strands that must allow us to recover the solution of the given problem in terms of the input (unique decipherability). Thus the initial set of DNA strands must be a code. In addition, it should satisfy some restrictions, called here DNA properties, in order to prevent them from interacting in undesirable ways. So a new interest towards the design of efficient algorithms for testing whether a language X is a code, has arisen from (wet) DNA Computing, but, as far as we know, only when X is a finite set. In this paper we provide an algorithm for testing whether an infinite but regular set of words is a code that avoids some DNA properties among unwanted intermolecular and intramolecular hybridizations. (Abstract)

Zambon, A., et al. Structure of the space of folding protein sequences defined by large language models. Physical Biology. January, 2024. We cite this entry by Center for Complexity and Biosystems, University of Milan researchers as another instance of this mid 2020s cross-integrity of metabolic methods with AI computational network capabilities.

Proteins populate a sequence space whose geometrical structure guides their natural evolution. By way of transformer models, we examine the protein landscape as an effective energy of sequence foldability, an approach similar to optimization methods in machine learning. We then employ statistical mechanics algorithm to explore regions with high local entropy in relatively flat landscapes. Our work thus combines machine learning and statistical physics so to provide new insights into the exploration of sequence landscapes where wide, flat minima coexist alongside narrower minima. (Excerpt)

Previous 1 | 2 | 3 | 4 | 5 | 6 | 7 Next