IV. Ecosmomics: Independent, UniVersal, Complex Network Systems and a Genetic Code-Script Source

2. The Innate Affinity of Genomes, Proteomes and Language

Deming, Laura, et al. Genetic Architect: Discovering Genomic Structure with Learned Neural Architectures. arXiv:1605.07156. As the genetic and brain sciences merge and cross-inform, UC San Francisco, Institute for Human Genetics, researchers show how deep learning network algorithms can also effectively parse genome phenomena. An inclusive synthesis then seems underway from celestial webs (Coutinho) to literature (Rosetta Cosmos), as a cosmos to culture anatomy and physiology just now becomes realized.

Each human genome is a 3 billion base pair set of encoding instructions. Decoding the genome using deep learning fundamentally differs from most tasks, as we do not know the full structure of the data and therefore cannot design architectures to suit it. As such, architectures that fit the structure of genomics should be learned not prescribed. Here, we develop a novel search algorithm, applicable across domains, that discovers an optimal architecture which simultaneously learns general genomic patterns and identifies the most important sequence motifs in predicting functional genomic outcomes. The architectures we find using this algorithm succeed at using only RNA expression data to predict gene regulatory structure, learn human-interpretable visualizations of key sequence motifs, and surpass state-of-the-art results on benchmark genomics challenges. (Abstract)

Deep learning demonstrates excellent performance on tasks in computer vision, text and many other fields. Most deep learning architectures consist of matrix operations composed with non-linearity activations. Critically, the problem domain governs how matrix weights are shared. In convolutional neural networks – dominant in image processing – translational equivariance is encoded through the use of the convolution operation; in recurrent networks – dominant in sequential data – temporal transitions are captured by shared hidden-to-hidden matrices. These architectures mirror human intuitions and priors o the structure of the underlying data. Genomics is an excellent domain to study how we might learn optimal architectures on poorly-understood data because while we have intuition that local patterns and long-range sequential dependencies affect genetic function, much structure remains to be discovered. (1)

Dunn, Ian. Are Molecular Alphabets Universal Enabling Factors for the Evolution of Complex Life? Origins of Life and Evolution of Biospheres. 43/6, 2013. The CytoCure LLC geneticist and research director comes a description of nucleotides and proteins as most like an alphabetic string of characters. While a correspondence between genetics and linguistic has been in the offing for some time, Ian Dunn gives it an updated, robust affirmation. Thus “a digital self-organizing complementary primary replicative alphabet” is seen as a universal property of genomic phenomena. And as ever the implication is a textual, inscribed, naturome as life’s newly legible script.

Terrestrial biosystems depend on macromolecules, and this feature is often considered as a likely universal aspect of life. While opinions differ regarding the importance of small-molecule systems in abiogenesis, escalating biological functional demands are linked with increasing complexity in key molecules participating in biosystem operations, and many such requirements cannot be efficiently mediated by relatively small compounds. It has long been recognized that known life is associated with the evolution of two distinct molecular alphabets (nucleic acid and protein), specific sequence combinations of which serve as informational and functional polymers. In contrast, much less detailed focus has been directed towards the potential universal need for molecular alphabets in constituting complex chemically-based life, and the implications of such a requirement.

To analyze this, emphasis here is placed on the generalizable replicative and functional characteristics of molecular alphabets and their concatenates. A primary replicative alphabet based on the simplest possible molecular complementarity can potentially enable evolutionary processes to occur, including the encoding of secondarily functional alphabets. Very large uniquely specified (‘non-alphabetic’) molecules cannot feasibly underlie systems capable of the replicative and evolutionary properties which characterize complex biosystems. Transitions in the molecular evolution of alphabets can be related to progressive bridging of barriers which enable higher levels of biosystem organization. It is thus highly probable that molecular alphabets are an obligatory requirement for complex chemically-based life anywhere in the universe. In turn, reference to molecular alphabets should be usefully applied in current definitions of life. (Abstract)

In order to evaluate the role of molecular alphabets in biosystems as potentially universal phenomena, it is useful to initially review how such alphabets are defined in general terms, and to consider how they may be grouped into different functional classes. An alphabet in the molecular sense generally refers to the defined set of monomers from which functional oligomers or polymers are derived, in an analogous fashion to stringing individual letters together in the correct sequence of a string itself. Consequently, it is always necessary to compare a set of alphabetic monomers with their concatenated functional forms. The following statement serves as a practical generalizable definition: ‘A biological molecular alphabet consists of a specific set of relatively small distinct molecules which be means of a templating process can covalently concatenate into a large number of combinatorial alternative oligomenrs or polymers with specified sequences., of essential informational and/or functional significance for biosystem operations.’ (447)

Eetemadi, Ameen and Ilias Tagkopoulos. Genetic Neural Networks: An Artificial Neural Network Architecture for Capturing Gene Expression Relationships. Bioinformatics. 35/13, 2019. We cite this entry by UC Davis computer scientists to show how readily these popular analytic methods seem to find similar application everywhere, even in this case so as to parse life’s heredity. Could commonality infer that brains and genomes and all else are deeply cerebral, information bearing, relative aware in kind?

Results: We present the Genetic Neural Network (GNN), an artificial neural network for predicting genome-wide gene expression given gene knockouts and master regulator perturbations. In its core, the GNN maps existing gene regulatory information in its architecture and it uses cell nodes that have been specifically designed to capture the dependencies and non-linear dynamics that exist in gene networks. Our results argue that GNNs can become the architecture of choice when building predictors of gene expression from the growing corpus of genome-wide transcriptomics data.

Elnaggar, A., et al. ProtTrans: Towards Cracking the Language of Life’s Code Through Self-Supervised Deep Learning and High Performance Computing. IEEE Transactions on Pattern Analysis and Machine Intelligence. 14/8, 2021. Twelve computer scientists mainly at the Technical University of Munich explore these 2020 frontiers of deep new methods and insights into the deep, natural grammars of the language of life9 from the text). A long Abstract cites many computer code methods with a facile ability facility to read, write and take up natural, ecosmomic code-scripts. See also CodeTrans: Towards Cracking the Language of Silicon's Code Through Self-Supervised Deep Learning and High Performance Computing by the same group at arXiv:2104.02443,.

Faltynek, Dan, et al. Bases are not Letters: On the Analogy between the Genetic Code and Natural Language by Sequence Analysis. Biosemiotics. Online April, 2019. Palacky University, Olomouc, Czech Republic system scholars DF, Vladimir Matlach, and Ludmila Lackova (search) continue their project to parse an endemic, natural affinity between the prime informative occasions of biochemical nucleotide genomes and human linguistic complexities.

The article deals with the notion of the genetic code and its metaphorical understanding as a “language”. In the traditional view of the language metaphor of the genetic code, combinations of nucleotides are signs of amino acids. Similarly, words combined from letters (speech sounds) represent certain meanings. The language metaphor of the genetic code assumes that the nucleotides stay in the analogy to letters, triples to words and genes to sentences. We propose an application of mathematical linguistic methods on the notion of the genetic code. We provide quantitative analysis (n-gram structure, Zipf’s law) of mRNA strings and natural language texts, along with a representative analysis of DNA, RNA and proteins. Our analysis of mRNA confirms an assumption that the design of the genetic code cannot analogize DNA bases and letters. The notion of the letter is much more appropriate if analogized with triplets or amino acids (Abstract excerpt)

Ferrer-I-Cancho, Ramon and Nuria Forns. The Self-Organization of Genomes. Complexity. Online First, March, 2010. As the quote cites, Barcelona biologists contribute to the recent robust affirmation that genetic and linguistic codes are one and the same in their expression of the universal complex system dynamics, which then, one may add, could take on the likeness of an independent, mathematical cosmic genotype.

Menzerath-Altmann law is a general law of human language stating, for instance, that the longer a word, the shorter its syllables. With the metaphor that genomes are words and chromosomes are syllables, we examine if genomes also obey the law. We find that longer genomes tend to be made of smaller chromosomes in organisms from three different kingdoms: fungi, plants, and animals. Our findings suggest that genomes self-organize under principles similar to those of human language. (Abstract)

Ferrer-i-Cancho, Ramon, et al. The Challenges of Statistical Patterns of Language: The Case of Menzerath’s Law in Genomes. Complexity. Online December, 2012. With coauthors Nuria Forns, Antoni Hernandez-Fernandez, Gemma Bel-enguix and Jaume Baixeries, Barcelona systems scientists advise that along with (George Kingsley) Zipf’s law, the theorem of German linguist Paul Menzerath about word or note frequencies in a text or score can hold equally well for biomolecular nucleotide genomes. By these lights, another entry is gained to appreciate a deep, parallel affinity between the genetic code and literate languages.

The importance of statistical patterns of language has been debated over decades. Although Zipf's law is perhaps the most popular case, recently, Menzerath's law has begun to be involved. Menzerath's law manifests in language, music and genomes as a tendency of the mean size of the parts to decrease as the number of parts increases in many situations. This statistical regularity emerges also in the context of genomes, for instance, as a tendency of species with more chromosomes to have a smaller mean chromosome size. It has been argued that the instantiation of this law in genomes is not indicative of any parallel between language and genomes because (a) the law is inevitable and (b) noncoding DNA dominates genomes. Here mathematical, statistical, and conceptual challenges of these criticisms are discussed. Two major conclusions are drawn: the law is not inevitable and languages also have a correlate of noncoding DNA. However, the wide range of manifestations of the law in and outside genomes suggests that the striking similarities between noncoding DNA and certain linguistics units could be anecdotal for understanding the recurrence of that statistical law. (Abstract)

Languages and genomes show a striking similarity at the semantic level: both possess units that have an arbitrary semantic reference of symbolic nature. Our comparison goes further and suggests that genomes code for some abstract version of grammatical and lexical meaning, the former in non‐coding regions and the latter in coding regions.(6) Quantitative linguistics offers powerful tools for discovering and investigating non‐trivial connections between human language and genomes. However, the evolutionary mechanisms and the constraints that may underlie the recurrence of Menzerath’s law still must be understood. (6)

Ferruz, Noelia, et al. ProtGPT2 is a Deep Unsupervised Language Model for Protein Design. Nature Communications. 13/4348, 2022. University of Bayreuth, German system biochemists describe current progress toward a deep unity of life’s two prime genetic and linguistic code domains. By virtue of AI/ML facilities, into the 2020s an infinite affinity, as long sensed, is being revealed. This worldwise phase of palliative and enhanced metabolomics then brings much promise for health and welfare.

Protein design projects aim to build novel biomolecules customized for specific purposes so to potentially solve many environmental and biomedical problems. Recent progress in Transformer-based architectures have been enabled by language models which can generate text with human-like capabilities. Here, we describe ProtGPT2, a linguistic form that can build de novo protein sequences following the principles of natural ones. AlphaFold prediction of ProtGPT2-sequences yield structures with embodiments and topologies not captured in current structure databases.

Natural language processing (NLP) has seen many advances in recent years. Analogies between protein sequences and human languages have long been described as a concatenation of letters from a chemically defined alphabet. Both amino acids, and human text arrange letters to form structural elements (“words”) which assemble into domains (“sentences”) that undertake a function (“meaning”). A vital similarity is that protein sequences, like natural languages, are information-complete: they fully store structure and function. We propose that these methods open a new approach the whole metabolicfield of proteomics. (1)

Flam-Sherperd, Daniel, et al. Atom-by-atom protein generation and beyond with language models. arXiv:2308.09482.
We post an August entry by University of Toronto and Vector Institute reseachers including Alán Aspuru-Guzik to record much current activity in biocomputional studies which now join Large Language Models of AI neural machine learning methods. As the excerpt cites, a broad continuity across chemical, genetic, biochemical and onto linguistic phases bodes for an innately informational, universe to wumanverse, literacy to literacy procreative milieu. See also, for example, PEvoLM: Protein Sequence Evolutionary Information Language Model by Issar Arab at 2308.08578.

Protein language models learn powerful representations directly from sequences of amino acids. In contrast, chemical language models learn atom-level results of smaller molecules that include every atom, bond, and ring. In this work, we show that chemical language models can learn atom-level proteins which can generate the standard genetic code and far beyond it. The results demonstrate the potential for biomolecular design at the atom level using language models. (Exerpt)

Gimona, Mario. Protein Linguistics and the Modular Code of the Cytoskeleton. Barbieri, Marcello, ed. The Codes of Life. Berlin: Springer, 2008. The University of Salzburg geneticist contributes to the long project to interpret, join and unify the molecular and literal versions, in support of the growing conclusion that “Nature is Structured in a Language-like Fashion.” See also an earlier paper “Protein Linguistics – A Grammar for Modular Protein Assembly?” in Nature Reviews: Molecular Cell Biology (7/1, 2006).

Hackenberg, Michael, et al. Clustering of DNA Words and Biological Function: A Proof of Principle. Journal of Theoretical Biology. 297/127, 2012. University of Granada and University of Malaga, Spain system biologists including Pedro Carpena contribute to historic 2010s verifications that the molecular nucleotide version and human cultural literature are one and the same, that they are formed and suffused by the same informative nonlinear complex network systems. View articles of this kind, for example, in the journal Complexity over recent years.

Relevant words in literary texts (key words) are known to be clustered, while common words are randomly distributed. Given the clustered distribution of many functional genome elements, we hypothesize that the biological text per excellence, the DNA sequence, might behave in the same way: k-length words (k-mers) with a clear function may be spatially clustered along the one-dimensional chromosome sequence, while less-important, non-functional words may be randomly distributed. To explore this linguistic analogy, we calculate a clustering coefficient for each k-mer (k=2–9 bp) in human and mouse chromosome sequences, then checking if clustered words are enriched in the functional part of the genome. The clustering of DNA words thus appears as a novel principle to detect functionality in genome sequences. As evolutionary conservation is not a prerequisite, the proof of principle described here may open new ways to detect species-specific functional DNA sequences and the improvement of gene and promoter predictions, thus contributing to the quest for function in the genome. (Abstract excerpt)

Heckmeier, Philipp, et al.. A billion years of evolution manifest in nanosecond protein dynamics. PNAS. 121/10, 2024. We cite this paper by University of Zurich and Columbia University biochemists as an example of how far the scope and range of these current techniques can reach. And again who are we peoples with an Earthomo sapience to be able to look down and back and reconstruct and re-present how it all came to occur?

Protein dynamics forms a broad bridge between structure and function, yet the impact of evolution on ultrafast protein processes remains enigmatic. This study delves into the nanosecond-scale phenomena of a conserved protein across species separated by almost a billion years as a way to investigate ten complex homologs. In so doing, we found a cascade of rearrangements which manifest in discrete time points over hundreds of millions of years. Our work poses a novel scientific inquiry within molecular paleontology compared by the rapid pace of protein processes which can connect the shortest time scale in living matter (10^-9 s) with the largest ones (10^16 s). (Abstract)

Previous 1 | 2 | 3 | 4 | 5 | 6 | 7 Next