[back]

ASCII_DNA: Biomolecular Writing in a Nutshell

Eugene Thacker/Biotech Hobbyist

Introduction:

This is a how-to manual for various techniques in “biomolecular writing,” using only pre-existing, freely-available bioinformatics research tools.

What is bioinformatics? There are numerous definitions, and each one emphasizes a different aspect of biotech research. Broadly speaking, bioinformatics is an emerging field which involves the integration of computer technologies into molecular biology research. This combination of biology and computers takes many forms however. In many cases bioinformatics simply refers to software tools to aid in biotech research, be it in the study of genes, proteins, or biochemical networks. The most famous example of bioinformatics in action is the human genome projects – both the International Human Genome Sequencing Consortium and Celera Genomics made extensive use of computational tools (hardware and software) to sequence, assemble, annotate, and archive the entire genomes of organisms. At the root of bioinformatics is a view of biological life through the lens of informatics (genetic and protein codes, genome databases, protein signaling, and so forth).

What does bioinformatics do? Basically bioinformatics is concerned with two threads of research: sequence and structure. By sequence we mean any type of research which is primarily concerned with elucidating and studying biomolecular sequences such as DNA or RNA. In this thread, a research team can take a test sample of DNA which they know nothing about, and compare it to any number of genome databases to see if there is a match or close relationship with any other known gene. In the second thread – structure – we mean any type of research which is primarily concerned with the three-dimensional shape and molecular interactions of biomolecules, such as proteins. Here a research team may want to study how a particular sequence of amino acids folds into a complex 3-D shape, enabling it to act as an enzyme, antibody, or other type of molecule.

What is biomolecular writing? We can think of it first as a practice, and as a means of inquiry into the relationshiops between DNA and data, proteins and information, biologies and technologies. Biomolecular writing can be thought of as a means of investigating the informatic view of biological life, which is the dominant way in which human genome projects and other like efforts conceive of the body. The notion of writing is already prevalent in molecular biology (translation, transcription, RNA editing, etc.), and we can use literary texts as a kind of hinge-object between genetic “codes” and computer “codes.



How to generate novel protein structures from literary texts:

1. Sequence preparation: Select sample text from database. Because genes and proteins vary widely in the length of their units, you can experiment with sample text length. We chose Mary Shelley’s Frankenstein from the Project Gutenberg database, one of the most well-known databases of literary texts. We took the first paragraph of Chapter 11 as our sample text, comprising approximately 1,122 characters:

It is with considerable difficulty that I remember the original era of my being; all the events of that period appear confused and indistinct. A strange multiplicity of sensations seized me, and I saw, felt, heard, and smelt at the same time; and it was, indeed, a long time before I learned to distinguish between the operations of my various senses. By degrees, I remember, a stronger light pressed upon my nerves, so that I was obliged to shut my eyes. Darkness then came over me and troubled me, but hardly had I felt this when, by opening my eyes, as I now suppose, the light poured in upon me again. I walked and, I believe, descended, but I presently found a great alteration in my sensations. Before, dark and opaque bodies had surrounded me, impervious to my touch or sight; but I now found that I could wander on at liberty, with no obstacles which I could not either surmount or avoid. The light became more and more oppressive to me, and the heat wearying me as I walked, I sought a place where I could receive shade. This was the forest near Ingolstadt; and here I lay by the side of a brook resting from my fatigue, until I felt tormented by hunger and thirst. This roused me from my nearly dormant state, and I ate some berries which I found hanging on the trees or lying on the ground. I slaked my thirst at the brook, and then lying down, was overcome by sleep.




2. Sequence preparation: Filter the input text as either nucleotide or polypeptide sequence. We used the Filter DNA or Filter Protein bioinformatics tools to extract DNA or amino acid sequence from the text. From the Frankenstein sample text, we extracted a protein sequence of 948 residues:

ITISWITHCNSIDERALEDIFFICLTYTHATIREMEMERTHERIGINALERAFMYEINGALLTHEEVENTSF
THATPERIDAPPEARCNFSEDANDINDISTINCTASTRANGEMLTIPLICITYFSENSATINSSEIEDMEAND
ISAWFELTHEARDANDSMELTATTHESAMETIMEANDITWASINDEEDALNGTIMEEFREILEARNEDTDIST
INGISHETWEENTHEPERATINSFMYVARISSENSESYDEGREESIREMEMERASTRNGERLIGHTPRESSED
PNMYNERVESSTHATIWASLIGEDTSHTMYEYESDARKNESSTHENCAMEVERMEANDTRLEDMETHARDLYH
ADIFELTTHISWHENYPENINGMYEYESASINWSPPSETHELIGHTPREDINPNMEAGAINIWALKEDANDIEL
IEVEDESCENDEDTIPRESENTLYFNDAGREATALTERATININMYSENSATINSEFREDARKANDPAQEDIE
SHADSRRNDEDMEIMPERVISTMYTCHRSIGHTTINWFNDTHATICLDWANDERNATLIERTYWITHNSTACL
ESWHICHICLDNTEITHERSRMNTRAVID THELIGHTECAMEMREANDMREPPRESSIVETMEANDTHEHEA
TWEARYINGMEASIWALKEDISGHTAPLACEWHEREICLDRECEIVESHADETHISWASTHEFRESTNEARIN
GLSTADTANDHEREILAYYTHESIDEFARKRESTINGFRMMYFATIGENTILIFELTTRMENTEDYHNGERAND
THIRSTTHISRSEDMEFRMMYNEARLYDRMANTSTATEANDIATESMEERRIESWHICHIFNDHANGINGNTHE
TREESRLYINGNTHEGRNDISLAKEDMYTHIRSTATTHERKANDTHENLYINGDWNWASVERCMEYSLEEP




3. First run: DNA or amino acid sequence is checked against genome and protein databases for potential near-matches. We chose a standard BLAST query at the NCBI website. Our text sample – now converted to protein sequence – is put through a BLAST search. We ran both blastp (for protein-protein comparisons) as well as a tblastn (which back-translates the protein sequence into amino acid sequence). The results showed only three distant-homologies, with P-values of 5 or higher (P-values of greater than 0.8-1.0 indicate a lack of relevant homology). The high P-values confirmed that there were no known close matches for the Frankenstein protein code.

However the tblastn query (which first translated the protein code into DNA, then compared that DNA sequence to its nucleotide database) returned several possible matches which low P-values (0.008-0.19). The best candidate was a DNA sequence from Takifugu rubripes (Japanese puffer fish).




4. Second run: The protein code is then put through protein structure prediction. While there are many databases of this type which predict different kinds of protein structure (secondary, conserved domains, motifs), we chose 3D-pssm, because of its ability to predict and model various types of protein structures. The 3D-pssm server returned an array of potentially homologous 3D structures:

As a check on protein structure prediction, we also put the Frankenstein sequence through the SWISS-MODEL server, which performs a more rigorous match of a protein sequence against multiple databases, according to parameters set by the user. As expected, SWISS-PROT did not find any matches for P-values of less than 0.001.




5. Structural analysis: The structures with the highest statistical probability were downloaded as both static images and as 3D files. We chose the stand-alone version of RasMol to model the protein structures. As is evident from the multiple views which RasMol makes available, this novel Frankenstein protein structure is incomplete; it does not have full H-bonds or C-bonds to complete the "backbone" of the structure, although an approximation can be given of its overall structure using RasMol.

Despite the "incompleteness" of "impossibility" of this protein structure, if you get to this stage you still have a novel biomolecule generated by a literary text, generated using bioinformatics tools. Perhaps its incompleteness is indicative of the "content" of the protein?




6. Laboratory synthesis: When feasible, the novel protein structures are synthesized in the lab using standard molecular biology techniques such as heating and cooling DNA samples, and induced protein crystallization. However, because our Frankenstein protein structure is incomplete, an attempted synthesis will only produce incomplete residues without structural coherence. However you can still try, even though you may come out with mush (which might be appropriate).