Molecular Biology in a Nutshell

gnat on 2001-09-04T19:14:22

Nope, that's not a new O'Reilly title (hmm, but now I think of it ... :-). Here's what I understand from the molbio textbooks I've been reading.



Picture a person. Zoom in and you'll see that they're made up of cells--skin, flesh, bone marrow, hair, even snot. Zoom in on a cell and see that it has different parts to it. The most interesting are the nucleus (where the DNA is kept) and the cytoplasm (where proteins are built from the DNA in the cell).



Proteins are interesting because most everything in your body is built of, regulated by, or in some other fashion tied into, proteins. Blood, hormones, and cells themselves are built of proteins. Many diseases can be traced to a shortage or an excess of some particular protein.



DNA is interesting because it is the blueprint for the manufacture of these proteins.



Zoom in on the nucleus, and you'll see twenty-three pairs of long DNA molecules. Each pair is a chromosome, and some chromosomes are longer than others (but both DNA molecules in a pair are the same length). Zoom in on a chromosome and you'll see that double helix shape that you've heard so much of.



The DNA double helix is very structured. A single strand of DNA is made up of a series of smaller chunks, called bases. There's no limit to the number of bases you can string together, they're like Legos. There are only four possible bases, known by the first letter of their chemical name: A, T, G, and C.



The cool thing about the bases is that they form bonds: A bonds with T, G bonds with C. Those bonds are what hold the pair of DNA molecules together in the helix shape. It's always the case that if one DNA molecule has an A in a particular place, then at the same place on the other DNA molecule in the pair will be a T. You can think of one as the complement of the other (like an XOR--do two complements and you've got the original DNA molecule back).



So you can think of a chromosome as just a huge string: ATCGACATTTAGA.... (Now you know why the name "GATTACA" for the movie about genetically engineering perfection was clever).



Chunks of each chromosome called genes are templates for particular proteins. Not all of a chromosome is used for this--a lot of the chromosome has no protein coding function. Some of these "dead" parts are introns, bits that do nothing. Other non-protein-coding parts are used for regulation (deciding when to turn on genes), or for building non-protein things. The bits of the chromosome used to produce proteins are called exons because they express (produce) proteins.



Now zoom in on a protein. A protein is also made up of connected units. The units are amino acids. The gene says "this protein is built with these amino acids in this order". The order is very important, because proteins fold up into a 3D structure (the prediction of which, by the way, is a hugely-difficult important unsolved problem in biology) and that 3D structure determines which charged bits are where, and thus what the protein will form chemical bonds with.



A gene is a chunk of a chromosome, so it can be written as a string: GACTCG.... Those bases (the letters) are interpreted in chunks of three. Each triplet is called a "codon". Each codon specifies a particular amino acid. There are 64 possible combinations of 3 bases, and only 20 amino acids in proteins. Some codons specify the same amino acid, some are stop codons (that say "end of protein"--yes, the gene is like a Microsoft Windows file with a ^Z at the end. Woe is us).



There are two important processes for DNA: transcription and replication. Transcription produces RNA (think of it as DNA's blue-collar cousin), and is part of producing a protein. Replication is producing another strand of DNA (when a cell divides, the DNA strands separate, translate, and repair to make two cells).



Transcription is the sexy stuff. It's a complex process--the nucleus is the guarded fortress of the cell, because it houses the secret plans (the DNA). So you don't want to let the blue collar workers and raw materials in to assemble the proteins within the nucleus. Instead the cell transcribes the DNA into messenger RNA (mRNA). In this process, you get a complementary copy of a small piece of DNA. So if the original read GATTACA then the mRNA reads CTAATGT. Actually, it reads CUAAUGU, because RNA has [U]racil instead of [T]hymine, but because the two chemicals are almost identical, they behave the same.



The mRNA then runs to the factory (the cytoplasm) with the blueprint and the actual protein is assembled inside the cytoplasm. The blueprint is assembled with transfer RNA (tRNA). Starting from one end of the mRNA and working to the other, the tRNA connects a codon in the mRNA with the appropriate amino acid. Specific amino acid sequences correspond to orders to leave the cell or go to a specific membrane, so when those sequences are encountered the new protein heads off to give you an erection, make you cry, or tell your blood to clot. This process of building a protein from RNA is "translation".



So biologists are interested in identifying genes and working out what they produce and when. The collection of all your genetic material is known as your genome, hence the Human Genome Project which attempts to map all the genes in your DNA and the proteins they produce. Biologists are also interested in which proteins affect which parts of the body, when genes are "turned on" to produce their proteins, and so on.



And they use computers for a lot of this analysis. Take the problem of identifying genes. Given a protein, where's the gene that produces this? The human genome is huge (the average sized gene contains at least 1200 bases, and the best estimates are for about 30,000 genes, and those are scattered around the chromosomes, separated by the "noise" of introns and other non-coding regions (and even introns may have effects, just to make the task harder).



So there are ways to chop a single DNA strand into roughly gene-sized chunks, slosh the chemicals you're interested in over those chunks, and see what bonds form. This is "microarray" technology, also known as "gene on a chip" and "genechips". The only problem is that you get a crapload of data as a result, and you need computer programs to statistically chew through it and produce you results like "hey, I'm pretty sure there's a gene for that at position 59,634,591".



The computational systems of biologists have sprung up in an ad-hoc fashion, and there are all sorts of problems.



First, some code was written by biologists hacking at code, much the same way as HTML monkeys would hack at Perl CGI code. This is changing now, though, as many more people have formal computer science and math training, and can develop the serious algorithms needed.



Second, the data that's kept is annotated using different conventions. The name of the human gene that regulates insulin is probably different from the mouse gene that regulates insulin, and both are probably different from the cow gene that regulates insulin. Often animals have genes in common (or similar), so biologists often want to search for a "gene that does X" in other species. But the terminology thwarts those searches.



Third, the publically available data is so massive that there's been little quality control possible. If someone says that section 5,158,927 to 5,906,184 of chromosome 3 is "ATGC...", and megabytes of this data flood in every day, how do you review and quality control it? It's hard to make QA sexy, we know this from software.



So ... that's most of what I know about DNA, genes, and bioinformatics. I hope that helps those of you reading up at home. I'm chewing my way slowly through "Recombinant DNA" by James D. Watson, Michael Gilman, Jan Witkowski, and Mark Zoller (Scientific American Books). It's really well written, but there's so much to learn that it's taking me a long time to get through it.



I'm also keeping up with the bioinformatics mailing list. Hopefully, by the time of the Bioinformatics Conference in January I'll be able to understand at least a quarter of each talk :-)



--Nat