Friday, June 22, 2007

secneuqes AND sdrawkcaB

One of my early graduate rotation projects (the period when you are scoping out an advisor -- and the advisor is scoping you out!) in the Church lab was to develop a set of scripts to take a bacterial DNA sequence, extract all of the possible Open Reading Frames longer than a threshold & BLAST those against the protein database. Things went great with a sizable sequence from Genbank, so I asked for a large sequence generated in-house. The results were curious: no particularly long ORFs, and none of them matched anything.

Puzzled, I reported this to George & it took him a moment to think of the answer: the sequence was backwards.

We write DNA sequences in a particular order for a reason, because that is the order (5'->3') in which Nature makes DNA. The underlying chemistry is such that this is one of the few inviolate rules of biology: thou shalt not polymerize nucleotides in a 3'->5' direction. The technology which has dominated in recent times, Sanger dideoxy sequencing, relies on DNA polymerization and so can also read sequence only in a 5'->3' direction. Most of the 'next generation' technologies which are coming available, such as 454 and Illumina/Solexa, also rely on polymerase extension and have an imposed direction.

But Sanger sequencing once had a serious rival: chemical sequencing. The Maxam-Gilbert approach relies on chemical cleavage of end-labeled DNA -- and depending on which end you label you can read either strand of a DNA fragment in either direction. George's genomic sequencing and multiplex sequencing also used chemical cleavage, and it turned out that the version of multiplex sequencing then being used probed the DNA in such a way that the reads came out 3'->5', and I had gotten the unreversed file.

I'm not the only person to fall into that trap. There was a burst of excitement over at the Harvard Mycoplasma sequencing project that a long true palindrome had been seen. In molecular biology the term palindrome is bent a bit to mean a sequence that reads the same forwards on one strand and back again on the other strand, but here was a sequence that actually read the same backwards and forwards on the same strand. Such a beastie hadn't been observed (I wonder if one has yet?), and would be a bit of a puzzle. A bit later: "Never mind". Someone had assembled a reversed and unreversed sequence, which were in reality the same thing.

Some such mistakes got farther, much farther. In sequencing the E.coli genome the U.Wisconsin team would compare their results back to all E.coli sequences in Genbank. They came across one that didn't at all fit, at least not until they tried the reverse sequence, which fit perfectly.

One member of the near crop of next generation technologies is a bit different on this score. The sequencing-by-ligation approach from the Church lab, being commercialized by ABI, works with double-stranded DNA, and so you can read either way from a known region. But this isn't exactly reading in either direction, since it is double stranded DNA.

However, some of the distant concepts for DNA sequencing might really throw out the limitation, which has some interesting informatics implications. Many approaches such as nanopores or microscopic reading of DNA sequence do not use polymerases, except maybe to label the DNA. So these methods might be able to read single-stranded DNA in either direction -- and you might not even known which direction you are reading! For de-novo sequencing, this could make life interesting -- though if the read lengths are long enough, it will be much like my surprise in the Church lab -- if you don't find anything biological, try reading backwards.


Anonymous said...

Hi Keith! There's nothing quite like a little egoboo to start one's week. Your mention of our discovery of the yrtne knaBneG sdrawkcab brought a smile to my morning.

But then I remembered having to call the lab that produced the entry in question ... I ended up talking with someone who was facing both serious health problems and a loss of research funding, and having to add my bad news on top of it all made me feel horrible.

Keith Robison said...

It's no fun being the bearer of bad news, and that certainly would really give one poause.

On the other hand, bad news early is better than bad news too late. I once came across a case where someone had cloned, sequenced, expressed, raised antibodies to and otherwise characterized a 'novel human protein, which was published in BBRC (or Biochemica acta, one of those two). The 'human' cDNA in question was actually a Mycoplasma rRNA in reverse orientation!

Anonymous said...

Hi Keith,
I believe that there were some honest-to-goodness palindromic sequences found when the human Y chromosome was finally completely sequenced. There were eight massive palindromes comprising 25% of Y-chromosome male-specific euchromatin - six of the eight palindromes carry protein-coding genes. See Skaletsky et al, Nature, Vol. 423.