Wednesday, August 22, 2007

Clearing the bookshelf

My email box recently resembled the scene in the first Harry Potter book where the boy learns his true heritage. The torrent of messages did not arrive by owl, but were from someone trying to reach me with important news: I was holding a heap of overdue library books.

Alas, I can't claim to have read them all. I don't get to the public library as often as I would like, but when I get there I tend to bring back a bunch. I'm a sucker for books in the rack or end-of-aisle displays, plus I tend to get a big cluster of books in one subject area to see which I like. Throw in the ability to request books from virtually anywhere at anytime via the Internet, and it can really be feast-or-famine.

One of the books which was overdue was one I had to wait on, How Doctors Think by Jerome Groopman. This is a book everyone should take a stab at. First, it is an interesting analysis of how people think; while it is in a medical context, many of the pitfalls and strategies he explores are relevant everywhere. Second, most if not all of us will be patients at some time, or interested parties in the medical care of loved ones. By understanding the mental traps doctors can fall into, patients & patient advocates can better assist doctors in their care and recognize when the doctor is not a good match for the patient or the problem.

Groopman also comes across as a real mensch. He seems like the sort of person you'd try to grab at departmental tea or after a seminar -- and he'd actually speak with you. I certainly didn't agree with all his conclusions in the book, but I could see enjoying any discussion he might bring forth. He is also honest about when he has himself fallen into traps, such as his own arthritis coloring his early evaluation of COX-2 inhibitors (he wrote an article in a national lay magazine touting them as super aspirin).

Another overdue book was Rosalind Franklin: The Dark Lady of DNA. I started reading it on the supposition that most of what I knew about Franklin was from Watson's books, which seemed embarassing. I later realized that I had also read Eighth Day of Creation, so some balance was already there. The book does a good job of laying out her many contributions in crystallography, why the time of the race for the double helix was completely awful for her, and the many challenges of being a Jewish woman scientist in English scientific labs of the 40's and 50's.

My one complaint with the book is that while it shows her famous diffraction photograph of DNA, the book (and probably every other one I've ever seen with the photo) lacks any of the prior photos for comparison. It would also be interesting to see the unpublished manuscript on the DNA structure that Aaron Klug later unearthed, to see how close she was to the solution when Watson & Crick scooped it away. On the point of how they did it, there is no extreme skullduggery discussed here: just a clueless Maurice Wilkins leaking the key data to an opportunistic Watson. It is also interesting to better understand the collaborations she had with each of W&C after the helix; it would seem that professionally she didn't see them as thieves of her glory.

One interesting speculation that hit me early on and is discussed late in the book. Franklin's life was cut short by ovarian cancer. I hadn't realized she was from an Ashkenazi background, a heritage that is unfortunately at higher risk than other populations of carrying BRCA mutations. Alternately, many who saw her work describe her as being particularly unworried by safety precautions around the X-ray beams, though to some degree this was common & their recollections may be colored by her outcome.

Alas, one book that got back unread was Invisible Frontiers, the story of the race to clone insulin. I read it as a senior in college, but it is really due for a re-read. One could imagine staying quite busy just reading biographies around the double helix : I'm really due to re-read Watson, Wilkin's autobiography is wait-listed, and Crick's autobiography somehow was in an earlier batch of books held (but not read) until overdue.

And then there are those owls; having now finished the last book in the series & read the first (and started the second) with my little wizard, the temptation is there to jump ahead and re-read the rest to better understand all the characters & threads woven in the last book. Alas, there still aren't any good clues to the genetics of Mugglery.

Tuesday, August 21, 2007

Personal breakthrough?

One of the contributing factors to a poor recent post frequency is an obsessive tackling of a particular problem at work, one that strayed into the borders of my programming competency. A complete solution is now coded; tomorrow I start trying to make it work.

Much of the programming in bioinformatics is pretty straightforward data slinging -- extract some data from a set of sources, cross-reference it, condense it, slice it, dice it, etc. Real algorithms are left to a small cadre of programmers working on, well, real algorithms.

Periodically though, one is faced with dusting off some algorithmics. In this case, I realized my problem could be formulated as a graph-walking problem, though with some painful rules about walking the graph. One way to think about it (which only occurred to me now, it probably would have been a help), is that the nodes come in different colors & there are rules for when you must or must not switch colors during the traverse. There's even another attribute (texture?) which has different alternation rules.

After figuring out the original graph idea, I started churning out code to tackle it. However, before long, I started struggling with the endgame of the algorithm -- I could set up a graph which would contain any valid solution, but I couldn't quite put together the code to pull out that solution. A sure sign that things were going south was that my classes & method signatures were becoming bloated, cluttered with lots of parameters & fields. On trying to get what I had running, memory blew up on me.

As is often the case, discussion with Miss Amanda suggested another approach. So I placed all the old code in a separate file & started on the new approach. I had figured out a clever way to reduce the memory requirements, both by a way to compress the representation of edges (because the problem results in many edges in the form S->E, S->E+1, S->E+2, etc) and an approach to avoid where I though the memory pig had really gone hogging.

Not helping any of this was the memory of a pointed article by my personal programming guru on the problems with recursive code. Graph & tree walking are recursive problems, but can be solved with non-recursive coding. Particularly in languages which support custom iterators (such as C# & Python), the non-recursive solutions have significant advantages. But, some such solutions occurred easily to me & others just became more knots of ugly, unproductive code.

But again, the endgame started unnerving me. New classes & methods sprung up, but it wasn't clear if they were really moving me forward or simply putting me in a Red Queen setup. So, another walk with my assistant & another approach.

After a few more days slog, that's the one that's ready to start testing. It feels good -- but nothing like what it will feel like if the thing actually WORKS!

Tuesday, August 14, 2007

King of the Migrators

I've been lucky enough lately to see a number of monarch butterflies -- or one of their imitators, which I can't keep straight from monarchs nor can I keep straight which kind of mimics they are. I enjoy seeing any butterflies, which is why it pains me that I see them so rarely in my own yard. Despite nearly zero pesticide use & plantings of all sorts of host and nectar plants, neither this house nor the previous one has seen many butterflies -- lots of dragonflies & bumblebees (and far too many mosquitos), but no butterflies.

Monarchs are amazing creatures on many scales (including their own scales!), but perhaps most amazing is their migration -- each year they schlep off to Mexico for the winter. Most amazingly is the fact that the monarchs which fly south for the winter clearly are homing in on a location they have never before visited -- it was their ancestors a few generations back who flew back. How they do this is still being worked out, but clearly the core of the guidance information must be inherited. Environmental triggers are apparently critical as well; the Wikipedia article notes that monarchs which have taken up residence in mild climes such as Bermuda do not migrate.

I once had an amazing monarch experience. We were going into the city one fall day, and I noted on one of the parkways a number of monarchs flitting across. While we waited for the Orange Line at Wellington, monarchs seemed to pass down the track at a rate of one every half minute or so. For once I didn't mind the long wait for a weekend train. Perhaps it should be rechristened the Orange&Black Line?

As I mused before, one interesting question is how structured are these populations. Are the monarchs I see this year mostly descendants of monarchs who summered here last year, or is everything scrambled? Of course, my solution to this is simple: sequence! With sequencing cheap, one could survey a lot of monarchs (perhaps from museum collections) to find a pool of polymorphisms, which could then be typed on even larger numbers of specimens using chips, directed sequencing or other SNP typing methods. One pleasant side-product would be a draft genome of the monarch.

Friday, August 10, 2007

Settling in

One of the reasons I got to peek in on 640 yesterday is my branch of Codon Devices has moved to new quarters. Whereas before we were down at near the east end of the Cambridge biotech zone at One Kendall Square, now I'm near the west edge closer to Central Square.

I'm proud that I didn't gain much stuff since the last move; though the file box was nearly full this time so things are creeping up.

One big change is that One Kendall Square had a lot of pricey but good restaurants nearby, and was also in range of the fleet of food trucks that park near the MIT Campus. The new site is in the borderlands between industrial Cambridge and residential Cambridge, with the result that there are only a few small pizza / sub shops in very close proximity. However, less than 10 minutes away is the culinary UN of Central and also 3 grocery stores (one standard one, a Trader Joe's, and Whole Foods), two of which have extensive salad bars.

The really big change is I have both a roomy cube & am steps away from my laboratory collaborators -- most have offices/cubes on the same floor (which is all offices), and the lab is now just a single unbarricaded staircase away -- as well as the breakroom and the restrooms.

The new office space also has lots of desk space & lots of light, almost too much in the morning. My tender perennials and annual herbs have already lined up with applications for asylum; the faint aroma of basil & rosemary should brighten up those grey winter days!

Wednesday, August 08, 2007

Peeking in on the Old Homestead

I had the occasion to walk by 640 Memorial Drive, the building in which I spent half of my Millennium career. It's a grand old building with an interesting history.


640 was original built by Henry Ford as an automobile assembly plant located close to a major market -- shipping cars from Michigan was proving troublesome and he wanted an alternative. To economize on land, he envisioned a semi-vertical assembly line -- the standard assembly line would be folded into a series of floors. Giant overhead cranes would lift parts and semi-completed assemblies between floors. The scheme proved impractical, and Ford later built a conventional assembly line over in Somerville. The building went through a number of industrial uses, including being a Polaroid camera assembly plant. It was apparently quite an eyesore in the late 80's, but by the time I first noticed it in the mid-90's it had been rehabbed very nicely. The huge bay once ranged by the cranes is now a soaring atrium & the site of the old railyard is parking.

When I interviewed at Millennium in 1996 they occupied top 2 floors, and by the time I arrived a portion of the middle (3rd) floor had been taken, plus the mouse facility in the basement. Eventually, another major tenant in the building (who made medical alert bracelet systems) was enticed to vamoose, leaving only a single other tenant (a pathology lab).

Around the time I moved back into 640 in 1999 there was a huge effort to fit out all this space. But, before a few years passed Millennium started its deflation and the parking lot starting getting empty again. Eventually, everyone moved out, leaving Millennium with an empty building with a lot of lease left on it.

I peered in a few windows and was surprised to see more occupied than expected. I didn't have time to browse a lot, but while some 1st floor offices were clearly vacant some of the space on the 2nd and 3rd floors were clearly occupied -- though I think my old haunt wasn't. I know there was at least recently some significant lab space vacant, as Codon took a look at it.

Millennium has, of course, been trying to unload the space ever since they moved out. Because it was lumped into restructuring costs, the space was absolutely off-limits -- even when a major power failure crippled the other buildings, 640 was not even seriously considered -- accounting rules are rules.

Which brings up a question. A major reason for vacating buildings was to save money, and even renting empty space is cheaper than having it occupied (light, heat, security, IT support, etc). But, a huge chunk of the cost savings were supposed to come from subletting the space -- a story repeated with other facilities. I wonder how big the gap is (and how fast it is growing) between projected savings and actual ones. Perhaps its buried in a financial statement somewhere, but it is certainly not a bit of forecasting anybody is going to be crowing about.

Too good to be true?

A recent GenomeWeb item stated (digested from a press release) that GATC Biotech in Germany is one of the first customers for ABI SOLiD sequencing-by-ligation instrument. This machine will complement the Roche 454 FLX and Illumina/Solexa 1G which GATC already has in house, meaning that GATC has all three launched next-generation sequencing instruments.

The eyebrow-raiser in the press release is
the SOLiD™ System is expected to be installed in early autumn this year and will boost the company's current sequencing capacity from 130 gigabases to 250 gigabases a year.

Nearly doubling capacity with one SOLiD instrument in a shop that already has a 1G and an FLX? If that number is really the impact of the SOLiD, then ABI is taking a huge lead in total reads. Of course, actual performance may vary from projections. Even if that is the joint contribution of the 3 next-gen sequencers, it would underscore what an advance they are -- especially considering how much up-front sample preparation & management work can be jettisoned in comparison to feeding a conventional sequencer.

(Disclosure: my company may be in the market for such services, and I would probably be one of the decision makers in such a decision)

Monday, August 06, 2007

Pre-WWW Hyperlinking

I recently attempted to rhapsodize on the wonders of restriction endonucleases. My exploration of this area has also reacquainted me with an amazing invention, what I might argue is the first artifact of what we now call synthetic biology.

An important early use, still going strong, for restriction enzymes is the cutting-and-pasting of DNA sequences. An early vector which was heavily used was pBR322, and it was also one of the first DNA molecules to have its entire sequence determined. pBR322 was particularly useful because for certain popular restriction enzymes it contained only a single site and that site was not in a critical region. This facilitated cloning into that site.

However, only a few restriction enzymes fit this description. In addition, a common problem with cloning into plasmids was that of empty vector, in which the plasmid reseals without capturing a DNA of interest. A clever scheme emerged somewhere of cloning into a portion (the alpha peptide) of E.coli beta-galactosidase; if the plasmid captured an insert then beta-Gal function would be disrupted. This loss-of-function would show up as white colonies when the E.coli were grown on media containing synthetic compounds that turn blue when cleaved by beta-Gal.

It turns out that this alpha peptide will accept a significant insertion of amino acids, and somewhere the germ of the idea of a polylinker emerged. The polylinker would contain many unique restriction sites and also enable blue-white cloning. For what I believe is the first time, a human sat down and designed a specific & novel DNA sequence for a specific & novel purpose and had it synthesized. Previous DNA synthesis efforts, such as the original effort by Har Gobind Khorana to make a tRNA or the synthesis of an artificial human hormone gene at UCSF, were intended to make something already extant in nature. The first polylinker was perhaps the first creative work of DNA!

That original polylinker had a mirror-symmetry and just 4 cloning sites, with the fold preventing using pairs of sites. Not long afterwards came the pUC polylinkers, which have each site represented only once and a very dense packing of sites. These have been propagated to many other vectors.

I've seen other polylinkers, but none seem to have the popularity of the pUC polylinkers. Shown is the pUC18 polylinker; one additional twist is that this sequence reads through (no stop codons) in either direction; pUC19 simply has the polylinker in the opposite orientation.

CAAGCTTGCATGCCTGCAGGTCGACTCTAGAGGATCCCCGGGTACCGAGCTCGAATTCGT

Two pedagogic angles occur to me. For any biology class, it would be fun to follow-up the session on restriction enzymes by handing each student the pUC polylinker sequence. The assignment is to find as many six or eight basepair palindromes as possible. The other interesting assignment would be for an advanced bioinformatics class: write a program to take a set of restriction enzymes and build a polylinker with them, with shorter outputs scoring higher and bidirectionality scoring higher. Such an exercise will really underline the achievement of the pUC design, which I believe was done with pencil-and-paper, not by computer program.

Wednesday, August 01, 2007

If you build it, they will come

At a game last night of the local minor league nine we got a chance to see an amazing bit of nature -- though I suspect I was in the minority marveling at it rather than being annoyed (or exhibiting gleeful sadistic destruction). The amazing site was easily millions, perhaps tens of millions, of mayflies swarming the field. Many compared the sight to a snowstorm, with observers present the previous night comparing those conditions to a blizzard. Later, when our bleachers section had largely cleared out, I could actually hear a buzzing noise from thousands of gossamer wings hitting the aluminum bleachers.

Kevin Costner needed to build his diamond in a cornfield & start playing the game, but these mayflies were simply confused by the high intensity lights being so close to their home -- home run balls splash in one of the rivers that powered the U.S.'s Industrial Revolution.

I never learned to fly fish, and so don't really know my hatches. Indeed, if I knew the right tied fly to use it would probably make identifying the critter via Google quicker. But thanks to bugguide.net I can specify it as a white mayfly, though I remember the wings being less translucent than in the image.

Hatches like these are probably largely synchronized by environmental cues occurring after the appropriate larval development is complete. What I've found particularly striking are the insects whose development is on a long multi-year clock. Seventeen-year 'locusts' (actually cicadas) being the classic example, and a memorable one for me -- I worked at a summer camp during the largest cohort's year and the constant hum in the woods was unforgettable. You went to sleep with it, woke up with it, ate with it, worked with it -- nowhere there could it be escaped, except by swimming underwater in the pool. The creatures were thick -- and often flew into you.

The thing I've wondered for a number of years now: how accurate are their clocks? If I took one million 17-year larvae and could somehow tag them, what would be the pattern of their emergence? What fraction would emerge 17 years later, and how many would show up 1 or 2 years early or 1 or 2 years late? Obviously, the graduate thesis project from hell. But the question is interesting. For example, if the clocks were sufficiently accurate, then each of the 17 cohorts would be effectively reproductively isolated from the other 17, meaning they would be approaching a state of being 17 different species!

A more practical experiment, which I am unaware of being executed (though I am hardly a strong watcher of the cicada literature), would be to ask how genetically isolated are each cohort from each other. By isolating a lot of members of each cohort and typing a large number of polymorphic markers, one could estimate the amount of gene flow between years. This could be done on stored samples, making it a practical project.

Or, to imagine another context, consider the standard story on Pacific salmon: when the coho's thoughts turn to love, they swim back to the exact place of their birth. Presumably this tale is supported by tag-and-release studies, but at what sample size? What error rate could be detected? How often does a chinook become confused and go up the wrong stream? Again, if the simple model of near perfect birthplace location is correct, then each salmon stream's population is reproductively isolated.

In either case, perfection is dubious. Biological systems are amazing, but noise happens & mutations occur. Keeping a biologic oscillator going for 17 years straight is truly incredible, but some of these metronomes must occasionally skip a beat. The existence of 17 different populations of 17 year cicadas suggests that alone: one original population bled over into the others. The other evolutionary alternative is that the 17-year period was selected multiple times from the proto-cicada population due to its useful properties -- a long, prime number period minimizes the chance of synchronizing with the population of a predator with a periodic population.

The 'snowstorm' we witnessed was really quite harmless to the hominids, but clearly a disaster for the white mayflies. Even without the sadistic kids pounding them into the floor, the vast majority of female flies who entered the stadium the other night would die without having any opportunity to lay their eggs back in the river. So a new threat with a periodic occurrence has entered the insect world: the schedule of night games in Single A ball.