Wednesday, February 17, 2010

Anybody know some good bioinformatic programming problems?

I recently found out that I've received a summer undergraduate intern slot. I have a soft spot for summer internships -- my own was a great experience -- and the company runs a very nice program, with specific social and learning experiences for the cadre. Anyone interested in applying should do so through the company website (and not here!). I do promise not to fill this space with "can you believe what the intern did today?!?!?", though executing "sudo rm -r /" might earn a slot!

I'm still trying to sketch out a grand scheme for the internship. But, it will certainly combine a certain amount of data analysis with a certain amount of programming. One person I've phone-screened has already asked about suggestions for programming problems to practice on. It's a great show of initiative, which I like but discovered for which I wasn't really prepared.

The challenge for me is to rewind my brain back to an early stage and remember what makes a good -- but doable -- problem. In my head, everything either seems too trivial or potentially discouragingly difficult. So, I'd be very interested in examples of programming challenges given to early programmers with a significant bioinformatics angle -- no bubble sorts or games of Wumpus!

I did find a couple of links with some examples: one from MIT and another from Duke (these links are really a level above). I'd love to find other examples -- and mostly don't care about the language used in the examples. I'm probably going to nudge my intern towards Java/Scala (leveraging BioJava as much as possible), perhaps if only to encourage me to put some more time in on my own retraining project.

So, any suggestions?


judowill said...

I'm actually a big fan of the Mathworks MATLAB programming competition. They keep an archived index of previous contests. There are numerous biologically inspired competitions like protein-folding and gene re-arrangements. The contests are designed so an adequate programmer can make a reasonable answer in a few hours. They come complete with test data, example programs, pretty visualizations, etc.

The only disadvantage is that you actually need MATLAB, but if its a college student they probably have a site-license.

Alan said...

HMMs of DNA sequences. Try to recognise coding sequence vs non-coding. Will learn about exons, introns, pseudogenes and HMM methods which are incredibly useful.

Should get something up and running quite quick, but take at least a couple of weeks to walk it all the way through.

Ideal problem for a very bright CS undergrad that's done some algorithms modules.

But then this isn't for anyone pre-first year. I remember my first summer Bell Labs they got me doing compilers and it was a bit beyond me then. Great experience though.

Anonymous said...

Forgive me; this isn't exactly relevant (or perhaps it is!). We are looking to hire a computational biologist for our HiSeq facility.
Perhaps your intern is a bit too green, but if you know anyone, could you send him/her my way?
Seth Crosby
Washington University
scrosby at wustl dot edu