Wednesday, February 17, 2010

Anybody know some good bioinformatic programming problems?

I recently found out that I've received a summer undergraduate intern slot. I have a soft spot for summer internships -- my own was a great experience -- and the company runs a very nice program, with specific social and learning experiences for the cadre. Anyone interested in applying should do so through the company website (and not here!). I do promise not to fill this space with "can you believe what the intern did today?!?!?", though executing "sudo rm -r /" might earn a slot!

I'm still trying to sketch out a grand scheme for the internship. But, it will certainly combine a certain amount of data analysis with a certain amount of programming. One person I've phone-screened has already asked about suggestions for programming problems to practice on. It's a great show of initiative, which I like but discovered for which I wasn't really prepared.

The challenge for me is to rewind my brain back to an early stage and remember what makes a good -- but doable -- problem. In my head, everything either seems too trivial or potentially discouragingly difficult. So, I'd be very interested in examples of programming challenges given to early programmers with a significant bioinformatics angle -- no bubble sorts or games of Wumpus!

I did find a couple of links with some examples: one from MIT and another from Duke (these links are really a level above). I'd love to find other examples -- and mostly don't care about the language used in the examples. I'm probably going to nudge my intern towards Java/Scala (leveraging BioJava as much as possible), perhaps if only to encourage me to put some more time in on my own retraining project.

So, any suggestions?

3 comments:

judowill said...

I'm actually a big fan of the Mathworks MATLAB programming competition. They keep an archived index of previous contests. There are numerous biologically inspired competitions like protein-folding and gene re-arrangements. The contests are designed so an adequate programmer can make a reasonable answer in a few hours. They come complete with test data, example programs, pretty visualizations, etc.

The only disadvantage is that you actually need MATLAB, but if its a college student they probably have a site-license.

Alan said...

HMMs of DNA sequences. Try to recognise coding sequence vs non-coding. Will learn about exons, introns, pseudogenes and HMM methods which are incredibly useful.

Should get something up and running quite quick, but take at least a couple of weeks to walk it all the way through.

Ideal problem for a very bright CS undergrad that's done some algorithms modules.

But then this isn't for anyone pre-first year. I remember my first summer Bell Labs they got me doing compilers and it was a bit beyond me then. Great experience though.

Anonymous said...

Forgive me; this isn't exactly relevant (or perhaps it is!). We are looking to hire a computational biologist for our HiSeq facility. http://seqanswers.com/forums/showthread.php?p=12004#post12004
Perhaps your intern is a bit too green, but if you know anyone, could you send him/her my way?
Seth Crosby
Washington University
scrosby at wustl dot edu