Wednesday, October 19, 2022

Better Than FizzBuzz: First Bioinformatics Problem in an Erratic Series of Indeterminant Length

Periodically in my work or during writing this blog I come across computational problems that have the aspects of making, at least in my mind, very good teaching problems.  Some of the characteristics are that the basic problem is relatively simple to explain, the skills required are reusable on other problems, the concepts are germane to other problems and that the posed problem can be expanded in steps to something much richer.  Such problems might even be the nucleus of undergraduate or even high school bioinformatics projects, though with the recent news of a high schooler sequencing his dead pet angelfish's genome the bar for high school projects has leapt a few notches! In contrast to a programming problem that doesn't fit these, I'm going to tag such posts as "Better than FizzBuzz".
What is FizzBuzz?  It's a popular early problem in coding classes that I've been wrestling with my feelings about it.  How much do I dislike it because it is new and not what I tangled with eons ago when I was learning?  Or are my objections more principled in that it is too limited in scope?  Or maybe its because of the wave of sadness that swept over me when I discovered FizzBuzz was the only project in a candidate's GitHub.

Anyway, if you google you'll find tons of descriptions.  But the short version is given a series of integers, apply certain rules so that under one condition the program prints "Fizz" and another it prints "Buzz". It's a simple exercise in writing chained conditional statements and that is of course a key early skill.  But that's pretty much it.  Once you've solved FizzBuzz, there aren't many more options for further work other than more conditions, doing it in some truly baroque way or perhaps writing a FizzBuzz interpreter that can take a file of rules and apply it.

For me, interesting problems have some variety in the solutions where that variety isn't just making things baroque but applying different algorithmic approaches or implementations.   Or perhaps something that you use as a key step in learning an additional language.  Something that might provide some interesting and useful conversation during a job interview.

On the job interview front, we've used a simple problem as a conversation starter - we give it to the candidates in advance to try to solve and it isn't intended to trip people up but rather to give a common problem where we can explore how the candidate approached the problem and what tool choices they made.  A friend who is interviewing showed me another company's problem and I really liked it -- but it wouldn't be fair of me to mention it here.  But it was exploring different aspects than ours -- ours was a basic build-a-pipeline exercise and theirs more around how to solve an interesting computational puzzle.  That distinction was also appropriate for each position, at least how I envision ours and what the job description was for the others.

I'll confess that the problem I'll describe here isn't my favorite one; that's one reason for it to lead off since it might be a letdown after the others.  But it is useful and I think the biology it explores might pique some interest as well in a young student.  And perhaps its simplicity is advantageous for the first problem to throw out.

It's a basic problem in amino acid frequencies.  Given a file of protein sequences in a standard format, determine the following:
1. For each amino acid, the longest protein which lacks that amino acid
2. Like #1, but with only 1 example of that amino acid
3. What is the shortest protein with all twenty amino acids?
4. For each amino acid, what protein has the greatest fractional abundance of that amino acid?

Reflecting on my own career, this is a problem I would have solved one way (lots of dictionaries) early in my career and these days would solve very differently (a few dictionaries but then all the numbers in dataframes).  It could even be a simple exercise in building a relational database.

This could be extended in a number of ways.  Counting pairs of adjacent amino acids is one way.  Parsing annotation fields out of Uniprot or similar might be another -- for example how do the answers change with extracellular proteins.  Or even running this on a number of genomes to see how different the results are.

I have three more such problems in mind, two of which require only writing up (one requires a handful of files to be uploaded to GitHub)


Kotofos said...

So, where one can get that file of protein sequences in a standard format?

Keith Robison said...

You’re the first to ask! I should have cooked an interesting one up

Until I do, try the fsa.gz files found at