Tuesday, July 13, 2010

There are 2 styles of Excel reports: Mine & Wrong

A key discovery which is made by many programmers, both inside and outside bioinformatics, is that Microsoft Excel is very useful as a general framework for reporting to users. Unfortunately, many developers don't get beyond that discovery to think about how to use this to best advantage. I've developed some pretty strong opinions on this, which have been repeatedly tested recently by various files I've been sent. I've also used this mechanism repeatedly, with some Codon reports for which I am guilty of excessive pride.

An overriding principle for me is that I am probably going to use any report in Excel as a starting point for further analysis, not an endpoint. I'm going to do further work in Excel or import it into Spotfire (my preference) or JMP or R or another fine tool. Unfortunately, there are a lot of practices which frustrate this.

First, as much data as possible should be packed into as few tabs as practical. Unless you have a very good reason, don't put data formatted the same way into multiple files or multiple tabs. I recently got some sequencing results from a vendor and there was one file per amplicon per sample. I want one file per total project!

Second, the column headers need to be ready for import. That means a single row of column headers and every column has a specific and unique header. Yes, for viewing it sometimes looks better to have multiple rows and use cell fusing and other tricks to minimize repetition -- but for import this is a disaster either guaranteed or likely to happen.

Third, every row needs to tell as complete a story as possible. Again, don't go fusing cells! It looks good, but nobody downstream can tell that the second row really repeats the first N cells of the row above (because they are fused).

Fourth, don't worry about extra rows. One tool I use for analysis of Sanger data spits out a single row per sample with N columns, one column for each mutation. This is not a good format! Similarly, think very carefully before packing a lot into a single cell -- Excel is terrible for parsing that back out. Don't be afraid to create lots of columns & rows -- Excel is much better at hiding, filtering or consolidating than it is at parsing or expanding.

Finally, color or font coding can be useful -- but use it carefully and generally redundantly. Ignoring the careful part means generating confusing "angry fruit salad" displays (and never EVER make text blink in a report or slide!!!).

Follow these simple rules and you can make reports which are springboards for further exploration. It's also a good start to thinking about using Excel as a simple front end to SQL databases.

So what was so great about my Codon reports? Well, I had figured out how to generate the XML to handle a lot of nice features of the sort I've discussed above. The report had multiple tabs, each giving a different view or summary of the data. The top tab did break my rules -- it was a purely summary table & was not formatted for input into other tools (though now I'm feeling guilty about that; perhaps I should have had another tab with it properly formatted). But each additional tab stuck to the rules. All of them had AutoFilter already turned on and had carefully chosen highlighting when useful -- using a combination of cell color and text highlighting to emphasize key cells. Furthermore, it also hewed to my absolute dictum "Sequences must always be in a fixed width font!". I didn't have it automatically generate Pivot Tables; perhaps eventually I would have gotten there.


marie said...

I would add that if you like Excel, but work with people who don't use it, there is the wonderfully handy feature of saving as a .csv file. This allows easy portability, and supports much larger data structures.

Anonymous said...

While not what it was designed for i actually use Microsofts LogParser a lot to fix format problems in .csv files, and to import and export data between SQL, csv and excel because it supports all the data formats.

Plus it has a SQL-like syntax that is pretty easy when you already know SQL. Unfortunately it is not complete (no JOINs) and it doesn't look like they are developing on it anymore.

Grahm said...

I agree with Marie. CSV's come in super handy.

Excel Reports