Thursday, December 13, 2018

An Unfortunate Master Class in Poor Plotting

I hope my admiration for Pacific Biosciences intellectual acumen was clear in my post on the acquisition by Illumina, because now I'm going to be a rabid crab over a webinar they aired yesterday.  I take telling scientific stories seriously and an important part of telling such stories is displaying data well.  I'm a perfectionist in this department by intention, but not always by execution -- I'm constantly reanalyzing my plots and diagrams for errors and cringing when I find them.  The webinar is trying to extol the value of the latest developments in the SMRT platform, but the data graphs often actively fight against any understanding or excitement.
There's a running problem with the plots: if you display adjacent plots  with the same type of data from two different datasets, then except if you have a damn good reason the axes must be scaled the same! If you are plotting them adjacently, then you probably want me to compare the two plots.  I might even want to compared the data even if you didn't intend it -- if you put plots next to each other you're inviting comparison.

Here's an example of a good side-by-side plot from the webinar.

Two different human libraries run on the 8M SMRT cells.  X-axis and Y-axis scaled identically so we can easily compare the two libraries and see they are largely the same, though with a bit of quirk in the middle.
Now let's get to the rogues' gallery. Here's a three graph comparison of libraries from different organisms run on the 8M SMRT cells.  The X-axes are the same -- but note that the Y's aren't remotely so -- maxing out at 1.75M for E.coli, 3.5M for B.subtilis and 2M for O.sativa. So if you want to compare the distributions, have fun rescaling everything!
But maybe you don't really care; the exact shapes of the distributions say something about the different DNA preps.  But how about trying to see the difference between 10 hour and 20 hour movies as with this slide?  This time the Y-axis is held constant, but the X axis ends at 140K for 10 hours and at 250K for 20 hours.  The audio commentary makes it clear that what you're supposed to take away is that it isn't worth running a library with this distribution for the longer movie -- but with the bad scaling it's really difficult to see what is gained by the longer instrument time.

Okay, enough of these plots of total data versus read length.  How about some plots of the read length distribution intended to show the advantage of using a single cell of the new chemistry rather than four cells of the old chemistry.  This time the X-axis is correctly fixed but the Y max varies.  Actually, for this plot I'd really rather have just curves rather than bars, as that would make it even easier to compare the curves -- well, if they can be clearly distinguished.  Since I have a very idiosyncratic color sensitivity I often struggle here, but with marker shapes one can make things clear.

Here's a truly egregious example -- the bottom plot represents three times as much data but you'd neer be able to tell that, as the top plot maxes at 20K and the bottom at six times that! This Iso-Seq data also could really use some inset plots zooming in on that 40K-60K region; I'd really like to know what happened to that hump in the 2.1?  Is that the real size distribution of the mRNA population or not?

Okay, one last plot to beat on.  Yet again the sin is the X-axis (it's faint praise, but I couldn't find paired plots that differed in both X and Y axis limits). 

Okay, I'm done -- probably because I couldn't find more paired plots.  

There are many factors that can lead to poor plots -- rushed preparation, not running your slides by a naive audience in advance, etc.  So it takes awareness and discipline to avoid making these mistakes.  But you should also have strong incentives, starting with professional pride but also if you want to convince people -- and here convince has sizable dollar signs associated with success.  And at least the are probably errors of inattention, not active crimes of execution such as perspective pie charts.


gasstationwithoutpumps said...

For that matter, most of the "paired plots" should have been done as two curves on a single plot. But perhaps their graph-making expertise is still at the 7th-grade level, and one plot per graph with autoset scales is all they can manage.

Anonymous said...

Very interesting. More importantly, 'polymerase G' is being reported, as are 'Polymerase Read Lengths'. This is not the same as what the end user can expect to align from the FastQ. Both the 'Actual read length' i.e. the length of the insert in the SMRT bells that the polymerase goes round and round on, and therefore the actual alignable G will be substantially less than the numbers shown on these slides, very short inserts can have very long polymerase coverage and so very high G numbers and very long 'polymerase read lengths'.