Here is the recipe, and it is simple beyond belief. Most states have a game where you pick three digits (i.e. a number between 0 and 999), with a nightly drawing for those three digits, usually televised. Before the drawing, go to your local lottery outlet. Make some friendly banter with the person at the counter, then loudly announce to everyone present that you are about to purchase a winning ticket. Now proceed to buy 1000 tickets, covering the complete sequence of values from 000 to 999. Then sit back, tune into the nightly broadcast, and enjoy your good fortune.
Wait? What's that? You wanted to actually win MONEY? Well, since such games pay off at best at 50 cents on the dollar, you are now out at least $500. But you did have a winning ticket.
Many people can see through the scheme before it is executed, but it surprising how often papers are published which fall prey to exactly the same problem. In statistics it is known at the Multiple Test problem, and dealing with it are various attempts at Multiple Test Correction.
Suppose I have a typical human microarray experiment using the Affymetrix U133 chip. I have just bought about 47,000 lottery tickets -- er, I mean probesets. Given that the number of independent samples you have are much, much fewer than this, there is a probability that you will see 'regulated' expression between your two sample populations purely by chance. Farther downstream, you take your gene list derived from your data and throw it against GO or some other set of other gene lists. You've again bought thousands of lottery tickets.
Now this is not to say these analyses are worthless; array experiments are not lotteries -- if you are careful. Good software will incorporate multiple test corrections that make your P-values less impressive sounding to account for running so many tests, just like the state lottery doesn't pay 1:1 on lottery tickets. But be careful when you roll your own stuff, or if you don't pay attention to the correction options in someone else's package. Plus, even with multiple test correction you still have junk in your results -- but you now have an estimate of how much junk.
The simplest multiple test correction is the Bonferroni correction. You simply multiply your raw P-values by the number of tests you've run. So if you ran your gene lists against 1000 categories, your impressive 0.0005 P-value goes to a quite uninteresting 0.5. Bonferroni is brutal to your results, but is a very conservative correction. It is assuming independence of your tests, which in some cases is correct -- but more often not in a genomics experiment.
However, what if your tests really aren't independent? Some of those GO categories are supersets of others, or at least strongly overlap. Some genes are strongly co-regulated no matter what the situation. Whenever there is correlative structure in the data, Bonferroni is potentially overkill.
There are a large number of other methods out there, and new ones still seem to pop up along with new terms such as Family Wise Error Rate (FWER) and False Discovery Rate (FDR).
I'm no statistician, but I've actually coded up one FDR estimation approach: Benjamini-Hochberg. Here's the recipe:
- Pick your alpha cutoff (0.05 is common)
- Sort your P-values and rank them, smallest to largest.
- Calculate a new cutoff for each test, with the cutoff equal to alpha x r/t, where r is the rank and t is the number of test you ran. So the cutoff for the top P-value is alpha/t (r=1) and for the worst is alpha (because r=t).
- Starting at rank 1, go down your list until you find a P-value that is greater than its cutoff. These are your significant hits -- you don't look below there. Because you start at the top & iterate down, it is known as a step-down procedure.
Benjamini-Hochberg has an intuitive appeal: your best value is compared against the most stringent cutoff, but after all it is your best P-value. The second one won't be quite so good, but it goes against a slightly harder standard since you've already looked at the best test. By the time you get to your last test, you are at your original alpha -- but you are now looking at your worst P-value.
A final thought. The alpha cutoff for most drug trials is 0.05, or you expect to see such results by chance about 1 in 20 times. Some trials do much better than this, but there are plenty of trials published which are right in this neighborhood. Exercise for the student: explain how this is different than buying multiple lottery tickets.