Also, this isn't meant to be high-and-mighty-and-spotless-expert calling calumny on the great unwashed masses. If I look down at my metaphorical foot I find many tightly spaced patterns of scars, sometimes nearly concentric. We all make mistakes, and often we repeat those of the past. We think we've covered bases that have always been covered or deceive ourselves that safety mechanisms which were needed in the past are no longer necessary.
A bit ago at work I was doing some exploring of a standard a backbone and became curious just how taxonomically widespread pieces of the backbone might be found naturally. So naturally, I pumped the sequence into the NCBI BLASTN server & pointed it at the RefSeq genomes. As expected, a bunch of bacterial plasmids popped up. What was unsettling, though, was a bunch of provisional genomic RefSeqs for eukaryotic chromosomes. Indeed, one project had apparently deposited every chromosome with a pUC-type vector sequence at one end. YIKES!
The other day I got curious again & tried searching the non-redundant DNA and protein databases but with the species filter set to eukaryote. Again, a bunch of hits -- and the shocking part was many were very recently deposited sequences -- even human ones. In some cases, the entire deposited sequence was vector-derived (e.g. the non-human "putative reverse transcriptases" ABK60177.1, CAD59768.1, CAD59767.1 & CAL37000.1).
For example, AK302803.1 is a 1352 nucleotide sequence deposited in 2008; from 888 on is clearly vector -- and the coding region is annotated as 1 to 1275! CAH85743 is a "Plasmodium" protein which is entirely vector derived; again deposited in 2008. PIR (is anybody still curating this?) has a number of vector-derived proteins (e.g. the 231 amino acid "NZ-3 antigen" JC7702; S.pombe beta-lactamase (!) T51301); I was surprised to even find a SwissProt entry that looks like it has pUC-derived sequence
>sp|Q63661.2|MUC4_RAT RecName: Full=Mucin-4; Short=MUC-4; AltName: Full=Pancreatic
adenocarcinoma mucin; AltName: Full=Testis mucin; AltName: Full=Ascites
sialoglycoprotein; Short=ASGP; AltName: Full=Sialomucin
complex; AltName: Full=Pre-sialomucin complex; Short=pSMC;
Contains: RecName: Full=Mucin-4 alpha chain; AltName:
Full=Ascites sialoglycoprotein 1; Short=ASGP-1; Contains: RecName:
Full=Mucin-4 beta chain; AltName: Full=Ascites sialoglycoprotein
2; Short=ASGP-2; Flags: Precursor
GENE ID: 303887 Muc4 | mucin 4, cell surface associated [Rattus norvegicus]
(Over 10 PubMed links)
Score = 46.6 bits (109), Expect = 0.006
Identities = 22/35 (62%), Positives = 25/35 (71%), Gaps = 3/35 (8%)
Frame = -3
pUC19 1427 CCLQTKKPPLPAVVCLPDQELPTLFPKVTGFSRAQ 1323
CCLQTKKPPLPAVVCLPD P+ P + S+ Q
Sbjct 1051 CCLQTKKPPLPAVVCLPD---PSSVPSLMHSSKPQ 1082
Even the RefSeq mRNA section has some very provisional mammalian predicted cDNAs (from chimp) which appear to be polylinker-type sequences from vector (selected restriction sites are marked)
=BamHI =SalI= =PaeI
pUC19 415 GGGGATCCTCTAGAGTCGACCTGCAGGCATG 444
XM_001160101.1 56 GGGGATCCTCTAGAGTCGACCTGCAGGCAT 85
XM_001146903.1 439 GGATCCTCTAGAGTCGACCTGCAGGCATG 467
XM_001141474.1 1503 GGGATCCTCTAGAGTCGACCTGCAGGCA 1530
XM_001141395.1 922 GGGATCCTCTAGAGTCGACCTGCAGGCA 949
Contamination of various sorts has plagued genome projects from the get-go. Perhaps the most notorious was a large deposition of human ESTs which were donated to the public with great fanfare (as a counterpoint to private EST efforts), only to be found later to be rich in yeast sequences. The solution is to run filters -- search everything you do against vectors, E.coli and other common contaminants. In addition, especially in this day-and-age, if your "human" mRNA sequence doesn't match the genome, you've got some 'splaining to do.
What's the harm? Well, when it comes to databases I don't like mess. You always need to check your data, but it's always a nuisance when you actually have to clean it a bunch. Miss something, and some experiment is dirty or worse ruined. Plus, and this is a bit of the theme to my proto-post, some folks haven't yet figured this out & the results are truly ugly. Even worse, these are the obvious problems since bacterial vectors in a eukaryotic sequence truly stick out. Now I'm wondering about all the pUC-like sequences I found in bacterial sources -- can I trust them either?
So, let's all make a it's-still-a-pretty-new-year resolution to recheck our sequencing pipelines. Deliberately throw pUC19 and the E.coli genome through it & see what comes out.