Friday, February 12, 2016

#AGBT16 Day 2: How is AGBT On Twitter Like Sequence Assembly?

I spent a bunch of time yesterday going through the Tweets from AGBT.  For me personally it is a useful exercise, plus I'll have it as a resource to go back to for future posts.  But the time and pain involved definitely had me sometimes questioning the wisdom of attempting this.
Previously, I was bemoaning some limits on Storify and wondering about Chirpstory. I spent some time during the commute yesterday exploring some of my options.  For example, it occurred to me that possibly my iPad woes with Storify could be Safari-specific, so I installed Chrome and tried it out.  Voila! Successful drag-and-drop! Of exactly one thing -- and then back to not being able to do anything. I also thought maybe I could just pull everything into the story and then weed in the iPad -- equally unsuccessful.

I also took a look at Chirpstory.  It's definitely aiming for a different experience, optimized for smartphones and tablets but with a much lesser degree of functionality.  Works great on the iPad, but that is largely by not having a drag-and-drop interface and by having a few useful high-level actions (such as delete duplicates, which Storify doesn't seem to have).  Sometimes I could see this working, other times it would be a nightmare.  For example, last night I invested a lot of time disentangling tweets from the concurrent sessions.  That's a bit slow in Storify with drag-and-drop; on Chirpstory you move things by little 'up' and 'down' arrows, which would turn this into a manual bubblesort from hell.

Getting back to my subtitle, disentangling tweets wasn't easy.  First, sometimes Storify and Twitter would conspire to fail at pulling down everything with the #agbt16 hashtag.  I experimented with searching on specific talks, using the hashtag in combination with alternately an individual's last name and initials.  While nobody with a name that is a common English word (like This) was speaking, there were several roots of confusion, all understandable in the chaos of trying to live tweet a meeting.

(1) there was a substitution in the speaker list of Eddy Rubin for Felicity Jones, which unfortunately didn't get explicitly stated in the tweet stream.  (2) a number of tweets used mutated versions of initials  or names -- first one correct and last one wrong or vice versa, such as 
those listing Eddy Jones was tweeted as a speaker.  (3) sometimes items were attributed to the wrong speaker. (4), while some speakers had ten or so active live tweeters, some speakers had few if any coverage, but unclear if they didn't speak or just nobody tweeted them.  (5), two different speakers with the initials SL were speaking simultaneously during the evening concurrent sessions.  (6) Some tweets had neither initials nor names attributing them to a speaker, a situation which sometimes could be resolved by correlating the tweet in time or tweeter in space. 

In other words, there were (respectively) a (1) sample swap, (2) chimaeric dual-encoded barcodes, (3) samples showing up under the wrong barcodes, (4) uneven coverage, with some observed failure for of anything to map to some regions of the target and (5) inadvertent reuse of a barcode for two samples within a run, (6) some data came in without any barcode and while heroic efforts were made to map these to a sample, it sometimes wasn't possible.

Anyway, I did get things (I hope mostly) right, as the range of biology covered was wonderful.  Also hoping that exhaustion didn't completely skew any of my sparse comments or lead me to choose any inappropriate supplementary links.  If you don't read the stories, you can't find my faux pas and shame me!

The morning's clinical session is mostly a single Storify (plus Illumina's morning update).  This includes Sam Aparicio talking on cancer genomics, which covered a lot of single cell sequencing and understanding the behavior of patient-derived xenografts.  Charles Chiu covered a lot of ground on rapid diagnosis of infectious diseases, with a number of both heartwarming (lives saved) and frustrating (information arriving too late to save the patient) case studies, which he is planning to write up as a case series.  Chiu also noted that approval from the FDA is being sought (via the LDT route) for these tests, and also excitement that Oxford Nanopore sequencing can crunch the time required from a day or two down to a few hours, at least in certain contexts.  Luis Diaz followed with another cancer talk, focused on using circulating tumor DNA to monitor tumors. Franck Rapaport (who has actively tweeted other speakers' talks) looked at hematologic tumors, and John Martignetti spoke on endometrial tumors. I ended up putting Katia Sol-Church's talk on a rare pediatric skeletal disease in a separate Storify; I think if I had this to do again I'd break each speaker into their own, but once you lump them I don't know how to split without starting over. An interesting mystery is that this disease, Baratela Scott Syndrome, has only been found in boys but is not X-linked

Chris Mason gave a lunchtime talk sponsored by QIAGEN which ranged from down low (New York's subways) to way up high (studying astronauts) with a stopover for targeted RNA sequencing.  Chris will be speaking again on Saturday on the astronaut genetics.

Afternoon session: Daniel Rohksar on tetraploidy in Xenopus, Eske Willerslev on ancient human genomes, Eddy Rubin (the sub) on microbial dark matter and Beth Shapiro on passenger pigeon genomes (from tens of millions of individuals to extinct!).

Evening brought a poster session and then concurrent talks, along with Matt Loose doing a live demo of nanopore sequencing (in the poster session, but apparently continued on in one of the elevators!).  I won't try to summarize those here, but there is a lot of great stuff covering a variety of topics, including spatially localized RNA sequencing, challenges in standardizing large-scale human sequencing, genetic mosaicism, protein interaction screening, CRISPR, liver transplants -- and the list goes on.

Well, back to the day job -- plus I need to write up some notes from a phone call with a genomics vendor. With the periodic sneaking in of starting to Storify today's stream.


Brian Krueger said...

You could do what I do and import all the tweets into a database using the twitter API. You don't miss anything that way.

Keith Robison said...

I've done things like that before with a simple Perl program -- the catch is for publishing it back out. Ideally Storify would have a back-end way to load a set of tweet uids -- I should look to see if that exists

Brian Krueger said...

You should be able to do it using the twitter tweetID through the storify API. So get all the tweets using the twitter API and then organize using the storify API. I've thought about trying to write an algorithm to autosort using lastnames, first names, initials but haven't had the time to play with it. Database full of tables with data though...

Mark said...

While 10X strategy (like old school cosmid/bac shotgun) would work well for areas with no local repeats, give it some sequence with enhancer-like elements or some PKS/NRPS clusters, and see the results :-)

Obviously it would depend on the actual library fragment size range and distribution and repeats similarity/size, but one must remember that Sanger shotgun was effectively a 2x750 - 2x1000 run, and not all areas doable by it would assemble well if you run it only on 2x125 or 2x150 illumina run.

So, if possible, do 10X runs on 2x250 mode.

PS: Some pacbio / nanopore & gnubio data may be very useful as an independent 10K assembly QC.