AGBT16 ended a week ago, but for various reasons I'm just now catching up on my Storify project. A vacation was in there but also some tool building. As I was griping about the pains of organizing the tweets manually, Brian Krueger suggested what was already dawning on me (but it helps to be poked -- professional embarrassment is often a stronger motivator than pure annoyance) -- I needed to stop doing this purely manually. So, off to deal with pulling in Tweets automatically and at least doing some organization programatically.
I had some code I used before for pulling Tweets -- two years ago! Python of all things -- if I remember correctly the Perl libraries were unable to deal with the newish Twitter authentication scheme. Someday I'll give up on Perl entirely, but this project would be a partial. Note that if you want to try the code yourself, you need to get API keys and other authentification doodads from Twitter and Storify -- it is easy, but a necessary step.
I had some code I used before for pulling Tweets -- two years ago! Python of all things -- if I remember correctly the Perl libraries were unable to deal with the newish Twitter authentication scheme. Someday I'll give up on Perl entirely, but this project would be a partial. Note that if you want to try the code yourself, you need to get API keys and other authentification doodads from Twitter and Storify -- it is easy, but a necessary step.
However, that previous program was dealing with favorites, not hashtags, and the interface for querying was a little different. I found the tweepy (one of a gaggle of Python libraries to deal with Twitter -- can't remember why I picked this one) documentation frustrating, but a bit of Googling found a nearly ready-made program that would not only query Tweets but dump them to an SQLite database. Sweeeeet.
Tried that out, and quickly bumped into Twitter's rate limit on pulling tweets. I understand the need for avoiding deliberate or inadvertent denial-of-service attacks, but on the other hand program development and debugging are seriously hindered when you hit these limits. I found some code using cursors and exception trapping to get around this -- and it kept failing to work. Since I wasn't in a hurry (the Tweet-pulling was during vacation evenings), I just had the program sleep for a second after each request -- this won't hit the rate limit, though it is a dumb way to do things. The program also is doing a commit after each write -- this saved tweets during earlier attempts that hit the rate limit. I also ran into a trouble trying to generate plain-text output at the same time -- Python exited complaining something I tried to print wasn't legal (I forget the exact message now) -- presumably some special characters.
import tweepy, time, sys
import sqlite3
import unicodedata
db = sqlite3.connect('data/agbt16.4.db')
# Get a cursor object
cursor = db.cursor()
cursor.execute('''CREATE TABLE IF NOT EXISTS tweets(id INTEGER PRIMARY KEY, id_str TEXT,name TEXT, geo TEXT, image TEXT, source TEXT, timestamp TEXT, text TEXT, rt INTEGER)''')
db.commit()
#enter the corresponding information from your Twitter application:
CONSUMER_KEY = 'YouNeedAKey'
CONSUMER_SECRET ='YouNeedASecret'
OAUTH_TOKEN = 'YouNeedAToken'
OAUTH_TOKEN_SECRET = 'YouNeedATokenSecret'
auth = tweepy.OAuthHandler(CONSUMER_KEY, CONSUMER_SECRET)
auth.set_access_token(OAUTH_TOKEN, OAUTH_TOKEN_SECRET)
api = tweepy.API(auth)
hashtag = '#agbt16 -RT'
for tweet in tweepy.Cursor(api.search, q=hashtag).items():
# process tweet here
cursor.execute('''INSERT INTO tweets(id_str,id,name, geo, image, source, timestamp, text, rt) VALUES(?,?,?,?,?,?,?,?,?)''',(tweet.id_str,tweet.id,tweet.user.screen_name, str(tweet.geo), tweet.user.profile_image_url, tweet.source, tweet.created_at, tweet.text, tweet.retweet_count))
db.commit()
print tweet.id,"\t",tweet.id_str,"\t",tweet.user.screen_name,'\t',tweet.created_at
time.sleep(1)
Next, some quick Perl to pre-process the data, parsing out candidates for author-identifying tags. I didn't get this quite right, but close. Parsing tags is hard, starting with different styles. For example, if I were speaking then some tweeters would use "Robison:" and others "KR:" but still others might omit the colons. Inevitably there would be a "Robinson:" in the mix. Since I tweet, some would tag with "@omicsomicsblog". And more permutations.
Perhaps a future iteration (maybe in Python?) will try to use some more smarts here. Time information is a useful prior, though sometimes Tweeters post after a talk so they can give a more thought-out summary. I have a bunch of other ideas on how to try to enhance this, which I might play around with.
#!/usr/bin/perl
use DBI;
my $dbh=DBI->connect("dbi:SQLite:dbname=data/agbt16.3.db");
my $sth=$dbh->prepare("select id,name,timestamp,text from tweets");
$sth->execute;
while ( my ($id,$name,$timestamp,$text)=$sth->fetchrow_array)
{
$text=~s/^\#AGBT16 +//i;
$text=~s/\n/ /;
next if ($text=~/^RT/ || $text=~/ RT /);
next if ($text=~/ mt /i && $name=~/SeqComplete/);
my ($speaker)=();
if ($text=~/^([A-Z\-]+)[:,]/i)
{ $speaker=$1;
$speaker="None" if ($speaker=~/Chirp/);
}
$timestamp=~s/ /\t/;
print join("\t",$id,$timestamp,$speaker,$name,$text),"\n";
}
That data went into Excel -- which is pretty good for viewing this data (as I have advocated), though I did initially shoot myself in the foot by importing the data carelessly -- the Twitter ids are integers that apparently blow out Excel's precision.
Once in Excel, I could improve the column of tags to catch missed items. I also removed tags for modified tweets and also blatant copies of tweets -- there is one particular Tweeter who sometimes adds value, but much too often copies others content without attribution -- in this day of quoting tweets there is no excuse. I won't fully shame them here, but let's just say in the world of Seq you ain't Complete without good faith efforts at attribution. And in that vein, let me heartily thank everyone who did live-tweet the meeting -- without you, none of this would be possible. It looks like 235 different individuals tweeted with the #agbt16 hashtag, with only a handful of those spammers. My one request for the future would be to clearly tweet out when talks are scratched or changed in timing or speaker -- that would be useful to know with certainty. There's also a nice collection of blogs on AGBT16 pulled together by AllSeq
Once done, then copy-and-paste lists of Twitter ids for each speaker back to Linux. Another bit of Perl programming then generated the Storify API upload file. This took a lot of iterations, which were slowed by Storify's own rate limit -- only 10 posts per hour, which seems a tad low. Partly the trouble was my inattention to detail -- turns out an API key truncated by 1 letter won't work -- but it wasn't helped by the available resources being slightly out-of-date. For example, for the Storify URL one must now use an https address, not http. But, that's all solved.
#!/usr/bin/perl
use strict;
use Getopt::Long;
my ($title)=();
&GetOptions("t|title=s",\$title);
die unless (defined $title);
my @elements=();
while ($_ = <>)
{
my ($id)=(/^([0-9]+)/);
push(@elements,"\"http://twitter.com/#!/Storify/status/$id\"") if ($id>0);
}
my $apiKey-="YouNeedAKey";
my $token="YouNeedAToken";
my $publish="true";
print "username=OmicsOmicsBlog&publish=$publish&api_key=$apiKey&_token=$token&story={\"title\":\"$title\"\n";
print " ,\"elements\": [\n";
print join(",\n",@elements),"\n";
print " ] }\n";
So, here is the final result. I've copied the AGBT schedule linked to the Storify. Editorial comments are in bold italics. I may continue to edit these a bit -- James Hadfield has blog entries on many that I really should add in & I've been also pulling in appropriate pictures when Storify failed to add a thumbnail photo (sometimes because no tweets had images, but sometimes perfectly good images are mysteriously ignored).
Wednesday, February 10, 2016
11:00 a.m. – 8:00 p.m. | Meeting Registration Mediterranean Ballroom Registration Desk Wednesday morning pre-buzz More pre-buzz Wednesday afternoon workshops |
Plenary Session: | Genomics I (Mike Zody, New York Genome Center, Chair)Mediterranean Ballroom |
5:00 p.m. – 5:30 p.m. | Sean Eddy, Harvard University “Genome evolution: the future of deciphering the past” |
5:30 p.m. – 6:00 p.m. | David Haussler, University of California, Santa Cruz “Global sharing of better and more genomes” |
6:00 p.m. – 6:30 p.m. | Pardis Sabeti, Harvard University “Genomic surveillance of microbial threats” |
6:30 p.m. – 7:00 p.m. | Matthew Sullivan, The Ohio State University “Unveiling viral ocean: towards a global map of ocean viruses” |
7:00 p.m. – 10:00 p.m. | Welcome Reception Valencia Terrace & Lawn |
7:30 a.m. – 9:00 a.m. | Breakfast Coquina North |
8:00 a.m. – 3:00 p.m. | Meeting Registration Mediterranean Ballroom Registration Desk |
Plenary Session: | Clinical Genomics (Sharon Plon, Baylor College of Medicine, Chair) Mediterranean Ballroom When I was still in manual mode, did this session as one big block -- won't ever do that again! |
9:00 a.m. – 9:30 a.m. | Sam Aparicio, British Columbia Cancer Research Centre “Clonal evolution and cancer medicine at single cell resolution” |
9:30 a.m. – 10:00 a.m. | Charles Chiu, University of California, San Francisco School of Medicine “Metagenomic deep sequencing for diagnosis of infectious diseases” |
10:00 a.m. – 10:30 a.m. | Luis Diaz, Johns Hopkins Hospital “Novel therapeutic and diagnostic applications of somatic mutations in solid tumor malignancies” |
10:30 a.m. – 11:00 a.m. | Coffee Break Coquina North* Denotes abstract selected talk |
11:00 a.m. – 11:20 a.m. | *Franck Rapaport, Memorial Sloan Kettering Cancer Center“Integrated DNA/RNA profiling for somatic alterations in adult B-cell ALL” |
11:20 a.m. – 11:40 a.m. | *John Martignetti, Icahn School of Medicine at Mount Sinai “A pre-operative, diagnostic gene panel for guiding primary treatment choices in endometrial cancer: advancing beyond the decades-old technology of dilation and curettage (D&C)” |
11:40 a.m. – 12:00 p.m. | *Katia Sol-Church, Nemours A.I. duPont Hospital for Children“Baratela Scott Syndrome is a rare recessive skeletal dysplasia caused by a new class of compound heterozygous defects” |
12:05 p.m. – 1:05 p.m. | Qiagen Workshop (Complimentary Lunch Provided) Coquina NorthChris Mason's first talk, touching on all his projects including RNA capture, subway metagenomics and NASA twin study |
12:00 p.m. – 2:00 p.m. | AGBT Lunch Valencia Terrace & Lawn |
1:10 p.m. – 2:10 p.m. | Roche Workshop Coquina North |
Plenary Session: | Evolutionary Genomics (Ken Dewar, McGill University and Génome Québec Innovation Centre, Chair)Mediterranean Ballroom |
2:15 p.m. – 2:45 p.m. | Eske Willerslev, University of Copenhagen “How we developed our biological and cultural diversity” |
2:45 p.m. – 3:15 p.m. | Felicity Jones, Friedrich Miescher Laboratory of the Max Planck Society “Dissecting the genomic basis of adaption in natural stickleback populations” I believe Jones' talk was a scratch -- would love to hear it some other time! Eddy Rubin spoke in this slot, but I've linked him when he was scheduled |
3:15 p.m. – 3:45 p.m. | Coffee Break Lower Level/Sponsor’s Promenade*Denotes abstract selected talk |
3:50 p.m. – 4:10 p.m. | *Beth Shapiro, University of California, Santa Cruz “Passenger pigeon paleogenomes reveal the genomic consequences of long-term extremely large effective population sizes” |
4:10 p.m. – 4:30 p.m. | *Daniel Rokhsar, DOE Joint Genome Institute “When genomes collide: the allotetraploid Xenopus laevis” |
4:30 p.m. – 4:50 p.m. | *Stephan Schuster, Nanyang Technological University “Why ethnicity matters for precision medicine” |
5:00 p.m. – 7:00 p.m. | Poster & Software Demo – Wine & Cheese Reception Coquina North and South |
5:00 p.m. – 7:30 p.m. | Dinner on your own Thursday night concurrent sessions done as a single Storify |
Concurrent Session: | Genome Technology(Ken Dewar, McGill University and Génome Québec Innovation Centre, Chair) Mediterranean Ballroom/General Session Room |
7:30 p.m. – 7:50 p.m. | Mariateresa de Cesare, University of Oxford “Unlocking the heterogeneity of the human transcriptome utilizing the ONT MinION TM” Need to recruit more tweeters! Mariateresa's talk had only 1 tweet! |
7:50 p.m. – 8:10 p.m. | Joel Malek, Weill Cornell Medical College in Qatar “AVA-Seq: a method for all-versus-all protein interaction mapping using next generation sequencing” |
8:10 p.m. – 8:30 p.m. | Israel Steinfeld, Agilent Technologies “Improved methods and analysis tools for efficient CRISPR/Cas genome editing” |
8:30 p.m. – 8:50 p.m. | James Hadfield, University of Cambridge “Progress in developing a nanopore rapid cancer MDX test” James apparently didn't get another official blogger to cover his own talk; I can't find any tweets |
8:50 p.m. – 9:10 p.m. | GiWon Shin, Stanford University “STR-Seq: a massively parallel microsatellite sequencing and genotyping technology” |
9:10 p.m. – 9:30 p.m. | Jiabin Tang, Memorial Sloan Kettering Cancer Center “Non-invasive somatic mutation profiling of liquid biopsies using capture-based next generation” |
Concurrent Session: | Transcriptomics(Martin Hirst, University of British Columbia, Chair) Mediterranean Salons 6-8 |
7:30 p.m. – 7:50 p.m. | Sten Linnarsson, Karolinska Institutet “Molecular anatomy of the mouse brain by single-cell RNA-seq” |
7:50 p.m. – 8:10 p.m. | Stefania Giacomello, SciLifeLab “Spatially resolved gene expression in the meristem of model angiosperm and gymnosperm species enabled by spatial transcriptomics” |
8:10 p.m. – 8:30 p.m. | Chia-Lin Wei, Lawrence Berkeley National Laboratory “Polycomb mediated 3-dimensional chromatin organization directs transcriptional silencing through extensive promoter looping and prevalent lncRNA association” |
8:30 p.m. – 8:50 p.m. | Max Seibold, National Jewish Health “Large-scale single cell transcriptome sequencing of the human airway epithelium” |
8:50 p.m. – 9:10 p.m. | Mohan Bolisetty, The Jackson Laboratory “Determining exon connectivity in complex mRNAs using the MinION sequencer” |
9:10 p.m. – 9:30 p.m. | Masako Suzuki, Albert Einstein College of Medicine “Metastable epialleles in mouse epialleles are defined by targeting of 5- hydroxymethylcytosine and increased DNA methylation entropy” |
Concurrent Session: | Genomic Medicine (Sharon Plon, Baylor College of Medicine, Chair) Palazzo Ballroom Salon E |
7:30 p.m. – 7:50 p.m. | Alexander Hoischen, Radboud University Medical Center “Ultra-sensitive mosaic mutation detection for clinical applications” |
7:50 p.m. – 8:10 p.m. | Stephen Lincoln, Invitae “Clinically important variants are often technically challenging for NGS: implications for NGS methods, validation and confirmation” |
8:10 p.m. – 8:30 p.m. | Brendan Keating, University of Pennsylvania “Detection and validation of signatures of liver transplantation rejection diagnoses and successful minimization of immunosuppression from serum miRNA profiles” |
8:30 p.m. – 8:50 p.m. | Richard Moore, BC Cancer Agency Genome Sciences Centre “Whole genome and transcriptome sequencing for personalized cancer therapy: lessons learned from first 300 cases” |
8:50 p.m. – 9:10 p.m. | Matthew Bainbridge, Baylor College of Medicine “Neptune: an automated pipeline for clinical reporting of carrier-tests” |
9:10 p.m. – 9:30 p.m. | Lei Huang, Peking University “Live Births with monogenic diseases and chromosome abnormality simultaneously avoided by NGS-based PGD/PGS with linkage analyses” |
7:30 a.m. – 9:00 a.m. | Breakfast Coquina North |
Plenary Session: Technology I | (John McPherson, University of California, Davis, Chair)Mediterranean Ballroom |
9:00 a.m. – 9:30 a.m. | Jay Flatley, Illumina “Beyond the $1000 Genome-What’s next for NGS?” |
9:30 a.m. – 10:00 a.m. | Anne Wojcicki, 23andMe “Making discoveries on the 23andMe platform”* Denotes abstract selected talk |
10:00 a.m. – 10:20 a.m. | *William Greenleaf, Stanford University “Single-cell chromatin accessibility reveals principles of regulatory variation” Another scratch? |
10:25 a.m. – 10:55 a.m. | Coffee Break Lower Level/Sponsor’s Promenade |
11:00 a.m. – 11:20 a.m. | *Jason Bielas, Fred Hutchinson Cancer Research Center “Deep profiling of complex cell populations using scalable single cell gene expression analysis” |
11:20 a.m. – 11:40 a.m. | *Paolo Piazza, University of Oxford “Linking epigenetics and gene expression at single cell levels using SMART-ATAC-seq” |
11:40 a.m. – 12:00 p.m. | *Andrea Kohn, University of Florida“Epitranscriptomic landscape of single neurons: insights in the memory mechanisms” |
12:00 p.m. – 2:00 p.m | Roche/PacBio SMRT® Sequencing Workshop (Complimentary Lunch Provided) Coquina NorthBen Murrell, Deep Sequencing of VirusesEuan Ashley, Precision MedicineBobby Sebra, Leveraging SMRT Sequencing |
12:00 p.m. – 2:00 p.m. | AGBT Lunch Valencia Terrace & Lawn |
Plenary Session: | Genomics II (Eric Green, NHGRI, Chair) Mediterranean Ballroom |
2:00 p.m. – 2:30 p.m. | Harold Varmus, Weill Cornell Medicine and New York Genome Center “What genomics is teaching us about carcinogenesis” |
2:30 p.m. – 3:00 p.m. | Eddy Rubin, DOE Joint Genome Institute “Metagenomic diamond mining” Given on previous day |
3:00 p.m. – 3:30 p.m. | Debbie Nickerson, University of Washington “Scaling Mendelian genomics” |
3:40 p.m. – 4:40 p.m. | 10X Workshop (Complimentary coffee and snack provided) Coquina North Michael Schnall-Levin of 10X Stacey Gabriel Scott Furlan, single cell transcriptomics |
4:45 p.m. – 6:45 p.m. | Poster Session and Break Coquina South |
12:00 p.m. – 1:15 p.m. | AGBT Lunch Valencia Terrace & Lawn |
Plenary Session: | Technology II (Len Pennacchio, Lawrence Berkeley National Laboratory, Chair) Mediterranean Ballroom |
1:30 p.m. – 2:00 p.m. | Nick Loman, University of Birmingham “Real-time genome sequencing in the field” |
2:00 p.m. – 2:30 p.m. | Susan Rosenberg, Baylor College of Medicine “Freeze-frame synthetic proteins trap genome damage intermediates in living cells”* Denotes abstract selected talk Rosenberg was the only speaker at AGBT16 who opted out of live tweeting |
2:30 p.m. – 2:50 p.m. | *Shawn Levy, HudsonAlpha Institute for Biotechnology “Unique run conditions allow multiplexed phased genomes on the HiSeq X to reveal high-resolution copy number changes and high-quality variant calling” |
2:50 p.m. – 3:10 p.m. | Coffee Break Mediterranean Pre-Function Space |
3:10 p.m. – 3:30 p.m. | *Jonas Korlach, Pacific Biosciences “Addressing complex diseases and hidden heritability with the sequel system” |
3:30 p.m. – 3:50 p.m. | *Tamir Biezuner, Weizmann Institute of Science “A generic, cost-effective and scalable cell lineage analysis platform” |
3:50 p.m. – 4:10 p.m. | *Christopher Hill, University of Washington “Long-read sequence assembly of the gorilla genome” |
4:10 p.m. – 4:40 p.m. | Closing Comments and Meeting Feedback |
7:00 p.m. – 12:00 a.m. | Farewell Dinner Party Coquina Ballrooms |
That concludes my organizing information on AGBT16 (more-or-less) -- next on the agenda is to distill down to bits some thoughts that pieces of this info triggered.
Wow! What a lot of work and talks to go through. Thanks.
ReplyDeletePython, Perl, SQL, Excel and other tools as needed (e.g., Spotify). The sign of a true bioinformatics person.
Do you have an inside source at starbase or Chez Shi Tzu? How'd you know about the critical role of Spotify? :-)
ReplyDelete