Omics! Omics!: Programming Fossil Stumble, but Scala School Progresses

If you watch this space by RSS, make sure you catch the additions and corrections by BioIT World's Kevin Davies and my fellow Codon Devices alum Jack Leonard on the Ion Torrent technology (as well as Daniel MacArthur's piece on the last day of AGBT). All were on site & actually saw the machine -- it's a bit scary to see my piece on the PARE technology tweeted (and retweeted) as a substitute for a missed session.

I'll take a break for at least a few days from AGBT & try to regain some calm -- a sequencing instrument that you can buy with a home equity loan is a dangerous temptation.

I've been having trouble carving out time -- and enthusiasm -- for my Scala retraining exercise. A week or so ago I did make one try and hit an annoying roadblock. I had previously worked through online examples to automagically convert Java iterators into a clean Scala idiom. So, I decided to try this with BioJava sequence iterators -- and had the rude surprise that these don't implement the Iterator interface! Aaarggh! The documentation is suggestive of the reason -- when BioJava sequence iterators were created, Java didn't support typesafe iterators (due to a lack of generic types). That's since been grafted onto Java, but BioJava hasn't updated to embrace this. Most likely this is to guarantee backwards compatibility -- in a sense it is a fossil record of what Java once was.

On Friday I had a big block of meeting-free time and resolved to attack things again. It's a big jump really switching over to a functional programming style (and taking the leap of faith that the scala compiler and JVM JIT will make some efficient code out of it). Scala also is very forgiving of syntax marks -- but in a very context dependent manner. Personally, I'd prefer a stricter taskmaster as being rapped on the knuckles for every infraction tends to reinforce the lesson faster.

For my problem, I chose to write a tool which would stream through a SAM format second gen sequencing alignment file and compute the number of reads covering each position. The program assumes that the SAM file is sorted by chromosome and then by position. Also, this first cut can't work on a very large file -- memory conservation was not a design constraint (though I did try to work some in).

Now, in some sense this is not a good problem to tackle -- not only will I avoid using the Picard library for processing SAM but there are already tools out there to perform the calculation. So I'm guilty of the sin of reinventing the wheel. But, it is a simple problem to formulate and has a nice trajectory forward for exploring multiprocessing and other fun topics. Plus, I'll try to couple things loosely enough that dropping in Picard should be possible without much acrobatics.

The topmost code is a bit boring, opening a file of SAM data with a class that parses it (SamStreamer) and one to count coverage (chrCoverageCollector)


object samParser extends Application
{
  val filename="chr21.20k.sam"
  val samFile=new SamStreamer(filename)
  var lineNumber = 0
  val ccc = new chrCoverageCollector();
  for (read <- samFile.typicalFragmentFronts )
      ccc.addCoverage(read)
  ccc.dump
}

SamStreamer has four parts. The first one (fromSamLine) converts a line from SAM into an object of class AlignedPairedShortRead -- pretty dull. The second one (reads) demonstrates some key concepts. First, it looks like the definition of a value type variable (immutable), but has a function body instead of a direct assignment. Second, it goes through the lines of the file with a "for comprehension" -- which also filters out any lines starting with "@" (header lines). Finally, it ends with a yield statement -- meaning this acts like an iterator over the set. The other three parts follow the same pattern -- iterate over a collection or stream, with a yield delivering each result. Iterators naturally -- without any special declarations!


class SamStreamer(samFile: String)
{   
   def fromSamLine (line : String): AlignedPairedShortRead  = 
   {      
     var fields : Array[String] = line.split("\t")
         return new AlignedPairedShortRead(
                fields(0),fields(1),fields(2),
                Integer.parseInt(fields(3)),
                fields(9), fields(5),fields(6), Integer.parseInt (fields(7) ))
   }
   val reads = for { line <- Source.fromFile(samFile).getLines 
                      if !line.startsWith("@") } yield fromSamLine(line)
   val typicalFragmentFronts 
     = for { read <- reads 
            if (read.frontOfTypicalFragment) } yield read
   val mapped = for { read <- reads 
            if (read.mapped) } yield read    
}

AlignedPairedShortRead is mostly a collection of fields and accessors for them. I could have coded this much more compactly -- I think. But that will be another lesson, as I just stumbled on it. The other bit of methods are a lot of tests for various states. For this example, "frontOfTypicalFragment" returns true if a read is the lower coordinate member for a read pair which maps within a short distance of one another on the same chromosome. Actually, for this example calling into this is a bug -- I should have a method in SamStreamer to screen for all mapped reads. Actually, a more clever way would be to pass the filtering function in to a more generic scanner function -- another exercise for a future session.


class AlignedPairedShortRead(fragName: String, flag: String, chr: String, pos: Int, 
                             read: String, cigar: String, mateChr: String, matePos: Int)
{
  def mapped : Boolean  = { return chr.startsWith("chr") && !cigar.equals("*") }
  def mateMapped : Boolean = { return (mateChr.equals("=") && pos!=matePos) || mateChr.startsWith("chr") }
  def bothMapped : Boolean = {  mapped && mateMapped }
  def frontOfTypicalFragment : Boolean = { return bothMapped && mateChr.equals("=") &&  pos < matePos+read.length }
  def id : String = { return fragName }
  def bounds : (Int,Int) = { return (pos,pos+read.length)}
  def Chr : String = { return chr }
  }

chrCoverageCollector (gad! no consistency in my naming convention!) takes reads and assigns them to a CoverageCollector according to which chromosome they are on (addCoverage). Another method dumps the results in BED format.


class chrCoverageCollector()
{
  val Coverage=new HashMap[String,CoverageCollector]  
  def addCoverage(read: AlignedPairedShortRead)=
   {  
    if (!Coverage.contains(read.Chr)) Coverage(read.Chr)=new CoverageCollector
    Coverage(read.Chr).increment(read)                                 
    }
    def dump()
    {
      for( chr:String <- Coverage.keys  ) 
        { 
          for (pc:RangeCoverage <- Coverage(chr).MergedSet )
            {
              println(pc.BedGraph(chr))
            }
        }
     } 
}

CoverageCollector does most of the real work -- except for one last class it introduces (I'm starting to wish I had written this from bottom up rather than top down!), RangeCoverage. CoverageCollector keeps a stash of RangeCoverage to store the coverage at individual positions. The one, mostly impotent attempt at memory conservation is a method (thin) to consolidate runs of the same coverage level -- but only those safely outside the last region incremented (remember, the reads come in sorted order!). allElemSorted can deliver all the RangeCoverage objects in positional order and MergedSet delivers the same, but with consolidation of merged elements.


class CoverageCollector()
{
   val Coverage = new HashMap[Int,RangeCoverage]
   def getCoverage (st:Int,en:Int) = 
     { for (pos <- st to en )       
       yield { if (!Coverage.contains(pos)) Coverage(pos)=new RangeCoverage(pos);    
                   Coverage(pos) }
     }
    var lastThin:Int = 0
    def increment(read: AlignedPairedShortRead)= 
    { 
      val(st:Int,en:Int)=read.bounds;
      for (coverCnt:RangeCoverage <- getCoverage(st,en ) ) coverCnt.increment 
          lastThin+= 1
      if (lastThin>100000) thin(st-500)
     } 
     def thin(thinMax:Int)
     {
       var prev=new RangeCoverage( -10 )
       for (rc:RangeCoverage <- Coverage.values.toList.filter( p=> p.St < thinMax ) )
       {
         if (prev.mergable(rc))
         {
           prev.engulf(rc)
           Coverage-=rc.St
         }
         prev=rc
       }
       lastThin=0
   }
   def allElemSorted:Array[RangeCoverage] = 
       Sorting.stableSort(Coverage.values.toList) 
   def MergedSet : List[RangeCoverage] = { 
      val rcf = new RangeCoverageMerger();
      return allElemSorted.toList.filter(rcf.mergeFilter ) }
}

Finally, the two last classes (okay, I claimed only one more before -- forgot about one). RangeCoverage extends the Ordered trait so it can be sorted. Traits are Scala's approach to multiple inheritance (MI). I played with MI in C++ and got my fingers singed by it; traits will require some more study as it seems to be very controversial whether they really are a good solution. I'll need to play some more before I can give anything resembling an informed opinion. RangeCoverageMerger is a little helper class to consolidate RangeCoverage objects which are adjacent and have the same coverage. I probably could have buried this in RangeCoverage with a little more cleverness, but I ran out of time & cleverness. One final language note: the "override" keyword is required whenever you override an underlying method -- though Scala lacks the requirement to declare the parent method "virtual" to enable overriding (I think I have that all right)


class RangeCoverage (st:Int)  extends Ordered[RangeCoverage] { 
 var coverage:Int =0
    var en:Int=st
    override def compare(that:RangeCoverage):Int = this.st compare that.en
    def increment = { coverage+= 1 } 
    def Coverage:Int = { return coverage }
    def St:Int = { return st}
    def En:Int = { return en}
    def BedGraph(Chr:String):String = { return     
        Chr+"\t"+st.toString+"\t"+en.toString+"\t"+coverage.toString }
    def mergable(next:RangeCoverage):Boolean = (en+1==next.St && coverage==next.coverage)
    def engulf(next:RangeCoverage) = { en=next.en }
}
class RangeCoverageMerger
{
  var prev = new RangeCoverage(-1)
  def mergeFilter (rc:RangeCoverage):Boolean = { 
    if (prev.St== -1 || !prev.mergable(rc)) { prev=rc; return true; }
    prev.engulf(rc); return false
  }
}

So, if I've copied this all correctly and with the bit below, one should be able to run the whole code (if not, my profuse apologies). On my laptop, it can get through 20,000 lines of aligned SAM data (which I can't post, both due to space and because it's company data) and not explode, though 50K blows out the Java heap. A next step is to deal with this problem -- and at set the stage for multiprocessing.

Okay, one final really dull, but important bit. This actually goes at the top, as these are the import declarations. Dull, but critical -- and I always gripe about folks leaving them out.


import scala.io.Source
import scala.collection.mutable.HashMap
import scala.util.Sorting

Omics! Omics!

Sunday, February 28, 2010

Programming Fossil Stumble, but Scala School Progresses

1 comment:

Google meta tag

Get new posts by email: