DISQUS

Think Gene: Comp Sci Sins of Biologists

  • John C · 1 year ago
    Well folks, we can up the ante a bit: codons!!
    On the way, we can note that in RNA, the nucleotide T is replaced by U. To represent all the codons (64 of them, 3 bases each) would require 6 bits, which is the same as for representing each of the 3 bases. So any effort to compress the representation stops here unless we are only going to look for the 20 amino acids; then we can take a bit off and we are down to five bits. The good stuff for what I write here is at: http://en.wikipedia.org/wiki/Genetic_code

    In that article you will find this interesting quote: "A comparison may be made with computer science, where the codon is the equivalent of a word, which is the standard "chunk" for handling data (like one amino acid of a protein), and a nucleotide for a bit." Nooo...I am not going to defend this literally, because this communicates figuratively.
  • sunny beach · 1 year ago
    That's scarcely enough to consign them to Dante's Comp Sci Hell alongside the SOA Vendors and the Physicists. Plain-text formats have their merits, and gzip should obliterate the extra inefficiency anyway, no?
  • Marcus Breese · 1 year ago
    Actually, you need 4 bits to store nucleic acid sequences, if you include ambiguity codes as well... (http://www.bioinformatics.org/SMS/iupac.html)

    0001 A
    0010 C
    0100 G
    1000 T

    Now, if you don't know the base, it's 1111 (N), if it's an A or a C, it's 0011, etc...

    Using bases as a bitmap also makes comparisons much faster too... you can just AND each bitmap against each other and if the result is greater than zero, it's a match.
  • gwern · 1 year ago
    Gzip would impose a constant overhead (I don't mean this in the algorithmic sense - I don't actually know how gzip scales, probably O(n) or something since I would be surprised if it looks not at a fixed sliding window but the whole corpus), though, and it could disallow lots of stuff (like random access, maybe - or at least, I don't know how to get the billionth nucleotide without decompressing the previous 999 million).
  • Nimish · 1 year ago
    Generally DEFLATE won't use a dictionary that large, and at some point it'll stop growing.
  • John C · 1 year ago
    So if we use the 20 amino acid trick (5-bits) then we have all we want, plus the start and stop codons for unprocessed sequence plus some room for a code representing an ambiguity (a codons worth).

    I think that we are studying completed and processed genomes at this point which means that all of the ambiguities have beem resolved. Some of the other ideas mentioned are important for sequences that still are being assembled. So, except for the necessary start and end codon, everything else would have been corrected.
  • Eddie Pasternack · 1 year ago
    $327MM ?? That's a lot of Miller Mattles!