<?xml version="1.0" encoding="utf-8"?>
<rss version="2.0"><channel><title>Think Gene - Latest Comments in Comp Sci Sins of Biologists</title><link>http://thinkgene.disqus.com/</link><description>a bio blog about genetics, genomics, and biotechnology</description><language>en</language><lastBuildDate>Sat, 28 Jun 2008 01:33:07 -0000</lastBuildDate><item><title>Re: Comp Sci Sins of Biologists</title><link>http://www.thinkgene.com/terrible-comp-sci-sins-of-biologists/#comment-2464594</link><description>$327MM ?? That's a lot of Miller Mattles!</description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Eddie Pasternack</dc:creator><pubDate>Sat, 28 Jun 2008 01:33:07 -0000</pubDate></item><item><title>Re: Comp Sci Sins of Biologists</title><link>http://www.thinkgene.com/terrible-comp-sci-sins-of-biologists/#comment-2464595</link><description>So if we use the 20 amino acid trick (5-bits) then we have all we want, plus the start and stop codons for unprocessed sequence plus some room for a code representing an ambiguity (a codons worth).&lt;br&gt;&lt;br&gt;I think that we are studying completed and processed genomes at this point which means that all of the ambiguities have beem resolved.  Some of the other ideas mentioned are important for sequences that still are being assembled.  So, except for the necessary start and end codon, everything else would have been corrected.</description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">John C</dc:creator><pubDate>Fri, 27 Jun 2008 18:02:38 -0000</pubDate></item><item><title>Re: Comp Sci Sins of Biologists</title><link>http://www.thinkgene.com/terrible-comp-sci-sins-of-biologists/#comment-2464598</link><description>Generally DEFLATE won't use a dictionary that large, and at some point it'll stop growing.</description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Nimish</dc:creator><pubDate>Fri, 27 Jun 2008 13:09:59 -0000</pubDate></item><item><title>Re: Comp Sci Sins of Biologists</title><link>http://www.thinkgene.com/terrible-comp-sci-sins-of-biologists/#comment-2464597</link><description>Gzip would impose a constant overhead (I don't mean this in the algorithmic sense - I don't actually know how gzip scales, probably O(n) or something since I would be surprised if it looks not at a fixed sliding window but the whole corpus), though, and it could disallow lots of stuff (like random access, maybe - or at least, I don't know how to get the billionth nucleotide without decompressing the previous 999 million).</description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">gwern</dc:creator><pubDate>Fri, 27 Jun 2008 13:09:18 -0000</pubDate></item><item><title>Re: Comp Sci Sins of Biologists</title><link>http://www.thinkgene.com/terrible-comp-sci-sins-of-biologists/#comment-2464596</link><description>Actually, you need 4 bits to store nucleic acid sequences, if you include ambiguity codes as well... (&lt;a href="http://www.bioinformatics.org/SMS/iupac.html" rel="nofollow"&gt;http://www.bioinformatics.org/SMS/iupac.html&lt;/a&gt;)&lt;br&gt;&lt;br&gt;0001 A&lt;br&gt;0010 C&lt;br&gt;0100 G&lt;br&gt;1000 T&lt;br&gt;&lt;br&gt;Now, if you don't know the base, it's 1111 (N), if it's an A or a C, it's 0011, etc...&lt;br&gt;&lt;br&gt;Using bases as a bitmap also makes comparisons much faster too...  you can just AND each bitmap against each other and if the result is greater than zero, it's a match.</description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Marcus Breese</dc:creator><pubDate>Fri, 27 Jun 2008 10:32:12 -0000</pubDate></item><item><title>Re: Comp Sci Sins of Biologists</title><link>http://www.thinkgene.com/terrible-comp-sci-sins-of-biologists/#comment-2464600</link><description>That's scarcely enough to consign them to Dante's Comp Sci Hell alongside the SOA Vendors and the Physicists. Plain-text formats have their merits, and gzip should obliterate the extra inefficiency anyway, no?</description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">sunny beach</dc:creator><pubDate>Fri, 27 Jun 2008 04:26:53 -0000</pubDate></item><item><title>Re: Comp Sci Sins of Biologists</title><link>http://www.thinkgene.com/terrible-comp-sci-sins-of-biologists/#comment-2464599</link><description>Well folks, we can up the ante a bit: codons!!&lt;br&gt;On the way, we can note that in RNA, the nucleotide T is replaced by U.  To represent all the codons (64 of them, 3 bases each) would require 6 bits, which is the same as for representing each of the 3 bases.  So any effort to compress the representation stops here unless we are only going to look for the 20 amino acids; then we can take a bit off and we are down to five bits.  The good stuff for what I write here is at: &lt;a href="http://en.wikipedia.org/wiki/Genetic_code" rel="nofollow"&gt;http://en.wikipedia.org/wiki/Genetic_code&lt;/a&gt;&lt;br&gt;&lt;br&gt;In that article you will find this interesting quote: "A comparison may be made with computer science, where the codon is the equivalent of a word, which is the standard "chunk" for handling data (like one amino acid of a protein), and a nucleotide for a bit." Nooo...I am not going to defend this literally, because this communicates figuratively.</description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">John C</dc:creator><pubDate>Fri, 27 Jun 2008 03:34:55 -0000</pubDate></item></channel></rss>