Think Gene - Latest Comments in Comp Sci Sins of Biologists

Re: Comp Sci Sins of Biologists

Djarum Black — Tue, 09 Feb 2010 13:03:03 -0000

This encoding further has the convenient property that the bits can be inverted to get the complementary DNA strand. Storing bases as ASCII is OK for small, human readable files, but otherwise, it’s a gross waste of storage,

Re: Comp Sci Sins of Biologists

Eddie Pasternack — Sat, 28 Jun 2008 01:33:07 -0000

$327MM ?? That's a lot of Miller Mattles!

Re: Comp Sci Sins of Biologists

John C — Fri, 27 Jun 2008 18:02:38 -0000

So if we use the 20 amino acid trick (5-bits) then we have all we want, plus the start and stop codons for unprocessed sequence plus some room for a code representing an ambiguity (a codons worth).

I think that we are studying completed and processed genomes at this point which means that all of the ambiguities have beem resolved. Some of the other ideas mentioned are important for sequences that still are being assembled. So, except for the necessary start and end codon, everything else would have been corrected.

Re: Comp Sci Sins of Biologists

Nimish — Fri, 27 Jun 2008 13:09:59 -0000

Generally DEFLATE won't use a dictionary that large, and at some point it'll stop growing.

Re: Comp Sci Sins of Biologists

gwern — Fri, 27 Jun 2008 13:09:18 -0000

Gzip would impose a constant overhead (I don't mean this in the algorithmic sense - I don't actually know how gzip scales, probably O(n) or something since I would be surprised if it looks not at a fixed sliding window but the whole corpus), though, and it could disallow lots of stuff (like random access, maybe - or at least, I don't know how to get the billionth nucleotide without decompressing the previous 999 million).

Re: Comp Sci Sins of Biologists

Marcus Breese — Fri, 27 Jun 2008 10:32:12 -0000

Actually, you need 4 bits to store nucleic acid sequences, if you include ambiguity codes as well... (http://www.bioinformatics.o...

0001 A
0010 C
0100 G
1000 T

Now, if you don't know the base, it's 1111 (N), if it's an A or a C, it's 0011, etc...

Using bases as a bitmap also makes comparisons much faster too... you can just AND each bitmap against each other and if the result is greater than zero, it's a match.

Re: Comp Sci Sins of Biologists

sunny beach — Fri, 27 Jun 2008 04:26:53 -0000

That's scarcely enough to consign them to Dante's Comp Sci Hell alongside the SOA Vendors and the Physicists. Plain-text formats have their merits, and gzip should obliterate the extra inefficiency anyway, no?

Re: Comp Sci Sins of Biologists

John C — Fri, 27 Jun 2008 03:34:55 -0000

Well folks, we can up the ante a bit: codons!!
On the way, we can note that in RNA, the nucleotide T is replaced by U. To represent all the codons (64 of them, 3 bases each) would require 6 bits, which is the same as for representing each of the 3 bases. So any effort to compress the representation stops here unless we are only going to look for the 20 amino acids; then we can take a bit off and we are down to five bits. The good stuff for what I write here is at: http://en.wikipedia.org/wik...

In that article you will find this interesting quote: "A comparison may be made with computer science, where the codon is the equivalent of a word, which is the standard "chunk" for handling data (like one amino acid of a protein), and a nucleotide for a bit." Nooo...I am not going to defend this literally, because this communicates figuratively.