DISQUS

DISQUS Hello! Think Gene is using DISQUS, a powerful comment system, to manage its comments. Learn more.

Community Page

Think Gene

a bio blog about genetics, genomics, and biotechnology
Jump to original thread »
Author

How Much Data is a Human Genome? Not Much.

Started by Andrew Yates · 9 months ago

I recently noted in Napster of Medicine that an entire human genome would fit on a music CD.
How much data IS a human genome?

2 bits per base (4 bases = 22)
3,080.4 Mb per human genome [1]
700 MB per CD-ROM

(1 human genome) *
(3,080,400,000 bases / 1 h ... Continue reading »

17 comments

  • 3 billion is the size of a *haploid* human genome. Since we have two copies of each of our chromosomes (except for men and the sex chromosomes), technically the number of bases is 6 billion, though of course the vast majority of these will be the same between two homologous chromosomes. But I'm sure the compression algorithm would recognize that...
  • Pretty cool. Just 10MB with the right compression. But I always thought it takes 7 bits? Thanks for letting me know. Palonek @ http://www.edwardpalonekblog.ca/
  • neandrothal: Yes and yes. I will make a note of that because it's a good point.

    Palonek: Why 7 bits? That's 2<sup>7</sup> = 128. You may be thinking of ASCII, which is 7 bits, to write the literal letters "A G C T." If ASCII is the encoding your biotech or lab uses for massive DNA files, you are over 3.5 times the data (so 3.5 the bandwidth, 3.5 the storage, and sometimes 3.5 the processing power.) That's bad.
  • As there are 4 bases. The base id could be represented in a 2-bit field for each base in a sequence. So, for storing, A G C T could be stored as 00 01 10 11 respectively and then retrieved and converted to human readable A G C T.
    Drew's assumption would be what I would do for storing this kind of data as, of course, there is a lot of it.
  • Your calculations are pretty much correct. The reference human genome still contains some unknown portions, so you need to be able to represent at least one possibility in addition to ACGT. But since you were talking probably about the real human genome, not the current unfinished data, that problem wouldn't apply.

    Using the ".2bit" format, human genome version "hg18" fits into a file listed here as 770 MB.

    http://hgdownload.cse.ucsc.edu/goldenPath/hg18/...

    The format is described here:

    http://genome.ucsc.edu/FAQ/FAQformat#format7

    As well as the earlier, and still sometimes useful, "nibble" format that used 2-bases per byte.

    In biology, the sequence of ACGT isn't all that contains inherited information. (There is also all the proteins you inherit along with DNA, and DNA methylation, and lots more stuff still to discover.) But I wouldn't know where to start to compute the information content there.
  • Ed, you bring up a very good point about methylation and other proteins on the DNA. If you only care about sequence, these things don't matter, but they definitely influence which genes are active or repressed, and even how active a gene is. I suppose it depends on what you're storing the data for.
  • As for the hope of crunching 770MB down to 700MB, it should be noted that programs like gzip rarely get better compression than 2 bit per nucleotide[1].

    Also, FASTA and other file formats are not primarily used for storage but transport. Almost no bioinformatics program operates directly on ASCII-data, but transforms such exchange formats to some internal representation.

    For the 10MB I guess the author thinks in terms of working on a diff with respect to some reference genome. While that is probably workable for applications on the human genome, it's not really patentable (UNIX patch and diff being older than me and there's probably even older prior art) and impracticable on a general scale. Impracticable because an index for describing any sequence in such a relative way would be far too big, i.e. it would probably require more storage than only transferring the sequences worked on directly.

    [1]
  • Thomas: I'd say that most bioinformatics scripts and programs operate on ASCII data, bit-packing data before is rather the exception for the few hard-core tools like BLAT/BLAST, etc. Most everyday scripts still parse fasta to strings and operate on them, I'd say.
  • In truth, human genomes can be more complex than even diploid (think CNV). This is especially true for cancer genomes. You may also want to capture more than one genomes in the reference, e.g., you may want to include the variations in dbSNP in your reference. To include just SNPs, you could expand your four-letter alphabet to include all the IUPAC DNA codes. Including indels would be even more complicated. Here are some thoughts on how to represent a complex reference genome.
  • For storing the methylation information you will need another bit for each base. This makes it 3 CDs :-)
  • Now, please do take a look at your fingertips. You ll see the fine lines of your fingerprint pattern. It is unique, and can be used to indentify a human; so fine and even much finer structures are defined in your organism.
    Now, how high would be only 3D positional information content needed to describe a human?
    You would need to position single cells, define the inner structure of particular cell types, describe the form of single nerve cells (dendrites)...etc
    Now how many cells are there in the human organism?
    Wihout any calculation, we can see the information quantity to describe a human in uncounted Terrabytes. Human chromosomes contain , as calculated here, 740 MB.
    So, why for the God's sake do we beleive that the whole of our hereditary information resides in the genes?
  • Because it's patterns. It's 1 gene per cell, it's instructions that may say "keep making these until chemical_gradient_$c-32 falls below threshold x and then stylize them based on concentration of chemical_gradient_^f-03"

    Well, maybe that's a bit hard to follow so let's try this instead: how many hairs are on your arm? Well, I don't really care about the particular number but what I want to know is if you had the same number of hairs on your arm when you were a child, and I mean the three foot tall variety.

    No, no you didn't. You had many fewer BUT they were about the same distance apart. Now, I'm sure you know that your arms don't just grow at the ends- there's a lot of growth in the middle and it's more or less continuous... but how could you add new hairs evenly spaced in that?

    Well it's simple. Much like our DNA you just need two values to keep track of it (though it's not really bits, it's not THAT simple.) You need a protein that causes hairs to grow and you need a protein that prevents them from growing. Like a lot of things in our body the protein that prevents hair from growing just stops cells from making the hair that promotes hair formation but the promoting protein promotes the preventer and promotes itself. There's another trick though. The preventer moves around between cells much more easily than the promoter.

    No need to do mental gymnastics here, I'll just state the end result: cells in high concentration of the promoter make enough of it to overcome the effects of the preventer and low concentrations just pool up on the preventer... up to a point. If there aren't any hairs close enough to prevent another from growing they don't have enough of the preventer so the promoter takes over and gives you another hair.

    A similar set up is also used to make sure you don't grow two heads. In fact this kind of thing is used so often that we can safely say the information used to build your body is many many times smaller than the actual information it would take to record the current state of your body.

    If you're much of a programmer you know how just a few lines of code (file might end up being a few kb if you didn't want it really small,) could produce an image of many gigabytes in size, if you had some reason to let it make a large enough image.

    Don't get me wrong though. There is more to us than our DNA.
    Our DNA basically lays out the boundaries of what we can possibly grow to be and the environment we grow in narrows it down until we reach that single possibility that is ultimate "you."
  • 740MB is the size of a human haploid nucleotide base string, not the data necessary to describe a mature human.

    We believe that most of our hereditary information resides in genes because it does. However, a genome, as you say, cannot possibly fully describe a mature human. A genome is more like a brief mathematical equation used to produce beautifully complex fractal design when fed with ambient noise and interpreted as colors and coordinates on a screen.
  • I am trying to draw your attention to this: The human, just like any other organism has its qualties determined, and their description then must reside somewhere. The amount of information needed to describe a human organism is enormous, the information amount carried by the genes very limited in comparison.
    Now, let's take a look at this possible analogy.
    Imagine you are demonstrating a PC to someone who has no idea of computers whatsoever, and has never seen one.( Increasingly difficult to find, but there must still be some around :)
    Ok , you show him how inputs on the keyboard produce results on the screen. Knownig nothing about the PC under the desk, our computer novice has to think that the keyboard alone causes all the fascinating happenings on the screen.
    Now our virtuous genetics has got hold of the keyboard - genes; making changes there changes the organism. But how for God's sake does it follow that all the hereditary information resides there, and nor on some 'HD' somewhere, away from the 'keyboard'?
    I am simply pointing out that the 'keyboard' has practically no data storage capacity for the task.

    'We believe that most of our hereditary information resides in genes because it does. '

    Oh, pardon the heresy involved, but I really don't know how do you know that.
  • My Pc only needs a tiny little fractal program to generate a fractal world of incredible complexity and beauty (e.g. the Mandlebrot set). Two identical program runs will produce identical outputs, unless I introduce a bit of noise. The earlier the noise, the wider the divergence. Hence identical twins have differing fingerprints.

    So I can believe the small numbers quoted.
  • Neandrothal is right, but if you know half the code can't you get the second half because G attaches to C and A attaches to T?
  • How DARE you patent it. You should share that with the world freely as a show of good faith. To do otherwise is reprehensible. Patent and copyright are the bane of our legal system at the moment. They stifle community and promote individualistic gain at the expense of the greater good of the community.

Add New Comment

Returning? Login