-
Website
http://www.thinkgene.com -
Original page
http://www.thinkgene.com/how-much-data-is-a-human-genome/ -
Subscribe
All Comments -
Community
-
Top Commenters
-
Andrew Yates
68 comments · 1 points
-
pnorth00
3 comments · 1 points
-
Wholesale Clothing
11 comments · 1 points
-
Jay Parkinson, MD
3 comments · 7 points
-
Sciphu
6 comments · 2 points
-
-
Popular Threads
Palonek: Why 7 bits? That's 2<sup>7</sup> = 128. You may be thinking of ASCII, which is 7 bits, to write the literal letters "A G C T." If ASCII is the encoding your biotech or lab uses for massive DNA files, you are over 3.5 times the data (so 3.5 the bandwidth, 3.5 the storage, and sometimes 3.5 the processing power.) That's bad.
Drew's assumption would be what I would do for storing this kind of data as, of course, there is a lot of it.
Using the ".2bit" format, human genome version "hg18" fits into a file listed here as 770 MB.
http://hgdownload.cse.ucsc.edu/goldenPath/hg18/...
The format is described here:
http://genome.ucsc.edu/FAQ/FAQformat#format7
As well as the earlier, and still sometimes useful, "nibble" format that used 2-bases per byte.
In biology, the sequence of ACGT isn't all that contains inherited information. (There is also all the proteins you inherit along with DNA, and DNA methylation, and lots more stuff still to discover.) But I wouldn't know where to start to compute the information content there.
Also, FASTA and other file formats are not primarily used for storage but transport. Almost no bioinformatics program operates directly on ASCII-data, but transforms such exchange formats to some internal representation.
For the 10MB I guess the author thinks in terms of working on a diff with respect to some reference genome. While that is probably workable for applications on the human genome, it's not really patentable (UNIX patch and diff being older than me and there's probably even older prior art) and impracticable on a general scale. Impracticable because an index for describing any sequence in such a relative way would be far too big, i.e. it would probably require more storage than only transferring the sequences worked on directly.
[1]
Now, how high would be only 3D positional information content needed to describe a human?
You would need to position single cells, define the inner structure of particular cell types, describe the form of single nerve cells (dendrites)...etc
Now how many cells are there in the human organism?
Wihout any calculation, we can see the information quantity to describe a human in uncounted Terrabytes. Human chromosomes contain , as calculated here, 740 MB.
So, why for the God's sake do we beleive that the whole of our hereditary information resides in the genes?
Well, maybe that's a bit hard to follow so let's try this instead: how many hairs are on your arm? Well, I don't really care about the particular number but what I want to know is if you had the same number of hairs on your arm when you were a child, and I mean the three foot tall variety.
No, no you didn't. You had many fewer BUT they were about the same distance apart. Now, I'm sure you know that your arms don't just grow at the ends- there's a lot of growth in the middle and it's more or less continuous... but how could you add new hairs evenly spaced in that?
Well it's simple. Much like our DNA you just need two values to keep track of it (though it's not really bits, it's not THAT simple.) You need a protein that causes hairs to grow and you need a protein that prevents them from growing. Like a lot of things in our body the protein that prevents hair from growing just stops cells from making the hair that promotes hair formation but the promoting protein promotes the preventer and promotes itself. There's another trick though. The preventer moves around between cells much more easily than the promoter.
No need to do mental gymnastics here, I'll just state the end result: cells in high concentration of the promoter make enough of it to overcome the effects of the preventer and low concentrations just pool up on the preventer... up to a point. If there aren't any hairs close enough to prevent another from growing they don't have enough of the preventer so the promoter takes over and gives you another hair.
A similar set up is also used to make sure you don't grow two heads. In fact this kind of thing is used so often that we can safely say the information used to build your body is many many times smaller than the actual information it would take to record the current state of your body.
If you're much of a programmer you know how just a few lines of code (file might end up being a few kb if you didn't want it really small,) could produce an image of many gigabytes in size, if you had some reason to let it make a large enough image.
Don't get me wrong though. There is more to us than our DNA.
Our DNA basically lays out the boundaries of what we can possibly grow to be and the environment we grow in narrows it down until we reach that single possibility that is ultimate "you."
( like in the compressed image:... 255, write the white pixel 1244 times in this line and then gray 110, 86 times...)
Data compression is the least that we can expect from the so obviously ingenious nature of living things. I certainly do not expect the structure of a skin cell to be written down as many times as there are cells ; or mitohondria described x-times, etc.
One time is enough. You still have to position the cells precisely along the lines of the fingerprints, and i do not want to even mention the brain. The exact position of a single hair - 2 mm right or left might be of low priority to the organism, but the way the nerve cells are connected certainly isn't.
Tissues and cells have spatial relative positions and shapes. With all the compression the hereditary information is expectedly subjected to ( however and wherever stored), the size of the 'file' must still be enormous.
Human brain only is said to have 100 billion (10E11) cells, and a multiple of that number in dendrites that realize the complex brain circuitry through synapses with other cells. Even if we take into account the certain existence of 'typical' circuits, amount of information needed to describe the brain remains mind-boggling.
Even on the cell level, numerous cell types have very complex internal life with very intricate and ingenious chemical internal regulation and metabolism. This exsists in a scaled up form on the tissue, organ and organism level, too.
It should strain any informed credulty a bit, that even the structure and functioning of the cell types in the human organism can be described with 740 MB, with the best compression methods thinkable.
You probably mean fractals when you mention generating images with simple algorithms. Nature certainly uses fractal-similar shapes ( Broccoli, flowers, etc.) where it suits the function;(The nature uses simply everything:) but try describing the wing profile of a bird or brain circuitry, for that matter , with a fractal. The exact topology and shape of the last two are crucial to the function and cannot be left to the will of the wisp fractal - that is why you can not recognize it, the innumerable recursions of a simple form, in the design of a , say, human skull. We would end up in everything else but in the simple fractal formulas, trying to describe it mathematically.
Why dont we start with something simpler; say fractalizing the shape of a ship's hull, or compressing a song recording to a fractal, before trying it on the forms of the higer organisms?
The genetics has a problem here, and a big one , too: the location of the greatest part of the hereditary information is not known.
Mentioning fractals to explain this looks akin to me to looking for a wonder. What can not be explained is mysterious; rationalism abhors the mysterious and any suggestion that things unknown may exist. Interestingly and typicaly for our age of reason, you search for the solution in the field we know something about- the fractals. Indeed, is there anything that we, Descartes grandchildren do not know?
Wouldn't it be simpler to say 'We do not know', accepting that the answer may lay in the mysterious realm of the unknown?
Socrates would have liked that answer better, I am sure.
Denying the ignorance, one never starts searching for the answer.
We believe that most of our hereditary information resides in genes because it does. However, a genome, as you say, cannot possibly fully describe a mature human. A genome is more like a brief mathematical equation used to produce beautifully complex fractal design when fed with ambient noise and interpreted as colors and coordinates on a screen.
Now, let's take a look at this possible analogy.
Imagine you are demonstrating a PC to someone who has no idea of computers whatsoever, and has never seen one.( Increasingly difficult to find, but there must still be some around :)
Ok , you show him how inputs on the keyboard produce results on the screen. Knownig nothing about the PC under the desk, our computer novice has to think that the keyboard alone causes all the fascinating happenings on the screen.
Now our virtuous genetics has got hold of the keyboard - genes; making changes there changes the organism. But how for God's sake does it follow that all the hereditary information resides there, and nor on some 'HD' somewhere, away from the 'keyboard'?
I am simply pointing out that the 'keyboard' has practically no data storage capacity for the task.
'We believe that most of our hereditary information resides in genes because it does. '
Oh, pardon the heresy involved, but I really don't know how do you know that.
So I can believe the small numbers quoted.
Fractal pictures are generated through recursive use of an equation; it appears innumerable times in the created picture.
In addition to the simple formula, creating such a picture needs a lot of processing power to apply the formula n- times. Just imagine calculating a fractal per hand.
The same is true about the compression; in general, greater the compression ratio, more processing power is needed to create the compressed file, or to reconstruct the original one.
Now, chromosomes are beleived to contain all the hereditary information of an organism. They contain a very small information quantity to describe the organisms of enormous complexity, and consequently some kind of data compression must then be at work here, to the ratios like millions to one. Let's allow even fractals as a compression method for the main forms and topologies of higher mammals, unlikely as it may seem. Whichever the way such unimaginably high compressions are to be achieved, an enormous processing power is necessary to 'read ' the stored information.
Can any such processing power be indentified in the cell, say the fertillized egg- cell, or in any cells and tissues in the later embryonal stadia?
That means , if we want the fantastic compressions, or even fractals as an explanation about the 'missing memory' in the cell, we are confronted with the 'missing processor ' problem :). Where is it? Wasn't it easier to confess ' We don't know' in the first place?
More broadly formulated; does anyone know even in rough outlines how is the information from the chromosomes being transfered to and realized in the concrete shapes and forms of the organisms and their sub-structures? An answer describing how proteins are synthesized on the basis of the genes info would not tell me much about what I really ask here.
If the answer is no, what could this idea on the fractals in genetics be but a vague hypothesis without any causal content? It is not enough to say 'Fractals do it!'. How do they do it? Or how does anything else do it? I do not think anyone could answer these questions today.
Early in the 19 century, to explain the energy source of the sun, it has been proposed the sun is a heap of burning coal. The first idea that could come to one's mind at the time of industrial revolution was obviously the ubiquitous coal, powering its furnaces and steam engines:). The hypothesis has been taken seriously at first, only to be abandoned soon afterwards for its obvious inadequacy. Of course, nothing could have been known at the time about the nuclear processes in the sun, not even in roughest outlines. The real explanation came more then a hundred years later.
I have a feeling that a similarily big chunk of knowledge is missing in the biology of today for a viable explanation of many important aspects of life, including the mechanisms of the transfer of the hereditary information. We have to do with the coal-heap explain- it-away theories, instead of an sincere and brave ' We don't know.'
As for your claim of compressing this down to 10 megabytes, this is completely unrealistic. Here is a 2008 paper estimating the entropy rate of the human genome:
http://www.biomedcentral.com/1471-2164/9/509
they come up with about 1.8 bits per base pair, which would mean that even with an optimal compression algorithm, the best you can hope for will be a compression by the order of 10%.
Unless, of course, your compression "algorithm" itself contains 750 megabytes of data, and will only write out the differences of your genome to some reference genome. In this case, you can hope for "compression" by 99.5%, or down to a couple of megabytes. But this isn't "compression", it is transfer of information from the "data file" to the "program file".
If you think that 800 Mb is "not much", well sure, you can store your genome on your ipod nano. Your body, however, stores it in each cell nucleus. This is data storage at the molecular level, far beyond the reach of our current technology. And then the information doesn't just sit there, but is being actively processed within the cell nucleus, a structure of the size of a few micrometers. This is beyond any realistic scope of human-made nanotechnology and will remain so for many years.
size 13 shoes