Saturday, March 16, 2013

"Big Data and Genetics: 1000 Genomes Project"






The study of genetics with the goal of curing disease is quite an interesting field now-a-days and most people may not realize that data mining is actually an extensive tool used for such studies. Currently, there is an international collaboration between many researchers know as the "1000 Genomes Project." It involves a free and open data set that is roughly 200 terabytes (thats absolutely humongous) of genetic information from 2500 hundred people from 25 different populations.
The projects data set is by far the largest free and open data set on genetic information available to the public. From what I can gather reading over the home page of the organization, the main goal of the project is to pin point genetic variations that have an occurrence of at least one percent within the population from the data set. Primarily, this is done through genetic sequencing. I won't go to far into explaining genetic sequencing other than saying that it has to do with the order of nucleotides (building blocks of DNA) over a DNA molecule.  The genetic code of any living thing is based on the sequence of nucleotides contained within a DNA molecule. (there are actually only four kinds of nucleotides. Crazy huh? you'd think it would be more complicated than that.)
DNA sequencing is required to understand a person's/living things genome. If you don't know what a genome is, it is essentially a noun to describe the entirety of an organism's hereditary genetic information. The project homepage goes on to say that that an individual's DNA has to be sequenced about 28 times to get a complete picture of an individual's genome.
I followed the data set to Amazon web service's and it turns out there are actually some map reduce algorithms specifically tailored to genome typing(Cloudburst, for example). I will discuss/explain/look over those and  that will be the topic of my next blog post. Consider this post an introduction to this topic.
Please check out this video on the 1000 Genomes Project, and I will also have a link to the homepage of the organization posted below.




No comments:

Post a Comment