Analytics and Visualization of Big Data: 1000 Genomes Project part 2:Genetic Sequencing Alignment and Similarity

As I mention in my last post, the 1000 Genomes project is an international collaboration to develop the most comprehensive map on the differences in genetics between people that has ever existed. I also mentioned that the data set was freely available on amazon web services as a public data set with over 200 Terabytes of genetic information from at least 2500 people from 26 different populations around the world. Originally I was going to do a post on the CLOUDBURST genetic sequencing algorithm, but sense I didn’t really know exactly what a sequence is, or what sequence alignment is and what it does, I decided to do this post on sequence alignment and similarity between genomic sequences. I know this post may seem kind of out there as far as it pertains to our data mining course, but if you read through it you will find that genomic similarity is very similar to the similarity concept discussed in class. Pun was intended. So, here we go.

A sequence in DNA is simply a group of letter characters which represent the four basic nucleotides, adenine, cytosine, guanine, and thymine that will occur in a particular order on a DNA helix. The letters are A,C,G,and T respectively. It is possible for more than one of the four basic nucleotides to occur at the same position, which may be represented by a letter other than just the four previously stated letters. Here is a complete list of the letters.

A = adenine
C = cytosine
G = guanine
T = thymine
R = G A (purine)
Y = T C (pyrimidine)
K = G T (keto)
M = A C (amino)
S = G C (strong bonds)
W = A T (weak bonds)
B = G T C (all but A)
D = G A T (all but C)
H = A C T (all but G)
V = G C A (all but T)
N = A G C T (any)

Any particular sequence may contain information which could explain certain genetic characteristics about any particular person or living thing.

Now that I have discussed what a sequence is as it pertains to genetics, it is important to discuss what sequence alignment is (in order to meet our goal of understanding the CLOUDBURST algorithm). Fortunately I have found an excellent youtube video discussing sequence alignment and measuring similarity, a topic discussed in our big data class, between any particular pair of genomic sequences.

The topic of similarity as discussed in class had to do with wanting to gain or recognize items which have a large fraction of their “market-basket” in common. In genomics, similarity would be like finding a pair of genetic sequences which have a large fraction of their nucleotides in common. Here is an excellent video that talks about sequence alignment and the similarity concept as it pertains to genomics.

From the video, similarity as it pertains to genomics is a degree of measure based off genomic sequences which "align" with each other well. Any two genomic sequences which align very well could be considered "similar" based on the margin of error that can be tolerated.

Ah ha , error! From the video, essential two types of errors occur during sequence alignment. Either there will be a "gap" in a sequence or a "mismatch" of nucleotide characters at a particular location. A "gap", from my take on the video, essential may occur when two genomic sequences of different lengths are being compared, and a gap must be used to align the character of the shorter sequence with that of the longer, given that if the gap is used, the next character will match. A "mismatch" may occur when any two characters do not align in a sequence. That one is pretty simple to explain, but if you watch the video it will be far more apparent as to what I am talking about.

Essential a sequence alignment algorithm will take two genomic sequences as inputs and align them multiple ways while assigning penalties or "cost" to any mismatches or gaps that might occur. The amount of "cost" per either of these errors is a predefined value. A good sequence alignment algorithm should output the sequence alignment which has the lowest cost. This cost is assigned to its alignment as a value and is known as a Needleman-Wunsch score. The score helps give a quick measure for the degree of similarity between two genomic sequences.

I will continue discussing aspects of the 1000 Genomes project as it pertains to big data in my next post and hopefully I will get into how to use the CLOUDBURST genomic sequencing algorithm in AWS.

Just a closing note, but it is interesting to think that there is actually data explicitly encoded inside your body that perhaps defines who you are or who we are as a species. Genetics seems to be a field where chaos, biology, computer science, and mathematics can meet. And with emerging fields like big data, perhaps one day will will bridge these gaps.

-Wade

Resources:

http://en.wikipedia.org/wiki/Sequence_alignment

http://en.wikipedia.org/wiki/Nucleotides

http://en.wikipedia.org/wiki/Nucleic_acid_sequence

Analytics and Visualization of Big Data

Thursday, March 21, 2013

1000 Genomes Project part 2:Genetic Sequencing Alignment and Similarity

No comments:

Post a Comment