Hidden Markov models have been successfully applied to a variety of
problems in molecular biology, ranging from alignment problems to gene finding and
annotation. let us consider a problem of finding CpG-islands in the human genome. Since
there is relative high chance that methyl-C will mutate to T except at the CpG-islands in the
promoter regions of genes, we’ll see more CpG-pairs in the CpG-islands than elsewhere.
Therefore, CpG-islands are useful markers for genes in human and some other organisms.
The question is that how to determine a segment of genome sequence from a CpG-island. For
instance, consider a DNA sequence of
AGCGCGATC. Apply a HMM model that assumes atransition probability matrix for the two states, CpG-island denoted by
+ and Non CpG-islanddenoted by
− as below:right
Probability +
−+ 0.8 0.2
Left
−
0.9 0.1The emission probability matrix is assumed to be:
Probability
A C G T+ 0.20 0.35 0.26 0.19
−
0.35 0.15 0.10 0.40These probabilities are made up here. In practice, they can be estimated through learning from
training data of the known genes. Calculate the probability of observing a DNA sequence of
AGCGCGATC
under a probable state path − + + + + + − − − for the model assumed. Whatis that probability under the path + + + + + − − − −?