#maths
Given a Hidden Markov Model with known parameters (stationary distribution/ transition probability/ emission probability) we can calculate the likelihood of seeing a sequence observations, efficiently using the forward algorithm.
Letβs say we observe π«,π. What is the likelihood of observing these states, given our hidden markov model?
The naive approach is to enumerate all possible hidden states and sum their probabilities. That is:
P(O)=ββqβP(Oβ£q)P(q)
Where:
- q=q1β,q2β,β¦,qTβ represents a sequence of states through the HMM that generates the observations O.
- P(Oβ£q) is the probability of observing the sequence O given the state sequence q, which is the product of the observation probabilities for each observed symbol at each step in the sequence given the state at that step: βt=1TβP(otββ£qtβ).
- P(q) is the probability of the state sequence q, which includes the initial state probability and the product of the transition probabilities for each transition in the sequence: P(q1β)βt=1Tβ1βP(qt+1ββ£qtβ).
- The sum is taken over all possible sequences of states q of length T.
We can expand the equation to:
P(O)=ββqβ[(βt=1TβP(otββ£qtβ))β(P(q1β)βt=1Tβ1βP(qt+1ββ£qtβ))]
For O={π«,π} the permutations of states q that can give rise to O are
q={{βRβ―,βRβ―},{βRβ―,βRβ―},{βRβ―,β},{βRβ―,βRβ―},{βRβ―,βRβ―},{βRβ―,β},{β,βRβ―},{β,βRβ―},{β,β}}
The number of possible hidden states can be calculated with:
P(n,r)=nr
where n is the number of states and r is the number of observations. This is an exponential function that grows with the number of observations we make.
The probability of making observation O is:
P(O={o1β=π«,o2β=π})β=βqββ[(t=1βTβP(otββ£qtβ))β(P(q1β)t=1βTβ1βP(qt+1ββ£qtβ))]=βqββP(q1β)βP(o1ββ£q1β)βP(q2ββ£q1β)βP(o2ββ£q2β)=βqββP(q1β)βP(π«β£q1β)βP(q2ββ£q1β)βP(πβ£q2β)=P(βRβ―)βP(π«β£βRβ―)βP(βRβ―β£βRβ―)βP(πβ£βRβ―)+P(βRβ―)Β βP(π«β£βRβ―)βΒ P(βRβ―β£βRβ―)βP(πβ£βRβ―)+P(βRβ―)Β βP(π«β£βRβ―)βP(ββ£βRβ―)βP(πβ£β)+P(βRβ―)Β βP(π«β£βRβ―)βP(βRβ―β£βRβ―)βP(πβ£βRβ―)+P(βRβ―)Β βP(π«β£βRβ―)βP(βRβ―β£βRβ―)βP(πβ£βRβ―)+P(βRβ―)Β βP(π«β£βRβ―)βΒ P(ββ£βRβ―)βP(πβ£β)+P(β)βP(π«β£β)βP(βRβ―β£β)βP(πβ£βRβ―)+P(β)βP(π«β£β)βP(βRβ―β£β)βP(πβ£βRβ―)+P(β)βP(π«β£β)βP(ββ£β)βP(πβ£β)β
Writing it out we can see that we are recomputing certain parts of the equation. For example P(β)βP(π«β£β) is computed 3 times. We can make the computation of P(O) much more efficient by computing it recursively. This is where the forward algorithm comes in.
Initialization:Recursion:Termination:βΞ±1β(Xiβ)=Ο[i]P(Y0β£Xiβ),Ξ±t+1β(i)=[j=1βNβΞ±tβ(Xjβ)P(Xiββ£Xjβ)]P(Ytβ£Xiβ),P(O)=i=1βNβΞ±Tβ(i),β1β€iβ€N1β€tβ€Tβ1,1β€iβ€Nβ
Letβs recompute our example using the forward algorithm:
Initialization:βt=1:Β Recursion:βt=2:Β Β Termination:βP(O)=βΞ±1β(βRβ―)=Ο[βRβ―]P(π«β£βRβ―),Ξ±1β(βRβ―)=Ο[βRβ―]P(π«β£βRβ―),Ξ±1β(β)=Ο[β]P(π«β£β)Ξ±2β(βRβ―)=[Ξ±1β(βRβ―)P(βRβ―β£βRβ―)+Ξ±1β(βRβ―)P(βRβ―β£βRβ―)+Ξ±1β(β)P(βRβ―β£β)]P(πβ£βRβ―)Ξ±2β(βRβ―)=[Ξ±1β(βRβ―)P(βRβ―β£βRβ―)+Ξ±1β(βRβ―)P(βRβ―β£βRβ―)+Ξ±1β(β)P(βRβ―β£β)]P(πβ£βRβ―)Ξ±2β(β)=[Ξ±1β(βRβ―)P(ββ£βRβ―)+Ξ±1β(βRβ―)P(ββ£βRβ―)+Ξ±1β(β)P(ββ£β)]P(πβ£β)Ξ±2β(βRβ―)+Ξ±2β(βRβ―)+Ξ±2β(β)β
The complexity to compute the forward algorithm is O(N2T). This is because for each of the observations T, we compute the transition from each state to every other state leading to NβN.
In comparison the complexity for the naive approach is O(NT), which is infeasible for even moderate sized T.
Further Reading: