Principle

Working with HMMs is a 2 phase problem:

  1. learn the parameters of the HMM using oberved training data (Baum-Welch algorithm, based on expectation maximization)
  2. use the learned model on new test data to compute the likelihood of observed data and the most likely chain of unobserved states

Training data: Julius Caesar

We are using here a text corpus taken from Shakespear Julius Caesar.

library(HMM)
library(lattice)
corpus = readChar('https://www.dropbox.com/s/plwknlglf29qjv2/ceasar.txt?dl=1',100000)

We remove unwanted characters such as space of others, and define the 26 letters of the alphabet (with the underline symbol) as our set of possible outcomes:

corpus = gsub('[\n\r]*','',corpus)
a = unlist(strsplit(gsub("[^a-z]", "_", tolower (corpus)), ""))
letter.labels=c("_", letters)

Let’s see:

substr(corpus,1,1000)
## [1] "The Life and Death of Julius CaesarShakespeare homepage | Julius Caesar | Act 3, Scene 1Previous scene | Next sceneSCENE I. Rome. Before the Capitol; the Senate sitting above.    A crowd of people; among them ARTEMIDORUS and the Soothsayer. Flourish. Enter CAESAR, BRUTUS, CASSIUS, CASCA, DECIUS BRUTUS, METELLUS CIMBER, TREBONIUS, CINNA, ANTONY, LEPIDUS, POPILIUS, PUBLIUS, and others CAESAR    [To the Soothsayer] The ides of March are come.Soothsayer    Ay, Caesar; but not gone.ARTEMIDORUS    Hail, Caesar! read this schedule.DECIUS BRUTUS    Trebonius doth desire you to o'erread,    At your best leisure, this his humble suit.ARTEMIDORUS    O Caesar, read mine first; for mine's a suit    That touches Caesar nearer: read it, great Caesar.CAESAR    What touches us ourself shall be last served.ARTEMIDORUS    Delay not, Caesar; read it instantly.CAESAR    What, is the fellow mad?PUBLIUS    Sirrah, give place.CASSIUS    What, urge you your petitions in the street?    Come to the Capitol.    C"

The variable a contains the characters as a table:

head(a,n=10)
##  [1] "t" "h" "e" "_" "l" "i" "f" "e" "_" "a"

We define a small helper function needed later on:

prob = function(x) {x/sum(x)}

Learning the parameters: Baum-Welch

We first need to specify the number of hidden states that we expect in our data; here, we start with n=2, ut we can try changing this parameter!

n = 2

In order to learn the parameters, we initialize the emission and transition matrices, and run the Baum-Welch algorithm with a specified number of iterations.

We start by generating a random transition and emission probability matrix:

# initialize emission
B0 = apply(matrix(runif(n*27), n), 1, prob)

# initialize transition
A0 = apply(matrix(runif(n^2), n), 1, prob)

We now initialize an hmm object which, at this point, only contains the random matrices A0 and B0:

hmm0 = initHMM(paste0("State ",1:n), letter.labels, startProbs=(prob(runif(n))),
                transProbs=A0,
                emissionProbs=B0
  )

We are now ready to run the Baum Welch algorithm with a number of iterations:

nit = 200
hmm = baumWelch(hmm0, a,nit)

looking at the learning parameters

We can now check the learned emission probabilities, and compare them with the random initialized probabilities:

## Plot the learned emission parameters
par(mfrow=c(2,1),mar=c(4,2,2,2))
barplot(t(B0)[,-1],beside=TRUE,names=letter.labels[-1],main='Initial',
        col=c('red','lightgreen'));
barplot(hmm$hmm$emissionProbs[,-1],beside=TRUE,main='Final',col=c('red','lightgreen'));legend('topright',legend=c('State 1','State 2'),fill=c('red','lightgreen'))

Can you identify what the two states correspond to?

Applying the model to new data

We can now apply the forward and Viterbi algorithms to a new sentence:

sentence = 'this_is_a_sentence_which_should_be_analysed_by_my_hidden_markov_model'
test = unlist(strsplit(gsub("[^a-z]", "_", tolower(sentence)), ""))

Now apply the Forward algorithm to compute using a dynamic programming approach the intermediate probabilities; the output is a table with the intermediate probabilities (given as ln(p)):

## apply forward
f = forward(hmm$hmm,test)
heatmap(f,Rowv  = NA,Colv=NA)

f
##          index
## states            1          2         3         4         5         6
##   State 1 -2.612014  -5.613055  -8.70342 -11.38531 -14.21031 -16.31962
##   State 2 -5.698445 -33.161427 -11.18522 -13.91275 -13.11114 -17.71754
##          index
## states            7         8         9        10        11        12        13
##   State 1 -18.91828 -21.73441 -23.32869 -26.19705 -28.01937 -30.21763 -33.58634
##   State 2 -21.25533 -20.61215 -32.89132 -25.22255 -29.62482 -45.28763 -34.72102
##          index
## states           14        15        16        17        18        19        20
##   State 1 -36.10776 -38.36128 -41.72998 -45.25841 -47.56241 -50.43082 -67.05364
##   State 2 -38.48050 -53.56106 -42.86466 -70.57446 -62.90549 -49.45644 -53.11219
##          index
## states           21        22        23        24        25        26
##   State 1 -56.72627 -59.81663 -63.46399 -66.49008 -69.35848 -71.18087
##   State 2 -82.93107 -62.29844 -89.03813 -94.11301 -68.38411 -72.78636
##          index
## states            27        28         29         30        31        32
##   State 1  -74.10123 -77.68402  -80.55152  -83.83445 -90.22165 -91.13524
##   State 2 -101.45108 -77.91015 -107.80388 -112.24774 -87.83152 -88.85623
##          index
## states           33        34        35        36        37        38        39
##   State 1 -93.60554  -95.5118 -98.38020 -100.0651 -103.4338 -105.8593 -109.1422
##   State 2 -93.73405 -110.1483 -97.40583 -109.6654 -104.5684 -116.1234 -137.5554
##          index
## states           40        41        42        43        44        45        46
##   State 1 -143.2821 -116.8633 -119.0181 -125.4053 -126.3189 -128.7892 -162.5314
##   State 2 -113.5480 -118.1026 -134.0010 -123.0152 -124.0399 -128.9177 -132.0907
##          index
## states           47        48        49        50        51        52        53
##   State 1 -135.5471 -137.7957 -171.4848 -144.4483 -145.5175 -148.6079 -154.9497
##   State 2 -133.1546 -137.7706 -140.9919 -142.0558 -171.8356 -151.0897 -152.4296
##          index
## states           54        55        56        57        58        59        60
##   State 1 -159.2695 -158.4108 -161.7795 -164.4836 -167.5355 -169.7296 -172.6371
##   State 2 -155.5620 -172.3683 -162.9142 -163.1211 -167.6703 -179.6774 -198.4859
##          index
## states           61        62        63        64        65        66        67
##   State 1 -177.6899 -181.2670 -185.7678 -188.6362 -191.9368 -195.1533 -200.9155
##   State 2 -182.2607 -181.4754 -316.6900 -187.6618 -192.1645 -194.7066 -197.6275
##          index
## states           68        69
##   State 1 -200.4545 -203.7374
##   State 2 -214.4283 -232.1507

We can also apply the Viterbi algorithm to the new sentence to check the most likely states:

## apply viterbi
v = viterbi(hmm$hmm,test)

data.frame(prediction=v,sentence=test)
##    prediction sentence
## 1     State 1        t
## 2     State 1        h
## 3     State 1        i
## 4     State 1        s
## 5     State 2        _
## 6     State 1        i
## 7     State 1        s
## 8     State 2        _
## 9     State 1        a
## 10    State 2        _
## 11    State 1        s
## 12    State 1        e
## 13    State 1        n
## 14    State 1        t
## 15    State 1        e
## 16    State 1        n
## 17    State 1        c
## 18    State 1        e
## 19    State 2        _
## 20    State 2        w
## 21    State 1        h
## 22    State 1        i
## 23    State 1        c
## 24    State 1        h
## 25    State 2        _
## 26    State 1        s
## 27    State 1        h
## 28    State 1        o
## 29    State 1        u
## 30    State 1        l
## 31    State 2        d
## 32    State 2        _
## 33    State 1        b
## 34    State 1        e
## 35    State 2        _
## 36    State 1        a
## 37    State 1        n
## 38    State 1        a
## 39    State 1        l
## 40    State 2        y
## 41    State 1        s
## 42    State 1        e
## 43    State 2        d
## 44    State 2        _
## 45    State 2        b
## 46    State 2        y
## 47    State 2        _
## 48    State 2        m
## 49    State 2        y
## 50    State 2        _
## 51    State 1        h
## 52    State 1        i
## 53    State 2        d
## 54    State 2        d
## 55    State 1        e
## 56    State 1        n
## 57    State 2        _
## 58    State 1        m
## 59    State 1        a
## 60    State 1        r
## 61    State 1        k
## 62    State 1        o
## 63    State 1        v
## 64    State 2        _
## 65    State 2        m
## 66    State 2        o
## 67    State 2        d
## 68    State 1        e
## 69    State 1        l