Working with HMMs is a 2 phase problem:
We are using here a text corpus taken from Shakespear Julius Caesar.
library(HMM)
library(lattice)
corpus = readChar('https://www.dropbox.com/s/plwknlglf29qjv2/ceasar.txt?dl=1',100000)
We remove unwanted characters such as space of others, and define the 26 letters of the alphabet (with the underline symbol) as our set of possible outcomes:
corpus = gsub('[\n\r]*','',corpus)
a = unlist(strsplit(gsub("[^a-z]", "_", tolower (corpus)), ""))
letter.labels=c("_", letters)
Let’s see:
substr(corpus,1,1000)
## [1] "The Life and Death of Julius CaesarShakespeare homepage | Julius Caesar | Act 3, Scene 1Previous scene | Next sceneSCENE I. Rome. Before the Capitol; the Senate sitting above. A crowd of people; among them ARTEMIDORUS and the Soothsayer. Flourish. Enter CAESAR, BRUTUS, CASSIUS, CASCA, DECIUS BRUTUS, METELLUS CIMBER, TREBONIUS, CINNA, ANTONY, LEPIDUS, POPILIUS, PUBLIUS, and others CAESAR [To the Soothsayer] The ides of March are come.Soothsayer Ay, Caesar; but not gone.ARTEMIDORUS Hail, Caesar! read this schedule.DECIUS BRUTUS Trebonius doth desire you to o'erread, At your best leisure, this his humble suit.ARTEMIDORUS O Caesar, read mine first; for mine's a suit That touches Caesar nearer: read it, great Caesar.CAESAR What touches us ourself shall be last served.ARTEMIDORUS Delay not, Caesar; read it instantly.CAESAR What, is the fellow mad?PUBLIUS Sirrah, give place.CASSIUS What, urge you your petitions in the street? Come to the Capitol. C"
The variable a
contains the characters as a table:
head(a,n=10)
## [1] "t" "h" "e" "_" "l" "i" "f" "e" "_" "a"
We define a small helper function needed later on:
prob = function(x) {x/sum(x)}
We first need to specify the number of hidden states that we expect in our data; here, we start with n=2
, ut we can try changing this parameter!
n = 2
In order to learn the parameters, we initialize the emission and transition matrices, and run the Baum-Welch algorithm with a specified number of iterations.
We start by generating a random transition and emission probability matrix:
# initialize emission
B0 = apply(matrix(runif(n*27), n), 1, prob)
# initialize transition
A0 = apply(matrix(runif(n^2), n), 1, prob)
We now initialize an hmm object which, at this point, only contains the random matrices A0 and B0:
hmm0 = initHMM(paste0("State ",1:n), letter.labels, startProbs=(prob(runif(n))),
transProbs=A0,
emissionProbs=B0
)
We are now ready to run the Baum Welch algorithm with a number of iterations:
nit = 200
hmm = baumWelch(hmm0, a,nit)
We can now check the learned emission probabilities, and compare them with the random initialized probabilities:
## Plot the learned emission parameters
par(mfrow=c(2,1),mar=c(4,2,2,2))
barplot(t(B0)[,-1],beside=TRUE,names=letter.labels[-1],main='Initial',
col=c('red','lightgreen'));
barplot(hmm$hmm$emissionProbs[,-1],beside=TRUE,main='Final',col=c('red','lightgreen'));legend('topright',legend=c('State 1','State 2'),fill=c('red','lightgreen'))
Can you identify what the two states correspond to?
We can now apply the forward and Viterbi algorithms to a new sentence:
sentence = 'this_is_a_sentence_which_should_be_analysed_by_my_hidden_markov_model'
test = unlist(strsplit(gsub("[^a-z]", "_", tolower(sentence)), ""))
Now apply the Forward algorithm to compute using a dynamic programming approach the intermediate probabilities; the output is a table with the intermediate probabilities (given as ln(p)):
## apply forward
f = forward(hmm$hmm,test)
heatmap(f,Rowv = NA,Colv=NA)
f
## index
## states 1 2 3 4 5 6
## State 1 -2.612014 -5.613055 -8.70342 -11.38531 -14.21031 -16.31962
## State 2 -5.698445 -33.161427 -11.18522 -13.91275 -13.11114 -17.71754
## index
## states 7 8 9 10 11 12 13
## State 1 -18.91828 -21.73441 -23.32869 -26.19705 -28.01937 -30.21763 -33.58634
## State 2 -21.25533 -20.61215 -32.89132 -25.22255 -29.62482 -45.28763 -34.72102
## index
## states 14 15 16 17 18 19 20
## State 1 -36.10776 -38.36128 -41.72998 -45.25841 -47.56241 -50.43082 -67.05364
## State 2 -38.48050 -53.56106 -42.86466 -70.57446 -62.90549 -49.45644 -53.11219
## index
## states 21 22 23 24 25 26
## State 1 -56.72627 -59.81663 -63.46399 -66.49008 -69.35848 -71.18087
## State 2 -82.93107 -62.29844 -89.03813 -94.11301 -68.38411 -72.78636
## index
## states 27 28 29 30 31 32
## State 1 -74.10123 -77.68402 -80.55152 -83.83445 -90.22165 -91.13524
## State 2 -101.45108 -77.91015 -107.80388 -112.24774 -87.83152 -88.85623
## index
## states 33 34 35 36 37 38 39
## State 1 -93.60554 -95.5118 -98.38020 -100.0651 -103.4338 -105.8593 -109.1422
## State 2 -93.73405 -110.1483 -97.40583 -109.6654 -104.5684 -116.1234 -137.5554
## index
## states 40 41 42 43 44 45 46
## State 1 -143.2821 -116.8633 -119.0181 -125.4053 -126.3189 -128.7892 -162.5314
## State 2 -113.5480 -118.1026 -134.0010 -123.0152 -124.0399 -128.9177 -132.0907
## index
## states 47 48 49 50 51 52 53
## State 1 -135.5471 -137.7957 -171.4848 -144.4483 -145.5175 -148.6079 -154.9497
## State 2 -133.1546 -137.7706 -140.9919 -142.0558 -171.8356 -151.0897 -152.4296
## index
## states 54 55 56 57 58 59 60
## State 1 -159.2695 -158.4108 -161.7795 -164.4836 -167.5355 -169.7296 -172.6371
## State 2 -155.5620 -172.3683 -162.9142 -163.1211 -167.6703 -179.6774 -198.4859
## index
## states 61 62 63 64 65 66 67
## State 1 -177.6899 -181.2670 -185.7678 -188.6362 -191.9368 -195.1533 -200.9155
## State 2 -182.2607 -181.4754 -316.6900 -187.6618 -192.1645 -194.7066 -197.6275
## index
## states 68 69
## State 1 -200.4545 -203.7374
## State 2 -214.4283 -232.1507
We can also apply the Viterbi algorithm to the new sentence to check the most likely states:
## apply viterbi
v = viterbi(hmm$hmm,test)
data.frame(prediction=v,sentence=test)
## prediction sentence
## 1 State 1 t
## 2 State 1 h
## 3 State 1 i
## 4 State 1 s
## 5 State 2 _
## 6 State 1 i
## 7 State 1 s
## 8 State 2 _
## 9 State 1 a
## 10 State 2 _
## 11 State 1 s
## 12 State 1 e
## 13 State 1 n
## 14 State 1 t
## 15 State 1 e
## 16 State 1 n
## 17 State 1 c
## 18 State 1 e
## 19 State 2 _
## 20 State 2 w
## 21 State 1 h
## 22 State 1 i
## 23 State 1 c
## 24 State 1 h
## 25 State 2 _
## 26 State 1 s
## 27 State 1 h
## 28 State 1 o
## 29 State 1 u
## 30 State 1 l
## 31 State 2 d
## 32 State 2 _
## 33 State 1 b
## 34 State 1 e
## 35 State 2 _
## 36 State 1 a
## 37 State 1 n
## 38 State 1 a
## 39 State 1 l
## 40 State 2 y
## 41 State 1 s
## 42 State 1 e
## 43 State 2 d
## 44 State 2 _
## 45 State 2 b
## 46 State 2 y
## 47 State 2 _
## 48 State 2 m
## 49 State 2 y
## 50 State 2 _
## 51 State 1 h
## 52 State 1 i
## 53 State 2 d
## 54 State 2 d
## 55 State 1 e
## 56 State 1 n
## 57 State 2 _
## 58 State 1 m
## 59 State 1 a
## 60 State 1 r
## 61 State 1 k
## 62 State 1 o
## 63 State 1 v
## 64 State 2 _
## 65 State 2 m
## 66 State 2 o
## 67 State 2 d
## 68 State 1 e
## 69 State 1 l