Representing biological sequences as Markov chains

BioMarkovChains

A Julia package to represent biological sequences as Markov chains

Installation

BioMarkovChains is a Julia Language package. To install BioMarkovChains, please open Julia's interactive session (known as REPL) and press ] key in the REPL to use the package mode, then type the following command

pkg> add BioMarkovChains

Creating Markov chain out of DNA sequences

An important step before developing several gene finding algorithms consist of having a Markov chain representation of the DNA. To do so, we implemented the BioMarkovChain method that will capture the initials and transition probabilities of a DNA sequence (LongSequence) and will create a dedicated object storing relevant information of a DNA Markov chain. Here an example:

Let find one ORF in a random LongDNA :

using BioSequences, BioMarkovChains, GeneFinder


# > 180195.SAMN03785337.LFLS01000089 -> finds only 1 gene in Prodigal (from Pyrodigal tests)
seq = dna"AACCAGGGCAATATCAGTACCGCGGGCAATGCAACCCTGACTGCCGGCGGTAACCTGAACAGCACTGGCAATCTGACTGTGGGCGGTGTTACCAACGGCACTGCTACTACTGGCAACATCGCACTGACCGGTAACAATGCGCTGAGCGGTCCGGTCAATCTGAATGCGTCGAATGGCACGGTGACCTTGAACACGACCGGCAATACCACGCTCGGTAACGTGACGGCACAAGGCAATGTGACGACCAATGTGTCCAACGGCAGTCTGACGGTTACCGGCAATACGACAGGTGCCAACACCAACCTCAGTGCCAGCGGCAACCTGACCGTGGGTAACCAGGGCAATATCAGTACCGCAGGCAATGCAACCCTGACGGCCGGCGACAACCTGACGAGCACTGGCAATCTGACTGTGGGCGGCGTCACCAACGGCACGGCCACCACCGGCAACATCGCGCTGACCGGTAACAATGCACTGGCTGGTCCTGTCAATCTGAACGCGCCGAACGGCACCGTGACCCTGAACACAACCGGCAATACCACGCTGGGTAATGTCACCGCACAAGGCAATGTGACGACTAATGTGTCCAACGGCAGCCTGACAGTCGCTGGCAATACCACAGGTGCCAACACCAACCTGAGTGCCAGCGGCAATCTGACCGTGGGCAACCAGGGCAATATCAGTACCGCGGGCAATGCAACCCTGACTGCCGGCGGTAACCTGAGC"

orfseq = findorfs(seq)[3] |> sequence

21nt DNA Sequence:
ATGCGTCGAATGGCACGGTGA

If we translate it, we get a 7aa sequence:

translate(orfseq)

7aa Amino Acid Sequence:
MRRMAR*

Now supposing I do want to see how transitions are occurring in this ORF sequence, the I can use the BioMarkovChain method and tune it to 2nd-order Markov chain:

BioMarkovChain(orfseq, 2)

BioMarkovChain of DNAAlphabet{4}() and order 1:
  - Transition Probability Matrix -> Matrix{Float64}(4 × 4):
   0.25    0.25    0.0     0.5
   0.25    0.0     0.75    0.0
   0.25    0.25    0.25    0.25
   0.0     0.25    0.75    0.0
  - Initial Probabilities -> Vector{Float64}(4 × 1):
   0.2     0.2     0.4     0.2

This is useful to later create HMMs and calculate sequence probability based on a given model, for instance we now have the E. coli CDS and No-CDS transition models or Markov chain implemented:

ECOLICDS

BioMarkovChain of DNAAlphabet{4}() and order 1:
  - Transition Probability Matrix -> Matrix{Float64}(4 × 4):
   0.31    0.224   0.199   0.268
   0.251   0.215   0.313   0.221
   0.236   0.308   0.249   0.207
   0.178   0.217   0.338   0.267
  - Initial Probabilities -> Vector{Float64}(4 × 1):
   0.245   0.243   0.273   0.239

What is then the probability of the previous DNA sequence given this model?

markovprobability(orfseq, model=ECOLICDS, logscale=true)

-39.71754773536592

This is off course not very informative, but we can later use different criteria to then classify new ORFs. For a more detailed explanation see the docs