This is a package that provides subroutines that loads the DNA sequences in the specified fasta file. The DNA sequences are then transformed into some other useful information, e.g. one-hot/WYK encoded vectors, kmer-frequency preserved shuffled sequences, Markov background estimates, partitioned datasets for K-fold cross-validations (for fasta with labels), etc. for downstream machine learning tasks. As of now, we require all sequences in the fasta file to be the same length, and strings must be defined on DNA alphabets {A,C,G,T}
.
Coming Soon