Skip to content

rtunney/BWT

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 

Repository files navigation

BWT

Purpose:

bwt.py implements linear time encoding of a text into its Burrows Wheeler Transform, and inversion of the transform back to the original text. The dc3 algorithm is implemented for linear time construction of a suffix array, which is used in the encoding to BWT. Decoding is achieved via an occurrence function on the BWT and backwards decoding.

BWT is a popular data structure for compressing the information contained within a suffix array. It can also be used for finding all positions in the encoded text matching a pattern, in time linear in the size of the pattern.

Use:

The program is run from the command line, and takes three arguments: -bwt/-ibwt for encoding/decoding, input_filename, and output_filename. For example:

  • python bwt.py -bwt moby_dick.fasta moby_dick.bwt.fasta
  • python bwt.py -ibwt moby_dick.bwt.fasta moby_dick.ibwt.fasta

The first command would read the text from moby_dick.fasta and output its bwt in fasta format to moby_dick.bwt.fasta The second command would read the bwt in fasta format and output the decoded text to moby_dick.ibwt.fasta Note that moby_dick.fasta and moby_dick.ibwt.fasta will have identical contents

bwt.py expects the input fasta file to be in the same directory as itself. It outputs to the specified file name within the same directory.

Format:

The program takes input and outputs files in fasta format. This includes a header line '>BWT' or '>iBWT', and subsequent lines limited to 80 characters of text each. The program will tolerate in the input file 1) comment lines designated with '>' or ';', and 2) newlines, but output is standard. Header is not critical to program function.

Texts input for BWT must end with a '$' This is by convention for alphabetical tie-breaking.

About

Make and invert the Burrows Wheeler Transform of a text

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages