Atomizer: Automated Segmentation of DNA Sequences with Complex
Evolutionary History
http://compbio.fmph.uniba.sk/atomizer/

The goal of this software is to automatically select suitable markers
in genomic sequences for reconstruction of evolutionary histories
involving large-scale events such as rearrangements or duplications.

This software is released to public domain. However if you use it for
research purposes, please cite the following paper in your publication:

Broňa Brejová, Michal Burger, Tomáš Vinař. Automated Segmentation of
DNA Sequences with Complex Evolutionary History. Accepted to WABI
2011.

Warning: this software is a prototype. Use at your own risk. If you
find any bugs or other problems, please let us know.
 
Requirements:
- Perl
- LASTZ program for alignments http://www.bx.psu.edu/~rsharris/lastz/
  (used by align.sh)
- CPLEX http://www-01.ibm.com/software/integration/optimization/cplex-optimizer/
  (used by ilp-atoms.pl, can be avoided by setting constant $max_component
   and $max_component2 to 0; mcl will be used instead)
- MCL http://micans.org/mcl/
  (used by ilp_atoms.pl, can be avoided by setting $max_component very high)

You may need to edit various constants at the end of each Perl script.
In our work we have removed (cut out) all repeats before running
the software on mammalian sequences.

Example pipeline for atomizing sequences in file seq.fa
#run lastz, produce psl file with alignments
./align.sh seq.fa
#atomize sequence
./perl-atoms.pl seq.fa seq.fa.psl 250 > seq.atoms.tmp 2> seq.atoms.err1
#collapse adjacent atoms if possible
./collapse-atoms.pl seq.atoms.tmp > seq.atoms.tmp2 2> seq.atoms.err2
#cluster atoms to classes
./ilp-atoms.pl seq.fa seq.fa.psl seq.atoms.tmp2 > seq.atoms.tmp3 2> seq.atoms.err3
#renumber atoms
./renumber_atoms.pl < seq.atoms.tmp3 > seq.atoms.tmp4
#collapse adjacent atoms if possible
./collapse-atoms.pl seq.atoms.tmp4 > seq.atoms 2> seq.atoms.err4

Format of the output:
List of atoms, one atom per line, 6 columns separated by tabs:
Sequence name, unique id of the atom, class id, strand (1 or -1),
start position, end position.

Start and end positions are in UCSC-style coordinates,
start is 0-based, end is 1-based.

