User:La comadreja/Software/SpaceRemover

From Wikiversity
Jump to navigation Jump to search
The Friedman Lab is no longer working on this program because it is ineffective for the stated purpose.
Nuvola apps biology.png Subject classification: this is a biology resource.
Gnome-fs-client.svg Subject classification: this is an information technology resource.
Face-blush.svg Completion status: About halfway there. You may help to clarify and expand it.
GilbertfieldNE.jpg This resource was intentionally abandoned by its creator. Please adopt it and change this tag.


This program is used to format .aln files before constructing hidden Markov models (HMMs).


A ClustalW2 output file in .aln form (align with numbers) is written like this:

Sequence1Name SpecificSequence Number_of_Nucleotides_or_AAs
Sequence2Name SpecificSequence Number_of_Nucleotides_or_AAs

Sequence1Name SpecificSequence Number_of_Nucleotides_or_AAs
Sequence2Name SpecificSequence Number_of_Nucleotides_or_AAs


This program takes such an .aln file, removes the spaces between the sequence name and the sequence, changes the count of nucleotides or aas at the end of the line to "83," and outputs another .aln file for use in HMMER.


Supply the names of the files on the command line, e.g.: $java SpaceRemover < inputTest.aln > outputTest.aln

Completion Stage[edit]

  • November 22, 2008: Files parsed by this program could be used to build hidden Markov models. When the HMM files were calibrated, however, the error message of "FATAL: fit failed; --num may be set too small?" was returned. The programmer (R.S.) will next convert files that are supplied in the HMMER tutorial to ClustalW2 form, construct hidden Markov models, and attempt to calibrate these files.
  • November 30, 2008: A new version of SpaceRemover,, is in the works and shows several improvements over its predecessor (e.g., that command line arguments are no longer required and the .aln file no longer needs to be further reformatted). R.S. was able to use ClustalW2 to convert the .msf file supplied in the HMMER tutorial (globins50.msf) to .aln form. (the original) cannot parse globins50.aln. When was used to reformat globins50.aln, the program ran excessively slowly. The next step for R.S. is to understand why it is so slow, fix it and make the program available online.
  • December 2, 2008: can parse globins50.aln supposedly successfully. However, when an HMM model of globins50.aln is built, the error message "FATAL: Parse error: sequence myg_escgi-------------------------------------------------VLSDAEWQLVL: length 2, expected 6 in alignment" is returned. R.S. does not know what this means and the next step is to figure out its meaning and fix it.
  • January 18, 2009: "length 2, expected 6 in alignment"--when the length was increased to 6, the program printed "length 6, expected 18 in alignment." R.S. still does not understand the meaning.
    • HMMs of ADRA1A1.txt, based on, appear to be able to find sequences, e.g. in the FASTA file Artemia.fa, but R.S. does not know if they are correct sequences.
    • globins50.phylip contains an excessive number of gaps and cannot be recognized by hmmbuild.
    • The next step is to search for a better HMMER tutorial than the one in the Userguide.


Code for the driver class is here.

Code for, a blueprint class that is used with the program, is here. is supplied for programming assignments by the instructors of Princeton University's introductory computer science class.