BioBind Home Page
BioBind Home Page
BioBind Contact
BioBind Contact
BioBind FAQ
BioBind FAQ
BioBind logo
 
 
>Home >FAQ >EMBOSS Guide

GCG to EMBOSS Conversion

In 1985 the Genetics Computing Group in Wisconsin, USA, developed some bioinformatics tools and created the GCG software package. This suite was further enhanced a decade later as European developers started contributing sequence analysis tools as parted of the Extended GCG (EGCG) package.

EMBOSS was designed in 1999 by some of the EGCG developers with many of the same principles that can be found in the GCG package. The EMBOSS suite is designed as a collection of programs that can be used together to create a flexible analysis pipeline. There are no arbitrary size limits on sequence length and as EMBOSS reads and writes 42 different formats, it is easy to input and output files created from or for other software packages.

For those GCG users who have started using the BioBind product, this conversion chart should help you make the switch as easily and efficiently as possible. You will find many equivalences between the two software packages, although you currently will also find areas where there is no GCG equivalent for an EMBOSS program and vice versa. Each EMBOSS program contains a one line description to identify it. This description appears at the start of each application.

EMBOSS applications will prompt for all information required for it to execute. Additional prompts may be accessible using the -opt flag after the program name on the command line. The default option is displayed in square brackets ([ ]). To accept it, hit <return> otherwise type in whatever you like. Output file names often have as their suffix the name of the relevant application. To halt a program in EMBOSS hit ctrl C.

assemble merger

Merges two overlapping sequences into one. Produces a merged file and an alignment file. Matrix options accessible using the -opt flag.

$ merger
Merge two overlapping nucleic acid sequences
Input sequence: cam1.fasta
Second sequence: cam2.fasta
Output sequence [cam1.fasta]: cam_both.fasta
Output alignment [cam1.out2]: cam_both.aln
   
backtranslate backtranseq

Translates protein back into a nucleotide sequence. Default codon usage table is the standard human one. To alter this use the -opt flag.

$ backtranseq
Back translate a protein sequence
Input sequence: calm_human
Output sequence [calm_human.fasta]:
   
bestfit water/matcher

Finds the best local alignment(s) between two sequences. matcher (Huang & Miller algorithm) provides a faster match and should be used for longer sequences. water (Smith-Waterman algorithm) is more accurate and should be used for shorter sequences. Matrix options for matcher are available using the -opt flag

$ matcher
Finds the best local alignments between two sequences
Input sequence: cam1_long.fasta
Second sequence: cam2_long.fasta
Output alignment [cam1_1-429.matcher]:

$ water Smith-Waterman local alignment. Input sequence: cam1.fasta Second sequence(s): cam2.fasta Gap opening penalty [10.0]: Gap extension penalty [0.5]: Output alignment [cam1.water]:

breakup splitter

Takes a sequence and splits it into smaller overlapping sequences. Use the -opt flag to select the size of each fragment.

$ splitter
Split a sequence into (overlapping) smaller sequences
Input sequence(s): cam1.fasta
Output sequence [cam1.fasta]:
   
chopup It is not necessary to have a separate program in EMBOSS for this, as all programs read and write a number of different file formats.
codonfrequency chips/cusp/compseq

chips calculates the effection number of codons used (Wright Nc statistic). cusp creates a codon usage table from coding sequence (CDS). compseq counts the composition of user-specifed words within the sequence. Use the -opt flag for further word specification.

$ chips
Codon usage statistics
Input sequence(s): cam1.fasta
Output file [cam1_1-429.chips]:

$ cusp Create a codon usage table Input sequence(s): cam1.fasta Output file [cam1_1-429.cusp]:

$ compseq Counts the composition of dimer/trimer/etc words in a sequence Input sequence(s): cam1.fasta Word size to consider (e.g. 2=dimer) [2]: Output file [cam1_1-429.composition]:

codonpreference syco/wobble

syco identifies coding sequence from codon frequency bias information (Gribskov statistic). Further options for plot specification can be retrieved using the -opt flag. wobble plots a graph of the third "wobble" codon in a sequence. Use the -opt flag to alter the window size.

$ syco
Synonymous codon usage Gribskov statistic plot
Input sequence: cam1.fasta
Graph type [x11]: ps
Created syco.ps

$ wobble Wobble base plot Input sequence: cam1.fasta Graph type [x11]: ps Output file [cam1_1-429.wobble]: Created wobble.ps

coilscan pepcoil

Identifies coiled coil regions in a protein sequence (Lupas, van Dyke & Stock algorithm).

$ pepcoil
Predicts coiled coil regions
Input sequence(s): calm_human
Window size [28]:
Output file [calm_human.pepcoil]:
  
compare dottup/dotmatcher

Comparison of similar regions across two sequences displayed in graphcal format. dottup is designed for identical matches, and dotmatcher for regions of similarity. Use the -opt flag to select matrix options.

$ dottup
Displays a wordmatch dotplot of two sequences
Input sequence: cam1.fasta
Second sequence: cam2.fasta
Word size [10]:
Graph type [x11]: ps
Created dottup.ps

$ dotmatcher Displays a thresholded dotplot of two sequences Input sequence: cam1.fasta Second sequence: cam2.fasta Graph type [x11]: ps Created dotmatcher.ps

composition compseq/pepstats

compseq counts the composition of user-specifed words within the sequence. Use the -opt flag for further word specification. pepstats calculates peptide sequence composition.

$ compseq
Counts the composition of dimer/trimer/etc words in a sequence
Input sequence(s): cam1.fasta
Word size to consider (e.g. 2=dimer) [2]:
Output file [cam1_1-429.composition]:

$ pepstats Protein statistics Input sequence(s): calm_human Output file [calm_human.pepstats]:

consensus prophecy

Creates a matrix or profile from a multiple alignment.

$ prophecy
Creates matrices/profiles from multiple alignments
Input sequence set: prot2.fasta
Profile type
         F : Frequency
         G : Gribskov
         H : Henikoff
Select type [F]:
Enter a name for the profile [mymatrix]:
Enter threshold reporting percentage [75]:
Output file [prot2.prophecy]:  
   
correspond codecmp

Compares codon frequency matrices.

$ codcmp
Codon usage table comparison
Codon usage file [Ehum.cut]:
Second Codon usage file [Ehum.cut]: Eacc.cut
Output file [outfile.codcmp]:
   
corrupt msbar

Randomly mutates a sequence. Use the -opt flag to mutate in frame.

$ msbar
Mutate sequence beyond all recognition
Input sequence(s): cam1.fasta
Number of times to perform the mutation operations [1]:
Point mutation operations
         0 : None
         1 : Any of the following
         2 : Insertions
         3 : Deletions
         4 : Changes
         5 : Duplications
         6 : Moves
Types of point mutations to perform [0]:
Block mutation operations
         0 : None
         1 : Any of the following
         2 : Insertions
         3 : Deletions
         4 : Changes
         5 : Duplications
         6 : Moves
Types of block mutations to perform [0]:
Codon mutation operations
         0 : None
         1 : Any of the following
         2 : Insertions
         3 : Deletions
         4 : Changes
         5 : Duplications
         6 : Moves
Types of codon mutations to perform [0]:
Output sequence [cam1_1-429.fasta]:
   
dataset dbiblast/dbigcg/dbifasta/dbiflat

Indexes the relevant database for use with EMBOSS.

distances no direct equivalent

See the PHYLIP package offered as part of your BioBind software options.

diverge no direct equivalent

See the PHYLIP package offered as part of your BioBind software options.

dotplot dottup/dotmatcher

See comparison

extractpeptide transeq

Translates a nucleotide sequence into protein. Use the -opt flag to specify information on the region, frame and genetic code.

$ transeq
Translate nucleic acid sequences
Input sequence(s): cam1.fasta
Output sequence [cam1_1-429.pep]:
    
fetch seqret/seqretsplit

seqret retrieves sequences from a database using the EMBOSS uniform sequence address. It can also by used with an input file to alter its format. seqretsplit splits a multi-sequence files into individual files containing a single sequence.Use the -opt flag to retrieve only the first sequence in a file.

$ seqretsplit
Reads and writes (returns) sequences in individual files
Input sequence(s): prot2.fasta
Output sequence [calm_human.fasta]:
    
findpatterns fuzznuc/fuzzpro

Fuzzy search of a pattern against a sequence on selection of sequences. Search allows mismatches. fuzznuc searches nucleotide and fuzzpro protein sequences.

$ fuzznuc
Nucleic acid pattern search
Input sequence(s): cam1.fasta
Search pattern: AGGT
Number of mismatches [0]: 1
Output report [cam1_1-429.fuzznuc]:

$ fuzzpro Protein pattern search Input sequence(s): prot2.fasta Search pattern: PATTERN Number of mismatches [0]: 3 Output report [calm_human.fuzzpro]:

frames plotorf/showorf

Plots or displays open reading frames. plotorf uses ATG as a start and TAA, TAG, TGA as stop codons and displays the results as a graphic. showorf writes out the results of a frame translation as text. Use the -opt flag for more options.

$ plotorf
Plot potential open reading frames
Input sequence: cam1.fasta
Graph type [x11]: ps
Created plotorf.ps

$ showorf Pretty output of DNA translations Input sequence: cam1.fasta Select Frames To Translate 0 : None 1 : F1 2 : F2 3 : F3 4 : R1 5 : R2 6 : R3 Select one or more values [1,2,3,4,5,6]: Output file [cam1_1-429.showorf]:

from EMBL

fromFasta

fromGenbank

fromIG

fromStaden

fromtrace

all

All EMBOSS applications read and write a variety of file formats, so an individual conversion program is not necessary.

gap stretcher/needle

Finds the best global alignment between two sequences. stretcher (Myers & Miller algorithm) provides a faster match and should be used for longer sequences. needle (Needleman-Wunsch algorithm) is more accurate and should be used for shorter sequences. Matrix options for stretcher are available using the -opt flag

$ stretcher
Finds the best global alignment between two sequences
Input sequence: cam1_long.fasta
Second sequence: cam2_long.fasta
Output alignment [cam1_1-429.stretcher]:

$ needle Needleman-Wunsch global alignment. Input sequence: cam1.fasta Second sequence(s): cam2.fasta Gap opening penalty [10.0]: Gap extension penalty [0.5]: Output alignment [cam1_1-429.needle]:

gapshow plotcon

Plots the quality of alignment conservation across a sliding window. Use the -opt flag to alter the comparison matrix.

$ plotcon
Plots the quality of conservation of a sequence alignment
Input sequence set: emma.aln
Window size [4]:
Graph type [x11]: ps
Created plotcon.ps
    
getseq newseq

Enter a short sequence into the program for use as an input file in other applications.

$ newseq
Type in a short new sequence.
Name of the sequence: Test
Description of the sequence: Test Protein Sequence
Type of sequence
         N : Nucleic
         P : Protein
Type of sequence [N]: P
Output sequence [outfile.fasta]: Test.fasta
Enter the sequence: wearethediddymenthediddymenthediddymen
  
growtree no direct equivalent

Use emma as the interface to ClustalW or the PHYILP option on your BioBind software.

helicalwheel pepwheel

Plots a protein sequence as a helix.Use the -opt flag to specify the output display.

$ pepwheel
Shows protein sequences as helices
Input sequence: calm_human
Graph type [x11]: ps
Created pepwheel.ps
    
hmmerAlign

hmmerBuild

hmmerCalibrate

hmmerFetch

hmmerIndex

hmmerPfam

hmmerSearch

no direct equivalent

The HMMER programs are available as an option with your BioBind software.

 
hthscan helixturnhelix

Searches for 22 residue helix turn helix motifs in a protein sequence (Dodd & Egan).Use the -opt flag to search using their 20 residue region and further specify calculation parameters.

$ helixturnhelix
Report nucleic acid binding motifs
Input sequence(s): calm_human
Output report [calm_human.hth]:
    
isoelectric iep

Calculates the isoelectric point of a protein.

$ iep calm_human
Calculates the isoelectric point of a protein
Output file [calm_human.iep]:
    
lookup whichdb

Does not offer all the parameters that lookup does, but will find identifers or acccession numbers in a database, and optionally retrieve the sequence.

$ whichdb
Search all databases for an entry
ID or Accession number: p62158
Output file [outfile.whichdb]:
Output file [cam1_1-429.restover]:
    
map /
mapplot /
mapsort
restrict/remap/restover

Calculates restriction maps based on the entries in the REBASE restriction enzyme database. Displays peptide translation of open reading frame. remap is the most felxible of these applications. Use the -opt flag to force specific cutters.

$ restrict cam1.fasta
Finds restriction enzyme cleavage sites
Minimum recognition site length [4]:
Comma separated enzyme list [all]:
Output report [cam1_1-429.restrict]:

$ remap Display a sequence with restriction cut sites, translation etc.. Input sequence(s): cam1.fasta Comma separated enzyme list [all]: Minimum recognition site length [4]: Output file [cam1_1-429.remap]:

$ restover Finds restriction enzymes that produce a specific overhang Input sequence(s): cam1.fasta Overlap sequence: overhang.fasta Output file [cam1_1-429.restover]:

melttemp dan

Calculates the melting temperature of a DNA or RNA sequence (Breslauer and Baldino statistics). Use the -opt flag to further specify calculations.

$ dan
Calculates DNA RNA/DNA melting temperature
Input sequence(s): cam1.fasta
Enter window size [20]:
Enter Shift Increment [1]:
Enter DNA concentration (nM) [50.]:
Enter salt concentration (mM) [50.]:
Output report [cam1_1-429.dan]:
    
MEME no direct equivalent

See the MEME application included as an option in your BioBind software.

moment hmoment

Calculates the hydrophobic moment of protein. Use the -opt flag to specify the angle of rotation.

$ hmoment
Hydrophobic moment calculation
Input sequence(s): calm_human
Output file [calm_human.hmoment]:
    
motifs patmatmotifs/pscan

patmatmotifs searches the PROSITE database for patterns. Use the -opt flag to specify patterns. pscan searches the PRINTS database for fingerprint motifs.

$ patmatmotifs
Search a PROSITE motif database with a protein sequence
Input sequence: calm_human
Output report [calm_human.patmatmotifs]:

$ pscan Scans proteins using PRINTS Input sequence(s): calm_human Minimum number of elements per fingerprint [2]: Maximum number of elements per fingerprint [20]: Output file [calm_human.pscan]:

names infoseq

Describes sequence attributes such as name, length, GC content.

$ infoseq
Displays some simple information about sequences
Input sequence(s): calm_human
# USA            			 Name     Accession Type Length Description
fasta::calm_human:CALM_HUMAN CALM_HUMAN  P62158    P    148   Calmodulin (CaM).
    
nooverlap diffseq

Finds differences between two sequences. Use the -opt flag to output the information in columns.

$ diffseq
Find differences between nearly identical sequences
Input sequence: cam1.fasta
Second sequence: cam2.fasta
Word size [10]:
Output report [cam1_1-429.diffseq]:
Output features [CaM1_1-429.diffgff]:
Second output features [CaM2.diffgff]:
    
pepdata getorf/sixpack

Translates all six open reading frames. getorf displays selected translations. sixpack displays DNA sequence and peptide translation. Use the -opt flag for either program to specify the codon usage information.

$ getorf
Finds and extracts open reading frames (ORFs)
Input sequence(s): cam1.fasta
Output sequence [cam1_1-429.orf]:

$ sixpack Display a DNA sequence with frame translation and ORFs Input sequence: cam1.fasta Output file [cam1_1-429.sixpack]: Output sequence [cam1_1-429.fasta]:

pepplot pepinfo + garnier

pepinfo displays biophysical properties of the protein sequence and plots hydrophobicity (Kyte & Doolittle, Sweet & Eisenberg, Eisernberg). Use the -opt flag to select parameters for the hydrophobicity plots. garnier displays a secondary structure plot (Garnier, Ogusthorpe & Robson)

$ pepinfo
Plots simple amino acid properties in parallel
Input sequence: calm_human
Graph type [x11]: ps
Output file [calm_human.pepinfo]:
Created pepinfo.ps

$ garnier Predicts protein secondary structure Input sequence(s): calm_human Output report [calm_human.garnier]:

peptidemap digest

Peptide full or partial digest of a protein sequence.

$ digest
Protein proteolytic enzyme or reagent cleavage digest
Input sequence: calm_human
Enzymes and Reagents
         1 : Trypsin
         2 : Lys-C
         3 : Arg-C
         4 : Asp-N
         5 : V8-bicarb
         6 : V8-phosph
         7 : Chymotrypsin
         8 : CNBr
Select number [1]:
Output report [calm_human.digest]:
    
peptidestructure /
plotstructure
garnier

Displays secondary structure plot (Garnier, Ogusthorpe & Robson)

$ garnier
Predicts protein secondary structure
Input sequence(s): calm_human
Output report [calm_human.garnier]:
    
pileup emma

Wrapper to the ClustalW multiple sequence alignment program. Accepts all EMBOSS input formats.

$ emma
Multiple alignment program - interface to ClustalW program
Input sequence(s): prot_all.fasta
Output sequence [cam2.aln]:
Dendogram output filename [cam2.dnd]:



 CLUSTAL W (1.83) Multiple Sequence Alignments



Sequence type explicitly set to Protein
Sequence format is Pearson
Sequence 1: CaM2            148 aa
Sequence 2: CaM3            148 aa
Sequence 3: CaM1            148 aa
Start of Pairwise alignments
Aligning...
Sequences (1:2) Aligned. Score:  93
Sequences (1:3) Aligned. Score:  100
Sequences (2:3) Aligned. Score:  97
Guide tree        file created:   [00002524C]
Start of Multiple Alignment
There are 2 groups
Aligning...
Group 1: Sequences:   2      Score:2070
Group 2: Sequences:   3      Score:1098
Alignment Score 1439
GCG-Alignment file created      [00002524B]
    
plasmidmap lindna/cirdna

Display of linear and circular DNA. Also, the Jemboss DNA Editor is a graphical interface to
display and edit linear and circular.

plotsimilarity plotcon

See gapshow

pretty /
prettybox
cons/prettyplot/showalign

cons calculates a consensus from a multiple alignment using specified parameters. prettyplot displays an alignment with specified colours and boxed in display. showalign displays the alignment in editable text format. Use the -opt flag for all three programs to set values. These programs are all incorporated in the Jemboss Alignment Editor together with additional capabilities.

prime eprimer3

Included in your BioBind software. Allows selection of a variety of different primers under several conditions. Use the -opt flag to alter parameters.

$ eprimer3
Picks PCR primers and hybridization oligos
Input sequence(s): cam3.fasta
Output file [cam3.eprimer3]:
    
profilegap /
profilemake
prophet/prophecy

prophecy creates matrices or profiles from multiple alignments. prophet reads in these files to create gapped alignment of proteins.

$ prophecy
Creates matrices/profiles from multiple alignments
Input sequence set: emma.aln
Profile type
         F : Frequency
         G : Gribskov
         H : Henikoff
Select type [F]:
Enter a name for the profile [mymatrix]:
Enter threshold reporting percentage [75]:
Output file [emma.prophecy]:

$ prophet Gapped alignment for profiles Input sequence(s): calm_human Profile or matrix file: emma.prophecy Gap opening coefficient [1.0]: Gap extension coefficient [1.0]: Output file [calm_human.prophet]:

profilescan patmatdb

Uses a motif to search a protein sequence.

$ patmatdb
Search a protein sequence with a motif
Input sequence(s): emma.aln
Protein motif to search for: HATS
Output report [cam2.patmatdb]:
    
profilesearch profit

Scans a sequence or database with a matrix or profile. Uses the matrix file created by prophecy.

$ profit
Scan a sequence or database with a matrix or profile
Profile or matrix file: emma.prophecy
Input sequence(s): calm_human
Output file [emma.profit]:
    
reformat seqret

Reformatting files is redundant in EMBOSS as each application reads and write a variety of different formats. However, if anything needs converting, seqret will do it.

$ seqret
Reads and writes (returns) sequences
Input sequence(s): calm.gcg
Output sequence [calm_human.fasta]:
    
repeat equicktandem/etandem/einverted/palindrome

Searches for tandem repeats, inverted or palindromic sequences in a nucleotide input file.

$ equicktandem
Finds tandem repeats
Input sequence: cam1.fasta
Maximum repeat size [600]:
Threshold score [20]:
Output report [cam1_1-429.qtan]:

$ etandem Looks for tandem repeats in a nucleotide sequence Input sequence: cam1.fasta Minimum repeat size [10]: Maximum repeat size [10]: Output report [cam1_1-429.tan]:

$ einverted Finds DNA inverted repeats Input sequence: cam1.fasta Gap penalty [12]: Minimum score threshold [50]: Match score [3]: Mismatch score [-4]: Output file [cam1_1-429.inv]:

$ palindrome Looks for inverted repeats in a nucleotide sequence Input sequence(s): cam1.fasta Enter minimum length of palindrome [10]: Enter maximum length of palindrome [100]: Enter maximum gap between repeated regions [100]: Number of mismatches allowed [0]: Output file [cam1_1-429.pal]: Report overlapping matches [Y]:

replace biosed/degapseq

biosed replaces specified characters in a text file. degapseq is specific for removing gaps.

$ biosed
Replace or delete sequence sections
Input sequence(s): cam1.fasta
Sequence section to match [N]:
Replacement sequence section [A]:
Output sequence [cam1_1-429.fasta]:   

$ degapseq Removes gap characters from sequences Input sequence(s): cam1.fasta Output sequence [cam1_1-429.fasta]:

reverse revseq

Reverses and complements a sequence. Almost any program in the suite can reverse and complement a sequence using the -reverse option. Alternatively the [start:end:reverse] syntax will accomplish the same task.

$ revseq
Reverse and complement a sequence
Input sequence(s): cam1.fasta
Output sequence [cam1_1-429.rev]:
    
sample extractseq

Extracts specific regions from a sequence. Use the -opt flag to save them to a separate file.

$ extractseq
Extract regions from a sequence
Input sequence: cam1.fasta
Regions to extract (eg: 4-57,78-94) [1-429]: 1-25
Output sequence [cam1_1-429.fasta]:
    
seg maskseq

Masks low complexity regions within a sequences. Use the -opt flag to select a region to mask.

$ maskfeat
Mask off features of a sequence.
Input sequence(s): cam1.fasta
Output sequence [cam1_1-429.fasta]:
    
shuffle shuffleseq

Shuffles one or a set of sequences.

$ shuffleseq
Shuffles a set of sequences maintaining composition
Input sequence(s): calm_human
Output sequence [calm_human.fasta]:
    
spscan sigcleave

Searches for signal sequences in proteins. Use the -opt flag to specify a prokaryotic sequence.

$ sigcleave
Reports protein signal cleavage sites
Input sequence(s): calm_human
Minimum weight [3.5]:
Output report [calm_human.sig]:
    
stemloop etandem/palindrome

See repeat

testcode wobble

See codonpreference

toFASTA

toPIR

toIG

toSTADEN

seqret

See fromEMBL

translate transeq

See extractpeptide.

window + statplot freak

Calculates the base or residue frequency of a sequence. Use the -opt flag to select the window type for calculation of the plot.

$ freak
Residue/base frequency table or plot
Input sequence(s): cam1.fasta
Residue letters [gc]:
Output file [cam1_1-429.freak]:
    
gcghelp tfm

Stands for "the fine manual" and contains the indivudal program documentation. Type tfm followed by the program name.

$ tfm stretcher
Displays a program's help documentation manual
    

©2005 BioBind.com All rights reserved.