onLoad="MM_preloadImages('/images/blue_button_aniR.gif','/images/mustard_button_aniR.gif','/images/red_button_aniR.gif','/images/orange_button_aniR.gif','/images/yellow_button_aniR.gif')" bgcolor=#FFFFFF leftmargin=0 rightmargin=0 topmargin=0 marginwidth=0 marginheight=0>
BioBind logo
BioBind Home Page
BioBind Home Page
BioBind Contact
BioBind Contact
BioBind FAQ
BioBind FAQ
BioBind logo
 
 
>Home >FAQ >EMBOSS Guide

Topics

  • Introduction
  • Qualifiers
  • Specifying Sequences - Uniform Sequence Address
  • Files
  • Formats
  • Alignment Formats
  • Report Formats
  • Graphic Format
  • Feature Format
  • Help Documentation

    Introduction

    EMBOSS is a fully flexible sequence analysis package containing over 200 individual applications which may be run independently or linked together as part of a personal pipeline. It has been designed to operate in a UNIX environment and for use with Windows, it runs using the Cygwin environment.

    It is free of arbitrary size limits - memory for sequences or matrices is allocated dynamically; the only restriction is the hardware. Thus the software can be used to analyse even the longest regions of the genome.

    For the developer wishing to use the BioBind software as a platform for initiating their own code, EMBOSS contains library functions for general string handling, pattern-matching, sorting, iteration and extremely fast indexing. Library functions for all common sequence analysis tasks are also available. Although we recomend using the Jemboss interface contained within your BioBind software, EMBOSS has a consistent API, enabling students to design and program interfaces.

    Qualifiers

    EMBOSS programs are designed to be run from the command-line, as well as within scripts. To customise their behaviour, each has a distinct set of qualifiers, also known as options or flags. They should be added to the command line after the program name (and possibly the input file) has been specified. Some qualifiers require an input which should come after the qualifier. e.g.,

    wossname -alphabetical   Forces wossname to specify all programs in alphabetical order

    seqret sw:dsx_human -osformat gcg   Retrieves the human DSX protein from Swiss-Prot and outputs the file in GCG format.

    There are 3 classes of parameters: standard, additional and advanced. Information on allowable flags for each program is given in the individual documentation files. Whilst the majority of these qualifiers are program specific, there are a number of general qualifiers which are relevant to all programs. Additional qualifiers are relevant to a large proportion of programs, such as those that take a sequence as input.

    Mandatory Qualifiers

    If values for mandatory parameters are not specified, the programs will prompt for them. These need to be specified in order for the program to run. The most obvious of these would be an input file.

    Additional Qualifiers

    Should optional parameters not be specified, default values will be used. To access additional qualifiers whilst running a program, adding the flag -opt to your command line will ensure that the program prompts for them all.

    Advanced Qualifiers

    EMBOSS programs will never prompt for advanced parameters; these must be explicitly specified. These are defined in the program documentation and in the majority of cases consist of files inherent to the functioning of the program, such as amino acid data files.

    General Qualifiers

    These may be used with any program to alter its behaviour.

    • -auto   -   Turns off prompts and descriptions. Used when in running programs scripts.
    • -debug   -   Writes debug output to the file programname.dbg
    • -die   -   Reports program termination
    • -error   -   Reports errors in the program
    • -fatal   -   Reports fatal errors in the program
    • -filter   -   Reads from standard input (keyboard) and writes to standard output (screen) by default.
    • -help   -   Reports command line options. Or –help –verbose for
    • -options   -   Prompts for all required and additional values
    • -stout   -   Writes to standard output (screen) by default.
    • -warning   -   Reports warnings from the program

    Each of these qualifiers can be prefixed with no (e.g. –nowarning) to negate the action.

    Specifying Sequences - Uniform Sequence Address

    The Uniform Sequence Address (USA) is an unambiguous means of specifying sequences in EMBOSS and has the following syntax: format::database:entry. Thus the USA specifies what format the program should expect, what file or database to open and which entry to look for and each piece of information is separated by colons.

    Format

    It is not always essential to specify the format of a sequence as EMBOSS will start with the default fasta format and then cycle through other formats until it finds the correct one. In addition, the format of entries in a sequence database are determined when it is indexed by EMBOSS. Thus all database formats are already known. Sequences in plain or IG format are sometimes not recognised by EMBOSS. If this is the case for your sequence, the format should be explicitly stated.

    Less common sources of sequences like programs and URLs can also be specified using the USA.

    Database

    The USA needs an explicit statement of the location of the data. Thus the database information must be available if a sequence is to be read in from one. Alternatively, if the sequence or other entry data is in a file, the path of this file must be specified here. Short, single sequences can be entered into a program on an ad hoc basis. In this case, the location to be specified is asis.

    Your BioBind installation currently contains the following database definitions:

    Nucleotide Databases

    • em   -   EMBL IDs from the EBI
    • ema   -   EMBL ACCs from the EBI
    • embl   -   EMBL IDs from the EBI
    • embla   -   EMBL ACCs from the EBI
    • gb   -   GenBank IDs from Infobiogen
    • gba   -   GenBank ACCs from Infobiogen
    • genbank  -   GenBank IDs from Infobiogen
    • genbanka  -   GenBank ACCs from Infobiogen
    • refseq   -   REFSEQ IDs from EBI
    • refseqa   -   REFSEQ ACCs from EBI

    Protein Databases

    • pir  -   PIR IDs from the EBI
    • pira  -   PIR ACCs from the EBI
    • sw  -   SWISSPROT IDs from the EBI
    • swa  -   SWISSPROT ACCs from the EBI
    • swissprot  -   SWISSPROT IDs from the EBI
    • swissprota   -   SWISSPROT ACCs from the EBI
    • uni   -   UNIPROT IDs from the EBI
    • unia  -   UNIPROT ACCs from the EBI
    • uniprot   -   UNIPROT IDs from the EBI
    • uniprota   -   UNIPROT ACCs from the EBI

    The abbreviations for some of the database names (e.g., em for EMBL and gb for GenBank) are included to reduce the length of input on the command line - and thus potentially the number of typos associated with it.

    Should you wish to create your own database and index it for your Biobind software, please bear in mind that dot characters (.) are not legal in database names. There is no naming convention for databases, and so you can call it what is most appropriate for you.

    Entry

    Finally the desired database entry or filename must be specified. If the sequence is in a database, then the entry should be an accession number or an identifier. If the data is elsewhere, the name of the data within a file is required. Should this information be omitted, the entire contents of the database or file will be read in by the program. Thus typing embl: and omitting the entry accession would result in the download of the entire EMBL database! If the entire file is necessary as input, then no information need appear and the entire USA would consist of just the path and name of the file.

    Currently your BioBind software retrieves database entries from repositories around the world, and thus it is necessary to specify by means of your database selection, whether you wish to use an accession number (e.g. embla:z83307) or an identifier (e.g. embl:hsa1280) to retrieve the sequence. The entry is not case sensitive, and thus an accession number or identifier typed in either upper or lower case will retrieve the desired sequence. A sequence file on the other hand IS case sensitive, so ensure it is typed on the command line in exactly as it is stored in your directory.

    Sequence start and end positions may also be specified as part of the USA by using square brackets ([ ]) and the positions separated by a colon (:). Thus a sequence input intended to analyse only the coding region of the 5385bp DNA sequence that translates to the Human Opsin 2 protein may be written as:

    seqret embl:AB065668 [201:5185]

    which would ensure retrieval of this sequence from the database between bases 201 to 5185 only.

    Wildcards

    In addition to the retrieving a single entry, multiple entries may be specifed in the USA format by the use of one of two wildcards. The asterisk (*) is used to specify any number of characters and will retrieve a number of sequences. The question mark (?) is used to specify a single character and thus allows for a more refined selection for retrieval. Using either of these wildcards on the command line will cause the UNIX system to mis-interpret them. Thus any USA that involves either an * or a ? must be suurounded in double quotes: "embl:hsa128?" or preceeded by a backslash: embl:hsa128\? when typed on the command line only. Quoting of wildcard characters is not required on the command-line as a reply to a prompt from a program or when using a Jemboss program form.

    Files

    Both single and multiple sequence files may be read in by an EMBOSS application. Alternatively List Files may be used instead of multiple sequence files. Some of the more common inputs appear below:

    • filename   -   all sequences in a file
    • filename:entry   -   an entry in a file
    • @filename   -   a list file
    • asis::ACTTAGGCTGACGG   -   a specific short sequence

    Sequence Files

    Single sequence files contain a single sequence in whatever format has been selected. By default this will be fasta format and look like this:

    >Sequence description on first line
    ATGACCCACCGTGTACGTGGACAGGTGTACAGCTCGATCCATCGACTCGCCTAGACTAC ACGCTACGCTCGATCGATCGACATCACGATCAGCATCGACATCAGCACTACGACTACGA ACACGTCGCTCGCTACGTACGCCCGTACCGATCGACACTACGACGCTCGCCGCGCAAAA

    There is no naming convention for these files as they must be saved as text only. They may be called anything you like.

    Multiple sequence files contain more than one sequence in the file and, in the default fasta format will look something like this.

    >First Sequence Description
    ATGACCCACCGTGTACGTGGACAGGTGTACAGCTCGATCCATCGACTCGCCTAGACTAC ACGCTACGCTCGATCGATCGACATCACGATCAGCATCGACATCAGCACTACGACTACGA ACACGTCGCTCGCTACGTACGCCCGTACCGATCGACACTACGACGCTCGCCGCGCAAAA
    >Second Sequence Description
    ATGACCCACCGTGTACGTGGACAGGTGTACAGCTCGATCCATCGACTCGCCTAGACTAC
    >Third description
    ATGACCCACCGTGTACGTGGACAGGTGTACAGCTCGATCCATCGACTCGCCTAGACTAC ACGCTACGCTATCACGATCAGCATCGACATCAGCACTACGACTACGACGCCGCGCAAAA ACACGTCGCTCGCTACGTACGCCCGTACCGATCGACACTACGACGCT
    >Fourth Sequence Description
    ATGACCCACCGTGTAAAAAAAAAAAAAAAAAGGTGTACAGCTCGATCCATCGAGACTAC ACGCTACGCTCGATCGATCGACATCACGATCAGCATCGACATCAGCACTACGACTACGA ACACGTCGCTCGCTACGTACGCCCGTACCGATCGACACTACGACGCTCGCCGCGCAAAA

    There are several file formats that do not lend themselves to saving more than one sequence in a single file. These include staden and plain format as they have no indication of where a sequence starts and ends.

    To assist users who are converting from GCG, an additional syntax allows a single entry to be specified within a multiple sequence file in curly brackets, e.g. file:{entry}. If this syntax is used on the command line, it should appear in double quotes: "file:{entry}" otherwise it will be identified by the UNIX system as something else.

    Several sequences with a similar name can be specified using wildcards. The asterisk (*) is used to specify any number of characters and thus *.fasta would specify all files that ended in .fasta. The question mark (?) is used to specify a single character. Thus sequence?.fasta may be used to read in all sequence files with a single character difference, such as: sequence1.fasta, sequence2.fasta, sequence3.fasta.

    List Files

    A list file is a list of USAs. Instead of containing the actual sequences within one file, a list file contains references to those sequences. They may be a mix of database and local file references and there is no limit to the number of different file formats the list may contain.

    In order for the application to treat the input as a list of file locations and not full sequences, an at (@) sign must be placed in front of the list filename (and path if that is appropriate), e.g. @topdirectory/subdirectory/workingdirectory/sequence.list. Alternatively, the format specifier list: could be used, e.g. list:topdirectory/subdirectory/workingdirectory/sequence.list.

    • sequence1.fasta   -   Local file with sequence(s) in fasta format
    • sequence2.gcg   -   Local file with sequence(s) in GCG format
    • embl:hsa128?   -   All EMBL sequences with hsa128 as the start of their identifier
    • @mutant.list   -   Another list file

    Blank lines within the list file are ignored. Individual USAs may be commented out with a hash (#) character at the beginning of the line. This is useful in multiple sequence lignments where the same list file may be used and individual sequences left in or out of the calculation.

    Formats

    A format in this context refers to the identification character surrounding a sequence and its relative features. It does not refer to the identification of a document such as "PDF" or ".doc". No format contains any hidden characters as they are in ASCII text only.

    There was no individual format devised specifically for EMBOSS. Instead it reads and writes 42 different formats, allowing sequence data to be read in from a variety of different databases and other analysis tools. By allowing so many choices of output, data could potentially be retrieved as the result of an EMBOSS application and read in by something else.

    Sequence Formats

    Sequence formats are ASCII text. Thus they contain no formatting, other than the identification of each sequence and the location within the file of any features of the sequence, such as ID, description, coding regions, motifs and references. Information will be more or less complete depending on the format choice. There are currently 30 input sequence formats contained in your BioBind software, including: EMBL, GCG, Genbank, PIR, MSF, SwissProt and plain (raw). There are also 30 output sequence options.

    abi [single sequence] ABI trace file format input format only
    acedb [multiple sequence] ACeDB file format input and output format
    asis [single sequence] Commmand line short sequence entry input format only
    asn1 [multiple sequence] ASN.1 format output format only
    clustal [multiple sequence] Clustal ALN multiple alignment format input and output format
    codata [multiple sequence] CODATA format input and output format
    dbid [multiple sequence] FASTA type format input and output format
    ddbj [multiple sequence] GenBank entry format input and output format
    embl [multiple sequence] EMBL entry format input and output format
    fasta [multiple sequence] FASTA format default input and output format
    fitch [multiple sequence] Fitch DNA format output format only
    gcg [multiple sequence] GCG 9.x and 10.x format input and output format
    gcg8 [multiple sequence] GCG 8.x format input and output format
    genbank [multiple sequence] GenBank entry format input and output format
    gff [multiple sequence] GFF format input and output format
    Hennig86 [multiple sequence] Hennig86 format input and output format
    IG [multiple sequence] IntelliGenetics format input and output format (must be specified)
    jackknifer [multiple sequence] Jackknifer format input and output format
    jackknifernon [multiple sequence] Jackknifernon format input and output format
    mega [multiple sequence] Mega format input and output format
    meganon [multiple sequence] Meganon format input and output format
    msf [multiple sequence] GCG Multiple Sequence format input and output format
    nexus [multiple sequence] Nexus/PAUP interleaved format input and output format
    nexusnon [multiple sequence] Nexus/PAUPnon format input and output format
    ncbi [multiple sequence] NCBI syle fasta format input and output format
    pfam [multiple sequence] Pfam format input format only
    pir [multiple sequence] NBRF PIR format input and output format
    plain [single sequence] no format (sequence only) input and output format
    phylip [multiple sequence] PHYLIP interleaved format input and output format
    phylip3 [multiple sequence] PHYLIP non-interleaved format input and output format
    raw [single sequence] no format - sequence only input and output format (must be specified)
    selex [multiple sequence] SELEX format input and output format
    staden [multiple sequence] STADEN format (defined by GCG) input and output format
    stockholm [multiple sequence] STOCKHOLM (used in Pfam and HMMER) format input and output format
    strider [multiple sequence] STRIDER DNA format input and output format
    swissprot [multiple sequence] Swiss-Prot entry format input and output format
    text [single sequence] no format - sequence only input and output format (must be specified)
    treecon [multiple sequence] TREECON format input and output format
    debug [single sequence] Report designed for debugging input and output format

    The default sequence file format is fasta. With two exceptions, the format of an input sequence need not be specified, as EMBOSS detects the rest automatically. Only plain (raw) or IG format need be explicitly stated.

    The default output can be altered by an environment setting: setenv EMBOSS_OUTFORMAT format

    where format is a specified sequence format for the new default setting.

      Sequence Input Qualifiers

      There are a variety of additional qualifiers that can be used to alter the behaviour of a sequence input.

      • -sbegin   -   integer   specifies first base used
      • -send   -   integer  sp ecifies last base used   (default: seq length)
      • -sreverse   -   boolean   reverse sequence   (requires: DNA sequence )
      • -sask   -   boolean   ask for begin, end, reverse  (requires: reverse only if DNA sequence )
      • -snucleotide   -   boolean   specify sequence as nucleotide 
      • -sprotein   -   boolean   specify sequence as protein   (requires: protein sequence )
      • -slower   -   boolean   conver letters to lower case 
      • -supper   -   boolean   convert letters to uppercase  
      • -sformat   -   string   specifies input sequence format   (requires: sequence format)
      • -sopenfile   -   string   specifies input filename   (requires: filename)
      • -sdbname   -   string   specifies database name  (requires: database)
      • -sid   -   string   specifies database or file entry name  (requires: entry name)
      • -ufo   -   string   specifies UFO features  (requires: filename)
      • -fformat   -   string   specifies feature format  (requires: feature format)
      • -fopenfile   -   string   specifies features filename   (requires: filename)

      Sequence Output Qualifiers

      There are a variety of additional qualifiers that can be used to alter the behaviour of a sequence output.

      • -osformat   -   string  specifies output sequence file format  
      • -osextension   -   string  specifies filename extension 
      • -osname   -   string  specifies a base filename 
      • -osdirectory   -   boolean  specifies output sequence file directory 
      • -osdbname   -   string  specifies name of added database 
      • -ossingle   -   boolean  creates separate outfile for each entry 
      • -ufo   -   string  specifies feature file to create 
      • -offormat   -   string  specifies feature format  
      • -ofname   -   string  specifies feature filename 
      • -ofdirectory   -   string  specifies output directory 

    Alignment Formats

    There are currently 2 multiple alignments formats and 8 pairwise alignment formats. They have been adopted from other programs, or written especially for this software. Each format is biased towards either a human readable format of the type that would be necessary for publication, or a more easily parseable format such that the results of an alignment could be used as part of a pipeline. There are various descriptors within an alignment format, for example to indicate similarity, identity and score.

    Different programs will have different default alignment formats. You may accept the default or choose your preferred format when you run the program. The formats have been given names that correspond to the names of existing alignment styles or programs. Some of the alignment formats can cope with an unlimited number of sequences, while others are only for pairwise alignments.

    Pairwise Alignment Formats

    pair Simple format for pairwise output (default output)
    markx Standard output from FASTA program suite
    srspair Similar to pair format
    score Score output only. No sequence display

    Multiple Alignment Formats

    fasta Standard fasta display. Gaps displayed as - (default Output)
    msf Standard MSF format

    Alignment Format Qualifiers

    • -aformat Alters output format
    • -awidth Displays alignment width
    • -ausashow Displays the full USA in the alignment

Gaps

In all EMBOSS alignment formats, gaps that have been introduced into the sequences to make them align are indicated by the - character. The exception to this rule is msf format which uses . as the gap character inside the sequences and ~ as the gap character at the terminal ends of the alignment.

The header block contains a line similar to:

# Gaps: 25/131 (19.1%)

This is a count of the number of positions (25) over the length of the alignment where there are one or more sequences with a gap, followed by the length (131) of the alignment and the percentage (19.1%) of positions in the alignment where there are gaps.

Head and tail of the format

The majority of the alignment formats (with the exception of those that are also standard sequence formats, such as fasta or MSF) have a block of information at the start of the alignment describing the program, date, output filename, sequence identifiers and some of the parameters and statistics relevant to the alignment.

########################################
# Program: demoalign
# Rundate: Thu Jan 17 09:30:08 2002
# Report_file: stdout
########################################
#=======================================
#
# Aligned_sequences: 4
# 1: IXI_234
# 2: IXI_235
# 3: IXI_236
# 4: IXI_237
# Matrix: EBLOSUM62
# Gap_penalty: 9
# Extend_penalty: -1
#
# Length: 131
# Identity: 95/131 (72.5%)
# Similarity: 127/131 (96.9%)
# Gaps: 25/131 (19.1%)
#
#
#=======================================

There is also a block of data at the end of the alignment for summary information. This is used by a few programs e.g. merger.

Length

The header block contains a line similar to:

# Length: 131

This is the length of the alignment, including any gaps that have been introduced to construct the alignment.

Identity

The header block contains a line similar to:

# Identity: 95/131 (72.5%)

This is a count of the number of positions (95) over the length of the alignment where all of the residues or bases at that position are identical, followed by the length of the alignment (131) and the percentage (72.5) of positions in the alignment where there are identities.

Similarity

The header block contains a line similar to:

# Similarity: 127/131 (96.9%)

This is a count of the number of positions (127) over the length of the alignment where all the residues or bases at that position are similar - i.e. they score positively in the comparison matrix used in the alignment, followed by the length (131) of the alignment and the percentage (96.9) of positions in the alignment where there are similarities. Note that the sum of identical and similar positions is greater than 100%. This is because the count of similar positions includes the count of identical positions as these will also score positively on the comparision matrix.

Score

The header block may contain a line similar to:

# Score: 100.0

This is the score used by the program that calculated the alignment to determine which is the best possible alignment to report. The algorithm that was used to derive the score is not part of the alignment formatting routines.

Markup Line

The markup line is commonly placed between a pairwise alignment or at the bottom of alignments of 3 or more sequences to shows where sequences are mismatched, gapped, identical or similar. In general the markup line uses a space for a mismatch or a gap, a colon (:) for a similarity and a pipe (|) to display identity. The markx set of alignment formats use a dot (.) for similarity and a colon (:) for identity.

Report Formats

Your BioBind software currently supports 18 different report formats. Standardisation of report formats is convenient not only to become familiar with them, but also to be able to select a specific output appropriate for its future use. Report formats are available in a human readable form for publication as well as more parseable options for input into further analysis tools.

  • embl   -   Reports in EMBL feature table format
  • genbank   -   Reports in GenBank feature table format
  • gff   -   Reports in GFF feature table format(default output)
  • pir   -   Reports in PIR feature table format
  • swiss   -   Displays Reports in Swiss-Prot feature table format
  • listfile   -   Reports motifs in List File format with USA style [start:end] feature positions
  • dbmotif   -   Reports in DBMotif feature table format
  • diffseq   -   Reports output similar to diffseq output
  • excel   -   TAB delimited table format for export into spreadsheets
  • feattable   -   Reports in FeatTable format
  • motif   -   Repots in Motif feature table format
  • regions   -   Reports in Regions feature table format
  • seqtable   -   Reports in SeqTable format
  • simple   -   Reports in SRS simple format
  • srs   -   Reports in SRS format
  • table   -   Reports in Table format
  • tagseq   -   Reports in TagSeq format
  • trace   -   Used for debugging. Writes out bug report

Head and tail of the format

The majority of the report formats have a block of information at the start of the report describing the program, date, output filename, ID name of the sequence and some of the parameters and statistics of the report. The exception to this are those formats which are also standard sequence or feature tables formats, such as embl, genbank, gff, pir, swiss, excel, feattable

########################################
# Program: garnier
# Rundate: Mon Feb 11 15:14:40 2002
# Report_file: report.dbmotif
########################################

#=======================================
#
# Sequence: 100K_RAT from: 1 to: 889
# HitCount: 206
#
# DCH = 0, DCS = 0
#
# Please cite:
# Garnier, Osguthorpe and Robson (1978) J. Mol. Biol. 120:97-120
#
#
#
#=======================================

There is also a block of information at the end of the report for summary information.

#---------------------------------------
#
# Residue totals: H:364 E:149 T:191 C:185
# percent: H: 41.7 E: 17.1 T: 21.9 C: 21.2
#
#
#---------------------------------------

Each program that writes an report, has a default report format defined for that program. This format is usually a table but other more appropriate formats may be chosen as the default.

    Report Format Qualifiers

    There are several options that change the behaviour of report formats. These apply to both the input and output files.

    • -rformat   -   string  specifies format  
    • -ropenfile   -   string  specifies report filename 
    • -rextension   -   string  specifies filename extension 
    • -rname   -   string  specifies a base file extension 
    • -raccshow   -   boolean  displays sequence accession number in report 
    • -rdesshow   -   boolean  displays sequence description in report 
    • -rusashow   -   boolean  disaplys sequence USA in report 
    • -rdirectory   -   boolean  reports output file direectory 

Graphic Format

Currently EMBOSS will output graphical displays in a variety of formats. Graphics are in the style of the static PLP plot libraries. New, interactive graphics are planned for the next release of your BioBind software.

    Graphic Format Qualifiers

    There is only one option that will change the behaviour of the output graphic.

    • -graph X11   -   Outputs graphics in X11 format(default output for EMBOSS)
    • -graph PNG   -   Outputs graphics in PNG format(default output for Jemboss)
    • -graph ps   -   Outputs graphics in postscript format
    • -graph tektronics   -   Outputs graphics in tektronics format
    • -graph cps   -   Outputs coloured postscript

Feature Format

A feature is a region of interest in a specified nucleic or protein sequence. It has a specified start and end position and a name descriptor to identify exatly what type of feature it is. The majority of feature table definitions have a controlled vocabulary (i.e. there is a specified list of feature key names that can be used), thus any edits to the feature tables must adhere to the allowed set of feature keys. Features may also explicitly or implicitly hold the name of the program or database that they are derived from, the sense (in a nucleic sequence), the score and many other pieces of information. Feature Tables are groups of features

Different programs may have different default feature formats. You may accept the default or select your preferred format when you run the program. All Feature files within your BioBind software suite only store both the the feature table and its relative sequence. They do not store raw feature tables.

There are currently 5 feature formats contained in your BioBind software:

  • embl   -   Format used by the EMBL nucleotide database
  • gff   -   General Feature Format defined by the Sanger Institute. Compatable with its genome software(default output)
  • swisprot   -   Format used by the Swiss-Prot protein database(default output)
  • pir   -   Format used by the PIR database
  • nbrf   -   Same as PIR format. Only available for input

Uniform Feature Object

A Uniform Feature Object (UFO) is a standard way of referring to a feature file so that it specifies the format of the features in a file and the name of that file. In an analogous way to the USA, the feature format is given and then a colon (:) separates it from the name of the file. e.g. embl:results.dat UFOs can be used to specify feature format and file both on input or output.

    Feature Format Qualifiers

    The commands available to modify the behaviour of the programs with regards to feature formats differ depending on whether the features are included in a sequence file or database entry.

    • -ufo   -   Uniform Features Object features
    • -fformat   -   Feature format

    Feature Format Input Qualifiers

    • -fbegin   -   Specifies first position from which to report feature
    • -fend   -   Specifies final position from which to report feature
    • -freverse   -   Reports features displayed on reverse strand   (requires: DNA sequence )

    Feature Format Output Qualifiers

    • -ofbegin   -   Specifies first position from which to report feature
    • -ofend   -   Specifies final position from which to report feature
    • -ofreverse   -   Reports features displayed on reverse strand   (requires: DNA sequence )

Help Documentation

There are three avenues of assistance that are open to users of EMBOSS through BioBind. The first is in selecting the correct application to use. This can be accomplished by the program wossname. This is a search programme and will allow you to specify a keyword. Each BioBind application has a single 55 character description line and wossname will search these descriptions and return all those - together with their programs - that contain the relevant word. It is worth searching with the stem of a word such as translat instead of translate or translation to retrieve as many hits as possible.

The second option is to access documentation on an individual program. The infomation can be retrieved in its entirety by using the program tfm. Type

tfm programname

at the command line prompt.

Alternatively, if you would like to know the parameter options for an individual program, add the -help qualifier to a program name on the command line. To obtain a longer, more verbose help output, add -help -verbose onto the command line after the program name.

Finally, if all else fails and you are a BioBind customer, contact us.

©2005 BioBind.com All rights reserved.