Topics
- Introduction
- Qualifiers
- Specifying Sequences - Uniform Sequence
Address
- Files
- Formats
- Alignment Formats
- Report Formats
- Graphic Format
- Feature Format
- Help Documentation
Introduction
EMBOSS is a fully flexible sequence analysis package containing
over 200 individual applications which may be run independently
or linked together as part of a personal pipeline. It has been
designed to operate in a UNIX environment and for use with Windows,
it runs using the Cygwin environment.
It is free of arbitrary size limits - memory for sequences or matrices is allocated dynamically;
the only restriction is the hardware. Thus the software can be used to analyse even the longest
regions of the genome.
For the developer wishing to use the BioBind software as a platform for initiating their own code,
EMBOSS contains library functions for general string handling, pattern-matching, sorting, iteration and
extremely fast indexing. Library functions for all common sequence analysis tasks are also available.
Although we recomend using the Jemboss interface contained within your BioBind software,
EMBOSS has a consistent API, enabling students to design and program interfaces.
Qualifiers
EMBOSS programs are designed to be run from the command-line, as well as within scripts.
To customise their behaviour, each has a distinct set of qualifiers, also known
as options or flags. They should be added to the command line after the program
name (and possibly the input file) has been specified. Some qualifiers require an input
which should come after the qualifier. e.g.,
wossname -alphabetical   Forces wossname to specify all programs in alphabetical order
seqret sw:dsx_human -osformat gcg   Retrieves the human DSX protein from Swiss-Prot and
outputs the file in GCG format.
There are 3 classes of parameters: standard, additional and advanced. Information on allowable
flags for each program is given in the individual documentation files. Whilst the majority of these qualifiers are
program specific, there are a number of general qualifiers which are relevant to all programs.
Additional qualifiers are relevant to a large proportion of programs, such as those that take
a sequence as input.
Mandatory Qualifiers
If values for mandatory parameters are not specified, the programs will prompt for them. These need to
be specified in order for the program to run. The most obvious of these would be an input file.
Additional Qualifiers
Should optional parameters not be specified, default values will be used.
To access additional qualifiers whilst running a program, adding the flag -opt to
your command line will ensure that the program prompts for them all.
Advanced Qualifiers
EMBOSS programs will never prompt for advanced parameters; these must be explicitly specified.
These are defined in the program documentation and in the majority of cases consist of files
inherent to the functioning of the program, such as amino acid data files.
General Qualifiers
These may be used with any program to alter its behaviour.
- -auto   -   Turns off prompts and descriptions. Used when in
running programs scripts.
- -debug   -   Writes debug output to the file programname.dbg
- -die   -   Reports program termination
- -error   -   Reports errors in the program
- -fatal   -   Reports fatal errors in the program
- -filter   -   Reads from standard input (keyboard) and writes to
standard output (screen) by default.
- -help   -   Reports command line options. Or –help –verbose for
- -options   -   Prompts for all required and additional values
- -stout   -   Writes to standard output (screen) by default.
- -warning   -   Reports warnings from the program
Each of these qualifiers can be prefixed with no (e.g. –nowarning) to negate the action.
Specifying Sequences - Uniform Sequence
Address
The Uniform Sequence Address (USA) is an unambiguous means of specifying sequences
in EMBOSS and has the following syntax: format::database:entry. Thus the USA
specifies what format the program should expect, what file or database to open and
which entry to look for and each piece of information is separated
by colons.
Format
It is not always essential to specify the format of a sequence
as EMBOSS will start with the default fasta format and then cycle through other formats
until it finds the correct one.
In addition, the format of entries in a sequence database are determined
when it is indexed by EMBOSS.
Thus all database formats are already known.
Sequences in plain or IG format are sometimes not recognised by EMBOSS.
If this is the case for your sequence, the format should be explicitly stated.
Less common sources of sequences like programs and URLs
can also be specified using the USA.
Database
The USA needs an explicit statement of the location of the data. Thus
the database information must be available if a sequence is to be read in from one.
Alternatively, if the sequence or other
entry data is in a file, the path of this file must be specified here. Short, single sequences
can be entered into a program on an ad hoc basis. In this case, the location to be
specified is asis.
Your BioBind installation currently contains the following database
definitions:
Nucleotide Databases
- em   -   EMBL IDs from the EBI
- ema   -   EMBL ACCs from the EBI
- embl   -   EMBL IDs from the EBI
- embla   -   EMBL ACCs from the EBI
- gb   -   GenBank IDs from Infobiogen
- gba   -   GenBank ACCs from Infobiogen
- genbank  -   GenBank IDs from Infobiogen
- genbanka  -   GenBank ACCs from Infobiogen
- refseq   -   REFSEQ IDs from EBI
- refseqa   -   REFSEQ ACCs from EBI
Protein Databases
- pir  -   PIR IDs from the EBI
- pira  -   PIR ACCs from the EBI
- sw  -   SWISSPROT IDs from the EBI
- swa  -   SWISSPROT ACCs from the EBI
- swissprot  -   SWISSPROT IDs from the EBI
- swissprota   -   SWISSPROT ACCs from the EBI
- uni   -   UNIPROT IDs from the EBI
- unia  -   UNIPROT ACCs from the EBI
- uniprot   -   UNIPROT IDs from the EBI
- uniprota   -   UNIPROT ACCs from the EBI
The abbreviations for some of the database names (e.g., em for EMBL and gb for
GenBank) are included to reduce the length of input on the command line - and thus potentially the number
of typos associated with it.
Should you wish to create your own database and index it for your Biobind software, please bear in mind
that dot characters (.) are not legal in database names. There is no naming convention for databases, and so you can call it what is most
appropriate for you.
Entry
Finally the desired database entry or filename must be specified.
If the sequence is in a database, then the entry should be an accession number or
an identifier. If the data is elsewhere, the name of the data within a file is required.
Should
this information be omitted, the entire contents of the database or file will be
read in by the program. Thus typing embl: and omitting the entry accession
would result in the download of the entire EMBL database! If the entire file is necessary
as input, then no information need appear and the entire
USA would consist of just the path and name of the file.
Currently your BioBind software retrieves database entries from repositories around the world, and thus it is necessary
to specify by means of your database selection, whether you wish to use an accession number (e.g. embla:z83307)
or an identifier (e.g. embl:hsa1280) to retrieve the sequence. The entry is not case sensitive, and thus an accession
number or identifier typed in either upper or lower case will retrieve the
desired sequence. A sequence file on the
other hand IS case sensitive, so ensure it is typed on the command line in exactly as it is stored
in your directory.
Sequence start and end positions may also be specified as part of the USA by using square brackets ([ ])
and the positions separated by a colon (:). Thus a sequence input intended to analyse only the coding
region of the 5385bp DNA sequence that translates to the Human Opsin 2 protein may be written as:
seqret embl:AB065668 [201:5185]
which would ensure retrieval of this sequence from the database between bases 201 to 5185 only.
Wildcards
In addition to the retrieving a single entry, multiple entries may be specifed in the USA format by the
use of one of two wildcards. The asterisk (*) is used to specify
any number of characters and will retrieve a number of sequences. The question mark (?) is used to specify a single character
and thus allows for a more refined selection for retrieval. Using either of these wildcards on the command line will cause the UNIX
system to mis-interpret them. Thus any USA that involves either an * or a ? must be suurounded in double quotes:
"embl:hsa128?" or preceeded by a backslash: embl:hsa128\?
when typed on the command line only.
Quoting of wildcard characters is not required on the command-line as a reply to a prompt from
a program or when using a Jemboss program form.
Files
Both single and multiple sequence files may be read in by an EMBOSS application. Alternatively
List Files may be used instead of multiple sequence files. Some of the more common inputs appear below:
- filename   -   all sequences in a file
- filename:entry   -   an entry in a file
- @filename   -   a list file
- asis::ACTTAGGCTGACGG   -   a specific short sequence
Sequence Files
Single sequence files contain a single sequence in whatever format has been selected.
By default this will be fasta format and look like this:
>Sequence description on first line
ATGACCCACCGTGTACGTGGACAGGTGTACAGCTCGATCCATCGACTCGCCTAGACTAC
ACGCTACGCTCGATCGATCGACATCACGATCAGCATCGACATCAGCACTACGACTACGA
ACACGTCGCTCGCTACGTACGCCCGTACCGATCGACACTACGACGCTCGCCGCGCAAAA
There is no naming convention for these
files as they must be saved as text only. They may be called anything you
like.
Multiple sequence files contain more than one sequence in the file and, in the default fasta
format will look something like this.
>First Sequence Description
ATGACCCACCGTGTACGTGGACAGGTGTACAGCTCGATCCATCGACTCGCCTAGACTAC
ACGCTACGCTCGATCGATCGACATCACGATCAGCATCGACATCAGCACTACGACTACGA
ACACGTCGCTCGCTACGTACGCCCGTACCGATCGACACTACGACGCTCGCCGCGCAAAA
>Second Sequence Description
ATGACCCACCGTGTACGTGGACAGGTGTACAGCTCGATCCATCGACTCGCCTAGACTAC
>Third description
ATGACCCACCGTGTACGTGGACAGGTGTACAGCTCGATCCATCGACTCGCCTAGACTAC
ACGCTACGCTATCACGATCAGCATCGACATCAGCACTACGACTACGACGCCGCGCAAAA
ACACGTCGCTCGCTACGTACGCCCGTACCGATCGACACTACGACGCT
>Fourth Sequence Description
ATGACCCACCGTGTAAAAAAAAAAAAAAAAAGGTGTACAGCTCGATCCATCGAGACTAC
ACGCTACGCTCGATCGATCGACATCACGATCAGCATCGACATCAGCACTACGACTACGA
ACACGTCGCTCGCTACGTACGCCCGTACCGATCGACACTACGACGCTCGCCGCGCAAAA
There are several file formats that do not lend themselves to saving more than
one sequence in a single file. These include staden and plain format
as they have no indication of where a sequence starts and ends.
To assist users who are converting from GCG, an additional syntax allows a single entry to
be specified within a multiple sequence file in curly brackets, e.g. file:{entry}.
If this
syntax is used on the command line, it should appear in double quotes: "file:{entry}" otherwise it will be identified by the UNIX
system as something else.
Several sequences with a similar name can be specified using wildcards. The asterisk (*) is used to specify
any number of characters and thus *.fasta would specify all files that ended in .fasta. The question
mark (?) is used to specify a single character. Thus sequence?.fasta may be used to read in all
sequence files with a single character difference, such as: sequence1.fasta, sequence2.fasta, sequence3.fasta.
List Files
A list file is a list of USAs. Instead of containing the actual sequences
within one file, a list file contains
references to those sequences. They may be a mix of database and local file
references and there is no limit to the number of different file formats the list
may contain.
In order for the application to treat the input as a list of file locations and not
full sequences, an at (@) sign must be placed in front of the list filename (and
path if that is appropriate), e.g. @topdirectory/subdirectory/workingdirectory/sequence.list.
Alternatively, the format specifier list: could be used, e.g.
list:topdirectory/subdirectory/workingdirectory/sequence.list.
- sequence1.fasta   -   Local file with sequence(s) in fasta format
- sequence2.gcg   -   Local file with sequence(s) in GCG format
- embl:hsa128?   -   All EMBL sequences with hsa128 as the start of their identifier
- @mutant.list   -   Another list file
Blank lines within the list file are ignored. Individual USAs may be commented out
with a hash (#) character at the beginning of the line. This is useful in multiple sequence
lignments where the same list file may be used and individual sequences left in or out
of the calculation.
Formats
A format in this context refers to the identification character surrounding a sequence
and its relative features. It does not refer to the identification of a document such as
"PDF" or ".doc". No format contains any hidden characters as they are in ASCII text only.
There was no individual format devised specifically for EMBOSS. Instead it reads and writes
42 different formats, allowing sequence data to be read in from a variety of different databases
and other analysis tools. By allowing so many choices of output, data could potentially be retrieved
as the result of an EMBOSS application and read in by something else.
Sequence Formats
Sequence formats are ASCII text. Thus they contain no formatting, other than the identification of each sequence and the location
within the file of any features of the sequence, such as ID, description, coding regions, motifs and references. Information will
be more or less complete depending on the format choice. There are currently 30 input sequence formats contained in your BioBind software,
including: EMBL, GCG, Genbank, PIR, MSF, SwissProt and plain (raw). There are also 30
output sequence options.
|
abi
|
[single sequence]
|
ABI trace file format
|
input format only
|
|
acedb
|
[multiple sequence]
|
ACeDB file format
|
input and output format
|
|
asis
|
[single sequence]
|
Commmand line short sequence entry
|
input format only
|
|
asn1
|
[multiple sequence]
|
ASN.1 format
|
output format only
|
|
clustal
|
[multiple sequence]
|
Clustal ALN multiple alignment format
|
input and output format
|
|
codata
|
[multiple sequence]
|
CODATA format
|
input and output format
|
|
dbid
|
[multiple sequence]
|
FASTA type format
|
input and output format
|
|
ddbj
|
[multiple sequence]
|
GenBank entry format
|
input and output format
|
|
embl
|
[multiple sequence]
|
EMBL entry format
|
input and output format
|
|
fasta
|
[multiple sequence]
|
FASTA format
|
default input and output format
|
|
fitch
|
[multiple sequence]
|
Fitch DNA format
|
output format only
|
|
gcg
|
[multiple sequence]
|
GCG 9.x and 10.x format
|
input and output format
|
|
gcg8
|
[multiple sequence]
|
GCG 8.x format
|
input and output format
|
|
genbank
|
[multiple sequence]
|
GenBank entry format
|
input and output format
|
|
gff
|
[multiple sequence]
|
GFF format
|
input and output format
|
|
Hennig86
|
[multiple sequence]
|
Hennig86 format
|
input and output format
|
|
IG
|
[multiple sequence]
|
IntelliGenetics format
|
input and output format (must be specified)
|
|
jackknifer
|
[multiple sequence]
|
Jackknifer format
|
input and output format
|
|
jackknifernon
|
[multiple sequence]
|
Jackknifernon format
|
input and output format
|
|
mega
|
[multiple sequence]
|
Mega format
|
input and output format
|
|
meganon
|
[multiple sequence]
|
Meganon format
|
input and output format
|
|
msf
|
[multiple sequence]
|
GCG Multiple Sequence format
|
input and output format
|
|
nexus
|
[multiple sequence]
|
Nexus/PAUP interleaved format
|
input and output format
|
|
nexusnon
|
[multiple sequence]
|
Nexus/PAUPnon format
|
input and output format
|
|
ncbi
|
[multiple sequence]
|
NCBI syle fasta format
|
input and output format
|
|
pfam
|
[multiple sequence]
|
Pfam format
|
input format only
|
|
pir
|
[multiple sequence]
|
NBRF PIR format
|
input and output format
|
|
plain
|
[single sequence]
|
no format (sequence only)
|
input and output format
|
|
phylip
|
[multiple sequence]
|
PHYLIP interleaved format
|
input and output format
|
|
phylip3
|
[multiple sequence]
|
PHYLIP non-interleaved format
|
input and output format
|
|
raw
|
[single sequence]
|
no format - sequence only
|
input and output format (must be specified)
|
|
selex
|
[multiple sequence]
|
SELEX format
|
input and output format
|
|
staden
|
[multiple sequence]
|
STADEN format (defined by GCG)
|
input and output format
|
|
stockholm
|
[multiple sequence]
|
STOCKHOLM (used in Pfam and HMMER) format
|
input and output format
|
|
strider
|
[multiple sequence]
|
STRIDER DNA format
|
input and output format
|
|
swissprot
|
[multiple sequence]
|
Swiss-Prot entry format
|
input and output format
|
|
text
|
[single sequence]
|
no format - sequence only
|
input and output format (must be specified)
|
|
treecon
|
[multiple sequence]
|
TREECON format
|
input and output format
|
|
debug
|
[single sequence]
|
Report designed for debugging
|
input and output format
|
The default sequence file format is fasta. With two exceptions, the format of an input
sequence need not be specified, as EMBOSS detects the rest automatically.
Only plain (raw) or IG format need be explicitly stated.
The default output can be altered by an environment setting:
setenv EMBOSS_OUTFORMAT format
where format is a specified sequence format for
the new default setting.
Sequence Input Qualifiers
There are a variety of additional qualifiers that can be used to alter the behaviour of a sequence input.
- -sbegin   -   integer   specifies
first base used
- -send   -   integer  sp ecifies
last base used   (default: seq length)
- -sreverse   -   boolean   reverse sequence   (requires: DNA sequence )
- -sask   -   boolean   ask for begin, end, reverse  (requires: reverse only if DNA sequence )
- -snucleotide   -   boolean   specify sequence as nucleotide 
- -sprotein   -   boolean   specify
sequence as protein   (requires: protein sequence )
- -slower   -   boolean   conver letters to lower case 
- -supper   -   boolean   convert letters to uppercase  
- -sformat   -   string   specifies input sequence format   (requires: sequence format)
- -sopenfile   -   string   specifies input filename   (requires: filename)
- -sdbname   -   string   specifies
database name  (requires: database)
- -sid   -   string   specifies database or file entry name  (requires: entry name)
- -ufo   -   string   specifies UFO features  (requires: filename)
- -fformat   -   string   specifies feature format  (requires: feature format)
- -fopenfile   -   string   specifies features filename   (requires: filename)
Sequence Output Qualifiers
There are a variety of additional qualifiers that can be used to alter the behaviour of a sequence output.
- -osformat   -   string  specifies output sequence file format  
- -osextension   -   string 
specifies filename extension 
- -osname   -   string  specifies a base filename 
- -osdirectory   -   boolean 
specifies output sequence file directory 
- -osdbname   -   string  specifies name of added database 
- -ossingle   -   boolean  creates separate outfile for each entry 
- -ufo   -   string  specifies feature file to create 
- -offormat   -   string  specifies feature format  
- -ofname   -   string  specifies feature filename 
- -ofdirectory   -   string  specifies output directory 
Alignment Formats
There are currently 2 multiple alignments formats and 8 pairwise
alignment formats. They have been adopted
from other programs, or written especially for this software. Each format is biased towards either a human
readable format of the type that would be necessary for publication, or a
more easily parseable format
such that the results of an alignment could be used as part of a pipeline. There are various descriptors within
an alignment format, for example to indicate similarity, identity and score.
Different programs will have different default alignment formats. You may
accept the default or choose
your preferred format when you run the program. The formats have been given names that correspond
to the names of existing alignment styles or programs. Some of the alignment formats can cope with an
unlimited number of sequences, while others are only for pairwise alignments.
Pairwise Alignment
Formats
|
pair
|
Simple format for pairwise output (default output)
|
|
markx
|
Standard output from FASTA program suite
|
|
srspair
|
Similar to pair format
|
|
score
|
Score output only. No sequence display
|
Multiple Alignment
Formats
|
fasta
|
Standard fasta display. Gaps displayed as - (default
Output)
|
|
msf
|
Standard MSF format
|
Alignment Format Qualifiers
- -aformat
Alters output format
- -awidth Displays alignment width
- -ausashow Displays the full USA in the alignment
Gaps
In all EMBOSS alignment formats, gaps that have been introduced into the sequences to make them align are indicated by the - character.
The exception to this rule is msf format which uses . as the gap character inside the sequences and ~ as the gap character at the
terminal ends of the alignment.
The header block contains a line similar to:
# Gaps: 25/131 (19.1%)
This is a count of the number of positions (25) over the length of the alignment where there are one or more sequences with a gap,
followed by the length (131) of the alignment and the percentage (19.1%) of positions in the alignment where there are gaps.
Head and tail of the format
The majority of the alignment formats (with the exception of those that are also standard sequence formats, such as fasta or MSF)
have a block of information at the start of the alignment describing the program, date, output filename, sequence identifiers
and some of the parameters and statistics relevant to the alignment.
########################################
# Program: demoalign
# Rundate: Thu Jan 17 09:30:08 2002
# Report_file: stdout
########################################
#=======================================
#
# Aligned_sequences: 4
# 1: IXI_234
# 2: IXI_235
# 3: IXI_236
# 4: IXI_237
# Matrix: EBLOSUM62
# Gap_penalty: 9
# Extend_penalty: -1
#
# Length: 131
# Identity: 95/131 (72.5%)
# Similarity: 127/131 (96.9%)
# Gaps: 25/131 (19.1%)
#
#
#=======================================
There is also a block of data at the end of the alignment for summary information. This is used by a few programs e.g. merger.
Length
The header block contains a line similar to:
# Length: 131
This is the length of the alignment, including any gaps that have been introduced to construct the alignment.
Identity
The header block contains a line similar to:
# Identity: 95/131 (72.5%)
This is a count of the number of positions (95) over the length of the alignment where all of the residues or bases at that position are identical,
followed by the length of the alignment (131) and the percentage (72.5) of positions in the alignment where there are identities.
Similarity
The header block contains a line similar to:
# Similarity: 127/131 (96.9%)
This is a count of the number of positions (127) over the length of the alignment where all the residues or bases at that position are
similar - i.e. they score positively in the comparison matrix used in
the alignment,
followed by the length (131) of the alignment and the percentage (96.9) of positions in the alignment where there are similarities.
Note that the sum of identical and similar positions is greater than 100%. This is because the count of similar positions includes the count
of identical positions as these will also score positively on the comparision matrix.
Score
The header block may contain a line similar to:
# Score: 100.0
This is the score used by the program that calculated the alignment to determine which is the best possible alignment to report.
The algorithm that was used to derive the score is not part of the alignment formatting routines.
Markup Line
The markup line is commonly placed between a pairwise alignment or at the bottom of alignments of 3 or more sequences
to shows where sequences are mismatched, gapped, identical or similar.
In general the markup line uses a space for a mismatch or a gap, a colon
(:) for a similarity and a pipe (|)
to display identity. The markx set of alignment formats use a dot (.) for similarity
and a colon (:) for identity.
Report Formats
Your BioBind software currently supports 18 different report formats. Standardisation of report formats
is convenient not only to become familiar with them, but also to be able to select a specific output
appropriate for its future use. Report formats are available in a human
readable form for publication as well as
more parseable options for input into further analysis tools.
- embl   -   Reports in EMBL feature table format
- genbank   -   Reports in GenBank feature table format
- gff   -   Reports in GFF feature table format(default output)
- pir   -   Reports in PIR feature table format
- swiss   -   Displays Reports in Swiss-Prot feature table format
- listfile   -   Reports motifs in List File format with USA style [start:end] feature positions
- dbmotif   -   Reports in DBMotif feature table format
- diffseq   -   Reports output similar to diffseq output
- excel   -   TAB delimited table format for export into spreadsheets
- feattable   -   Reports in FeatTable format
- motif   -   Repots in Motif feature table format
- regions   -   Reports in Regions feature table format
- seqtable   -   Reports in SeqTable format
- simple   -   Reports in SRS simple format
- srs   -   Reports in SRS format
- table   -   Reports in Table format
- tagseq   -   Reports in TagSeq format
- trace   -   Used for debugging. Writes out bug report
Head and tail of the format
The majority of the report formats have a block of information at the start of the report
describing the program, date, output filename, ID name of the sequence and some of the parameters and statistics of the report.
The exception to this are those formats which are also standard sequence or feature tables formats, such as embl, genbank, gff, pir, swiss, excel, feattable
########################################
# Program: garnier
# Rundate: Mon Feb 11 15:14:40 2002
# Report_file: report.dbmotif
########################################
#=======================================
#
# Sequence: 100K_RAT from: 1 to: 889
# HitCount: 206
#
# DCH = 0, DCS = 0
#
# Please cite:
# Garnier, Osguthorpe and Robson (1978) J. Mol. Biol. 120:97-120
#
#
#
#=======================================
There is also a block of information at the end of the report for summary information.
#---------------------------------------
#
# Residue totals: H:364 E:149 T:191 C:185
# percent: H: 41.7 E: 17.1 T: 21.9 C: 21.2
#
#
#---------------------------------------
Each program that writes an report, has a default report format defined for that program.
This format is usually a table but other more appropriate formats may be chosen as the default.
Report Format Qualifiers
There are several options that change the behaviour of report formats. These apply
to both the input and output files.
- -rformat   -   string  specifies format  
- -ropenfile   -   string  specifies report filename 
- -rextension   -   string  specifies filename extension 
- -rname   -   string  specifies a base file extension 
- -raccshow   -   boolean  displays sequence accession number in report 
- -rdesshow   -   boolean  displays sequence description in report 
- -rusashow   -   boolean  disaplys sequence USA in report 
- -rdirectory   -   boolean  reports output file direectory 
Graphic Format
Currently EMBOSS will output graphical displays in a variety of formats.
Graphics are in the style of the static PLP plot libraries. New, interactive
graphics are planned for the next release of your BioBind software.
Graphic Format Qualifiers
There is only one option that will change the behaviour of the output graphic.
- -graph X11   -   Outputs graphics in X11 format(default output for EMBOSS)
- -graph PNG   -   Outputs graphics in PNG format(default output for Jemboss)
- -graph ps   -   Outputs graphics in postscript format
- -graph tektronics   -   Outputs graphics in tektronics format
- -graph cps   -   Outputs coloured postscript
Feature Format
A feature is a region of interest in a specified nucleic or protein sequence. It has a specified start and
end position and a name descriptor to identify exatly what type of feature
it is. The majority of feature table
definitions have a controlled vocabulary (i.e. there is a specified list of feature key names that can be used),
thus any edits to the feature tables must adhere to the allowed set of feature keys.
Features may also explicitly or implicitly hold the name of the program or database that they are
derived from, the sense (in a nucleic sequence), the score and many other pieces of information.
Feature Tables are groups of features
Different programs may have different default feature formats. You may accept the
default or select your preferred format when you run the program. All Feature files within
your BioBind software suite only store both the the feature table and its relative sequence. They do
not store raw feature tables.
There are currently 5 feature formats contained in your BioBind software:
- embl   -   Format used by the EMBL nucleotide database
- gff   -   General Feature Format defined by the Sanger Institute. Compatable with its genome software(default output)
- swisprot   -   Format used by the Swiss-Prot protein database(default output)
- pir   -   Format used by the PIR database
- nbrf   -   Same as PIR format. Only available for input
Uniform Feature Object
A Uniform Feature Object (UFO) is a standard way of referring to a feature file so that it specifies
the format of the features in a file and the name of that file. In an analogous way to the USA, the feature
format is given and then a colon (:) separates it from the name of the file. e.g. embl:results.dat
UFOs can be used to specify feature format and file both on input or output.
Feature Format Qualifiers
The commands available to modify the behaviour of the programs with regards to feature
formats differ depending on whether the features are included in a sequence file or database entry.
- -ufo   -   Uniform Features Object features
- -fformat   -   Feature format
Feature Format Input Qualifiers
- -fbegin   -   Specifies first position from which to report feature
- -fend   -   Specifies final position from which to report feature
- -freverse   -   Reports features displayed on reverse strand   (requires: DNA sequence )
Feature Format Output Qualifiers
- -ofbegin   -   Specifies first position from which to report feature
- -ofend   -   Specifies final position from which to report feature
- -ofreverse   -   Reports features displayed on reverse strand   (requires: DNA sequence )
Help Documentation
There are three avenues of assistance that are open to users of EMBOSS through BioBind. The
first is in selecting the correct application to use. This can be accomplished by the program
wossname. This is a search programme and will allow you to specify
a keyword. Each BioBind
application has a single 55 character description line and wossname will search these
descriptions and return all those - together with their programs - that contain the relevant word.
It is worth searching with the stem of a word such as translat instead of translate
or translation to retrieve as many hits as possible.
The second option is to access documentation on an individual program. The infomation can be
retrieved in its entirety by using the program tfm. Type
tfm programname at the command line prompt.
Alternatively, if you would like to know the parameter options for an individual program,
add the -help qualifier to a program name on the command line. To
obtain a longer, more verbose help output, add -help -verbose onto
the command line after the program name.
Finally, if all else fails and you are a BioBind customer,
contact us.
|