featureparse
Parse features from GenBank, GenPept, or EMBL data
Syntax
FeatStruct
= featureparse(Features
)
FeatStruct
=
featureparse(Features
, ...'Feature', FeatureValue
,
...)
FeatStruct
= featureparse(Features
,
...'Sequence', SequenceValue
, ...)
Input Arguments
Features | Any of the following:
|
FeatureValue | Name of a feature contained in Features .
When specified, featureparse returns only the substructure
that corresponds to this feature. If there are multiple features with
the same FeatureValue , then FeatStruct is
an array of structures. |
SequenceValue | Property to control the extraction, when possible, of the sequences
respective to each feature, joining and complementing pieces of the
source sequence and storing them in the Sequence field
of the returned structure, FeatStruct .
When extracting the sequence from an incomplete CDS feature, featureparse uses
the codon_start qualifier to adjust the frame of
the sequence. Choices are true or false (default). |
Output Arguments
FeatStruct | Output structure containing a field for every database feature.
Each field name in FeatStruct matches the
corresponding feature name in the GenBank, GenPept, or EMBL database,
with the exceptions listed in the table below. Fields in FeatStruct contain
substructures with feature qualifiers as fields. In the GenBank,
GenPept, and EMBL databases, for each feature, the only mandatory
qualifier is its location, which featureparse translates
to the field Location . When possible, featureparse also
translates this location to numeric indices, creating an Indices field.Note If you use the |
Description
parses
the features from FeatStruct
= featureparse(Features
)Features
, which contains GenBank,
GenPept, or EMBL features. Features
can
be a:
Character vector or string containing GenBank, GenPept, or EMBL features
MATLAB character array including text describing GenBank, GenPept, or EMBL features
MATLAB structure with fields corresponding to GenBank, GenPept, or EMBL data, such as those returned by
genbankread
,genpeptread
,emblread
,getgenbank
,getgenpept
, orgetembl
FeatStruct
is the output structure
containing a field for every database feature. Each field name in FeatStruct
matches
the corresponding feature name in the GenBank, GenPept, or EMBL
database, with the following exceptions.
Feature Name in GenBank, GenPept, or EMBL Database | Field Name in MATLAB Structure |
---|---|
-10_signal | minus_10_signal |
-35_signal | minus_35_signal |
3'UTR | three_prime_UTR |
3'clip | three_prime_clip |
5'UTR | five_prime_UTR |
5'clip | five_prime_clip |
D-loop | D_loop |
Fields in FeatStruct
contain substructures
with feature qualifiers as fields. In the GenBank, GenPept, and
EMBL databases, for each feature, the only mandatory qualifier is
its location, which featureparse
translates to
the field Location
. When possible, featureparse
also
translates this location to numeric indices, creating an Indices
field.
Note
If you use the Indices
field to extract sequence
information, you may need to complement the sequences.
calls FeatStruct
= featureparse
(Features
, ...'PropertyName
', PropertyValue
,
...)featureparse
with optional
properties that use property name/property value pairs. You can specify
one or more properties in any order. Each PropertyName
must
be enclosed in single quotation marks and is case insensitive. These
property name/property value pairs are as follows:
returns only the substructure that corresponds
to FeatStruct
=
featureparse(Features
, ...'Feature', FeatureValue
,
...)FeatureValue
, the name of a feature
contained in Features
. If there are multiple
features with the same FeatureValue
, then FeatStruct
is
an array of structures.
controls
the extraction, when possible, of the sequences respective to each
feature, joining and complementing pieces of the source sequence and
storing them in the field FeatStruct
= featureparse(Features
,
...'Sequence', SequenceValue
, ...)Sequence
. When extracting
the sequence from an incomplete CDS feature, featureparse
uses
the codon_start
qualifier to adjust the frame of
the sequence. Choices are true
or false
(default).
Examples
The following example obtains all the features stored in the GenBank file nm175642.txt
:
gbkStruct = genbankread('nm175642.txt'); features = featureparse(gbkStruct) features = source: [1x1 struct] gene: [1x1 struct] CDS: [1x1 struct]
The following example obtains only the coding sequences (CDS) feature of the Caenorhabditis elegans cosmid record (accession number Z92777) from the GenBank database:
worm = getgenbank('Z92777'); CDS = featureparse(worm,'feature','cds') CDS = 1x12 struct array with fields: Location Indices locus_tag standard_name note codon_start product protein_id db_xref translation
Retrieve two nucleotide sequences from the GenBank database for the neuraminidase (NA) protein of two strains of the Influenza A virus (H5N1).
hk01 = getgenbank('AF509094'); vt04 = getgenbank('DQ094287');
Extract the sequence of the coding region for the neuraminidase (NA) protein from the two nucleotide sequences. The sequences of the coding regions are stored in the
Sequence
fields of the returned structures,hk01_cds
andvt04_cds
.hk01_cds = featureparse(hk01,'feature','CDS','Sequence',true); vt04_cds = featureparse(vt04,'feature','CDS','Sequence',true);
Once you have extracted the nucleotide sequences, you can use the
nt2aa
andnwalign
functions to align the amino acids sequences converted from the nucleotide sequences.[sc,al]=nwalign(nt2aa(hk01_cds),nt2aa(vt04_cds),'extendgap',1);
Then you can use the
seqinsertgaps
function to copy the gaps from the aligned amino acid sequences to their corresponding nucleotide sequences, thus codon-aligning them.hk01_aligned = seqinsertgaps(hk01_cds,al(1,:)) vt04_aligned = seqinsertgaps(vt04_cds,al(3,:))
Once you have code aligned the two sequences, you can use them as input to other functions such as
dnds
, which calculates the synonymous and nonsynonymous substitutions rates of the codon-aligned nucleotide sequences. By settingVerbose
totrue
, you can also display the codons considered in the computations and their amino acid translations.[dn,ds] = dnds(hk01_aligned,vt04_aligned,'verbose',true)
Version History
See Also
emblread
| genbankread
| genpeptread
| getgenbank
| getgenpept