Main Content

multialign

Align multiple sequences using progressive method

Description

SeqsMultiAligned = multialign(Seqs) performs a progressive multiple alignment for a set of sequences.

Pairwise distances between sequences are computed after pairwise alignment with the Gonnet scoring matrix and then by counting the proportion of sites at which each pair of sequences are different (ignoring gaps). The guide tree is calculated by the neighbor-joining method assuming equal variance and independence of evolutionary distance estimates.

SeqsMultiAligned = multialign(Seqs,Tree) uses a tree as a guide for the progressive alignment. The sequences should have the same order as the leaves in the tree or use a field ("Header" or "Name") to identify the sequences.

SeqsMultiAligned = multialign(___,Name=Value) uses additional options specified by one or more name-value arguments.

example

Examples

collapse all

This example shows how to align multiple protein sequences.

Use the fastaread function to read p53samples.txt, a FASTA-formatted file included with Bioinformatics Toolbox™, which contains p53 protein sequences of seven species.

p53 = fastaread('p53samples.txt')
p53=7×1 struct array with fields:
    Header
    Sequence

Compute the pairwise distances between each pair of sequences using the 'GONNET' scoring matrix.

dist = seqpdist(p53,'ScoringMatrix','GONNET');

Build a phylogenetic tree using an unweighted average distance (UPGMA) method. This tree will be used as a guiding tree in the next step of progressive alignment.

tree = seqlinkage(dist,'average',p53)
    Phylogenetic tree object with 7 leaves (6 branches)

Perform progressive alignment using the PAM family scoring matrices.

ma = multialign(p53,tree,'ScoringMatrix',...
                {'pam150','pam200','pam250'})
ma=7×1 struct array with fields:
    Header
    Sequence

Enter an array of sequences.

seqs = {'CACGTAACATCTC','ACGACGTAACATCTTCT','AAACGTAACATCTCGC'};

Promote terminations with gaps in the alignment.

multialign(seqs,'terminalGapAdjust',true)
ans = 3x17 char array
    '--CACGTAACATCTC--'
    'ACGACGTAACATCTTCT'
    '-AAACGTAACATCTCGC'

Compare the alignment without termination gap adjustment.

multialign(seqs)
ans = 3x17 char array
    'CA--CGTAACATCT--C'
    'ACGACGTAACATCTTCT'
    'AA-ACGTAACATCTCGC'

Input Arguments

collapse all

Nucleotide or amino acid sequences, specified as a cell array of character vectors, vector of strings, matrix of characters, or vector of structures.

You can specify:

  • Cell array of character vectors or vector of strings containing nucleotide or amino acid sequences.

  • Matrix of characters, in which each row corresponds to a nucleotide or amino acid sequence.

  • Vector of structures containing a Sequence field for the residues and a Header or Name field for the labels.

Phylogenetic tree, specified as a phytree object. You can calculate the tree using the seqlinkage or seqneighjoin function.

Name-Value Arguments

Specify optional pairs of arguments as Name1=Value1,...,NameN=ValueN, where Name is the argument name and Value is the corresponding value. Name-value arguments must appear after other arguments, but the order of the pairs does not matter.

Example: SeqsMultiAligned = multialign(Seqs,Weights="equal") assigns the same weight to every sequence.

Before R2021a, use commas to separate each name and value, and enclose Name in quotes.

Example: SeqsMultiAligned = multialign(Seqs,"Weights","equal")

Sequence weighting method, specified as "THG" or "equal". Weights emphasize highly divergent sequences by scaling the scoring matrix and gap penalties. Closer sequences receive smaller weights.

  • "THG" — Thompson-Higgins-Gibson method using the phylogenetic tree branch distances weighted by their thickness.

  • "equal" — Assigns the same weight to every sequence.

Scoring matrix for the progressive alignment, specified as a character vector, string scalar, or numeric matrix. You can specify a series of scoring matrices as a cell array of character vectors, array of strings, or numeric array.

Match and mismatch scores are interpolated from the series of scoring matrices by considering the distances between the two profiles or sequences being aligned. The first matrix corresponds to the smallest distance, and the last matrix to the largest distance. Intermediate distances are calculated using linear interpolation.

You can specify scoring matrix names. Valid choices are:

  • "BLOSUM62"

  • "BLOSUM30" increasing by 5 up to "BLOSUM90" (default for amino acid sequences is the "BLOSUM80" to "BLOSUM30" series)

  • "BLOSUM100"

  • "PAM10" increasing by 10 up to "PAM500"

  • "DAYHOFF"

  • "GONNET"

  • "NUC44" (default for nucleotide sequences). This choice is not supported for amino acid sequences.

Note

The above scoring matrices, provided with the software, also include a scale factor that converts the units of the output score to bits.

You can also specify a numeric matrix of size M-by-M, such as the one returned by the blosum, pam, dayhoff, gonnet, or nuc44 function. You can also specify a numeric array of size M-by-M-by-N for a series of N user-defined scoring matrices.

Note

  • If you use a scoring matrix that you created or was created by one of the above functions, the matrix does not include a scale factor. The output score will be returned in the same units as the scoring matrix. When passing your own series of scoring matrices, ensure they share the same scale.

  • If you need to compile multialign into a standalone application or software component using MATLAB® Compiler™, use a numeric matrix instead of the scoring matrix name.

Example: "BLOSUM62" or 'BLOSUM62' specifies a BLOSUM scoring matrix with a percent identity level of 62, and includes a scale factor.

Example: ["pam150","pam200","pam250'] or {'pam150','pam200','pam250'} specifies a series of three PAM scoring matrices.

Example: blosum(62) specifies the numeric matrix returned by the blosum function, and does not include a scale factor.

Use linear interpolation of the scoring matrices, specified as a numeric or logical true (1) or false (0). When SMInterp is false, each scoring matrix is assigned to a fixed range depending on the distances between the two profiles or sequences being aligned.

Initial penalty for opening a gap, specified as a positive scalar or a function handle.

If you enter a function, multialign passes four values to the function: the average score for two matched residues (sm), the average score for two mismatched residues (sx), and, the length of both profiles or sequences (len1, len2). By default, multialign uses the function handle @(sm,sx,len1,len2) 5*sm, which sets the initial penalty for opening the gap at five times the average score for two matched residuals. Although the default function does not depend on sx, len1, or len2, your custom function can use these values.

Data Types: double

Initial penalty for extending a gap, specified as a positive scalar or a function handle. If you specify this value, the function uses the affine gap penalty scheme, that is, it scores the first gap using the GapOpen value and scores subsequent gaps using the ExtendGap value. If you do not specify this value, the function scores all gaps equally, using the GapOpen penalty.

If you enter a function, multialign passes four values to the function: the average score for two matched residues (sm), the average score for two mismatched residues (sx), and, the length of both profiles or sequences (len1, len2). By default, multialign uses the function handle @(sm,sx,len1,len2) sm/4, which sets the initial penalty for extending the gap at one-fourth the average score for two matched residuals. Although the default function does not depend on sx, len1, or len2, your custom function can use these values.

Data Types: double

Threshold delay of divergent sequences, specified as a numeric scalar. The multialign function delays the alignment of divergent sequences whose closest neighbor is farther than:

(DelayCutoff) * (median patristic distance between sequences)

The default value is unity, where sequences with the closest sequence farther than the median distance are delayed.

Use parallel computation of the pairwise alignments, specified as a numeric or logical false (0) or true (1).

  • If true, and Parallel Computing Toolbox™ is installed, then computation occurs using parfor-loops.

    • If a parpool is open, then the computation uses the open parpool and occurs in parallel.

    • If there are no open parpool, but automatic creation is enabled in the Parallel Preferences, then the default pool will be automatically opened and computation occurs in parallel.

    • If there are no open parpool and automatic creation is disabled, then computation uses parfor-loops in serial mode.

  • If Parallel Computing Toolbox is not installed, then computation uses parfor-loops in serial mode.

  • If false, then the computation uses for-loops in serial mode.

Display the sequences with sequence information, specified as a numeric or logical false (0) or true (1).

Control automatic adjustment based on existing gaps, specified as a numeric or logical true (1) or false (0).

When true, for every profile position, multialign proportionally lowers the penalty for opening a gap toward the penalty of extending a gap based on the proportion of gaps found in the contiguous symbols and on the weight of the input profile.

When false, turns off the automatic adjustment based on existing gaps of the position-specific penalties for opening a gap.

This argument is analogous to the function profalign and is used through every step of the progressive alignment of profiles.

Adjust the penalty for opening a gap at the ends of the sequence, specified as a numeric or logical false (0) or true (1). When true, the multialign function adjusts the penalty for opening a gap at the ends of the sequence to be equal to the penalty for extending a gap.

This argument is analogous to the function profalign and is used through every step of the progressive alignment of profiles.

Output Arguments

collapse all

Aligned sequences, returned as a cell array of character vectors, vector of strings, matrix of characters, or vector of structures. The format of SeqsMultiAligned matches the format of the input sequences to align, Seqs.

  • When Seqs is a cell array of character vectors, vector of strings, or matrix of characters, the output alignment in SeqsMultiAligned follows the same order as the input.

  • When Seqs is a vector of structures, the Sequence field of SeqsMultiAligned is updated with the alignment. Other fields of SeqsMultiAligned match the fields of Seq.

Extended Capabilities

Version History

Introduced before R2006a