cuffnorm
Normalize transcript expression levels
Syntax
Description
cuffnorm(
normalizes transcript expression to FPKM for the samples in
transcriptsAnnot
,alignmentFiles
)alignmentFiles
and corrects for differences in library size [1].
cuffnorm
requires the Cufflinks Support Package for the Bioinformatics Toolbox™. If the support package is not installed, then the function provides a download
link. For details, see Bioinformatics Toolbox Software Support Packages.
cuffnorm(
uses additional options specified by transcriptsAnnot
,alignmentFiles
,opt
)opt
.
cuffnorm(
uses additional options specified by one or more name-value pair arguments. For example,
transcriptsAnnot
,alignmentFiles
,Name,Value
)cuffnorm('gyrAB.gtf',["Myco_1_1.sam", "Myco_2_1.sam"],'NumThreads',5)
specifies to use five parallel threads.
Examples
Assemble Transcriptome and Normalize Expression Levels
Create a CufflinksOptions
object to define cufflinks options, such
as the number of parallel threads and the output directory to store the results.
cflOpt = CufflinksOptions;
cflOpt.NumThreads = 8;
cflOpt.OutputDirectory = "./cufflinksOut";
The SAM files provided for this example contain aligned reads for Mycoplasma
pneumoniae from two samples with three replicates each. The reads are
simulated 100bp-reads for two genes (gyrA
and
gyrB
) located next to each other on the genome. All the reads are
sorted by reference position, as required by cufflinks
.
sams = ["Myco_1_1.sam","Myco_1_2.sam","Myco_1_3.sam",... "Myco_2_1.sam", "Myco_2_2.sam", "Myco_2_3.sam"];
Assemble the transcriptome from the aligned reads.
[gtfs,isofpkm,genes,skipped] = cufflinks(sams,cflOpt);
gtfs
is a list of GTF files that contain assembled isoforms.
Compare the assembled isoforms using cuffcompare
.
stats = cuffcompare(gtfs);
Merge the assembled transcripts using cuffmerge
.
mergedGTF = cuffmerge(gtfs,'OutputDirectory','./cuffMergeOutput');
mergedGTF
reports only one transcript. This is because the two
genes of interest are located next to each other, and cuffmerge
cannot distinguish two distinct genes. To guide cuffmerge
, use a
reference GTF (gyrAB.gtf
) containing information about these two
genes. If the file is not located in the same directory that you run
cuffmerge
from, you must also specify the file path.
gyrAB = which('gyrAB.gtf'); mergedGTF2 = cuffmerge(gtfs,'OutputDirectory','./cuffMergeOutput2',... 'ReferenceGTF',gyrAB);
Calculate abundances (expression levels) from aligned reads for each sample.
abundances1 = cuffquant(mergedGTF2,["Myco_1_1.sam","Myco_1_2.sam","Myco_1_3.sam"],... 'OutputDirectory','./cuffquantOutput1'); abundances2 = cuffquant(mergedGTF2,["Myco_2_1.sam", "Myco_2_2.sam", "Myco_2_3.sam"],... 'OutputDirectory','./cuffquantOutput2');
Assess the significance of changes in expression for genes and transcripts between
conditions by performing the differential testing using cuffdiff
.
The cuffdiff
function operates in two distinct steps: the function
first estimates abundances from aligned reads, and then performs the statistical
analysis. In some cases (for example, distributing computing load across multiple
workers), performing the two steps separately is desirable. After performing the first
step with cuffquant
, you can then use the binary CXB output file as
an input to cuffdiff
to perform statistical analysis. Because
cuffdiff
returns several files, specify the output directory is
recommended.
isoformDiff = cuffdiff(mergedGTF2,[abundances1,abundances2],... 'OutputDirectory','./cuffdiffOutput');
Display a table containing the differential expression test results for the two genes
gyrB
and gyrA
.
readtable(isoformDiff,'FileType','text')
ans = 2×14 table test_id gene_id gene locus sample_1 sample_2 status value_1 value_2 log2_fold_change_ test_stat p_value q_value significant ________________ _____________ ______ _______________________ ________ ________ ______ __________ __________ _________________ _________ _______ _______ ___________ 'TCONS_00000001' 'XLOC_000001' 'gyrB' 'NC_000912.1:2868-7340' 'q1' 'q2' 'OK' 1.0913e+05 4.2228e+05 1.9522 7.8886 5e-05 5e-05 'yes' 'TCONS_00000002' 'XLOC_000001' 'gyrA' 'NC_000912.1:2868-7340' 'q1' 'q2' 'OK' 3.5158e+05 1.1546e+05 -1.6064 -7.3811 5e-05 5e-05 'yes'
You can use cuffnorm
to generate normalized expression tables for
further analyses. cuffnorm
results are useful when you have many
samples and you want to cluster them or plot expression levels for genes that are
important in your study. Note that you cannot perform differential expression analysis
using cuffnorm
.
Specify a cell array, where each element is a string vector containing file names for a single sample with replicates.
alignmentFiles = {["Myco_1_1.sam","Myco_1_2.sam","Myco_1_3.sam"],... ["Myco_2_1.sam", "Myco_2_2.sam", "Myco_2_3.sam"]} isoformNorm = cuffnorm(mergedGTF2, alignmentFiles,... 'OutputDirectory', './cuffnormOutput');
Display a table containing the normalized expression levels for each transcript.
readtable(isoformNorm,'FileType','text')
ans = 2×7 table tracking_id q1_0 q1_2 q1_1 q2_1 q2_0 q2_2 ________________ __________ __________ __________ __________ __________ __________ 'TCONS_00000001' 1.0913e+05 78628 1.2132e+05 4.3639e+05 4.2228e+05 4.2814e+05 'TCONS_00000002' 3.5158e+05 3.7458e+05 3.4238e+05 1.0483e+05 1.1546e+05 1.1105e+05
Column names starting with q have the format: conditionX_N, indicating that the column contains values for replicate N of conditionX.
Input Arguments
transcriptsAnnot
— Name of transcript annotation file
string | character vector
Name of the transcript annotation file, specified as a string or character vector. The file
can be a GTF or GFF file produced by cufflinks
,
cuffcompare
, or another source of GTF annotations.
Example: "gyrAB.gtf"
Data Types: char
| string
alignmentFiles
— Names of SAM, BAM, or CXB files
string vector | cell array
Names of SAM, BAM, or CXB files containing alignment records for each sample, specified as a string vector or cell array. If you use a cell array, each element must be a string vector or cell array of character vectors specifying alignment files for every replicate of the same sample.
Example: ["Myco_1_1.sam", "Myco_2_1.sam"]
Data Types: char
| string
| cell
opt
— cuffnorm
options
CuffNormOptions
object | string | character vector
cuffnorm
options, specified as a
CuffNormOptions
object, string, or character vector. The string or
character vector must be in the original cuffnorm
option syntax
(prefixed by one or two dashes) [1].
Name-Value Arguments
Specify optional pairs of arguments as
Name1=Value1,...,NameN=ValueN
, where Name
is
the argument name and Value
is the corresponding value.
Name-value arguments must appear after other arguments, but the order of the
pairs does not matter.
Before R2021a, use commas to separate each name and value, and enclose
Name
in quotes.
Example: cuffnorm('gyrAB.gtf',["Myco_1_1.sam",
"Myco_2_1.sam"],'NumThreads',5)
ExtraCommand
— Additional commands
""
(default) | string | character vector
The commands must be in the native syntax (prefixed by one or two dashes). Use this option to apply undocumented flags and flags without corresponding MATLAB® properties.
Example: 'ExtraCommand','--library-type
fr-secondstrand'
Data Types: char
| string
IncludeAll
— Flag to apply all available options
false
(default) | true
The original (native) syntax is prefixed by one or two dashes.
By default, the function converts only the specified options. If the value is
true
, the software converts all available options, with default values
for unspecified options, to the original syntax.
Note
If you set IncludeAll
to true
, the software
converts all available properties, using default values for unspecified properties. The
only exception is when the default value of a property is NaN
,
Inf
, []
, ''
, or
""
. In this case, the software does not translate the
corresponding property.
Example: 'IncludeAll',true
Data Types: logical
Labels
— Labels for samples
[]
(default) | string | character vector | string vector | cell array of character vectors
Labels for samples, specified as a string, character vector, string vector, or cell array of character vectors. If you are providing labels, you must specify the same number of labels as input samples.
Example:
'Labels',["mutant1","mutant2"]
Data Types: char
| string
| cell
LibraryNormalizationMethod
— Method to normalize library size
"geometric"
(default) | "classic-fpkm"
| "quartile"
Method to normalize the library size, specified as one of the following options:
"geometric"
— The function scales the FPKM values by the median geometric mean of fragment counts across all libraries as described in [2]."classic-fpkm"
— The function applies no scaling to the FPKM values or fragment counts."quartile"
— The function scales the FPKM values by the ratio of upper quartiles between fragment counts and the average value across all libraries.
Example:
'LibraryNormalizationMethod',"classic-fpkm"
Data Types: char
| string
NormalizeCompatibleHits
— Flag to use only fragments compatible with reference transcript to calculate FPKM values
true
(default) | false
Flag to use only fragments compatible with a reference
transcript to calculate FPKM values, specified as true
or
false
.
Example: 'NormalizeCompatibleHits',false
Data Types: logical
NormalizeTotalHits
— Flag to include all fragments to calculate FPKM values
false
(default) | true
Flag to include all fragments to calculate FPKM values,
specified as true
or false
. If the value is
true
, the function includes all fragments, including fragments without a
compatible reference.
Example: 'NormalizeTotalHits',true
Data Types: logical
NumThreads
— Number of parallel threads to use
1
(default) | positive integer
Number of parallel threads to use, specified as a positive integer. Threads are run on separate processors or cores. Increasing the number of threads generally improves the runtime significantly, but increases the memory footprint.
Example: 'NumThreads',4
Data Types: double
OutputDirectory
— Directory to store analysis results
current directory ("./"
) (default) | string | character vector
Directory to store analysis results, specified as a string or character vector.
Example: 'OutputDirectory',"./AnalysisResults/"
Data Types: char
| string
OutputFormat
— Format for result files
"simple-table"
(default) | "cuffdiff"
Format for result files, specified as "simple-table"
or "cuffdiff"
.
"simple-table"
— The output is in tab-delimited table format."cuffdiff"
— The output is in the same form used bycuffdiff
.
Example:
'OutputFormat',"cuffdiff"
Data Types: char
| string
Seed
— Seed for random number generator
0
(default) | nonnegative integer
Seed for the random number generator, specified as a nonnegative integer. Setting a seed value ensures the reproducibility of the analysis results.
Example: 'Seed',10
Data Types: double
Output Arguments
isoform
— Name of file containing normalized expression level for isoform
"./isoforms.fpkm_table"
Name of a file containing the normalized expression level for each isoform, returned as a string.
The output string also includes the directory information defined by
OutputDirectory
. The default is the current directory. If you set
OutputDirectory
to "/local/tmp/"
, the output
becomes "/local/tmp/isoforms.fpkm_table"
.
gene
— Name of file containing normalized expression level for gene
"./genes.fpkm_table"
Name of a file containing the normalized expression level for each gene, returned as a string.
The output string also includes the directory information defined by
OutputDirectory
. The default is the current directory. If you set
OutputDirectory
to "/local/tmp/"
, the output
becomes "/local/tmp/genes.fpkm_table"
.
tss
— Name of file containing normalized expression level for transcript start site
"./tss_groups.fpkm_table"
Name of a file containing the normalized expression level for each transcript start site (TSS), returned as a string.
The output string also includes the directory information defined by
OutputDirectory
. The default is the current directory. If you set
OutputDirectory
to "/local/tmp/"
, the output
becomes "/local/tmp/tss_groups.fpkm_table"
.
cds
— Name of file containing normalized expression level for coding sequence
"./cds.fpkm_table"
Name of a file containing the normalized expression level for each coding sequence, returned as a string.
The output string also includes the directory information defined by
OutputDirectory
. The default is the current directory. If you set
OutputDirectory
to "/local/tmp/"
, the output
becomes "/local/tmp/cds.fpkm_table"
.
References
[1] Trapnell, Cole, Brian A Williams, Geo Pertea, Ali Mortazavi, Gordon Kwan, Marijke J van Baren, Steven L Salzberg, Barbara J Wold, and Lior Pachter. “Transcript Assembly and Quantification by RNA-Seq Reveals Unannotated Transcripts and Isoform Switching during Cell Differentiation.” Nature Biotechnology 28, no. 5 (May 2010): 511–15.
Version History
Introduced in R2019a
See Also
External Websites
MATLAB Command
You clicked a link that corresponds to this MATLAB command:
Run the command by entering it in the MATLAB Command Window. Web browsers do not support MATLAB commands.
Select a Web Site
Choose a web site to get translated content where available and see local events and offers. Based on your location, we recommend that you select: .
You can also select a web site from the following list
How to Get Best Site Performance
Select the China site (in Chinese or English) for best site performance. Other MathWorks country sites are not optimized for visits from your location.
Americas
- América Latina (Español)
- Canada (English)
- United States (English)
Europe
- Belgium (English)
- Denmark (English)
- Deutschland (Deutsch)
- España (Español)
- Finland (English)
- France (Français)
- Ireland (English)
- Italia (Italiano)
- Luxembourg (English)
- Netherlands (English)
- Norway (English)
- Österreich (Deutsch)
- Portugal (English)
- Sweden (English)
- Switzerland
- United Kingdom (English)
Asia Pacific
- Australia (English)
- India (English)
- New Zealand (English)
- 中国
- 日本Japanese (日本語)
- 한국Korean (한국어)