affyinvarsetnorm

Perform rank invariant set normalization on probe intensities from multiple Affymetrix CEL or DAT files

Syntax

NormData = affyinvarsetnorm(Data) [NormData, MedStructure] = affyinvarsetnorm(Data) ... affyinvarsetnorm(..., 'Baseline', BaselineValue, ...) ... affyinvarsetnorm(..., 'Thresholds', ThresholdsValue, ...) ... affyinvarsetnorm(..., 'StopPercentile', StopPercentileValue, ...) ... affyinvarsetnorm(..., 'RayPercentile', RayPercentileValue, ...) ... affyinvarsetnorm(..., 'Method', MethodValue, ...) ... affyinvarsetnorm(..., 'Showplot', ShowplotValue, ...)

Arguments

`Data`	Matrix of intensity values where each row corresponds to a perfect match (PM) probe and each column corresponds to an Affymetrix^® CEL or DAT file. (Each CEL or DAT file is generated from a separate chip. All chips should be of the same type.)
`MedStructure`	Structure of each column's intensity median before and after normalization, and the index of the column chosen as the baseline.
`BaselineValue`	Property to control the selection of the column index `N` from `Data` to be used as the baseline column. Default is the column index whose median intensity is the median of all the columns.
`ThresholdsValue`	Property to set the thresholds for the lowest average rank and the highest average rank, which are used to determine the invariant set. The rank invariant set is a set of data points whose proportional rank difference is smaller than a given threshold. The threshold for each data point is determined by interpolating between the threshold for the lowest average rank and the threshold for the highest average rank. Select these two thresholds empirically to limit the spread of the invariant set, but allow enough data points to determine the normalization relationship. `ThresholdsValue` is a 1-by-2 vector [`LT, HT`] where `LT` is the threshold for the lowest average rank and `HT` is threshold for the highest average rank. Values must be between `0` and `1`. Default is [`0.05, 0.005`].
`StopPercentileValue`	Property to stop the iteration process when the number of data points in the invariant set reaches `N` percent of the total number of data points. Default is `1`. Note If you do not use this property, the iteration process continues until no more data points are eliminated.
`RayPercentileValue`	Property to select the `N` percentage of the highest ranked invariant set of data points to fit a straight line through, while the remaining data points are fitted to a running median curve. The final running median curve is a piecewise linear curve. Default is `1.5`.
`MethodValue`	Property to select the smoothing method used to normalize the data. Enter `'lowess'` or `'runmedian'`. Default is `'lowess'`.
`ShowplotValue`	Property to control the plotting of two pairs of scatter plots (before and after normalization). The first pair plots baseline data versus data from a specified column (chip) from the matrix `Data`. The second is a pair of M-A scatter plots, which plots M (ratio between baseline and sample) versus A (the average of the baseline and sample). Enter either `'all'` (plot a pair of scatter plots for each column or chip) or specify a subset of columns (chips) by entering the column number(s) or a range of numbers.

Description

NormData = affyinvarsetnorm(Data) normalizes the values in each column (chip) of probe intensities in Data to a baseline reference, using the invariant set method. NormData is a matrix of normalized probe intensities from Data.

Specifically, affyinvarsetnorm:

Selects a baseline index, typically the column whose median intensity is the median of all the columns.
For each column, determines the proportional rank difference (prd) for each pair of ranks, RankX and RankY, from the sample column and the baseline reference.
prd = abs(RankX - RankY)
For each column, determines the invariant set of data points by selecting data points whose proportional rank differences (prd) are below threshold, which is a predetermined threshold for a given data point (defined by the ThresholdsValue property). It repeats the process until either no more data points are eliminated, or a predetermined percentage of data points is reached.
The invariant set is data points with a prd < threshold.
For each column, uses the invariant set of data points to calculate the lowess or running median smoothing curve, which is used to normalize the data in that column.

[NormData, MedStructure] = affyinvarsetnorm(Data) also returns a structure of the index of the column chosen as the baseline and each column's intensity median before and after normalization.

Note

If Data contains NaN values, then NormData will also contain NaN values at the corresponding positions.

... affyinvarsetnorm(..., 'PropertyName', PropertyValue, ...) calls affyinvarsetnorm with optional properties that use property name/property value pairs. You can specify one or more properties in any order. Each PropertyName must be enclosed in single quotation marks and is case insensitive. These property name/property value pairs are as follows:

... affyinvarsetnorm(..., 'Baseline', BaselineValue, ...) lets you select the column index N from Data to be the baseline column. Default is the index of the column whose median intensity is the median of all the columns.

... affyinvarsetnorm(..., 'Thresholds', ThresholdsValue, ...) sets the thresholds for the lowest average rank and the highest average rank, which are used to determine the invariant set. The rank invariant set is a set of data points whose proportional rank difference is smaller than a given threshold. The threshold for each data point is determined by interpolating between the threshold for the lowest average rank and the threshold for the highest average rank. Select these two thresholds empirically to limit the spread of the invariant set, but allow enough data points to determine the normalization relationship.

ThresholdsValue is a 1-by-2 vector [LT, HT], where LT is the threshold for the lowest average rank and HT is threshold for the highest average rank. Values must be between 0 and 1. Default is [0.05, 0.005].

... affyinvarsetnorm(..., 'StopPercentile', StopPercentileValue, ...) stops the iteration process when the number of data points in the invariant set reaches N percent of the total number of data points. Default is 1.

Note

If you do not use this property, the iteration process continues until no more data points are eliminated.

... affyinvarsetnorm(..., 'RayPercentile', RayPercentileValue, ...) selects the N percentage of the highest ranked invariant set of data points to fit a straight line through, while the remaining data points are fitted to a running median curve. The final running median curve is a piecewise linear curve. Default is 1.5.

... affyinvarsetnorm(..., 'Method', MethodValue, ...) selects the smoothing method for normalizing the data. When MethodValue is 'lowess', affyinvarsetnorm uses the lowess method. When MethodValue is 'runmedian', affyinvarsetnorm uses the running median method. Default is 'lowess'.

... affyinvarsetnorm(..., 'Showplot', ShowplotValue, ...) plots two pairs of scatter plots (before and after normalization). The first pair plots baseline data versus data from a specified column (chip) from the matrix Data. The second is a pair of M-A scatter plots, which plots M (ratio between baseline and sample) versus A (the average of the baseline and sample). When ShowplotValue is 'all', affyinvarsetnorm plots a pair of scatter plots for each column or chip. When ShowplotValue is a number(s) or range of numbers, affyinvarsetnorm plots a pair of scatter plots for the indicated column numbers (chips).

Examples

collapse all

Normalize Affymetrix data

Open Live Script

This example shows how to normalize affymetrix data. The prostatecancerrawdata.mat file used in the example contains data from Best et al., 2005.

Load a MAT-file, included with the Bioinformatics Toolbox™ software, which contains Affymetrix data variables, including pmMatrix , a matrix of PM probe intensity values from multiple CEL files.

load prostatecancerrawdata

Normalize the data in pmMatrix and plot data from columns (chips) 2 and 3. Column 1 is the baseline.

NormMatrix = affyinvarsetnorm(pmMatrix, 'Showplot',[2 3]);

References

[1] Li, C., and Wong, W.H. (2001). Model-based analysis of oligonucleotide arrays: model validation, design issues and standard error application. Genome Biology 2(8): research0032.1-0032.11.

[2] Best, C.J.M., Gillespie, J.W., Yi, Y., Chandramouli, G.V.R., Perlmutter, M.A., Gathright, Y., Erickson, H.S., Georgevich, L., Tangrea, M.A., Duray, P.H., Gonzalez, S., Velasco, A., Linehan, W.M., Matusik, R.J., Price, D.K., Figg, W.D., Emmert-Buck, M.R., and Chuaqui, R.F. (2005). Molecular alterations in primary prostate cancer after androgen ablation therapy. Clinical Cancer Research 11, 6823–6834.

Version History

Introduced in R2006a