seqfilter

Filter out sequences based on specified criterion

Syntax

seqfilter(fastqFile)

seqfilter(fastqFile,Name,Value)

[outFiles,nSeqIn,nSeqOut]
= seqfilter(___)

Description

seqfilter(fastqFile) applies a filtering criterion to the sequences in fastqFile and saves the sequences that meet the criterion in a new FASTQ file. By default, the sequences that pass the criterion are saved under file names with the suffix '_filtered' appended. If you do not specify any criterion, the function filters sequences using the default.

example

seqfilter(fastqFile,Name,Value) uses additional options specified by one or more Name,Value pair arguments.

example

[outFiles,nSeqIn,nSeqOut] = seqfilter(___) returns a cell array outFiles with the names of output files. nSeqIn and nSeqOut represent the numbers of sequences included and excluded from each input file, respectively.

Examples

collapse all

Filter next-generation sequencing data

Open Live Script

Filter out sequences with more than 10% of low quality bases, where a base is considered low quality when its quality score is less than 20.

 [outFile,in,out] = seqfilter('SRR005164_1_50.fastq',...
                              'Method','MaxPercentLowQualityBases',...
                              'Threshold',[10 20]) ;

Check the number of sequences saved in the output file.

in

in = 
39

Check the number of sequences filtered out.

out

out = 
11

Filter out sequences having an average quality score of below 20.

[outFile,in,out] = seqfilter('SRR005164_1_50.fastq',...
                             'Method','MeanQuality',...
                             'Threshold',20);

Apply the filtering criterion to every 10 bases as a sliding window.

[outFile,in,out] = seqfilter('SRR005164_1_50.fastq',...
                             'Method','MeanQuality',...
                             'Threshold',20,'WindowSize',10);

Filter out sequences with less than 100 bases.

[outFile,in,out] = seqfilter('SRR005164_1_50.fastq',...
                             'Method','MinLength',...
                             'Threshold',100);

Input Arguments

collapse all

`fastqFile` — Names of FASTQ files with sequence and quality information
character vector | string | string vector | cell array of character vectors

Names of FASTQ-formatted files with sequence and quality information, specified as a character vector, string, string vector, or cell array of character vectors.

Example: 'SRR005164_1_50.fastq'

Name-Value Arguments

collapse all

Specify optional pairs of arguments as Name1=Value1,...,NameN=ValueN, where Name is the argument name and Value is the corresponding value. Name-value arguments must appear after other arguments, but the order of the pairs does not matter.

Before R2021a, use commas to separate each name and value, and enclose Name in quotes.

Example: 'Method','MaxNumberLowQualityBases','Threshold',[5 15] specifies to filter out sequences with a total of more than 5 low-quality bases, where a base is considered a low-quality base if its quality score is less than 15.

`Method` — Criterion to filter sequences
`'MaxNumberLowQualityBases'` (default) | `'MaxPercentLowQualityBases'` | `'MeanQuality'` | `'MinLength'`

Criterion to filter sequences, specified as one of the following options. Specify only one filtering criterion per function call.

'MaxNumberLowQualityBases'– applies a maximum threshold on the number of low-quality bases allowed.
'MaxPercentLowQualityBases'– applies a maximum threshold on the percentage of low-quality bases allowed.
'MeanQuality'– applies a minimum threshold on the average base quality across each sequence.
'MinLength'– applies a minimum threshold on the sequence length.

Use this name-value pair argument together with 'Threshold' to specify the appropriate threshold value. Depending on the filtering criterion, the corresponding value for 'Threshold' can be a scalar or two-element vector. See the 'Threshold' option for the default values. If you do not specify 'Threshold', then the function uses the default threshold value of the specified method. For each filtering criterion, the function uses the base quality encoding format specified by the 'Encoding' name-value pair argument.

`Threshold` — Threshold value for filtering criterion
scalar | vector

Threshold value for the filtering criterion, specified as a scalar or vector. Use this name-value pair to define the threshold value for the filtering criterion specified by 'Method'.

Depending on the filtering criterion, the corresponding value for 'Threshold' can be a scalar or two-element vector. If you do not specify 'Threshold', then the function uses the default threshold value of the corresponding method. For each filtering criterion, the function uses the encoding format of the base quality specified by the 'Encoding' name-value pair argument.

`'Method'`	`'Threshold'`	Default `'Threshold'` value
`'MaxNumberLowQualityBases'`	Two-element vector `[V1 V2]`. V1 is a nonnegative integer that specifies the maximum number of low-quality bases allowed. V2 specifies the minimum base quality. Any base with quality less than V2 is considered a low-quality base. Any sequence containing a number of low-quality bases greater than V1 is filtered out and not saved in the output file.	`[0 10]`
`'MaxPercentLowQualityBases'`	Two-element vector `[V1 V2]`. V1 is a scalar between 0 and 100 that specifies the maximum percentage of low-quality bases allowed. V2 specifies the minimum base quality. Any base with quality less than V2 is considered a low-quality base. Any sequence containing a percentage of low-quality bases greater than V1 is filtered out and not saved in the output file.	`[0 10]`
`'MeanQuality'`	Positive scalar that specifies the minimum threshold on the average base quality across each sequence. Any sequence with average base quality less than this value is filtered out.	`0`
`'MinLength'`	Nonnegative integer that specifies the minimum threshold on the sequence length allowed. Any sequence with length less than this value is filtered out.	`1`

`WindowSize` — Size of sliding window to apply filtering criterion to sequence
`Inf` (default) | positive integer

Size of the sliding window to apply the filtering criterion to a sequence, specified as a positive integer. The size of the window corresponds to the number of bases that the function uses at one time to apply the criterion. If any window fails the criterion, the whole sequence is discarded.

The default is Inf, that is, the filtering criterion is applied to the whole sequence.

`Encoding` — Base quality encoding format
`'Illumina18'` (default) | `'Sanger'` | `'Solexa'` | `'Illumina13'` | `'Illumina15'`

Base quality encoding format, specified as a character vector or string.

`OutputDir` — Relative or absolute path to output file directory
character vector | string

Relative or absolute path to the output file directory, specified as a character vector or string. The default is the current directory.

Example: 'OutputDir','F:\results'

`OutputSuffix` — Suffix to use in output file name
`'_filtered'` (default) | character vector | string

Suffix to use in the output file name, specified as a character vector or string. It is inserted after the input file name and before the file extension. The default is '_filtered'.

`PairedFiles` — Whether to consider input files as pairs for paired-end sequence data
`false` (default) | `true`

Whether to consider the input files as pairs for paired-end sequence data, specified as true or false.

If true, the input files are read as pairs, and the sequence data is maintained in sync between the files. That is, if a sequence is filtered out in the first file, the corresponding sequence in the paired file is also filtered out.

`WriteSingleton` — Whether to save singleton sequences in a separate output file
`false` (default) | `true`

Whether to save singleton sequences in a separate output file, specified as true or false. To set this to true, the 'PairedFiles' option must also be set to true.

A singleton sequence is the sequence that passes the filtering criterion but its corresponding sequence in the paired file does not. If true, singleton sequences are saved in a separate file with the suffix '_singleton'. The default is false, meaning that, only sequences that pass the filtering criterion in both input files of a given pair are saved in the output files.

`UseParallel` — Option to perform computations in parallel
`"off"` (default) | `"auto"` | `"on"`

Option to perform computations in parallel using a parallel pool of workers, specified as one of these values:

"off" — Run in serial on the MATLAB^® client.
"auto" — Use a parallel pool if one is open or if MATLAB can automatically create one. If a parallel pool is not available, run in serial on the MATLAB client.
"on" — Use a parallel pool if one is open or if MATLAB can automatically create one. If a parallel pool is not available, throw an error.

If you do not have a parallel pool open and automatic pool creation is enabled, MATLAB opens a pool using the default cluster profile. To use a parallel pool to run computations in MATLAB, you must have Parallel Computing Toolbox™.

Before R2026a: You can specify this argument as true or false only. The default value is false. To run computations in parallel, set this argument to true.

Note

There is a cost associated with sharing large input files across workers in a distributed environment. In some cases, running in parallel may not be beneficial in terms of performance.

`OverWrite` — Flag to overwrite existing files
`false` or 0 (default) | `true` or 1

Flag to overwrite existing files, specified as a numeric or logical 1 (true) or 0 (false).

When the value is false and a file matching one of the output file names already exists, the function generates an error.

Data Types: double | logical

Output Arguments

collapse all

`outFiles` — Output file names
cell array of character vectors

Output file names, returned as a cell array of character vectors.

`nSeqIn` — Number of sequences selected from each input file
scalar | vector

Number of sequences selected from each input file, returned as a scalar or an n-by-1 vector where n is the number of input files. If there are multiple input files, the order within nSeqIn corresponds to the order of the input files.

`nSeqOut` — Number of sequences excluded from each input file
scalar | vector

Number of sequences excluded from each input file, returned as a scalar or an n-by-1 vector where n is the number of input files. If there are multiple input files, the order within nSeqOut corresponds to the order of the input files.

Extended Capabilities

expand all

Automatic Parallel Support
Accelerate code by automatically running computation in parallel using Parallel Computing Toolbox™.

seqfilter has automatic parallel support.

To run computations in parallel, set the UseParallel argument to "on" or "auto".

Version History

Introduced in R2016b

expand all

R2026a: Enhanced control over parallel execution with `UseParallel` argument

The UseParallel name-value argument now accepts "off", "auto", or "on" instead of true or false. This change gives you more control over when to use a parallel pool for parallel execution.

Specifying the UseParallel argument as true or false is not recommended.

This table shows how to update your code depending on your goal.

Goal	Not recommended	Recommended
Write code that runs on the MATLAB client	`seqfilter(fastqFile,UseParallel=false)`	`seqfilter(fastqFile,UseParallel="off")` (default)
Write portable code that runs on a parallel pool and, if a pool is not available runs on the MATLAB client.	`seqfilter(fastqFile,UseParallel=true)`	`seqfilter(fastqFile,UseParallel="auto")`
Write code that runs on a parallel pool and errors if a pool is not available.	N/A	`seqfilter(fastqFile,UseParallel="on")`

There are no plans to remove support for true or false values.

seqfilter

Syntax

Description

Examples

Filter next-generation sequencing data

Input Arguments

fastqFile — Names of FASTQ files with sequence and quality information character vector | string | string vector | cell array of character vectors

Name-Value Arguments

Method — Criterion to filter sequences 'MaxNumberLowQualityBases' (default) | 'MaxPercentLowQualityBases' | 'MeanQuality' | 'MinLength'

Threshold — Threshold value for filtering criterion scalar | vector

WindowSize — Size of sliding window to apply filtering criterion to sequence Inf (default) | positive integer

Encoding — Base quality encoding format 'Illumina18' (default) | 'Sanger' | 'Solexa' | 'Illumina13' | 'Illumina15'

OutputDir — Relative or absolute path to output file directory character vector | string

OutputSuffix — Suffix to use in output file name '_filtered' (default) | character vector | string

PairedFiles — Whether to consider input files as pairs for paired-end sequence data false (default) | true

WriteSingleton — Whether to save singleton sequences in a separate output file false (default) | true

UseParallel — Option to perform computations in parallel "off" (default) | "auto" | "on"

OverWrite — Flag to overwrite existing files false or 0 (default) | true or 1

Output Arguments

outFiles — Output file names cell array of character vectors

nSeqIn — Number of sequences selected from each input file scalar | vector

nSeqOut — Number of sequences excluded from each input file scalar | vector

Extended Capabilities

Automatic Parallel Support Accelerate code by automatically running computation in parallel using Parallel Computing Toolbox™.

Version History

R2026a: Enhanced control over parallel execution with UseParallel argument

See Also

`fastqFile` — Names of FASTQ files with sequence and quality information
character vector | string | string vector | cell array of character vectors

`Method` — Criterion to filter sequences
`'MaxNumberLowQualityBases'` (default) | `'MaxPercentLowQualityBases'` | `'MeanQuality'` | `'MinLength'`

`Threshold` — Threshold value for filtering criterion
scalar | vector

`WindowSize` — Size of sliding window to apply filtering criterion to sequence
`Inf` (default) | positive integer

`Encoding` — Base quality encoding format
`'Illumina18'` (default) | `'Sanger'` | `'Solexa'` | `'Illumina13'` | `'Illumina15'`

`OutputDir` — Relative or absolute path to output file directory
character vector | string

`OutputSuffix` — Suffix to use in output file name
`'_filtered'` (default) | character vector | string

`PairedFiles` — Whether to consider input files as pairs for paired-end sequence data
`false` (default) | `true`

`WriteSingleton` — Whether to save singleton sequences in a separate output file
`false` (default) | `true`

`UseParallel` — Option to perform computations in parallel
`"off"` (default) | `"auto"` | `"on"`

`OverWrite` — Flag to overwrite existing files
`false` or 0 (default) | `true` or 1

`outFiles` — Output file names
cell array of character vectors

`nSeqIn` — Number of sequences selected from each input file
scalar | vector

`nSeqOut` — Number of sequences excluded from each input file
scalar | vector

Automatic Parallel Support
Accelerate code by automatically running computation in parallel using Parallel Computing Toolbox™.

R2026a: Enhanced control over parallel execution with `UseParallel` argument