Main Content

seqsplit

Split sequences into separate files based on barcodes

Description

seqsplit(fastqFile,barcodeFile) splits sequences in fastqFile according to the barcodes in barcodeFile and saves the sequences in separate files. By default, the output file name consists of the input file name followed by the barcode identifier. Sequences that do not match any provided barcodes, or that match multiple barcodes ambiguously, are saved in a file with the suffix '_unmatched' instead of the barcode identifier.

example

seqsplit(___,Name,Value) uses additional options specified by one or more Name,Value pair arguments.

example

[outFiles,N] = seqsplit(___) returns the names of output files in a cell array outFiles. N represents a vector containing the numbers of sequences saved in each output file.

example

Examples

collapse all

Create a tab-delimited file with barcode IDs and barcode sequences.

 barcodeInfo = {'ID1', 'AAAAC'; 'ID2', 'AGATT'; 'ID3', 'GACTT'};
 writetable(cell2table(barcodeInfo), 'barcodeExample.txt', ...
        'Delimiter', '\t', 'WriteVariableNames', false);

Split sequences into separate output files based on the barcode sequences. By default, the function assumes that the barcode is located at the 5' end of each sequence, and no mismatches are allowed during barcode matching.

[outFiles, N] = seqsplit('SRR005164_1_50.fastq', 'barcodeExample.txt');

Check the number of sequences in each output file after splitting.

N
N = 3×1

     2
     1
     1

Allow up to two mismatches during the barcode matching.

[outFiles, N] = seqsplit('SRR005164_1_50.fastq', 'barcodeExample.txt', ...
        'MaxMismatches',2,'OutputSuffix','_MM2_split');
N
N = 3×1

     5
     9
     5

Input Arguments

collapse all

Names of FASTQ-formatted files with sequence and quality information, specified as a character vector, string, string vector, or cell array of character vectors.

Example: 'SRR005164_1_50.fastq'

Name of barcode file with barcode information, specified as a character vector or string. The file must be tab-formatted, containing barcode IDs and barcode sequences. Each ID must be followed by a barcode sequence, and all barcode sequences must have the same length.

Example: 'barcodeExample.txt'

Name-Value Arguments

Specify optional pairs of arguments as Name1=Value1,...,NameN=ValueN, where Name is the argument name and Value is the corresponding value. Name-value arguments must appear after other arguments, but the order of the pairs does not matter.

Before R2021a, use commas to separate each name and value, and enclose Name in quotes.

Example: 'MaxMismatches',2 specifies to allow up to 2 mismatches during barcode matching.

Maximum number of mismatches allowed during barcode matching, specified as a nonnegative integer. The default is 0, that is, no mismatches are allowed.

Type of barcode to match, specified as 3 or 5. A value of 5 corresponds to the barcode located at the 5' end of each sequence, and 3 corresponds to the 3' end.

Example:

Whether to remove the barcode and corresponding quality information from the matched sequences, specified as true or false. The default is true.

Whether to save unmatched sequences and corresponding quality information in a separate output file, specified as true or false. The output file name has the suffix '_unmatched' instead of the barcode ID.

Relative or absolute path to the output file directory, specified as a character vector or string. The default is the current directory.

Example: 'OutputDir','F:\results'

Suffix to use in the output file name, specified as a character vector or string. It is inserted after the input file name and before the barcode ID. The default is '_split'.

Whether to perform computation in parallel, specified as true or false.

For parallel computing, you must have Parallel Computing Toolbox™. If a parallel pool does not exist, one is created automatically when the auto-creation option is enabled in your parallel preferences. Otherwise, computation runs in serial mode.

Note

There is a cost associated with sharing large input files across workers in a distributed environment. In some cases, running in parallel may not be beneficial in terms of performance.

Example: 'UseParallel',true

Output Arguments

collapse all

Output file names, returned as a cell array of character vectors. By default, the name of each output file consists of the input file name followed by the output suffix ('_split') and the barcode identifier.

Numbers of sequences saved in each output file, returned as a scalar or an n-by-1 vector, where n is the number of output files. If there are multiple output files, the order within N corresponds to the order of the output files.

Extended Capabilities

Version History

Introduced in R2016b