I have raw files from Next-generation sequencing of 16s rRNA in .fastq format and I want to analyse them to obtain the OTU and taxonomy relative abundance of all the microbial species present in the sample.
A complete answer to this question is outside the scope of a single MATLAB Answers post, I suggest reading some published papers on various approaches to reconstructing phylogeny with 16s rRNA. Here's one such paper, though there are many others: https://academic.oup.com/nar/article/36/18/e120/1070009.
In general, you will need to perform the following series of steps:
Obtain reference sequences of the 16s gene (likely in FASTA format) for each of the microbial species you wish to test for. These can likely be obtained from public databases like the NCBI: https://www.ncbi.nlm.nih.gov/gene/?term=16s%20rrna. For particular sequences of interest, you can obtain these in MATLAB using getgetbank
Assign each of your input reads to it's closest species match. There are several methods to do so, one way is to use blastlocal using the FASTA reference sequences from step 1 as the database, and your FASTQ reads as the queries. The relative abundance of each species can be inferred from the number of matches to each of your reference sequences.
To construct a taxonomy, you must then perform a multiple alignment of the 16s gene for each of your observed species (likely a subset of your references from (1)), and construct a phylogenetic tree using the distances between each sequence. In MATLAB, this can be done with multialign, seqpdist, and seqlinkage. The definition of an OTU is not set in stone, but in general is a common set of very similar sequences. From the phytree created with seqlinkage, you can construct OTUs by providing a similarity threshold using cluster(phytree).
Feel free to ask more specific questions about any of these steps in a follow up question. If you need broader help with constructing a pipeline to do this analysis, we do offer consulting.