Parsing a file multiple entries consisting of strings. Each entry contains a header followed by a descriptor. The objective is to use headers to obtain a subset of entries.

1 Ansicht (letzte 30 Tage)
The following is an example of a file to be parsed.
Each entry contains a header indicated by ">" followed by sequences of letters (amino acid descriptors). Please note that the sequences shown in the example are truncated. The objective is to use list of headers such as "FBtr0077276", ">FBtr0080587" and fish out both the header and the corresponding amino acid sequences in the same format as that of the submitted file. The format is widely used in bioinformatics and is known as fasta.
Thank you for your comments/help
Example of input file. The headers are highlighted in bold
>FBtr0077276 | Cyp6v1 Cytochrome P450 6v1
MVYSTNILLAIVTILTGVFIWSRRTYVYWQRRRVKFVQPTHLLGNLSRVLRLEESFALQL
RRFYFDERFRNEPVVGIYLFHQPALLIRDLQLVRTVLVEDFVSFSNRFAKCDGRSDKMGA
>FBtr0079061 | Cyp28d2 Cytochrome P450 28d2
MCPVTTFLVLVLTLLVLVYVFLTWNFNYWRKRGIKTAPTWPFVGSFPSIFTRKRNIAYDI
>FBtr0079925 | Cyp4e3 Cytochrome P450 4e3
MWLAVLALLVLPLITLVYFERKASQRRQLLKEFNGPTPVPILGNANRIGKNPAEILSTFF
>FBtr0080587 | Cyp28a5 Cytochrome P450 28a5
MVLITLTLVSLVVGLLYAVLVWNYDYWRKRGVPGPKPKLLCGNYPNMFTMKRHAIYDLDD
>FBtr0081077 | Cyp310a1 Cytochrome P450 310a1
MWLLLPILLYSAVFLSVRHIYSHWRRRGFPSEKAGITWSFLQKAYRREFRHVEAICEAYQ
SGKDRLLGIYCFFRPVLLVRNVELAQTILQQSNGHFSELKWDYISGYRRFNLLEKLAPMF
>FBtr0077276 | Cyp6v1 Cytochrome P450 6v1
MVYSTNILLAIVTILTGVFIWSRRTYVYWQRRRVKFVQPTHLLGNLSRVLRLEESFALQL
RRFYFDERFRNEPVVGIYLFHQPALLIRDLQLVRTVLVEDFVSFSNRFAKCDGRSDKMGA
>FBtr0079061 | Cyp28d2 Cytochrome P450 28d2
MCPVTTFLVLVLTLLVLVYVFLTWNFNYWRKRGIKTAPTWPFVGSFPSIFTRKRNIAYDI

Antworten (1)

Star Strider
Star Strider am 24 Sep. 2024
The Bioinformatics Toolbox has a number of functions for these files. The fastaread function appears to be appropriate. (I don’t have that Toolbox, I’m simply aware of some of its functions.)
  2 Kommentare
George
George am 24 Sep. 2024
Verschoben: Star Strider am 24 Sep. 2024
Thank you for the prompt reply.
I've used the fastaread function.
A truncated output is shown below.
head(header)
{'FBtr0077276 | Cyp6v1 Cytochrome P450 6v1' }
{'FBtr0079061 | Cyp28d2 Cytochrome P450 28d2' }
{'FBtr0079925 | Cyp4e3 Cytochrome P450 4e3' }
{'FBtr0080587 | Cyp28a5 Cytochrome P450 28a5' }
head(squence)
head(a)
{'MVYSTNILLAIVTILTGVFIWSR.....................'}
{'MCPVTTFLVLVLTLLVLVYVFLTWNFNYWRKRGIKTAPTWPFVGS........ '}
I could generate a cell array from this and parse it but unfortunatelly the sequence format is lost.
Star Strider
Star Strider am 24 Sep. 2024
I am not certain what you intend by ‘the sequence format is lost’. I also don’t have your file, so I can’t run fastaread with it to test this.
Perhaps you could create a table with:
CYP = cell2table(sequence, 'RowNames',header)
That might work.
Experimenting with something like that —
header = {'FBtr0077276 | Cyp6v1 Cytochrome P450 6v1',
'FBtr0079061 | Cyp28d2 Cytochrome P450 28d2'};
sequence = {{'MVYSTNILLAIVTILTGVFIWSR.....................'}
{'MCPVTTFLVLVLTLLVLVYVFLTWNFNYWRKRGIKTAPTWPFVGS........ '}};
CYP_Table = cell2table(sequence, 'RowNames',header)
CYP_Table = 2x1 table
sequence __________________________________________________________ FBtr0077276 | Cyp6v1 Cytochrome P450 6v1 {'MVYSTNILLAIVTILTGVFIWSR.....................' } FBtr0079061 | Cyp28d2 Cytochrome P450 28d2 {'MCPVTTFLVLVLTLLVLVYVFLTWNFNYWRKRGIKTAPTWPFVGS........ '}
This required a bit of manual editing because I don’t have the actual function outputs (and I don’t have actual experience wit the function).. It might be possible to avoid the manual edits, perhaps using cellfun, and maybe compose as well. (I can’t tell from here.)
It should be relatively straightforward to get the information from ‘CYP_Table’ after that, although I don’t know what you want to do with the data after reading it and creating the table (if that’s what you actually want to do).
.

Melden Sie sich an, um zu kommentieren.

Kategorien

Mehr zu Bioinformatics Toolbox finden Sie in Help Center und File Exchange

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by