MATLAB Answers

Extracting length information of pattern from specific string (not fixed string)

9 views (last 30 days)
Jimmy cho
Jimmy cho on 20 Aug 2020
Commented: Jimmy cho on 23 Aug 2020
Hi guys!
I want to implement in matlab function that gets in its input (String , substring) , output the all following data that following my substring, the length of String isn't already known, this means I need to exctract the length of my following Data that I need .
Assumptions:
the length of my following data after occurrence of "0101" isn't already known, I must extract the length from the immediate 8bit that follows the occurrence/appearance of my substring (the length of all my following data after occurrence/appearance my substring is always represented in 8bit in binary and it's always the immediate followed 8bit after occurrence of my substring), all the following data length are the same at each occurance this means that the output matrix columns are the same for all the occurance but I still have to read them and its value (length value are the same at each occurance of my substring "0101").
for example:
string="0101000100001111111111100000001000010100010000111111111110000011000" , substring is always constant and it's "0101".
00010000-> 16 in decimal.
so here the output is the 16 followed data after ("00010000") which it's: 1111111111100000 , how do I know the length of my following data? it's given in the String itself immediately after appearance of substring "0101" and the length is always 8bit !, so here in my question the immediate 8bit followed to my substring ("0101") represents the following data after those 8bit, so here the immediate following 8bit after appearance "0101" is 00010000 and in decimal It's 16 , this 16 is the length of the data that I want to take/output after the 8bits that represetns the size of the following data, so here in my case I look at "0101" and then I must read the 8bit that immediately following it , that 8bit represents the length, so I need to convert the 8bit in decimal value (in my case it's 16) and take all the following data that comes after that 8bit of length represenation which its size is represented in binary in the immediate 8bit followed by occurrence substring(by occurance "0101") ; As a result the output here is 1111111111100000.
the output is:
output=[1111111111100000 ; 1111111111100000] , each row again represents respectively all following data at each occurrence, and first row represents first occurrence, second row represents second occurrence ....respectively ..etc
Another example:
String="01010000111111111111111000001000100101000011111111111111100010111111" , substring is always constant and it's "0101".
00001111 -> 15 in decimal for first occurance of "0101"
so here the output is the 16 followed data after ("00010000") which it's: 111111111110000, how do I know the length of my following data? it's given in the String itself immediately after appearance of substring "0101" and the length is always 8bit !, so here in my question the immediate 8bit followed to my substring ("0101") represents the following data after those 8bit, so here the immediate following 8bit after appearance "0101" is 00001111 and in decimal It's 15 , this 15 is the length of the data that I want to take/output after the 8bits that represetns the size of the following data, so here in my case I look at "0101" and then I must read the 8bit that immediately following it , that 8bit represents the length, so I need to convert the 8bit in decimal value (in my case it's 15) and take all the following data that comes after that 8bit of length represenation which its size is represented in binary in the immediate 8bit followed by occurance substring(by occurance "0101") ; As a result the output here is 111111111110000. (15 offset data that immediately following what I marked on the first occurance of 0101)
00001111 -> 15 in decimal for second occurance of "0101" and the 15 following bit after the 8bit of the length representation is
111111111110001 (15 offset data that immediately following what I marked on the second occurance of 0101)
So the output matrix is two rows because there's two occurance of "0101" , the number of rows is equal to the number of occurance of my substring 0101, and at each row represents the immediated followed data at the current length that I've read it from the immediate 8bit followed by my substring occurance.
the output is:
output=[111111111110000; 111111111110001] , each row again represents respectively all following data at each occurrence, and first row represents first occurrence, second row represents second occurrence ....respectively ..etc
I need to check the length representation (8bit followed immediately at each occurrence of my substring "0101" , it should be the same length at each occurrence of my substring but I need to check it , so you can assume that I must read the length at each occurrence and it should be the same length on whole occurrence of my substrings but I need to check/read it at every occurrence although it must be the same value ..
Note - there can be more than one occurrence of my substring "0101" in my string, I need to return all the followed data respectively to what I explained above in a row of matrix (this means first row represents all offset data that follows first occurance of my substring, the second row represents all offset data that follows the second occurrence...etc ... ) there can't be overlaping between occurance..so assume all work fine and there's no overlaps between occurance (there's always enough data between one occurrence to another .. ).
my substring occurrences can be anywhere and not specifically at the beginning of my string !
so it could be inputs string=[11111111101010000111111111111111000001000100101000011111111111111100010111111]
the function that I tried to implement in matlab is: (I get wrong outputs unfortunately):
function TruncateSubstringResultCheck= TruncateSyncWordResultCheck(input1,substring) %input1 is my string , my substring as I said in my case it's always "0101"
positions = strfind(input1, substring) ;
TruncatedSubstring= cell2mat(arrayfun(@(idx) input1(idx+length(substring):idx+length(substring)+N-1), positions, 'uniform', 0 ).');
for i=1:NumberOfRows
substring = TruncatedSubstring(i,:);
TruncateSubstringResultCheck(i,:)=substring;
end
Could anyone help me to fix that and get the required output ? thanks for any assistance !

  4 Comments

Show 1 older comment
the cyclist
the cyclist on 20 Aug 2020
I see now that you mentioned that there could be multiple occurrences of 0101. I can adapt my solution for that case, pretty easily. I'll do that later today, if no one else has done so.
Jimmy cho
Jimmy cho on 20 Aug 2020
Hi , the position of "0101" could be anywhere in my given input -string- , and it could be more than one time occurred (there's no overlaps between each occurance of "0101"), so if there's for instance three occurance of my substring "0101" in my string, so my output is a matrix with 3 rows .. (3 rows because there's 3 time occurance of my substring, the number of rows is equal to the number of occurance of my substring, and respectively each row represents each occurance appropriately -occurance number one represents the first row, second occurance represents the second row ..etc )
thanks !
Jimmy cho
Jimmy cho on 21 Aug 2020
because just all things messed up here in my thread, I updated new thread here for more clarifications and more detailed:
it would be appreciated if you can help! thanks alot
hope it's now more clear and understandable.

Sign in to comment.

Accepted Answer

Stephen Cobeldick
Stephen Cobeldick on 21 Aug 2020
Edited: Stephen Cobeldick on 21 Aug 2020
One simple dynamic regular expression can do this quite efficiently:
>> fun = @(s)sprintf('[01]{%d}',bin2dec(s));
>> rgx = '0101([01]{8})((??@fun($1)))';
>> str = '010100001111111111111110000010101001010000111111111111111000001111100101000010001111111111100000111110';
>> tkn = regexp(str,rgx,'tokens');
>> tkn = vertcat(tkn{:});
>> out = tkn(:,2);
>> out{:}
ans =
111111111110000
ans =
0111111111111111000001111100101000010001
Note that this returns an output following the rules that you described, and so does not match the (incorrect) examples.

  9 Comments

Show 6 older comments
the cyclist
the cyclist on 22 Aug 2020
@Stephen, you misinterpreted my earlier comment. I meant that I had not pursued this type of solution because I did not know how the regular expression would work in this case. (But now I do, thanks to you!) I didn't mean that I thought it would not work.
Really glad to see this works as intended. It's certainly the more elegant algorithm.
Stephen Cobeldick
Stephen Cobeldick on 22 Aug 2020
"output=[00000;00000]"
If the input is a character vector and the data subvectors can have different lengths then it is not possible to concatenate them into one character matrix. You could pad them to have the same length and then concatenate them together. Or convert to string, in which case you will get a vector of strings (where each element is a scalar string with a different number of characters).
Converting to numeric is possible, but note that apart from some coincidental visual similarity, the decimal number 101 is totally unrelated to the binary number 101.
"I will explain what my issue, my input(str) isn't string it's a binary array integers.... str=[00010100000101000001010000010100000]"
You example cannot be stored as one integer by any standard integer class supported by MATLAB. Perhaps you actually meant that each of those digits are a separate element of an integer array, e.g.:
vec = [0,0,0,1,0,1,0,0,0,0,0,1,0,1,0,0,0,0,0,1,0,1,0,0,0,0,0,1,0,1,0,0,0,0,0];
in which case you can trivially convert those integers to character:
str = sprintf('%d',vec);

Sign in to comment.

More Answers (3)

the cyclist
the cyclist on 20 Aug 2020
Edited: the cyclist on 20 Aug 2020
If 0101 is always at the beginning of the string, then
% Example input
str ="0101000011111111111111100000101010";
% The 8 digits after 0101 define the length.
% In other words, the 5th to 12th digits.
L = bin2dec(extractBetween(str,5,12));
% The L digits after 0101 and the next 8, are the output string.
% In other words, start from the 13th digit, and get L digits.
output = extractBetween(str,13,12+L);
or if you actually have a character array :
% Example input
str ='0101000011111111111111100000101010';
% The 8 digits after 0101 define the length.
% In other words, the 5th to 12th digits.
L = bin2dec(str(5:12));
% The L digits after 0101 and the next 8, are the output string.
% In other words, start from the 13th digit, and get L digits.
output = str(13:(12+L));

  1 Comment

Jimmy cho
Jimmy cho on 20 Aug 2020
Hi , the position of "0101" could be anywhere in my given input -string- , and it could be more than one time occurred (there's no overlaps between each occurance of "0101"), so if there's for instance three occurance of my substring "0101" in my string, so my output is a matrix with 3 rows .. (3 rows because there's 3 time occurance of my substring, the number of rows is equal to the number of occurance of my substring, and respectively each row represents each occurance appropriately -occurance number one represents the first row, second occurance represents the second row ..etc )
thanks !
it doesn't give me matrix if there's more than one occurance of my substring in my string as what I explained here in my comment above.

Sign in to comment.


the cyclist
the cyclist on 20 Aug 2020
% Sample input
str ="0101000000011010100000001101010000000110101000000011010100000001101010000000110101000000111110101";
% Initialize with first index of 0101, and string length
idx0101 = regexp(str,"0101","once");
strL = strlength(str);
% Initialize string array for output
segments = strings(0);
% Loop over string, while it is long enough to hold 0101 and the lenght
% identifier segment
while strL >= idx0101 + 11
% Find the segment length
segmentL = bin2dec(extractBetween(str,idx0101+4,idx0101+11));
% If the string is long enough to contain a string of that length,
% extract it
if strL >= 12+segmentL
% Pull segment of the correct length
thisSegment = extractBetween(str,idx0101+12,idx0101+11+segmentL);
% Append the segment to the array
segments = [segments; thisSegment];
% Remove the segment and its identifiers
str = extractAfter(str,idx0101+11+segmentL);
% Find the length of the shortened string, and first location of
% "0101", so that we can start over
strL = strlength(str);
idx0101 = regexp(str,"0101","once");
else
break % Break out of the loop if the string is not long enough to have a new segment
end
end

  11 Comments

Show 8 older comments
the cyclist
the cyclist on 21 Aug 2020
My solution here gives the output that you specified for the input/ouput combinations you specified in the other location, if you do
str2double(segments)
as I sugested.
Jimmy cho
Jimmy cho on 22 Aug 2020
Hi @the cyclist !
I edited my question again and hope it's now more cleared and explained!

Sign in to comment.


per isakson
per isakson on 22 Aug 2020
This is an answer to the follow_up question, which was closed when I tried to submit.
%%
chr = '01010000111111111111111000001000100101000011111111111111100010111111';
sbs = '0101';
%%
pos = strfind( chr, sbs );
out = cell( numel(pos), 1 );
%%
for pp = 1 : numel(pos)
ix1 = pos(pp) + 4;
ix2 = ix1 + 8 - 1;
if ix2+len <= numel(chr)
len = bin2dec( chr(ix1:ix2) );
out{pp,1} = chr(ix2+1:ix2+len);
else
out(pp) = [];
end
end
%%
output = string( out );
This script prints
output =
2×1 string array
"111111111110000"
"111111111110001"
And the script outputs the same result for
chr = 'xxxxxxxxxxxx01010000111111111111111000001000100101000011111111111111100010111111';
and for
chr = '11111111101010000111111111111111000001000100101000011111111111111100010111111';
There is at least one problem with the script and that is handling of the case where the distance between substring is less than 12+1 positions.

  3 Comments

Jimmy cho
Jimmy cho on 22 Aug 2020
Hi !
I edited my question again and hope it's now more cleared and explained!
thanks for any assistance.
Jimmy cho
Jimmy cho on 22 Aug 2020
thanks for instructing me to re-edit my question here, hope it's now more understandable and explained
Jimmy cho
Jimmy cho on 22 Aug 2020
it doesn't work for me because the input is actually array of binary integers and substring is an array of binary integers -sorry for not mentioning that in my thread!!!!!! . As a result my string is an array of binary integers, also my substring is [0 1 0 1] which it's array of binary integers.
so according to your solution it doesn't work for me might be because the inputs are binary array integers
%%
chr = [01010000111111111111111000001000100101000011111111111111100010111111];
sbs = [0101];
%%
pos = strfind( chr, sbs );
out = cell( numel(pos), 1 );
%%
for pp = 1 : numel(pos)
ix1 = pos(pp) + 4;
ix2 = ix1 + 8 - 1;
if ix2+len <= numel(chr)
len = bin2dec( chr(ix1:ix2) );
out{pp,1} = chr(ix2+1:ix2+len);
else
out(pp) = [];
end
end
%%
output = string( out );

Sign in to comment.

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by