What exactly is going on with this code? I'm having difficulty understanding it

21 Ansichten (letzte 30 Tage)
Hello,
I'm struggling with understanding the code below. I was hoping to give my best interpretation of what's going on, and hope that I'm right.
for FilesToRead = 1 : length(FileNames)
y = load(fullfile(Directory, FileNames{FilesToRead}));
y = {y};
n = 400000;
for i=1:length(y)
x{i}=(1:length(y{i}))';
end
for ii=1:length(y)
for k=1:length(y{ii})/n
coeffs{ii}(k,:)=polyfit(x{ii}(n*(k-1)+1 : k*n),y{ii}(n*(k-1)+1 : k*n),1);
yfit{ii}(k,:)=coeffs{ii}(k,2)+coeffs{ii}(k,1)*x{ii}(n*(k-1)+1 : k*n);
end
end
for ii=1:length(y)
dd{ii}= subsref(yfit{ii}.', substruct('()', {':'})).';
dd2{ii}=y{ii}(1:length(dd{ii}))-dd{ii}';
end
for i=1:length(dd2)
t{i}=zeros(1,length(dd2{i}));
end
t{1}=1:length(dd2{1});
for i=2:length(dd2)
t{i}= t{i-1}(end)+[1:length(dd2{i})];
end
clear coeffs dd i ii k prompt ws yfit
yRaw=y; xRaw=x;
x=t; y=dd2;
clear dd2 t
% Window for variance - 5 seconds = 100,000
n = 100000;
for i=1:length(y)
for k=1:length(y{i})/n
a(k)=var(y{i}(n*(k-1)+1:k*n));
end
v{i}=a;
a=[];
end
t{1}=1:length(v{1});
for i=2:length(y)
t{i}=t{i-1}(end)+[1:length(v{i})];
end
for i=1:length(t)
tm{i} = t{i}.*((n/20000)/60); % time in minutes
end
end
So essentially, every 400,000 data points, a 1st degree polynomial is fitted to the 400,000 points. Then, the data is zeroed, or flattened, where the slope of the polynomial was turned to 0, and every data point had an equal number removed from from the area on the slope the data point was located.
This is done to the entire data set. Those new values are then stored, and variance data is gathered from every 100,000 data points.
Is this the general gist?
  9 Kommentare
EL
EL am 23 Sep. 2019
Bearbeitet: EL am 23 Sep. 2019
I have three 400,000 length files ready if you'd like to test those?
Here's new code I'm working with to test out
clc
clear
%%This sections is to load the correct folder to gather data from
% ***Below is the same code for Linux!
%rootfolder = '/Data/';
% ***Don't forget to swith the slashes!
rootfolder = 'Z:Data\'; %%MAJOR ISSUE AREA. THE DIRECTORY HERE ALWAYS CHANGES! WATCH OUT!!!!
% Chopped = '/Chopped';
Chopped = '\Chopped\';
prompt = 'Enter date of experiment to be analyzed: ';
DataDate = input(prompt, 's');
Directory = strcat(rootfolder,DataDate,Chopped);
Directory2=strcat(rootfolder,DataDate,'\');
%% This is to select the save file name
prompt = 'Enter the name for the processed files according to mmddyyyy_bug_media_oC ';
Filenamesave = input(prompt, 's');
%% This is to select the signal type. This should have already been done with datachop =]
prompt = 'Enter the Signal Type DEF, LFM or SUM ';
signal = input(prompt, 's');
%% Below are some code alternatives to that were previously tried. Maybe I should delete these......
% FileNames=dir(Directory);
%FileNames=dir(fullfile(Directory, '*.txt'))
% FileNames=dir(Directory);
% FileNames(1:2) = [];
% FileNames = {FileNames.name};
%% This selects all .txt files in the folder. If the linux script DataChop is used, then you're good!
FileNames=dir(Directory);
FileNames = {FileNames.name};
FileNames = FileNames(contains(FileNames, '.txt'));
disp(' ');
disp(' ');
disp('File structure generated');
disp(' ');
disp(' ');
%% This is the run inputs
% signal = 'DEF';
% epflplot = 'N';
% rawplot = 'N';
% fftplot = 'N';
% ft = 20000;
% bt = 30;
%% This is the loop to process the data. First, a linear fit of a 1st \
% degree polynomial is fitted to the data, with the data points being
% zeroes according to that slope. Once zeroed, variance is gathered.
for FilesToRead = 1 : length(FileNames)
y = load(fullfile(Directory, FileNames{FilesToRead}));
y = {y};
n = 400000;
for i=1:length(y)
x{i}=(1:length(y{i}))';
end
for ii=1:length(y)
for k=1:length(y{ii})/n
coeffs{ii}(k,:)=polyfit(x{ii}(n*(k-1)+1 : k*n),y{ii}(n*(k-1)+1 : k*n),1);
yfit{ii}(k,:)=coeffs{ii}(k,2)+coeffs{ii}(k,1)*x{ii}(n*(k-1)+1 : k*n);
end
end
for ii=1:length(y)
dd{ii}= subsref(yfit{ii}.', substruct('()', {':'})).';
dd2{ii}=y{ii}(1:length(dd{ii}))-dd{ii}';
end
for i=1:length(dd2)
t{i}=zeros(1,length(dd2{i}));
end
t{1}=1:length(dd2{1});
for i=2:length(dd2)
t{i}= t{i-1}(end)+[1:length(dd2{i})];
end
clear coeffs dd i ii k prompt ws yfit
yRaw=y; xRaw=x;
x=t; y=dd2;
clear dd2 t
% Window for variance - 5 seconds = 100,000
n = 100000;
for i=1:length(y)
for k=1:length(y{i})/n
a(k)=var(y{i}(n*(k-1)+1:k*n));
end
v{i}=a;
a=[];
end
t{1}=1:length(v{1});
for i=2:length(y)
t{i}=t{i-1}(end)+[1:length(v{i})];
end
for i=1:length(t)
tm{i} = t{i}.*((n/20000)/60); % time in minutes
end
%% Saving Data
numstr=num2str(FilesToRead);
PathSave=strcat(Directory, Filenamesave,'_',signal,'_', numstr, '.mat');
save(strcat(PathSave), 't', 'tm', 'v','xRaw', 'x', 'yRaw','y','n','FileNames');
%% This is a counter, giving the display an output of the number of files processed.
denom=num2str(length(FileNames));
numer=num2str(FilesToRead);
fraction=strcat(numer,'/',denom);
counter=strcat(fraction, ' Files Processed');
disp(counter);
%% Writing variance data to ASCII File
% dlmwrite(filename, v, delimiter, row, col)
% Data is appended each time. Data is added by row.
%
VarianceASCIIFilename=strcat(Directory2, 'VarianceData', '_', DataDate, '_', signal, '.txt');
dlmwrite(VarianceASCIIFilename, v{1,1}, 'delimiter', '\t', '-append');
%% Appending Data to .mat file
% VarianceMATFilename=strcat(Directory2, 'VarianceData', '_', DataDate, '_', signal, '_');
% save(VarianceMATFilename, 'v', '-append');
%
%% Hopefully the clear/clc doesn't mess anything up!!
% clear v t tm xRaw x yRaw y n
end
% disp(' ');
% disp(' ');
% disp('All Files Processed');
% disp(' ');
% disp(' ');
% disp('Graph Building Starting');
%% Making Graphs
%Starting with variance. Using data from ascii file. Each row is a new
%'interval' of data. In our case, 30 minutes if 'DataChop' functions
%correctly
%% Box and Whisker Plots, 30 min bin time
%Load individual variables with the following format
% load('FilePath/Name', 'variable');
% For plotting the data, do....
% boxplot('variable'{1,1});
% For loading your .txt file; below is an example 09232019
%
%% Various Variables from previous script
% Variables that were not saved with this update. Might be needed when
% making graphs
% -v7.3
% Names
%PathNames
%ft
This makes teh same out puts.
If you set the directory to the three fiels linked, you'll process them in about 30 seconds.
dpb
dpb am 24 Sep. 2019
...
%rootfolder = '/Data/';
% ***Don't forget to swith the slashes!
rootfolder = 'Z:Data\'; %%MAJOR ISSUE AREA. THE DIRECTORY HERE ALWAYS CHANGES! WATCH OUT!!!!
% Chopped = '/Chopped';
Chopped = '\Chopped\';
prompt = 'Enter date of experiment to be analyzed: ';
DataDate = input(prompt, 's');
Directory = strcat(rootfolder,DataDate,Chopped);
Directory2=strcat(rootfolder,DataDate,'\');
...
We talked about this issue previously I think, too...???
Use fullfile() to build qualified filenames without specifying the system-specific delimiter character and it will automagically recognize which system are running on and do the right thing for you.
%% This selects all .txt files in the folder.
FileNames=dir(Directory);
FileNames = {FileNames.name};
FileNames = FileNames(contains(FileNames, '.txt'));
...
for FilesToRead = 1 : length(FileNames)
y = load(fullfile(Directory, FileNames{FilesToRead}));
...
is just
rootfolder = 'Z:Data';
Chopped = 'Chopped';
prompt = 'Enter date of experiment to be analyzed: ';
DataDate = input(prompt, 's');
Directory=fullfile(rootfolder,DataDate,Chopped);
FileNames=dir(fullfile(Directory,'*.txt');
...
for FilesToRead=1:length(FileNames)
y=load(fullfile(Directory,FileNames(FilesToRead).name));
...
and will include only those with the .txt extension and have the system-dependency handled automatically.
I don't think there's any point in then turning y into a cell array instead of just operating on the resulting vector as native double; that alone should speed things up noticeably as well as reduce memory footprint.
Then, as G has noted (and I did in our previous conversations as well) the loop and initial x is completely superfluous and waste of time and memory.
Also as I noted before, one could then compute the statistics on N elements of the vector piecewise by reshape() to NxM columns, use the inbuilt MATLAB facility to compute by default over columns for builtin routines like mean and var or iterate over the M columns for those which haven't been vectorized such as polyfit. At that point you may find it convient to store the coefficients in cell array since there are two per column, but it isn't clear there's any need to actually save them at all but to just process each column (section) and move on to the next. Doing it this way whether save the intermediates or not would eliminate almost all the convoluted indexing that makes so much of the code so difficult to parse.
Once have done that, then just working through the rest of the machinations should be relatively straightforward with the same idea although I still contend would be far simpler to just use a much smaller dataset initially with N=400 or somesuch so don't get so overwhelmed by just the size and can see results quickly and on screen instead of being innundated by the sheer mass of numbers that obfuscates the underlying problem. Once do that, then you can go back and verify your revised code produces same results for a real data set over that time length.
Again, it would also be wise to recode so that these magic numbers aren't buried in the bowels of the code but are easily set and modifiable so the code could be used generically if conditions do change in future.

Melden Sie sich an, um zu kommentieren.

Akzeptierte Antwort

Guillaume
Guillaume am 24 Sep. 2019
Bearbeitet: Guillaume am 24 Sep. 2019
Removing all the unnecessary cruft, this is probably what the code reduces to. I've gone with the fact that your files are 400,000 elements long to remove the badly coded split into 400,000 elements.
I haven't tested the code so maybe got it wrong. i've kept the meaningless variable names. If I had written it, I would have used much better variable names (seriously, a as a variable name?)
%inputs:
y = load(fullfile(Directory, FileNames{FilesToRead})); %since the file is text, it would be better to use csvread or similar instead of load
x = (1:numel(y))';
coeffs = polyfit(x, y(:), 1);
yfit = polyval(coeffs, x); %much simpler than coding the polynomial yourself
dd2 = y - yfit;
n = 100000;
a = var(reshape(dd2, n, []));
tm = x * n / 20000 / 60;
I've not kept the t variable that is recreated again and agian as it doesn't do anything useful. If you still needed, it's the same as x.
It would be simple to add the split into 400,000 elements if it's really needed.
edit: as discussed below, I missed that dd2 gets renamed to y, replacing the original y. Seriously, who does that? What was wrong with the original dd2 (other than both being meaningless)?
  18 Kommentare
EL
EL am 29 Sep. 2019
Q1: Is this file you're processing now the full experiment file without the partitioning of it into multiple pieces as before? I THINK that's what I'm reading, just to make sure am on same page in thinking through the process.
It is the base number of files produces by the machine. There will be a minimum of two, maybe more. It depends on how often I have to stop data acquisition and change conditions.
Q2: Why the variable names bt and ybt? What does "bt" stand for and why would you submit that's a good variable name for a length constant? To me, at least, it makes reading the code difficult because it doesn't seem to have any meaning related to what it is. Making the code human-readable as much as a novel plot as possible is always a goal.
bt was the bin time tha the box plots would be made in, 36,000,000, or 30 minutes of data. Ybt was a variable I threw in to connect bt and y between different lines. Not really needed anymore though
Q3: Can you hold ybt in memory ok on your system and other systems you expect to be able to run this on? You did not check the optional return value of the number of elements read so you don't know whether the read returned the whole wanted number or not. I would expect so, 3.6E6/8/(1024*1024) --> <30 MB on machines with GB of memory these days, but your code needs to verify and handle the case if it fails--if nothing else, inform the user and abort; better have backup logic to process the file in smaller chunks.
Not using ybt anymore, so we don’t need it.
Q4: What do you think is the purpose of the second loop using fscanf? You've already got the double array ybt in memory; just operate on it. Oh--I get it...this appears to be the attempt to implement the suggestion to just loop over the files back when were reading 400K individual files. OK, now that you are processing the whole file, really instead of actually reading the full 3.6E6 array at once, if the processing is really always on these chunks of 400K elements at a time (or some other arbitrary set as may evolve in future so should definitely be a variable and not a buried "magic number"), it really makes sense to just read that number each time from the file, operate on it and then go on to the next. Then you keep a secondary running count variable of the total number of elements you've read/processed and quit when that reaches the other, bigger magic number (3.6E6 here, or N1/N2 --> 3.6E6/400K = 9 iterations of the loop).
I don’t recall off the top of my head. It was something I thought of that may have let the script work.
Q5: Where is the plotting over what sequence of data in the process? The individual smaller sections or the whole thing or...?? Would help knowing that so ensure get the data organized for it, too.
All the data analyzed would be plotted together. I intend on having box plots made with 30 minutes of data (36,000,000 divided by 100,000 for variance result sin n=360 per box plot).
Then the raw y data over time, so the drift of the machine can be measured.
Then the flattened y data over time, so the average amplitude can be viewed without noise (the drift)
Then the flattened y data would undergo FFT, so the most common amplitudes can be viewed as a control.
Additionally, violin graphs over 30 minute bins would measure which amplitudes were most common in their bin time.
Regarding uigetfile, this script will be run remotely through command line without an interface. The data date and the signal type is enough to differentiate between data. I also intend on having a list populate of the files selected and what order they’ll be processed, so the script can be terminated if the wrong files were loaded.
dpb
dpb am 29 Sep. 2019
"Not using ybt anymore, so we don’t need it."
That doesn't answer the Q? underlying -- can you hold the full vector in memory on the systems you're running on? What you use for the variable name is immaterial. If so, then the full file can be operated on at once and there's no need to loop through the file at all. If not, then you need the loop to read segments and process them...but you don't need to (and shouldn't) separate out into individual files to do that.
"data date and the signal type is enough to differentiate between data"
Yeah, you've said that but you're going at it in what appears to be very awkward ways...if you would illustrate the naming pattern used, I'm certain a wildcard pattern could be built from the inputs you've asked for from the user which will have to come from somewhere, no matter how you actually end up running the script (which should undoubtedly be turned into a function for data integrity and scoping) so could have dir return only those of interest directly.
Anyway, details of interface and some minor efficiencies aside, looks like you've got a handle on how to proceed from here; I'll reiterate on the idea of playing at the command line with small dataset so you can understand what the reshape stuff is doing and how.
Think I'll bow out now...good luck.

Melden Sie sich an, um zu kommentieren.

Weitere Antworten (1)

dpb
dpb am 25 Sep. 2019
Bearbeitet: dpb am 25 Sep. 2019
"If you set the directory to the three fiels linked, you'll process them in about 30 seconds."
I did set breakpoint in debugger and ran the first of the three to the point of computing dd2
As we already knew, the result of
y=load('x000.txt');
ypr=detrend(y); % use the Signal Processing TB detrend function
resulted in
K>> dd2{1}(1:10)
ans =
-0.0011
0.0005
0.0041
-0.0033
0.0033
-0.0001
0.0023
0.0001
-0.0003
0.0022
K>> ypr(1:10)
ans =
-0.0011
0.0005
0.0041
-0.0033
0.0033
-0.0001
0.0023
0.0001
-0.0003
0.0022
K>>
which are the same result in one line of code instead of the 50 or so in your function and runs in a fraction of a second. And, in fact,
K>> max(abs(dd2{1}-ypr))
ans =
2.8449e-16
K>>
shows the difference in the result by different computation method is within machine precision of a double precision variable so we've verified that's what the gosh-awful mess actually does.
So, all you now need is to compute the variance of the detrended time series over the time length of choice -- as G shows and I recommended, the easiest way to do this is to reshape() the vector to Nx[] array and let Matlab vector operation by column on the array compute the values for each column in one call.
It appears the only thing left if the computation of the tm array and then plotting the results.
All the calculations could be in a loop of 20 lines at the absolute outside; more like 10 or even fewer it would seem probable.
Wrap that in an outer loop that opens the big file and reads the desired number of records (400K or whatever were chosen) and computes each section.
  4 Kommentare
Guillaume
Guillaume am 25 Sep. 2019
@EL, were you the one asking the same question on discord?
If so, then start a new question (here on answers) for just that, explaining in a lot more details what you want to save, to what kind of file (mat, text, excel, binary, whatever), and why it is incremental.
Note that if your input data files are huge, then using datastore and tall arrays would probably be a good idea. Again, this requires discarding the current code.
EL
EL am 25 Sep. 2019
Yeah, that was me on discord.
I'll start a new question regarding that.

Melden Sie sich an, um zu kommentieren.

Tags

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by