Simple for loop on 90k records runs 51 seconds?!?

1 Ansicht (letzte 30 Tage)
Viktor Nikolov
Viktor Nikolov am 8 Jul. 2021
Kommentiert: Viktor Nikolov am 12 Jul. 2021
Hi!
I'm puzzled by a very bad performance of a simple loop running on 90k records. :-(
% I read a table from MySQL, which contains data counters from a router:
data = fetch(conn,query);
% I do datetime column conversion:
data.time = datetime(data.time,'InputFormat','yyyy-MM-dd HH:mm:ss.S');
>>height(data)
ans =
89458
>> head(data, 3)
ans =
3×3 table
time in_delta out_delta
____________________ __________ __________
01-Sep-2020 00:00:02 7.6231e+05 1.1959e+06
01-Sep-2020 00:05:02 7.0894e+05 8.4548e+05
01-Sep-2020 00:10:02 1.0837e+06 8.3609e+05
% Then I perform this simple loop to calculate router throughput based on data counters:
tic
for i = 1:height(data)-1
data.duration(i) = data.time(i+1) - data.time(i); % interval between measurements
data.in_Mbps(i) = data.in_delta(i+1) * 8 / seconds(data.duration(i)) / 1e6; % incoming througput in Mbps
data.out_Mbps(i) = data.out_delta(i+1) * 8 / seconds(data.duration(i)) / 1e6; % outgoing througput in Mbps
end;
toc
Elapsed time is 51.2 seconds
I run this on a laptop with Intel Core i5 on 2.9 GHz, 8 GB RAM.
I know Matlab is an interpret but 51 sec for 90k records still seem extremely slow.
I know that doing a vector calculation would take milliseconds but I need to combine data(i) and data(i+1) in a single statement.
What am I doing wrong?
  2 Kommentare
Yongjian Feng
Yongjian Feng am 8 Jul. 2021
Bearbeitet: Yongjian Feng am 8 Jul. 2021
8GB RAM seems low.
Can you do a profile to see where the bottleneck is?
https://www.mathworks.com/help/matlab/matlab_prog/profiling-for-improving-performance.html
Viktor Nikolov
Viktor Nikolov am 8 Jul. 2021
I ran the profiler (full report attached).
If I understand it well, the most of the time is consumed by the call
t = move(t).subsasgnDot(s,b);
made from function tabular.subsasgn.
Is it assignment command which takes most time??? Can I rewrite it to be more efficient?
RAM doesn't seem to be an issue. Matlab consumes 1 GB and Task Manager still reports 2.3 GB available.

Melden Sie sich an, um zu kommentieren.

Akzeptierte Antwort

dpb
dpb am 8 Jul. 2021
Bearbeitet: dpb am 8 Jul. 2021
for i = 1:height(data)-1
data.duration(i) = data.time(i+1) - data.time(i);
...
You've haven't preallocated and even though it's a table, I'm guessing that's the problem. Time
data.duration=repmat(seconds(0),height(data),1);
tic
for i = 1:height(data)-1
data.duration(i) = data.time(i+1) - data.time(i); % interval between measurements
data.in_Mbps(i) = data.in_delta(i+1) * 8 / seconds(data.duration(i)) / 1e6; % incoming througput in Mbps
data.out_Mbps(i) = data.out_delta(i+1) * 8 / seconds(data.duration(i)) / 1e6; % outgoing througput in Mbps
end;
toc
in comparison.
But, you don't need a loop at all;
data.duration=[seconds(0); diff(data.time)];
data.in_Mbps =8*data.in_delta(i+1)/seconds(data.duration)/1e6;
data.out_Mbps=8*data.out_delta/seconds(data.duration)/1e6;
Just offset the time difference to align with the variables as you wish to avoid the indexing offset.
Or, of course, you could write row indexing vectors of (1:height-1) and (2:height) that are of the same length and leave the offsets as were if that were really crucial.
I just timed the loop with and without preallocation here for an existing table of only ~300 records and the difference was about a factor of 10X at the command line. I'd expect in a m-file to be better with JIT, but still illustrates the problem.
  4 Kommentare
dpb
dpb am 8 Jul. 2021
Bearbeitet: dpb am 8 Jul. 2021
Will be curious to hear what speedup the vectorized solution gives...
Viktor Nikolov
Viktor Nikolov am 12 Jul. 2021
Just for the record, let me document the vectorized solution and its performance:
% I replaced this loop (which runs 51 sec on 90k records)
for i = 1:height(data)-1
data.duration(i) = data.time(i+1) - data.time(i);
data.in_Mbps(i) = data.in_delta(i+1) * 8 / seconds(data.duration(i)) / 1e6;
data.out_Mbps(i) = data.out_delta(i+1) * 8 / seconds(data.duration(i)) / 1e6;
end;
% by a vector computation:
tic
data.duration = [ diff(data.time); seconds(0) ];
data.in_Mbps = ...
[ data.in_delta(2:height(data)) * 8 ./ seconds(data.duration(1:height(data)-1)) / 1e6; ...
0 ];%must add this last zero element to have the vector of the same height as table data
data.out_Mbps = ...
[ data.out_delta(2:height(data)) * 8 ./ seconds(data.duration(1:height(data)-1)) / 1e6; 0 ];
toc
Elapsed time is 0.008785 seconds.
The lesson learned for me is never to use a for loop for processing data in a table. There are simple tricks allowing you to do it as a vector computation.

Melden Sie sich an, um zu kommentieren.

Weitere Antworten (1)

Steven Lord
Steven Lord am 8 Jul. 2021
I would use vector operations. Let's see, try something like this (untested):
interval = diff(data.time);
s = seconds(interval);
in_Mbps = data.in_delta(2:height(data)) * 8 ./ s / 1e6;
out_Mbps = data.out_delta(2:height(data)) * 8 ./ s / 1e6;
data.duration = [0; interval];
data.in_Mbps = [NaN; in_Mbps];
data.out_Mbps = [NaN; out_Mbps];

Produkte


Version

R2021a

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by