Find duplicate elements and remove the rows that has similar values in one column
4 Ansichten (letzte 30 Tage)
Ältere Kommentare anzeigen
Dear Matlab experts,
I am using the following function to find the rows that has similar value in their 9th column. The speed of calculation is very slow as the data is big. Any suggestions for modifying my code to increase the speed or any other suggestions to achieve that purpose?
Thank you in advance.
function in1=dup_remove(out2)
b=[];
for i=1:size(out2,1)
[r,c]=find(out2(:,9)==out2(i,9));
if(length(r)==1)
b=[b;out2(i,:)];
end
end
if (~isempty(b))
in1=b;
end
end
5 Kommentare
Jan
am 19 Okt. 2022
@KSSV: How? I've tried it without success. The only way with standard Matlab functions I've found, uses unique to get a list of occurring values and histcounts to identify the elements, which occur once only. This was much slower than sorting the input, comparing neighbors by diff , remove the duplicates and reproducing the original order.
Akzeptierte Antwort
Jan
am 18 Okt. 2022
Bearbeitet: Jan
am 19 Okt. 2022
Avoid iteratively growing arrays, because they are extremly expensive. See:
x = [];
for k = 1:1e6
x(k) = rand;
end
This creates a new vector x in each iteration and copies the former contents of the vector to the new one, so Matlab reserves and copies sum(1:1e6)*8 Bytes, which is more than 4 TB!
Pre-allocation solves the problem:
x = zeros(1, 1e6);
for k = 1:1e6
x(k) = rand;
end
Tis reserves 8 MB only and copies just the scalar elements.
In your case:
function y = dup_remove(x)
x9 = x(:, 9); % Slightly faster than indexing each time
n = size(x,1);
match = false(n, 1);
for i = 1:n
[r, c] = find(x9 == x9(i));
match(i) = (numel(r) == 1);
end
y = x(match, :);
end
It is too strange, to call the input "out2" and the output "in1".
A smarter method:
function y = dup_remove(x)
x9 = x(:, 9); % Slightly faster than indexing each time
T = true(numel(x9), 1);
[S, idx] = sort(x9(:).');
m = [true, diff(S) ~= 0];
ini = strfind(m, [true, false]);
m(ini) = false; % Mark 1st occurence in addition
T(idx) = m; % Restore original order
y = x(T, :);
end
The sorting avoids to compare each element with all others, but only one comparison with the neighbor is required.
2 Kommentare
Jan
am 18 Okt. 2022
Bearbeitet: Jan
am 18 Okt. 2022
Some timings:
x = randi([0, 65535], 1e4, 9);
n = 10; % Repeat loops for accurate timings
tic
for k = 1:n
y0 = dup_remove(x);
end
toc % Original:
tic
for k = 1:n
y1 = dup_remove1(x);
end
toc % Avoid iterative growing:
tic
for k = 1:n
y11 = dup_remove11(x);
end
toc % Without FIND:
tic
for k = 1:n
y2 = dup_remove2(x);
end
toc % Using SORT and comparison of neighbors:
function in1=dup_remove(out2)
b=[];
for i=1:size(out2,1)
[r,c]=find(out2(:,9)==out2(i,9));
if(length(r)==1)
b=[b;out2(i,:)];
end
end
if (~isempty(b))
in1=b;
end
end
function y = dup_remove1(x)
x9 = x(:, 9); % Slightly faster than indexing each time
n = size(x,1);
m = false(n, 1);
for i = 1:n
[r, c] = find(x9 == x9(i));
m(i) = (numel(r) == 1);
end
y = x(m, :);
end
function y = dup_remove11(x)
x9 = x(:, 9); % Slightly faster than indexing each time
n = size(x,1);
m = false(n, 1);
for i = 1:n
m(i) = (sum(x9 == x9(i)) == 1);
end
y = x(m, :);
end
function y = dup_remove2(x)
x9 = x(:, 9); % Slightly faster than indexing each time
T = true(numel(x9), 1);
[S, idx] = sort(x9(:).');
m = [true, diff(S) ~= 0];
ini = strfind(m, [true, false]);
m(ini) = false; % Mark 1st occurence in addition
T(idx) = m; % Restore original order
y = x(T, :);
end
Weitere Antworten (0)
Siehe auch
Kategorien
Mehr zu Logical finden Sie in Help Center und File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!