unique is giving the same expression twice

1 Ansicht (letzte 30 Tage)
Wesso
Wesso am 29 Jan. 2021
Bearbeitet: dpb am 29 Jan. 2021
Hi,
(data is attached)
[Country,~,ix] = unique(A);
tally = accumarray(ix, 1);
Q2= table(Country, tally);
Q2 contains the same expression twice for the unique values of 'Audit and assurance, and tax services'. what could be the reason? and how to overcome it? is it a bug?
  4 Kommentare
Steven Lord
Steven Lord am 29 Jan. 2021
They may look the same, but can you prove they're stored the same? Store two of the expressions that look identical in separate variables x and y then run the following code and show us the results.
disp(x)
disp(y)
isequal(x, y)
whos x y
x==y % only if x and y are the same size
dpb
dpb am 29 Jan. 2021
Bearbeitet: dpb am 29 Jan. 2021
This undoubtedly is the same issue I pointed out before at https://www.mathworks.com/matlabcentral/answers/730643-replacing-999-in-a-table-to-nan-regardless-of-the-type-of-the-column?s_tid=srchtitle#comment_1294958 where the encoding is different. Thus the strings visually appear the same, but one contains a double-byte character and the other doesn't.
Here's the specifics to show what was there for that particular set of values I looked at; undoubtedly you'll find the same thing here if you look carefully...
>> sort(categories(Final.org04b))
ans =
46×1 cell array
{'-999' }
{'-9999' }
...
{'I don't know' }
{'I don’t know' }
...
>> tmp=ans(42:43)
tmp =
2×1 cell array
{'I don't know'}
{'I don’t know'}
>> strcmp(tmp(1),tmp(2))
ans =
logical
0
>> [double(tmp{1});double(tmp{2})]
ans =
73 32 100 111 110 39 116 32 107 110 111 119
73 32 100 111 110 8217 116 32 107 110 111 119
>>
NB: the extended character "8217" in the second instead of the ASCII 39 for the single quote.

Melden Sie sich an, um zu kommentieren.

Akzeptierte Antwort

dpb
dpb am 29 Jan. 2021
Bearbeitet: dpb am 29 Jan. 2021
I didn't notice the data attached for this case -- the same exercise as above shows:
>> sort(categories(A))
ans =
29×1 cell array
{'Agriculture and fishing' }
{'Audit and assurance, and tax services' }
{'Audit and assurance, and tax services' }
{'Banking and capital markets' }
{'Civil Societies/NGOs' }
{'Civil society/NGOs' }
{'Construction' }
{'Consulting services' }
{'Education and academia' }
{'Electronics' }
{'Energy, utilities and resources' }
{'Financial services' }
{'Food Services' }
{'Government and public services' }
{'Health and healthcare services' }
{'Hospitality' }
{'IT and telecommunications' }
{'Manufacturing' }
{'Mining and Quarrying' }
{'Oil and gas' }
{'Other' }
{'Other business services' }
{'Other business services, please specify: ____________'}
{'Petrochemicals' }
{'Real Estate' }
{'Tourism' }
{'Transportation and logistics' }
{'Wholesale and retail trade' }
{'org03' }
>> tmp=ans(2:3)
tmp =
2×1 cell array
{'Audit and assurance, and tax services'}
{'Audit and assurance, and tax services'}
>>
There's an extended character (=160) in the second where there's an ordinary space in the first:
>> find(tmp{1}~=tmp{2})
ans =
25
>> [double(tmp{1}(25));double(tmp{2}(25))]
ans =
32
160
>>
Besides that, there are other anomolous entries as well just as were pointed out in the other categorical array in the previous Q?
...
{'Civil Societies/NGOs' }
{'Civil society/NGOs' }
...
{'Other business services' }
{'Other business services, please specify: ____________'}
...
that need to be cleaned up or one will never be able to match all elements of what are obviously intended to be the same categories but are not.
The data need a throrough cleaning before being ready for prime time.

Weitere Antworten (0)

Kategorien

Mehr zu Data Distribution Plots finden Sie in Help Center und File Exchange

Tags

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by