Problme with Text analysis

Question

0 Stimmen

Hi, I try to clean a table containing both latin and non-latin strings to plot a wordcloud. I used regexprep function but not successfully. I can't remove korean strings. Any idea? Here an example of the code and the output:

pathName = 'Keyword Aug. 2020 to Oct. 2021_MatlabSmall.xlsx';
T = readtable(pathName,'Range','A:B');
% Convert all Character Vector to Lowercase
T.Keyword = lower(T.Keyword);
% Remove not useful keywords
T(strcmp(T.Keyword, '(not provided)'), :)=[];
T(strcmp(T.Keyword, '(not set)'), :)=[];
% Set lower case
T.Keyword = lower(T.Keyword);
% Remove links
T(contains(T.Keyword, 'http'), :)=[];
T(contains(T.Keyword, '.'), :)=[];
T.Keyword = strrep(T.Keyword, ' ', '_');
display(head(T));
% Replace non alphanumerics
T.Keyword = regexprep(T.Keyword,'^a-z','');
 
8×2 table
                 Keyword                 Sessions
    _________________________________    ________
    'stuff'                                390   
    'forum'                                128   
    'student'                               76   
    '재료'                                  59   
    'stuff'                                 56   
    'uninstall_stuff_license_manager'       52   
    'stuff_resource_center'                 43   
    'stuff_student_community'               34   

0 Kommentare
-2 ältere Kommentare anzeigen -2 ältere Kommentare ausblenden

Melden Sie sich an, um zu kommentieren.

Melden Sie sich an, um diese Frage zu beantworten.

Follow Question

Answer 1

DGM am 19 Okt. 2021

In MATLAB Online öffnen

0 Stimmen

I'm terrible with regex, but this might get you somewhere. Replaces everything but lowercase alpha and underscores.

A = {'9.banana' 'orange-123_juice' 'ン戦国時' 'apple_sauce' 'abcクルミ' 'peach' 'pear' 'ピラミッド' 'cherry'}.'
A = 9×1 cell array
    {'9.banana'        }
    {'orange-123_juice'}
    {'ン戦国時'         }
    {'apple_sauce'     }
    {'abcクルミ'        }
    {'peach'           }
    {'pear'            }
    {'ピラミッド'       }
    {'cherry'          }
B = regexprep(A,'[^a-z_]','')
B = 9×1 cell array
    {'banana'      }
    {'orange_juice'}
    {0×0 char      }
    {'apple_sauce' }
    {'abc'         }
    {'peach'       }
    {'pear'        }
    {0×0 char      }
    {'cherry'      }

0 Kommentare
-2 ältere Kommentare anzeigen -2 ältere Kommentare ausblenden

Melden Sie sich an, um zu kommentieren.

Problme with Text analysis

0 Kommentare
-2 ältere Kommentare anzeigen -2 ältere Kommentare ausblenden

Antworten (1)

0 Kommentare
-2 ältere Kommentare anzeigen -2 ältere Kommentare ausblenden

Kategorien

Produkte

Version

Tags

Community Treasure Hunt

Problme with Text analysis

0 Kommentare -2 ältere Kommentare anzeigen -2 ältere Kommentare ausblenden

Antworten (1)

0 Kommentare -2 ältere Kommentare anzeigen -2 ältere Kommentare ausblenden

Kategorien

Produkte

Version

Tags

Siehe auch

Community Treasure Hunt

0 Kommentare
-2 ältere Kommentare anzeigen -2 ältere Kommentare ausblenden

0 Kommentare
-2 ältere Kommentare anzeigen -2 ältere Kommentare ausblenden