Removing commas between columns in text data
5 Ansichten (letzte 30 Tage)
Ältere Kommentare anzeigen
Kim Maria Damiani
am 16 Okt. 2021
Kommentiert: Kim Maria Damiani
am 16 Okt. 2021
I have a txt file which is the ouput of a lemmatizer, in the form
Sometimes, ,, I, use, commas, .
I, like, writing, ,, I, like, reading
How can I read it into a tokenizedDocument deleting the unneccessary commas between tokens? A simple approach would be
test=readlines('/path/to/file.txt')
test=strrep(test,',','')
test=tokenizedDocument(test)
but it would remove even the commas already present in the original text, while I'd like to preserve punctuation-
0 Kommentare
Akzeptierte Antwort
Walter Roberson
am 16 Okt. 2021
test = {'Sometimes, ,, I, use, commas, .'
'I, like, writing, ,, I, like, reading'};
test = regexprep(test, {'(?<=[^,]),\s', '\s*,,', '\s+\.'}, {' ', ',', '.'})
Notice we had to have a special rule for periods. You have 'use, commas' which should almost certainly translate to 'use commas' (so comma space becomes space), but after that 'commas, .' should not become 'commas .' .
To put it another way, we cannot use the rule that comma space pair is to be deleted: that works for the comma space between the word 'commas' and the period, but it does not work for the comma space pair between 'use' and 'commas': if you tried to apply that rule then 'use, commas' would merge together to 'usecommas' .
Weitere Antworten (1)
Chunru
am 16 Okt. 2021
test = {'Sometimes, ,, I, use, commas, .'
'I, like, writing, ,, I, like, reading'};
test = regexprep(test, ',\s', ' ')
0 Kommentare
Siehe auch
Kategorien
Mehr zu Text Data Preparation finden Sie in Help Center und File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!