textAnalytics toolbox: removing Entity details from documents

Question

0 Stimmen

I have a very large set of documents that I am preprocessing to use in a bert classification model.

I have tokenized the documents and added the entity details.

Now I want to remove all of the tokenswith in the documents that have been "tagged as" orginisation.

I have the following variables:

documents: tokenized documents

tdetails: a table of tokens with the document number, sentence number, line number, Type, Language, PartOfSpeech and Entity.

Token

"Astoria" 1 2 3 'letters' 'en' 'proper-noun' 'person'

"Federal Savings Bank" 1 2 3 'other' 'en' 'proper-noun' 'organization'

"settled" 1 2 3 'letters' 'en' 'verb' 'non-entity'

How do I remove all of the tokens in the variable documents based on the entity=organisation

eg in documents(1,1).Vocabulary(7) I can find "Federal Savings Bank" which is in row 7 of the example above. I coudl loop through all of the documents and tdetails==organisation but that woudl take quite while

cant seem to figure out how to do this more simply

0 Kommentare
-2 ältere Kommentare anzeigen -2 ältere Kommentare ausblenden

Melden Sie sich an, um zu kommentieren.

Melden Sie sich an, um diese Frage zu beantworten.

Follow Question

Answer 1

Cris LaPierre am 18 Nov. 2023

In MATLAB Online öffnen

2 Stimmen

I would use removeWords.

documents = tokenizedDocument(Text(:));
tdetails = tokenDetails(documents) ;
documents2 = removeWords(documents,tdetails{tdetails.Entity=="organisation"}); 

1 Kommentar
-1 ältere Kommentare anzeigen -1 ältere Kommentare ausblenden

david cowan am 19 Nov. 2023

Verschoben: Cris LaPierre am 19 Nov. 2023

Really appreciate that.

removeWords !!

I'll not forget that now - I knew there had to be a simple approach I was just missing

Melden Sie sich an, um zu kommentieren.

textAnalytics toolbox: removing Entity details from documents

0 Kommentare
-2 ältere Kommentare anzeigen -2 ältere Kommentare ausblenden

Akzeptierte Antwort

1 Kommentar
-1 ältere Kommentare anzeigen -1 ältere Kommentare ausblenden

Weitere Antworten (0)

Kategorien

Produkte

Version

Tags

Community Treasure Hunt

textAnalytics toolbox: removing Entity details from documents

0 Kommentare -2 ältere Kommentare anzeigen -2 ältere Kommentare ausblenden

Akzeptierte Antwort

1 Kommentar -1 ältere Kommentare anzeigen -1 ältere Kommentare ausblenden

Weitere Antworten (0)

Kategorien

Produkte

Version

Tags

Siehe auch

Community Treasure Hunt

0 Kommentare
-2 ältere Kommentare anzeigen -2 ältere Kommentare ausblenden

1 Kommentar
-1 ältere Kommentare anzeigen -1 ältere Kommentare ausblenden