Reduce the Size of Matrix

5 Ansichten (letzte 30 Tage)
Isay
Isay am 22 Nov. 2014
Kommentiert: Stephen23 am 1 Dez. 2014
I need to Reduce my Matrices(Xbool_last and Xfreq_last) , because in the 1000th step of loop(it means docx=1000) , Matlab said : out of Memory!(loop is from 1 to 5549 !! )
please look at the part of code with information in it:
%%The loop for exploring ALL the documents to create the tf-idf weight matrix
for docx = 1 : length(DBlast)
docx
for word = 1 : length(DBlast{docx})
% In docx , we search all words in docx
word_xi = DBlast{docx}{word,1} ;
for docy = 1 : length(DBlast)
% While the source words are from docx search for them in
% the rest of documents
% if word_1i found in document i(=doc) vote 1
if sum(strcmpi(DBlast{docy},word_xi)) ~= 0
ind = find(strcmpi(DBlast{docy},word_xi) ~= 0) ;
Xbool(word,docy) = 1 ;
Xfreq(word,docy) = Freqlast{docy}(ind) ;
else
% else vote 0
Xbool(word,docy) = 0 ;
Xfreq(word,docy) = 0 ;
end
end
end
Xbool_last = [Xbool_last;uint8(Xbool)];
Xfreq_last = [Xfreq_last;uint8(Xfreq)];
Xbool = [] ;
Xfreq = [] ;
end
===============================================================================
So, questions: 1- how can i Reduce the size of Xbool_last and Xfreq_last? if i need to export Matrices TO .txt file (or something else) for Using it , How can I save them? or load them?
can you say the recommended code?
2. How can I use, the output of above code in tf-idf algorithm?(if you konw),
the tf-idf code is attached

Akzeptierte Antwort

Guillaume
Guillaume am 22 Nov. 2014
Bearbeitet: Guillaume am 22 Nov. 2014
You're already using uint8 to store your values. There isn't a smaller type unless you start packing booleans into bits which I assume is not possible for Xfreq_last anwyay. Using bits to store boolean is bound to be slow in matlab and awkward in matlab. There's no built-in function for that.
However, your storage looks incredibly inefficient to me. Say you're processing the first word of the first document. You find it in documents 2, 10, 150, 2048, 4125 for example. For a start, instead of storing those values (which woudln't take much memory, ~20 bytes as uint32), you store a boolean array of size 1x5549 (~5549 bytes) with only a few ones. But more importantly, in document 2, you're going to be looking for the exact same word, which you'll find in the exact same documents and store that again. Why?
Why not do the storage per word, instead of document, and for each word, just store which document it's found in?
  1 Kommentar
Stephen23
Stephen23 am 1 Dez. 2014
+1 for excellent advice of data management.

Melden Sie sich an, um zu kommentieren.

Weitere Antworten (0)

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by