I have a large table with over 13,000 rows, each row represents an experiment and containins a set of genes. I would like to get generate a list of all the genes and the number of experiments that contain that gene. For example:
1 Ansicht (letzte 30 Tage)
Ältere Kommentare anzeigen
Hello,
tableA:
Experiment, Gene1, Gene2, Gene3, Gene4;
'1', 'A', 'C', 'D', '';
'2', 'A', 'C', 'G', '';
'3', 'C', 'B', 'D', 'D';
What I would like is a count of number of experiments that each gene occurs in ignoring duplicates within the same experiment. For Example A = 2, B = 1, C = 3, D = 2
0 Kommentare
Antworten (3)
Guillaume
am 5 Okt. 2017
Here is how I'd do it:
%demo input
tableA = cell2table({...
'1', 'A', 'C', 'D', '';
'2', 'A', 'C', 'G', '';
'3', 'C', 'B', 'D', 'D';}, ...
'VariableNames', ...
{'Experiment', 'Gene1', 'Gene2', 'Gene3', 'Gene4'})
%processing
genes = unique(tableA{:, 2:5}).'; %get list of genes
genes(1) = []; %get rid of ''
expcount = sum(cell2mat(rowfun(@(g) ismember(genes, g), tableA, 'SeparateInputs', false, 'InputVariables', 2:5, 'OutputFormat', 'cell')));
result = array2table(expcount, 'VariableNames', compose('gene_%s', string(genes))) %for pretty display
2 Kommentare
Guillaume
am 5 Okt. 2017
Exactly, g is the the name of the input to the anonymous function. It receives a row of the table, and check the list of genes against it. That gives you an array of 0 and 1 telling which genes are present in the row. Summing all the rows gives you the result you want.
Stephen
am 5 Okt. 2017
Bearbeitet: Stephen
am 5 Okt. 2017
Given your example and format, you will probably want to make use of a for loop and the "unique" function:
Gene = struct('numExp',[],...) %Add other properties of experiments
AllGenes = struct('A',Gene ,'B',Gene ,'C',Gene ,...etc) % Each gene gets a structure field
for ptr = 1:13000
currGenes = unique(tableA(2:end));
for ptr2 = 1:length(currGenes)
incGene = currGenes{ptr2};
AllGenes.(incGene).numExp = AllGenes.(incGene).numExp + 1;
end
end
Peter Perkins
am 13 Okt. 2017
+1 for Guillaume's use of rowfun.
Here's another possibility that might be wiorth looking at if you have very large number of experiments and a smaller number of genes:
>> tableA = cell2table({...
'1', 'A', 'C', 'D', '';
'2', 'A', 'C', 'G', '';
'3', 'C', 'B', 'D', 'D';}, ...
'VariableNames', ...
{'Experiment', 'Gene1', 'Gene2', 'Gene3', 'Gene4'});
Turn things into categorical because categorical is useful:
>> tableA.Experiment = categorical(tableA.Experiment);
>> genes = {'A' 'B' 'C' 'D' 'G'};
>> tableA.Gene1 = categorical(tableA.Gene1,genes);
>> tableA.Gene2 = categorical(tableA.Gene2);
>> tableA.Gene3 = categorical(tableA.Gene3);
>> tableA.Gene4 = categorical(tableA.Gene4);
Create one array, then look across rows for each gene, looping over genes instead of rows. This could create a 1-row table, or a scalar struct, or a double vector (with no names). I chose the struct, just because a table with only one row is kind of overkill:
>> allGenes = [tableA.Gene1 tableA.Gene2 tableA.Gene3 tableA.Gene4]
allGenes =
3×4 categorical array
A C D <undefined>
A C G <undefined>
C B D D
>> for i = 1:length(genes)
counts.(genes{i}) = sum(any(allGenes==genes{i},2));
end
>> counts
counts =
struct with fields:
A: 2
B: 1
C: 3
D: 2
G: 1
0 Kommentare
Siehe auch
Kategorien
Mehr zu Cell Arrays finden Sie in Help Center und File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!