Statistics of datastore of tabular data

2 Ansichten (letzte 30 Tage)
Omar Kamel
Omar Kamel am 25 Mär. 2024
Kommentiert: Omar Kamel am 28 Mär. 2024
Hey all,
I have thousands of parquet files. Each file has more than 50,000 rows of numerical data with more than 100 columns each. My data can't fit in memory so I use datastores to import and handle the data for machine learning workflow downstream. I would like to know if it is possible to calculate some statistics (max, min, mean, std for each channel) of each file during the datastore creation process, which I can use afterwards to filter and select the relevant segments of data for my downstream analysis.
Thanks in advance

Akzeptierte Antwort

Abhas
Abhas am 26 Mär. 2024
Hi Omar,
To calculate statistics (max, min, mean, std for each channel) during the datastore creation process in MATLAB and use them for filtering and selecting relevant data segments for downstream analysis, you can follow these steps:
  1. Create a Datastore: Initialize a 'datastore' for your Parquet files.
  2. Define Custom Function: Create a function to compute the desired statistics for each chunk of data.
  3. Apply Transformation: Use the 'transform' function to apply your custom statistics calculation to the datastore.
  4. Read and Aggregate Statistics: Iterate over the datastore to read the statistics of each chunk and aggregate them globally.
  5. Use Statistics for Filtering: Leverage the aggregated statistics to filter and select relevant data segments.
Here's the MATLAB code to reflect the above steps:
% Step 1: Create Your Datastore
ds = parquetDatastore('path/to/your/parquet/files/*.parquet');
% Step 2: Define Your Custom Function
function statsTable = calculateStats(tbl)
statsTable = varfun(@min, tbl, 'OutputFormat', 'table');
statsTable.Properties.VariableNames = strcat(statsTable.Properties.VariableNames, '_min');
maxTable = varfun(@max, tbl, 'OutputFormat', 'table');
maxTable.Properties.VariableNames = strcat(maxTable.Properties.VariableNames, '_max');
statsTable = [statsTable, maxTable];
meanTable = varfun(@mean, tbl, 'OutputFormat', 'table');
meanTable.Properties.VariableNames = strcat(meanTable.Properties.VariableNames, '_mean');
statsTable = [statsTable, meanTable];
stdTable = varfun(@std, tbl, 'OutputFormat', 'table');
stdTable.Properties.VariableNames = strcat(stdTable.Properties.VariableNames, '_std');
statsTable = [statsTable, stdTable];
end
% Step 3: Apply the Transformation
ds = transform(ds, @calculateStats);
% Step 4: Read and Aggregate the Statistics
globalMin = inf; % Initialize for min. Do similarly for max, mean, std
while hasdata(ds)
statsChunk = read(ds);
chunkMin = min(table2array(statsChunk(:, contains(statsChunk.Properties.VariableNames, '_min'))), [], 'all');
globalMin = min(globalMin, chunkMin);
% Update global max, mean, std similarly
end
% At this point, globalMin (and other statistics) can be used for filtering and selecting relevant data segments
At this point, you have the aggregated statistics (e.g., globalMin) which you can use to filter and select relevant segments of your data for further analysis.
You may refer to the following documentation links to have a better understanding on working with datastore and transform in MATLAB:
  1. parquetDatastore: https://www.mathworks.com/help/matlab/ref/matlab.io.datastore.parquetdatastore.html?s_tid=doc_ta
  2. transform: https://www.mathworks.com/help/matlab/ref/matlab.io.datastore.transform.html?s_tid=doc_ta
  1 Kommentar
Omar Kamel
Omar Kamel am 28 Mär. 2024
Hi Abhas, Thanks a lot for the elaborate answer. This is what I was exactly looking for.

Melden Sie sich an, um zu kommentieren.

Weitere Antworten (0)

Kategorien

Mehr zu Data Preprocessing finden Sie in Help Center und File Exchange

Produkte


Version

R2023b

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by