Grouped Statistics Calculations with Tall Arrays

This example shows how to use the findgroups and splitapply functions to calculate grouped statistics of a tall timetable containing power outage data. findgroups and splitapply enable you to break up tall variables into groups, use those groups to separate data, and then apply a function to each group of data. Alternatively, if you have Statistics and Machine Learning Toolbox™, then you also can use the grpstats function to calculate grouped statistics.

This example creates a tall timetable for the power outage data, even though the raw data only has about 1500 rows. However, you can use the techniques presented here on much larger data sets because no assumptions are made about the size of the data.

Create Datastore and Tall Timetable

The sample file, outages.csv, contains data representing electric utility outages in the United States. The file contains six columns: Region, OutageTime, Loss, Customers, RestorationTime, and Cause.

Create a datastore for the outages.csv file. Use the 'TextScanFormats' option to specify the kind of data each column contains: categorical ('%C'), floating-point numeric ('%f'), or datetime ('%D').

data_formats = {'%C','%D','%f','%f','%D','%C'};
ds = datastore('outages.csv','TextscanFormats',data_formats);

Create a tall table on top of the datastore, and convert the tall table into a tall timetable. The OutageTime variable is used for the row times since it is the first datetime or duration variable in the table.

T = tall(ds);
T = table2timetable(T)
T =

  Mx5 tall timetable

    OutageTime    Region    Loss    Customers    RestorationTime    Cause
    __________    ______    ____    _________    _______________    _____

        ?           ?        ?          ?               ?             ?  
        ?           ?        ?          ?               ?             ?  
        ?           ?        ?          ?               ?             ?  
        :           :        :          :               :             :
        :           :        :          :               :             :

Clean Missing Data

Some of the rows in the tall table have missing data represented by NaN and NaT values. Remove all of the rows that are missing at least one piece of data.

idx = ~any(ismissing(T),2);
T = T(idx,:)
T =

  Mx5 tall timetable

    OutageTime    Region    Loss    Customers    RestorationTime    Cause
    __________    ______    ____    _________    _______________    _____

        ?           ?        ?          ?               ?             ?  
        ?           ?        ?          ?               ?             ?  
        ?           ?        ?          ?               ?             ?  
        :           :        :          :               :             :
        :           :        :          :               :             :

Mean Power Outage Duration by Region

Determine the mean power outage duration in each region. The findgroups function groups the data by the categorical values in Region. The splitapply function applies the specified function to each group of data and concatenates the results together.

[G,regions] = findgroups(T.Region);
times = splitapply(@mean,T.RestorationTime-T.OutageTime,G)
times =

  Mx1 tall duration array

    ?
    ?
    ?
    :
    :

Change the display format of the duration results to be in days, and put the results in an in-memory table with the associated regions.

times.Format = 'd';
varnames = {'Regions','MeanOutageDuration'};
maxOutageDurations = gather(table(regions,times,'VariableNames',varnames))
Evaluating tall expression using the Local MATLAB Session:
- Pass 1 of 3: Completed in 0.34 sec
- Pass 2 of 3: Completed in 0.7 sec
- Pass 3 of 3: Completed in 0.84 sec
Evaluation completed in 3.1 sec
maxOutageDurations=5×2 table
     Regions     MeanOutageDuration
    _________    __________________

    MidWest          52.258 days   
    NorthEast        1.8538 days   
    SouthEast         1.835 days   
    SouthWest        2.0212 days   
    West            0.99895 days   

Most Common Power Outage Causes by Region

Determine how often each power outage cause occurs in each region. First, group the data by both cause and region. Specify three outputs to findgroups to return information on the cause and region for each piece of data. Then use splitapply to count the number of occurrences of each cause in each region.

[G2,causes,regions] = findgroups(T.Cause,T.Region);
C = splitapply(@(x) numel(x),T.Cause,G2)
C =

  Mx1 tall double column vector

    ?
    ?
    ?
    :
    :

Convert the results into an in-memory table and unstack the 'Count' and 'Region' variables. Use fillmissing on the in-memory table to replace NaN values with zeros.

RegionCauses = gather(table(causes,regions,C,'VariableNames',{'Cause','Region','Count'}))
Evaluating tall expression using the Local MATLAB Session:
- Pass 1 of 2: Completed in 0.62 sec
- Pass 2 of 2: Completed in 0.45 sec
Evaluation completed in 1.6 sec
RegionCauses=43×3 table
         Cause           Region      Count
    ________________    _________    _____

    attack              MidWest        4  
    attack              NorthEast     75  
    attack              SouthEast      6  
    attack              West          44  
    earthquake          NorthEast      1  
    earthquake          West           1  
    energy emergency    MidWest       11  
    energy emergency    NorthEast     11  
    energy emergency    SouthEast     39  
    energy emergency    SouthWest      5  
    energy emergency    West          19  
    equipment fault     MidWest        6  
    equipment fault     NorthEast     13  
    equipment fault     SouthEast     28  
    equipment fault     SouthWest      1  
    equipment fault     West          50  
      ⋮

RegionCauses = unstack(RegionCauses,'Count','Region');
RegionCauses = fillmissing(RegionCauses,'constant',{'',0,0,0,0,0})
RegionCauses=10×6 table
         Cause          MidWest    NorthEast    SouthEast    SouthWest    West
    ________________    _______    _________    _________    _________    ____

    attack                 4          75            6            0         44 
    earthquake             0           1            0            0          1 
    energy emergency      11          11           39            5         19 
    equipment fault        6          13           28            1         50 
    fire                   0           4            2            0         10 
    severe storm          17          54           86            4         13 
    thunder storm         22          37           39            6          4 
    unknown                4           4            2            0          1 
    wind                  12          19           11            3         15 
    winter storm           9          30           23            1         17 

See Also

| |

Related Topics