Lazy Evaluation of Tall Arrays
One of the differences between tall arrays and in-memory MATLAB® arrays is that tall arrays typically remain unevaluated
until you request that calculations be performed. (The exceptions to this rule include
plotting functions like plot
and histogram
and
some statistical fitting functions like fitlm
, which automatically
evaluate tall array inputs.) While a tall array is in an unevaluated state, MATLAB might not know its size, its data type, or the specific values it contains.
However, you can still use unevaluated arrays in your calculations as if the values were
known. This allows you to work quickly with large data sets instead of waiting for each
command to execute. For this reason, it is recommended that you use
gather
only when you require output.
MATLAB keeps track of all the operations you perform on unevaluated tall arrays as
you enter them. When you eventually call gather
to evaluate the queued
operations, MATLAB uses the history of unevaluated commands to optimize the calculation by
minimizing the number of passes through the data. Used properly, this lazy
evaluation can save huge amounts of execution time by eliminating unnecessary
passes through large data sets.
Display of Unevaluated Tall Arrays
The display of unevaluated tall arrays varies depending on how much MATLAB knows about the array and its values. There are three pieces of information reflected in the display:
Array size — Unknown dimension sizes are represented by the variables
M
orN
in the display. If no dimension sizes are known, then the size appears asMxNx....
.Array data type — If the array has an unknown underlying data type, then its type appears as
tall array
. If the type is known, it is listed as, for example,tall double array
.Array values — If the array values are unknown, then they appear as
?
. Known values are displayed.
MATLAB might know all, some, or none of these pieces of information about a given tall array, depending on the nature of the calculation.
For example, if the array has a known data type but unknown size and values, then the unevaluated tall array might look like this:
M×N×... tall double array ? ? ? ... ? ? ? ... ? ? ? ... : : : : : :
If the type and relative size are known, then the display could be:
1×N tall char array ? ? ? ...
If some of the data is known, then MATLAB displays the known values:
100×3 tall double matrix 0.8147 0.1622 0.6443 0.9058 0.7943 0.3786 0.1270 0.3112 0.8116 0.9134 0.5285 0.5328 0.6324 0.1656 0.3507 0.0975 0.6020 0.9390 0.2785 0.2630 0.8759 0.5469 0.6541 0.5502 : : : : : :
Evaluation with gather
The gather
function is used to evaluate tall
arrays. gather
accepts tall arrays as inputs and returns in-memory
arrays as outputs. For this reason, you can think of this function as a bridge between tall
arrays and in-memory arrays. For example, you cannot control if
or
while
loop statements using a tall logical array, but once the array is
evaluated with gather
it becomes an in-memory logical value that you
can use in these contexts.
gather
performs all queued operations on a tall array and returns
the entire result in memory. Since gather
returns
results as in-memory MATLAB arrays, standard memory considerations apply. MATLAB might run out of memory if the result returned by gather
is too large.
Most of the time you can use gather
to see the entire result of a
calculation, particularly if the calculation includes a reduction operation such as
sum
or mean
. However, if the result is too large
to fit in memory, then you can use gather(head(X))
or
gather(tail(X))
to perform the calculation and look at only the first
or last few rows of the result.
Resolve Errors with gather
If you enter an erroneous command and gather
fails to evaluate a
tall array variable, then you must delete the variable from your workspace and recreate the
tall array using only valid commands. This is because MATLAB keeps track of all the operations you perform on unevaluated tall arrays as
you enter them. The only way to make MATLAB “forget” about an erroneous statement is to reconstruct the tall
array from scratch.
Example: Calculate Size of Tall Array
This example shows what an unevaluated tall array looks like, and how to evaluate the array.
Create a datastore for the data set airlinesmall.csv
. Convert the
datastore into a tall table and then calculate the size.
varnames = {'ArrDelay', 'DepDelay', 'Origin', 'Dest'}; ds = tabularTextDatastore('airlinesmall.csv', 'TreatAsMissing', 'NA', ... 'SelectedVariableNames', varnames); tt = tall(ds)
tt = M×4 tall table ArrDelay DepDelay Origin Dest ________ ________ ______ _____ 8 12 'LAX' 'SJC' 8 1 'SJC' 'BUR' 21 20 'SAN' 'SMF' 13 12 'BUR' 'SJC' 4 -1 'SMF' 'LAX' 59 63 'LAX' 'SJC' 3 -2 'SAN' 'SFO' 11 -1 'SEA' 'LAX' : : : : : : : :
s = size(tt)
s = 1×2 tall double row vector ? ? Preview deferred. Learn more.
Calculating the size of a tall array returns a small answer (a 1-by-2 vector), but the
display indicates that an entire pass through the data is still required to calculate the
size of tt
.
Use the gather
function to fully evaluate the tall array and bring
the results into memory. As the command executes, there is a dynamic progress display in the
command window that is particularly helpful with long calculations.
Note
Always ensure that the result returned by gather
will be able to
fit in memory. If you use gather
directly on a tall array without
reducing its size using a function such as mean
, then MATLAB might run out of memory.
tableSize = gather(s)
Evaluating tall expression using the Local MATLAB Session: - Pass 1 of 1: Completed in 0.42 sec Evaluation completed in 0.48 sec tableSize = 123523 4
Example: Multi-pass Calculations with Tall Arrays
This example shows how several calculations can be combined to minimize the total number of passes through the data.
Create a datastore for the data set airlinesmall.csv
. Convert the
datastore into a tall table.
varnames = {'ArrDelay', 'DepDelay', 'Origin', 'Dest'}; ds = tabularTextDatastore('airlinesmall.csv', 'TreatAsMissing', 'NA', ... 'SelectedVariableNames', varnames); tt = tall(ds)
tt = M×4 tall table ArrDelay DepDelay Origin Dest ________ ________ ______ _____ 8 12 'LAX' 'SJC' 8 1 'SJC' 'BUR' 21 20 'SAN' 'SMF' 13 12 'BUR' 'SJC' 4 -1 'SMF' 'LAX' 59 63 'LAX' 'SJC' 3 -2 'SAN' 'SFO' 11 -1 'SEA' 'LAX' : : : : : : : :
Subtract the mean value of DepDelay
from ArrDelay
to create a new variable AdjArrDelay
. Then calculate the mean value of
AdjArrDelay
and subtract this mean value from
AdjArrDelay
. If these calculations were all evaluated separately, then
MATLAB would require four passes through the data.
AdjArrDelay = tt.ArrDelay - mean(tt.DepDelay,'omitnan'); AdjArrDelay = AdjArrDelay - mean(AdjArrDelay,'omitnan')
AdjArrDelay = M×1 tall double column vector ? ? ? : : Preview deferred. Learn more.
Evaluate AdjArrDelay
and view the first few rows. Because some
calculations can be combined, only three passes through the data are required.
gather(head(AdjArrDelay))
Evaluating tall expression using the Local MATLAB Session: - Pass 1 of 3: Completed in 0.4 sec - Pass 2 of 3: Completed in 0.39 sec - Pass 3 of 3: Completed in 0.23 sec Evaluation completed in 1.2 sec ans = 0.8799 0.8799 13.8799 5.8799 -3.1201 51.8799 -4.1201 3.8799
Summary of Behavior and Recommendations
Tall arrays remain unevaluated until you request output using
gather
, an optimization called lazy evaluation.Use
gather
in most cases to evaluate tall array calculations. If you believe the result of the calculations might not fit in memory, then usegather(head(X))
orgather(tail(X))
instead.Work primarily with unevaluated tall arrays and request output only when necessary. The more queued calculations there are that are unevaluated, the more optimization MATLAB can do to minimize the number of passes through the data.
If you enter an erroneous tall array command and
gather
fails to evaluate a tall array variable, then you must delete the variable from your workspace and recreate the tall array using only valid commands.