Problem 1983. Big data

Optimize this line of code:

 B = sum(gradient(corrcoef(A)).^2);

for a matrix A with size(A,2)>>size(A,1)

Description:

Analyses of large datasets often require some level of optimization for speed and/or memory usage. This is an example problem where both the initial data A and final measure of interest B fit perfectly well in memory, but the intermediate variables (in this case an impossibly large correlation matrix) required to compute the final measure of interest do not.

We have a large 2-dimensional matrix A (with dimensions 100 x 100,000).

We need to compute the following row vector B (with dimensions 1 x 100,000):

 B = sum(gradient(corrcoef(A)).^2);

This computes first the matrix of correlation coefficients for each pair of columns in A:

 a = corrcoef(A)

(a 100,000 by 100,000 matrix), then computes the spatial derivative of the resulting correlation matrix along the second dimension:

 b = gradient(a)

(another 100,000 by 100,000 matrix), and finally computes the squared norm of each column in the resulting matrix:

 B = sum(b.^2,1) 

(a 100,000 element vector)

This straight-forward "vectorized" approach, nevertheless, fails because it requires too much memory (enough to store a 100,000 x 100,000 correlation matrix, around 80Gb).

We clearly need some form of simplification/optimization. Can you compute the measure B within the time-limit of a Cody solution? (typically about 30 seconds)

Solutions will be scored based on computation time (score equal to total time in seconds).

Context: (not relevant to solving this problem)

This problem arises in the analyses of fMRI datasets. A typical result from a fMRI scanner session is a 4-dimensional matrix A(x,y,z,t), where the first three dimensions are spatial dimensions (a scanner of the subject's head/brain) and the fourth dimension is temporal (sequential scans obtained during a typical fMRI session). Think of these as time-varying three-dimensional pictures of your brain activation. A lot of research in the past few years has focused on functional connectivity, a measure of the temporal correlation between the "activation" of any pair of brain areas. Several recent papers have investigated the possibility to obtain entirely data-driven parcellations of the brain (partitioning the brain into functionally-homogeneous areas) based on these spatial patterns of functional connectivity. The measure B above represents one of the measures that have been suggested as a way to drive these data-driven parcellations (borders between two adjacent but functionally distinct brain areas are expected to show higher spatial gradients in functional connectivity profiles). For simplicity I have collapsed the three spatial dimensions into one for this problem, but the computational complexity of the original computation is approximately the same (a typical scanner session results in something of the order of several hundred thousands "voxels" -three dimensional "pixels"- within the brain, and a few hundred time-points; this makes computing the entire "voxel-to-voxel" correlation matrix, or measures derived from it, rather challenging).

Solution Stats

12.82% Correct | 87.18% Incorrect
Last Solution submitted on Nov 29, 2020

Problem Comments

Solution Comments

Show comments

Problem Recent Solvers4

Suggested Problems

More from this Author38

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!