Main Content

Read and Analyze Hadoop Sequence File

This example shows how to create a datastore for a Sequence file containing key-value data. Then, you can read and process the data one block at a time. Sequence files are outputs of mapreduce operations that use Hadoop®.

Set the appropriate environment variable to the location where Hadoop is installed. In this case, set the MATLAB_HADOOP_INSTALL environment variable.

setenv('MATLAB_HADOOP_INSTALL','/mypath/hadoop-folder')

hadoop-folder is the folder where Hadoop is installed and mypath is the path to that folder.

Create a datastore from the sample file, mapredout.seq, using the datastore function. The sample file contains unique keys representing airline carrier codes and corresponding values that represent the number of flights operated by that carrier.

ds = datastore('mapredout.seq')
ds = 
  KeyValueDatastore with properties:

       Files: {
              ' ...\matlab\toolbox\matlab\demos\mapredout.seq'
              }
    ReadSize: 1 key-value pairs
    FileType: 'seq'

datastore returns a KeyValueDatastore. The datastore function automatically determines the appropriate type of datastore to create.

Set the ReadSize property to six so that each call to read reads at most six key-value pairs.

ds.ReadSize = 6;

Read subsets of the data from ds using the read function in a while loop. For each subset of data, compute the sum of the values. Store the sum for each subset in an array named sums. The while loop executes until hasdata(ds) returns false.

sums = [];
while hasdata(ds)
    T = read(ds);
    T.Value = cell2mat(T.Value);
    sums(end+1) = sum(T.Value);
end

View the last subset of key-value pairs read.

T
T = 

      Key       Value
    ________    _____

    'WN'        15931
    'XE'         2357
    'YV'          849
    'ML (1)'       69
    'PA (1)'      318

Compute the total number of flights operated by all carriers.

numflights = sum(sums)
numflights =

      123523

See Also

| | |

Related Topics