Main Content

matlab.compiler.mlspark.RDD Class

Namespace: matlab.compiler.mlspark
Superclasses:

Interface class to represent a Spark Resilient Distributed Dataset (RDD)

Description

A Resilient Distributed Dataset or RDD is a programming abstraction in Spark™. It represents a collection of elements distributed across many nodes that can be operated in parallel. All work in Spark is expressed as either creating new RDDs, transforming existing RDDs, or calling operations on RDDs to compute a result. You can create RDDs in two ways:

  • By loading an external dataset

  • By parallelizing a collection of objects in the driver program

Once created, two types of operations can be performed using RDDs: transformations and actions.

Construction

An RDD object can only be created using the methods of the SparkContext class. A collection of SparkContext methods used to create RDDs is listed below for convenience. See the documentation of the SparkContext class for more information.

SparkContext Method NamePurpose
parallelize

Create an RDD from local MATLAB® values

datastoreToRDD

Convert MATLAB datastore to a Spark RDD

textFile

Create an RDD from a text file

Once an RDD has been created using a method from the SparkContext class, you can use any of the methods in the RDD class to manipulate your RDD.

Properties

The properties of this class are hidden.

Methods

Transformations

aggregateByKeyAggregate the values of each key, using given combine functions and a neutral “zero value”
cartesianCreate an RDD that is the Cartesian product of two RDDs
coalesceReduce the number of partitions in an RDD
cogroupGroup data from RDDs sharing the same key
combineByKeyCombine the elements for each key using a custom set of aggregation functions
distinctReturn a new RDD containing the distinct elements of an existing RDD
filterReturn a new RDD containing only the elements that satisfy a predicate function
flatMapReturn a new RDD by first applying a function to all elements of an existing RDD, and then flattening the results
flatMapValuesPass each value in the key-value pair RDD through a flatMap method without changing the keys
foldByKeyMerge the values for each key using an associative function and a neutral “zero value”
fullOuterJoinPerform a full outer join between two key-value pair RDDs
glomCoalesce all elements within each partition of an RDD
groupByReturn an RDD of grouped items
groupByKeyGroup the values for each key in the RDD into a single sequence
intersectionReturn the set intersection of one RDD with another
joinReturn an RDD containing all pairs of elements with matching keys
keyByCreate tuples of the elements in an RDD by applying a function
keysReturn an RDD with the keys of each tuple
leftOuterJoinPerform a left outer join
mapReturn a new RDD by applying a function to each element of an input RDD
mapValuesPass each value in a key-value pair RDD through a map function without modifying the keys
reduceByKeyMerge the values for each key using an associative reduce function
repartitionReturn a new RDD that has exactly numPartitions partitions
rightOuterJoinPerform a right outer join
sortBySort an RDD by a given function
sortByKeySort RDD consisting of key-value pairs by key
subtractReturn the values resulting from the set difference between two RDDs
subtractByKeyReturn key-value pairs resulting from the set difference of keys between two RDDs
unionReturn the set union of one RDD with another
valuesReturn an RDD with the values of each tuple
zipZip one RDD with another
zipWithIndexZip an RDD with its element indices
zipWithUniqueIdZip an RDD with generated unique Long IDs

Actions

aggregateAggregate the elements of each partition and subsequently the results for all partitions into a single value
collectReturn a MATLAB cell array that contains all of the elements in an RDD
collectAsMapReturn the key-value pairs in an RDD as a MATLAB containers.Map object
countCount number of elements in an RDD
foldAggregate elements of each partition and the subsequent results for all partitions
reduceReduce elements of an RDD using the specified commutative and associative function
reduceByKeyLocallyMerge the values for each key using an associative reduce function, but return the results immediately to the driver
saveAsKeyValueDatastoreSave key-value RDD as a binary file that can be read back using the datastore function
saveAsTallDatastoreSave RDD as a MATLAB tall array to a binary file that can be read back using the datastore function
saveAsTextFileSave RDD as a text file

Operations

cacheStore an RDD in memory
checkpointMark an RDD for checkpointing
getCheckpointFileGet the name of the file to which an RDD is checkpointed
getDefaultReducePartitionsGet the number of default reduce partitions in an RDD
getNumPartitionsReturn the number of partitions in an RDD
isEmptyDetermine if an RDD contains any elements
keyLimitReturn threshold of unique keys that can be stored before spilling to disk
persistSet the value of an RDD’s storage level to persist across operations after it is computed
toDebugStringObtain a description of an RDD and its recursive dependencies for debugging
unpersistMark an RDD as nonpersistent, remove all blocks for it from memory and disk

More About

expand all

References

See the latest Spark documentation for more information.

Version History

Introduced in R2016b