matlab.compiler.mlspark.RDD Class

Namespace: matlab.compiler.mlspark
Superclasses:

Interface class to represent a Spark Resilient Distributed Dataset (RDD)

Description

A Resilient Distributed Dataset or RDD is a programming abstraction in Spark™. It represents a collection of elements distributed across many nodes that can be operated in parallel. All work in Spark is expressed as either creating new RDDs, transforming existing RDDs, or calling operations on RDDs to compute a result. You can create RDDs in two ways:

By loading an external dataset
By parallelizing a collection of objects in the driver program

Once created, two types of operations can be performed using RDDs: transformations and actions.

Construction

An RDD object can only be created using the methods of the SparkContext class. A collection of SparkContext methods used to create RDDs is listed below for convenience. See the documentation of the SparkContext class for more information.

SparkContext Method Name	Purpose
`parallelize`	Create an RDD from local MATLAB^® values
`datastoreToRDD`	Convert MATLAB `datastore` to a Spark `RDD`
`textFile`	Create an RDD from a text file

Once an RDD has been created using a method from the SparkContext class, you can use any of the methods in the RDD class to manipulate your RDD.

Properties

The properties of this class are hidden.

Methods

Transformations

aggregateByKey	Aggregate the values of each key, using given combine functions and a neutral “zero value”
cartesian	Create an RDD that is the Cartesian product of two RDDs
coalesce	Reduce the number of partitions in an RDD
cogroup	Group data from RDDs sharing the same key
combineByKey	Combine the elements for each key using a custom set of aggregation functions
distinct	Return a new RDD containing the distinct elements of an existing RDD
filter	Return a new RDD containing only the elements that satisfy a predicate function
flatMap	Return a new RDD by first applying a function to all elements of an existing RDD, and then flattening the results
flatMapValues	Pass each value in the key-value pair RDD through a `flatMap` method without changing the keys
foldByKey	Merge the values for each key using an associative function and a neutral “zero value”
fullOuterJoin	Perform a full outer join between two key-value pair RDDs
glom	Coalesce all elements within each partition of an RDD
groupBy	Return an RDD of grouped items
groupByKey	Group the values for each key in the RDD into a single sequence
intersection	Return the set intersection of one RDD with another
join	Return an RDD containing all pairs of elements with matching keys
keyBy	Create tuples of the elements in an RDD by applying a function
keys	Return an RDD with the keys of each tuple
leftOuterJoin	Perform a left outer join
map	Return a new RDD by applying a function to each element of an input RDD
mapValues	Pass each value in a key-value pair RDD through a map function without modifying the keys
reduceByKey	Merge the values for each key using an associative reduce function
repartition	Return a new RDD that has exactly `numPartitions` partitions
rightOuterJoin	Perform a right outer join
sortBy	Sort an RDD by a given function
sortByKey	Sort RDD consisting of key-value pairs by key
subtract	Return the values resulting from the set difference between two RDDs
subtractByKey	Return key-value pairs resulting from the set difference of keys between two RDDs
union	Return the set union of one RDD with another
values	Return an RDD with the values of each tuple
zip	Zip one RDD with another
zipWithIndex	Zip an RDD with its element indices
zipWithUniqueId	Zip an RDD with generated unique Long IDs

Actions

aggregate	Aggregate the elements of each partition and subsequently the results for all partitions into a single value
collect	Return a MATLAB cell array that contains all of the elements in an RDD
collectAsMap	Return the key-value pairs in an RDD as a MATLAB `containers.Map` object
count	Count number of elements in an RDD
fold	Aggregate elements of each partition and the subsequent results for all partitions
reduce	Reduce elements of an RDD using the specified commutative and associative function
reduceByKeyLocally	Merge the values for each key using an associative reduce function, but return the results immediately to the driver
saveAsKeyValueDatastore	Save key-value RDD as a binary file that can be read back using the `datastore` function
saveAsTallDatastore	Save RDD as a MATLAB tall array to a binary file that can be read back using the `datastore` function
saveAsTextFile	Save RDD as a text file

Operations

cache	Store an RDD in memory
checkpoint	Mark an RDD for checkpointing
getCheckpointFile	Get the name of the file to which an RDD is checkpointed
getDefaultReducePartitions	Get the number of default reduce partitions in an RDD
getNumPartitions	Return the number of partitions in an RDD
isEmpty	Determine if an RDD contains any elements
keyLimit	Return threshold of unique keys that can be stored before spilling to disk
persist	Set the value of an RDD’s storage level to persist across operations after it is computed
toDebugString	Obtain a description of an RDD and its recursive dependencies for debugging
unpersist	Mark an RDD as nonpersistent, remove all blocks for it from memory and disk

More About

expand all

Resilient Distributed Dataset

A Resilient Distributed Dataset or RDD is a programming abstraction in Spark. It represents a collection of elements distributed across many nodes that can be operated in parallel. RDDs tend to be fault-tolerant. You can create RDDs in two ways:

By loading an external dataset.
By parallelizing a collection of objects in the driver program.

After creation, you can perform two types of operations using RDDs: transformations and actions.

Transformations

Transformations are operations on an existing RDD that return a new RDD. Many, but not all, transformations are element-wise operations.

Actions

Actions compute a final result based on an RDD and either return that result to the driver program or save it to an external storage system such as HDFS™.

References

See the latest Spark documentation for more information.

Version History

Introduced in R2016b

matlab.compiler.mlspark.RDD Class

Description

Construction

Properties

Methods

Transformations

Actions

Operations

More About

Resilient Distributed Dataset

Transformations

Actions

References

Version History

See Also

Classes

Topics