Introduction to Apache Spark Paired RDD

Spark provides special types of operations on RDDs that contain key/value pairs (Paired RDDs). These operations are called paired RDDs operations. Paired RDDs are a useful building block in many programming languages, as they expose operations that allow us to act on each key operation in parallel or re-group data across the network. Spark Paired RDDs are defined as the RDD containing a key-value pair. There is two linked data item in a key-value pair (KVP). We can say the key is the identifier, while the value is the data corresponding to the key value.

Most of the Spark operations work on RDDs containing any type of objects. But on RDDs of key-value pairs, a few special operations are available. For example, distributed “shuffle” operations, such as grouping or aggregating the elements by a key. These operations are automatically available on RDDs containing Tuple2 objects, in Scala. In the Pair RDD functions class, the key-value pair operations are available. That wraps around an RDD of tuples. Air RDDs of Apache Spark are a useful building block. Operations that allow us to act on each key in parallel, it exposes those operations. Also, helps to regroup the data across the network.

Paired RDDs can be created by running a map() function that returns key/value pairs. The procedure to build key/value RDDs differs by language. In Python, for making the functions on the keyed data to work, we need to return an RDD composed of tuples.

For instance, in spark paired RDDs reduceByKey() method aggregate data separately for each key and a join() method, which merges two RDDs together by grouping elements with the same key. It is very normal to extract fields from an RDD.

For example, representing, for instance, an event time, customer ID, or another identifier. Also, use those fields in spark pair RDD operations as keys.

Creating a paired RDD using the first word as the key in Python:

pairs = lines.map(lambda x: (x.split(" ")[0], x))

In Scala also, for having the functions on the keyed data to be available, we need to return tuples as shown in the previous example. An implicit conversion on RDDs of tuples exists to provide the additional key/value functions as per requirements.

Creating a paired RDD using the first word as the keyword in Scala:

val pairs = lines.map(x => (x.split(" ")(0), x))

Java doesn’t have a built-in function of tuples, so only Spark’s Java API has users create tuples using the scala.Tuple2 class. Java users can construct a new tuple by writing new Tuple2(elem1, elem2) and can then access its relevant elements with the _1() and _2() methods.

Java users also need to call special versions of Spark’s functions when they are creating paired RDDs. For instance, the mapToPair () function should be used in place of the basic map() function.

Creating a paired RDD using the first word as the keyword in Java:

PairFunction<String, String, String> keyData =new PairFunction<String, String, String>() 
{
public Tuple2<String, String> call(String x) 
{
return new Tuple2(x.split(" ")[0], x);
}
};
JavaPairRDD<String, String> pairs = lines.mapToPair(keyData);

Aggregations

When datasets are described in terms of key/value pairs, it is a common feature required to aggregate statistics across all elements with the same key/value. Spark has a set of operations that combines values that own the same key/value. These operations return RDDs and thus are transformations rather than actions. Below are the transformations:

  • reduceByKey()
  • foldByKey()
  • combineByKey()

Grouping of Data

With key data is a common type of use case in grouping our data sets is used with respect to predefined key/value, for example, viewing all of a customer’s orders together in one file.

If our data is already keyed in the way we want to implement, groupByKey() will group our data using the key/value using our RDD. On an RDD consisting of keys of type K and values of type V, we get back an RDD operation of type [K, Iterable[V]].

This groupBy() transformation works on unpaired data or on data where we want to use different terms of conditions besides equality on the current key been specified. It requires a function that allows applying the same to every element in the source of RDD and uses the result to determine the key/value obtained.

Joins

The most useful and effective operations we get with keyed data values come from using it together with other keyed data. Joining datasets together is probably one of the most common types of operations we can perform on a paired RDD.

  • innerJoin(): The only keys that are present in both paired RDDs are returned as the output.
  • leftOuterJoin(): The resulting paired RDD would have entries for each key in the source RDD. The value which is been associated with each key in the result is a tuple of the value from the source RDD and an option for the value from the other paired RDD.
  • rightOuterJoin(): is almost identical functioning to leftOuterJoin() except the key must be present in the other RDD and the tuple has an option for the source rather than the other RDD functions.

Sorting Data

We can sort an RDD with key or value pairs provided that there is an ordering defined on the key set. Once we have sorted our data elements, any subsequent call on the sorted data to collect() or save() will result in an ordered dataset.

Actions Available on Pair RDDs

  • countByKey(): Counts the number of elements for each key pair
  • collectAsMap(): Collects the result outputs as a map to provide easy lookup
  • lookup(key): Returns all values associated with the provided key pair