Skip to main content
Home
  • Tutorials
    • Quality Assurance
    • Software Development
    • Machine Learning
    • Data Science
  • About Us
  • Contact
programsbuzz facebook programsbuzz twitter programsbuzz linkedin
  • Log in

Main navigation

  • Tutorials
    • Quality Assurance
    • Software Development
    • Machine Learning
    • Data Science
  • About Us
  • Contact

Introduction to Apache Spark RDD

Profile picture for user shiksha.dahiya
Written by shiksha.dahiya on 07/27/2021 - 22:04

The Resilient Distributed Dataset is a concept at the heart of Spark. It is designed to support in-memory data storage, distributed across a cluster in a manner that is demonstrably both fault-tolerant and efficient. Fault-tolerance is achieved, in part, by tracking the lineage of transformations applied to coarse grained sets of data. Efficiency is achieved through parallelization of processing across multiple nodes in the cluster, and minimization of data replication between those nodes. A distributed memory abstraction that lets programmers perform in-memory computations on large clusters in a fault-tolerant manner.

RDDs are motivated by two types of applications that current computing frameworks handle inefficiently:

  • iterative algorithms 
  • interactive data mining tools.

In both cases, keeping data in memory can improve performance by an order of magnitude. To achieve fault tolerance efficiently, RDDs provide a restricted form of shared memory, based on coarse grained transformations rather than fine-grained updates to shared state. However, we show that RDDs are expressive enough to capture a wide class of computations, including recent specialized programming models for iterative jobs, such as Pregel, and new applications that these models do not capture. We have implemented RDDs in a system called Spark, which we evaluate through a variety of user applications and benchmarks.

Resilient Distributed Datasets (RDD) is a fundamental data structure of Spark. It is an immutable distributed collection of objects. Each dataset in RDD is divided into logical partitions, which may be computed on different nodes of the cluster. RDDs can contain any type of Python, Java, or Scala objects, including user-defined classes. Formally, an RDD is a read-only, partitioned collection of records. RDDs can be created through deterministic operations on either data on stable storage or other RDDs. RDD is a fault-tolerant collection of elements that can be operated on in parallel.

There are two ways to create RDDs:

  • parallelizing an existing collection in your driver program,
  • or referencing a dataset in an external storage system, such as a shared file system, HDFS, HBase, or any data source offering a Hadoop Input Format.

Spark makes use of the concept of RDD to achieve faster and efficient MapReduce operations. Let us first discuss how MapReduce operations take place and why they are not so efficient.

Once data is loaded into an RDD, two basic types of operation can be carried out:

• Transformations, which create a new RDD by changing the original through processes such as mapping, filtering, and more;

• Actions, such as counts, which measure but do not change the original data. The original RDD remains unchanged throughout.

The chain of transformations from RDD1 to RDDn are logged, and can be repeated in the event of data loss or the failure of a cluster node. Transformations are said to be lazily evaluated, meaning that they are not executed until a subsequent action has a need for the result. This will normally improve performance, as it can avoid the need to process data unnecessarily. It can also, in certain circumstances, introduce processing bottlenecks that cause applications to stall while waiting for a processing action to conclude. Where possible, these RDDs remain in memory, greatly increasing the performance of the cluster, particularly in use cases with a requirement for iterative queries or processes.

Resilient distributed datasets (RDDs) that enables efficient data reuse in a broad range of applications. RDDs are fault-tolerant, parallel data structures that let users explicitly persist intermediate results in memory, control their partitioning to optimize data placement, and manipulate them using a rich set of operators. 

Data sharing

Data sharing is slow in MapReduce due to replication, serialization, and disk IO. Most of the Hadoop applications, they spend more than 90% of the time doing HDFS read write operations. Recognizing this problem, researchers developed a specialized framework called Apache Spark.

The key idea of spark is Resilient Distributed Datasets (RDD); it supports in memory processing computation. This means, it stores the state of memory as an object across the jobs and the object is sharable between those jobs. Data sharing in memory is 10 to 100 times faster than network and Disk.

Iterative operations on Spark RDD

It will store intermediate results in a distributed memory instead of Stable storage (Disk) and make the system faster. If the Distributed memory (RAM) is sufficient to store intermediate results (State of the JOB), then it will store those results on the disk. If different queries are run on the same set of data repeatedly, this particular data can be kept in memory for better execution times.

RDD Transformations

RDD transformations returns pointer to new RDD and allows you to create dependencies between RDDs. Each RDD in dependency chain (String of Dependencies) has a function for calculating its data and has a pointer (dependency) to its parent RDD.

Related Content
Apache Spark Tutorial
Introduction to Apache Spark
How to Get Started with PySpark
Tags
Apache Spark
  • Log in or register to post comments

Choose Your Technology

  1. Agile
  2. Apache Groovy
  3. Apache Hadoop
  4. Apache HBase
  5. Apache Spark
  6. Appium
  7. AutoIt
  8. AWS
  9. Behat
  10. Cucumber Java
  11. Cypress
  12. DBMS
  13. Drupal
  14. GitHub
  15. GitLab
  16. GoLang
  17. Gradle
  18. HTML
  19. ISTQB Foundation
  20. Java
  21. JavaScript
  22. JMeter
  23. JUnit
  24. Karate
  25. Kotlin
  26. LoadRunner
  27. matplotlib
  28. MongoDB
  29. MS SQL Server
  30. MySQL
  31. Nightwatch JS
  32. PactumJS
  33. PHP
  34. Playwright
  35. Playwright Java
  36. Playwright Python
  37. Postman
  38. Project Management
  39. Protractor
  40. PyDev
  41. Python
  42. Python NumPy
  43. Python Pandas
  44. Python Seaborn
  45. R Language
  46. REST Assured
  47. Ruby
  48. Selenide
© Copyright By iVagus Services Pvt. Ltd. 2023. All Rights Reserved.

Footer

  • Cookie Policy
  • Privacy Policy
  • Terms of Use