Features of Apache Spark

Apache Spark is offen deployed in conjunction with a Hadoop cluster, and Spark is able to benefit from a number of capabilities as a result. On its own, Spark is a powerful tool for processing large volumes of data. But, on its own, Spark is not yet well-suited to production workloads in the enterprise. Integration with Hadoop gives Spark many of the capabilities that broad adoption and use in production environments will require, including:

YARN resource manager, which takes responsibility for scheduling tasks across available nodes in the cluster;

Distributed File System, which stores data when the cluster runs out of free memory, and which persistently stores historical data when Spark is not running;

Disaster Recovery capabilities, inherent to Hadoop, which enable recovery of data when individual nodes fail. These capabilities include basic (but reliable) data mirroring across the cluster and richer snapshot and mirroring capabilities such as those offered by the MapR Data Platform;

Data Security, which becomes increasingly important as Spark tackles production workloads in regulated industries such as healthcare and financial services. Projects like Apache Knox and Apache Ranger offer data security capabilities that augment Hadoop. Each of the big three vendors have alternative approaches for security implementations that complement Spark. Hadoop’s core code, too, is increasingly recognizing the need to expose advanced security capabilities that Spark is able to exploit;

A distributed data platform, benefiting from all of the preceding points, meaning that Spark jobs can be deployed on available resources anywhere in a distributed cluster, without the need to manually allocate and track those individual jobs.

Spark is a general-purpose data processing engine, suitable for use in a wide range of circumstances. Interactive queries across large data sets, processing of streaming data from sensors or financial systems, and machine learning tasks tend to be most frequently associated with Spark. Developers can also use it to support other data processing tasks, benefiting from Spark’s extensive set of developer libraries and APIs, and its comprehensive support for languages such as Java, Python, R and Scala. Spark is offen used alongside Hadoop’s data storage module, HDFS, but can also integrate equally well with other popular data storage subsystems such as HBase, Cassandra, MapR-DB, MongoDB and Amazon’s S3. There are many reasons to choose Spark,

but three are key features of SPARK :

Simplicity:

  • Spark’s capabilities are accessible via a set of rich APIs, all designed specifically for interacting quickly and easily with data at scale. These APIs are well documented, and structured in a way that makes it straightforward for data scientists and application developers to quickly put Spark to work;

Speed:

  • Spark is designed for speed, operating both in memory and on disk. In 2014, Spark was used to win the Daytona Gray Sort benchmarking challenge, processing 100 terabytes of data stored on solid-state drives in just 23 minutes. The previous winner used Hadoop and a different cluster configuration, but it took 72 minutes. This win was the result of processing a static data set. Spark’s performance can be even greater when supporting interactive queries of data stored in memory, with claims that Spark can be 100 times faster than Hadoop’s MapReduce in these situations;

Support:

  • Spark supports a range of programming languages, including Java, Python, R, and Scala. Although offen closely associated with Hadoop’s underlying storage system, HDFS, Spark includes native support for tight integration with a number of leading storage solutions in the Hadoop ecosystem and beyond. Additionally, the Apache Spark community is large, active, and international. A growing set of commercial providers including Databricks, IBM, and all of the main Hadoop vendors deliver comprehensive support for Spark-based solutions.