In spark, a partition is a logical chunk of data. Spark data structures RDDs/DataFrames/Data sets are partitioned and processed in different cores/nodes. Partitioning allows you to control application parallelism, which results in better utilisation of clusters. In general, distribution of work to more workers is achieved by creating more number of partitions; with fewer partitions of the data, the worker nodes complete the work in larger data chunks. However, partitioning may not be helpful in all applications. For example, if the given
RDD is scanned only once, partitioning it in advance is pointless. Partitioning is useful only when a data set is reused multiple times (in key-oriented situations using functions such as join()).