Between Coalesce and Repartition, which has lower network shuffling? Which operations are recommended and in what situations?

  • Coalesce uses the existing partitions. So, it has less data shuffling in the network. It is recommended to use Coalesce if you want to reduce the number of partitions.
  • Repartition uses network shuffling and recreates new partitions that are equal in size. It is recommended to use repartition when you want to increase the number of partitions.
  • Coalesce is not recommended to increase the number of partitions, as it may create unequal size partitions, and a spark job does not work well with unequal size partitions. This may, in turn, create a need for network shuffling.