What are the different approaches/ways in which you can optimise any Spark job?

First is code optimisation, wherein:

  • You broadcast the smaller DataFrames for reducing network calls,
  • Use checkpoints to avoid failures and recreation of larger DataFrames, and
  • Use caching to reuse the data frames from the memory itself. Secondly, cluster optimisation, wherein:
  • You select the right resources to be used from the cluster for the given Spark job,
  • You properly partition the data so that it can be brought to a close locality like process or at least the node level