First is code optimisation, wherein:
- You broadcast the smaller DataFrames for reducing network calls,
- Use checkpoints to avoid failures and recreation of larger DataFrames, and
- Use caching to reuse the data frames from the memory itself. Secondly, cluster optimisation, wherein:
- You select the right resources to be used from the cluster for the given Spark job,
- You properly partition the data so that it can be brought to a close locality like process or at least the node level