Apache Spark Performance Over Time Before After Tuning

10 Apache Spark Performance Tuning Tips for 2025

Published by Quarkray · Big Data Engineering

Apache Spark remains the backbone of large-scale data processing. But out-of-the-box performance rarely meets production demands. Whether you are running ETL pipelines, training machine learning models, or powering real-time dashboards, tuning Spark can mean the difference between a job that takes hours and one that finishes in minutes. Here are ten battle-tested tips to help you get the most out of Spark in 2025.

1. Right-Size Your Partitions

Partition sizing is the single most impactful lever for Spark performance. Too few partitions and executors sit idle while a handful struggle with oversized tasks. Too many partitions and the scheduler overhead dominates actual compute. The sweet spot is usually between 128 MB and 256 MB per partition. Use spark.sql.files.maxPartitionBytes to control input partition size and repartition() or coalesce() to adjust intermediate stages.

Monitor the Spark UI's task distribution. If the max task duration is several times larger than the median, you have a skew problem that partitioning alone cannot solve. Consider salting keys or using Adaptive Query Execution to handle it dynamically.

2. Enable Adaptive Query Execution (AQE)

Adaptive Query Execution, introduced in Spark 3.0 and significantly improved in Spark 3.2+, is a game-changer. AQE dynamically optimizes query plans at runtime based on actual data statistics collected after shuffle stages. Enable it with spark.sql.adaptive.enabled=true.

AQE provides three core optimizations: coalescing post-shuffle partitions to eliminate tiny partitions, converting sort-merge joins to broadcast joins when one side turns out to be small, and optimizing skew joins by splitting oversized partitions. In our experience, AQE alone can reduce job runtimes by 20-40% on skewed workloads without any code changes.

3. Optimize Shuffle Operations

Shuffles are the most expensive operation in Spark. Every shuffle writes intermediate data to disk, transfers it across the network, and reads it back. Reduce shuffles by using broadcast joins for small tables (under 10 MB by default, adjustable via spark.sql.autoBroadcastJoinThreshold). Prefer reduceByKey over groupByKey for RDD operations, as the former combines values locally before shuffling.

Use the external shuffle service (spark.shuffle.service.enabled=true) to decouple shuffle data from executor lifecycles. This is essential for dynamic allocation and prevents data loss when executors are reclaimed. Also consider tuning spark.sql.shuffle.partitions—the default of 200 is rarely optimal for your workload.

4. Leverage Caching Strategically

Caching DataFrames with .cache() or .persist() avoids recomputation when a dataset is used multiple times. However, caching is not free: it consumes memory that could be used for execution, and caching a dataset that is only used once actually hurts performance.

Use StorageLevel.MEMORY_AND_DISK_SER for large datasets that do not fit entirely in memory. Serialized storage uses less memory at the cost of CPU for deserialization. Always call .unpersist() when a cached dataset is no longer needed to free memory for subsequent stages.

5. Choose the Right File Format

Columnar file formats like Parquet and ORC dramatically reduce I/O for analytical queries. Parquet with Snappy compression is the de facto standard for Spark workloads. It enables predicate pushdown (skipping entire row groups that do not match filter conditions) and column pruning (reading only the columns needed for a query).

Avoid small files. Hundreds of thousands of small Parquet files cause excessive metadata overhead and scheduler pressure. Use compaction jobs to merge small files, or configure your writers to produce files in the 256 MB to 1 GB range. Delta Lake, Iceberg, and Hudi all provide automatic compaction features that handle this for you.

6. Tune Memory Configuration

Spark divides executor memory into execution memory (for shuffles, joins, aggregations) and storage memory (for caching). The unified memory model in Spark 2+ allows these to borrow from each other, but you still need to set the total correctly. A common starting point is 4-8 GB per executor with 4-5 cores each.

Watch for garbage collection pauses. If GC time exceeds 10% of task time, consider using the G1 GC collector (-XX:+UseG1GC) and increasing executor memory. Off-heap memory via spark.memory.offHeap.enabled can also help reduce GC pressure for memory-intensive workloads.

7. Use Broadcast Variables for Lookup Data

When you need to join a large dataset with a small lookup table, broadcast the small table to all executors. This eliminates the shuffle entirely and converts an expensive sort-merge join into a fast map-side join. Use broadcast() explicitly or increase the auto-broadcast threshold.

Broadcast variables are also useful beyond joins. If your UDFs reference a dictionary, model, or configuration map, broadcasting it ensures each executor gets a single copy in memory rather than one per task. This can reduce memory usage by orders of magnitude for jobs with thousands of tasks.

8. Minimize Data Serialization Overhead

Spark's default Java serialization is slow and produces large payloads. Switch to Kryo serialization (spark.serializer=org.apache.spark.serializer.KryoSerializer) for a 2-10x improvement in serialization speed. Register your custom classes with Kryo for even better performance.

For DataFrame operations, Spark uses its own tungsten binary format which bypasses Java serialization entirely. Prefer DataFrame and Dataset APIs over raw RDDs whenever possible. The Catalyst optimizer and Tungsten execution engine provide optimizations that are impossible with the RDD API.

9. Handle Data Skew Proactively

Data skew occurs when a few partition keys contain disproportionately more data than others. This causes some tasks to take far longer than the rest, making the entire stage wait. Identify skew by checking the Spark UI for tasks that take much longer than the median.

Solutions include salting skewed keys (appending a random number to the key, performing the join, then aggregating), using AQE's skew join optimization, or isolating the skewed keys and processing them separately with a broadcast join. For aggregation skew, a two-phase approach—first aggregate with salted keys, then aggregate the partial results—distributes work more evenly.

10. Monitor, Profile, and Iterate

Performance tuning is not a one-time activity. Use the Spark UI to identify bottlenecks: look at task durations, shuffle read/write sizes, GC time, and spill metrics. Spark's event logs can be analyzed offline with the History Server. Third-party tools like Datadog, Unravel, and Databricks' built-in profiler provide deeper insights.

Establish baselines and benchmark systematically. Change one parameter at a time and measure the impact. What works for one workload may not work for another. Keep a performance log documenting what you tried, what worked, and what didn't. This institutional knowledge is invaluable as your data volumes grow and your team scales.

Conclusion

Spark performance tuning is part science, part art. Start with the fundamentals—partition sizing, file formats, and shuffle reduction—before diving into advanced techniques like memory tuning and skew handling. Enable AQE as a baseline, and invest in monitoring to catch regressions early. With these ten tips in your toolkit, you will be well-equipped to handle even the most demanding workloads in 2025.

At Quarkray, we help organizations optimize their Spark infrastructure for maximum throughput and minimum cost. If you need expert guidance on tuning your data pipelines, get in touch.