Finding Bottlenecks with Just the Spark UI
Identify and resolve performance issues in Apache Spark using this battle-tested checklist for Spark UI analysis
Apache Spark is powerful — until it’s not. Performance issues can sneak in through skewed joins, improper caching, shuffles, and memory pressure. The good news? You can uncover most of these problems without logs or third-party tools. All you need is the Spark UI.
This post gives you a battle-tested checklist of what to look for and the red flags that scream “optimize me.” Bookmark it. You’ll need it.
✅ Jobs Tab — First Glance, Big Picture
- What to check:
- Jobs sorted by duration
- Failed or retried jobs
- Number of stages per job
- Red flags:
- Long job durations (especially after retries)
- High failure/retry count
- Jobs with uneven stage durations
- Why it matters:
- This tab helps narrow your investigation to the most expensive parts of your pipeline.
✅ Stages Tab — Stage-level Heat Zones
- What to check:
- Stage duration (sort descending)
- Shuffle Read/Write volume
- GC Time per stage
- Input/output records
- Task duration variance
- Red flags:
- A single stage dominating runtime
- High shuffle read/write (GBs instead of MBs)
- GC Time > 10% of stage duration
- Tasks with large duration variance = data skew
- Why it matters:
- Most Spark performance problems surface at the stage level — especially wide transformations like
groupBy
,join
, anddistinct
.
- Most Spark performance problems surface at the stage level — especially wide transformations like
✅ Tasks Tab — Microscope Mode
- What to check:
- Distribution of task durations
- Failed tasks
- Input/output bytes
- Spill data
- Task locality (NODE_LOCAL, PROCESS_LOCAL, ANY)
- Red flags:
- A few tasks take much longer than the rest
- Spill (memory or disk) present in most tasks
- Uneven I/O = bad partitioning or skew
- Tasks consistently running as
ANY
→ locality delay
- Why it matters:
- This is where you catch anomalies like stragglers and poorly distributed workloads.
✅ Executors Tab — Cluster Health Pulse
- What to check:
- GC Time %
- Task and input distribution
- Shuffle read/write volume per executor
- Failed tasks
- Memory and storage usage
- Red flags:
- High GC Time (esp. >20%)
- One executor doing more work than others = imbalanced partitions
- Memory usage close to max (with spill) = consider tuning executor memory
- Repeated task failures → hardware issues or bad code
- Why it matters:
- Executor imbalance, memory pressure, and garbage collection can tank performance silently.
✅ SQL Tab — Plan or Pain?
- What to check:
- Query duration
- Join strategies used (
SortMergeJoin
vsBroadcastHashJoin
) - Scan size and output rows
- Partition pruning effectiveness
- Physical plan hierarchy
- Red flags:
- Full table scans without filters
- Small dimension tables not being broadcasted
- Lots of
SortMergeJoin
whereBroadcastJoin
could be used - Missing partition filters (check your
WHERE
clause)
- Why it matters:
- Optimizing Spark SQL often brings the biggest wins. Poor query planning is a hidden cost center.
✅ Storage Tab — Caching Gone Wrong
- What to check:
- Cached RDD/DataFrame size
- Memory used vs memory cached
- Fraction cached
- Recomputations after actions
- Red flags:
- RDDs cached but not materialized
- Low memory fraction cached = wasteful caching
- Frequent recomputation = improper persistence strategy
- Why it matters:
- Caching is only helpful if it’s used right. Otherwise, it just eats memory.
✅ Environment & Config Tabs — Sanity Checks
- What to check:
- Spark version
- Configurations like:
spark.sql.shuffle.partitions
spark.executor.memory
spark.sql.autoBroadcastJoinThreshold
spark.dynamicAllocation.enabled
- Red flags:
- Default partition count (e.g., 200) for huge datasets
- Broadcast join thresholds too low
- Dynamic allocation turned off without reasoning
- Why it matters:
- Misconfigurations at this level cascade into all the tabs above.
Pro Tips
- Use the timeline view in the Stages tab for quick visual detection of stragglers.
- Combine SQL plan info with stage/task data to correlate slow queries with compute overhead.
- Bookmark long-running job links to compare pre- and post-optimization.
Conclusion
Debugging Spark performance isn’t a black box. The Spark UI gives you all the visibility you need — if you know where to look.
Next time your pipeline misbehaves, skip the logs and dashboards. Open the Spark UI, walk this checklist, and tackle bottlenecks with confidence.
✨ One UI to find them all. One checklist to debug them. ✨