Apache Spark Performance Linter

spark-perf-lint

Enterprise-grade static analysis for PySpark — catches anti-patterns before they reach production, explains why they hurt, and hands you a concrete fix.

93
Performance Rules
CRITICAL · WARNING · INFO
11
Dimensions
Config · Shuffle · Joins · …
3
Tiers
Pre-commit · CI/LLM · Runtime
<3s
Pre-commit Speed
Zero Spark dependency

Quick Start

Up and running in 30 seconds. No Spark installation required for Tier 1.

terminal
# Install
pip install spark-perf-lint

# Scan your PySpark code
spark-perf-lint scan src/

# Or add as pre-commit hook
spark-perf-lint init && pre-commit install

How it works

1
Discovers .py files that import PySpark
2
Parses Python AST — no Spark runtime needed
3
Matches 93 rules across 11 performance dimensions
4
Reports findings with severity, fix, and impact

What You'll See

Clear, actionable findings with severity levels and concrete recommendations.

bash
$ spark-perf-lint scan etl/customer_pipeline.py
✗ CRITICAL SPL-D03-001  etl/customer_pipeline.py:42
Cross join / cartesian product
→ df.join(other, on='id', how='inner')
✗ CRITICAL SPL-D09-005  etl/customer_pipeline.py:87
.collect() without prior filter or limit
→ df.join(other, 'id').limit(1000).collect()
⚠ WARNING   SPL-D06-001  etl/customer_pipeline.py:55
.cache() without .unpersist()
→ Call df_cached.unpersist() when done
⚠ WARNING   SPL-D01-010  etl/customer_pipeline.py:12
Default shuffle partitions unchanged (200)
→ .config("spark.sql.shuffle.partitions", "400")
ℹ INFO      SPL-D07-003  etl/customer_pipeline.py:31
select("*") disables column pruning
→ df.select('status').groupBy('status').count()
──────────────────────────────────────────────
Scanned 1 file · 5 findings · 2 critical · 2 warnings · 1 info
✗ FAILED (2 critical findings)

11 Dimensions of Spark Performance

Every dimension of PySpark performance, from cluster configuration to production observability.

Three-Tier Architecture

From fast pre-commit hook to deep runtime analysis — use the tier that fits your workflow.

🔒
Tier 1
Pre-Commit Hook
Pure Python AST analysis. Runs on every git commit, scanning only staged files. Zero Spark dependency — works in any environment.
<3s
Scan time
Free
Cost
No
Spark needed
93
Rules
🔍
Tier 2
CI/PR Analysis
Claude API enrichment. Sends full file context to Claude for context-aware explanations and refactoring suggestions. Runs on pull requests.
10–30s
Scan time
~$0.05
Per PR
No
Spark needed
API key
Required
🔬
Tier 3
Deep Audit
Live Spark runtime analysis. Inspects physical query plans, runs benchmark comparisons, and profiles real execution metrics against a running cluster.
5–15m
Scan time
Cluster
Cost
Yes
Spark needed
Notebook
Interface

Pre-Commit Hook Setup

Block bad PySpark from ever reaching your repo. Installs in 60 seconds.

Step 01
Install pre-commit
pip install pre-commit
Step 02
Add to .pre-commit-config.yaml
Add the repo stanza shown below
Step 03
Install the hook
pre-commit install
Step 04
Test it
pre-commit run --all-files
.pre-commit-config.yaml
repos:
  - repo: https://github.com/sunilpradhansharma/spark-perf-lint
    rev: v0.1.0
    hooks:
      - id: spark-perf-lint
What happens on commit
$ git commit -m "add customer ETL pipeline"
spark-perf-lint...................................................Failed
✗ CRITICAL SPL-D08-001 etl/pipeline.py:8
AQE disabled — spark.sql.adaptive.enabled = false
→ Remove the line — AQE is on by default in Spark 3.2+
Commit blocked. Fix findings and try again.
Use git commit --no-verify to bypass (emergencies only)

All 93 Rules

Every rule with ID, severity, description, and expandable before/after examples.

⚙️ D01 — Cluster Configuration 10 rules
Rule IDSeverityName & Description
SPL-D01-001WARNING
Missing Kryo serializer
spark.serializer not set — 10× slower Java serializer used.
▸ details
Impact
10× serialization speedup; 3–5× reduction in shuffle data size
Before
SparkSession.builder.appName("job").getOrCreate()
After
spark = (SparkSession.builder
  .config("spark.serializer",
    "org.apache.spark.serializer.KryoSerializer")
  .getOrCreate())
SPL-D01-002WARNING
Executor memory not configured
spark.executor.memory not set — cluster default (often 1g) used.
SPL-D01-003WARNING
Driver memory not configured
spark.driver.memory not set — risk of OOM on collect or broadcast.
SPL-D01-004INFO
Dynamic allocation not enabled
spark.dynamicAllocation.enabled not true — idle executors waste resources.
SPL-D01-005WARNING
Executor cores set too high
spark.executor.cores above 5 — HDFS throughput degrades, GC pauses increase.
SPL-D01-006WARNING
Memory overhead too low
spark.executor.memoryOverhead below 384 MB — container killed without OOM log.
SPL-D01-007WARNING
Missing PySpark worker memory config
Python UDFs present but spark.python.worker.memory not set.
SPL-D01-008WARNING
Network timeout too low
spark.network.timeout below 120s — false-positive executor evictions during GC.
SPL-D01-009INFO
Speculation not enabled
spark.speculation not enabled — slow nodes stall entire stages.
SPL-D01-010WARNING
Default shuffle partitions unchanged
spark.sql.shuffle.partitions at default 200 — too low for large data, too high for small.
🔀D02 — Shuffle & Sort8 rules
Rule IDSeverityName & Description
SPL-D02-001CRITICAL
groupByKey() instead of reduceByKey()
groupByKey shuffles all values — use reduceByKey for 10–100× less shuffle data.
▸ details
Impact
10–100× shuffle data reduction for aggregation workloads
Before
rdd.groupByKey().mapValues(sum)
After
rdd.reduceByKey(lambda a, b: a + b)
SPL-D02-002WARNING
Default shuffle partitions unchanged
spark.sql.shuffle.partitions at default 200 — causes spill on large data.
SPL-D02-003WARNING
Unnecessary repartition before join
repartition() before join() causes a redundant shuffle — join already shuffles on key.
SPL-D02-004WARNING
orderBy/sort without limit
Global sort without .limit() is O(n log n) — add limit for top-N patterns.
SPL-D02-005INFO
distinct() without prior filter
distinct() triggers full shuffle — apply filters first to reduce data volume.
SPL-D02-006INFO
Shuffle followed by coalesce
coalesce() after shuffle creates unbalanced partitions — use repartition().
SPL-D02-007WARNING
Multiple shuffles in sequence
Two or more shuffle operations in sequence without intermediate caching.
SPL-D02-008INFO
Shuffle file buffer too small
spark.shuffle.file.buffer below 1 MB — up to 32× more syscalls per shuffle write.
🔗D03 — Joins10 rules
Rule IDSeverityName & Description
SPL-D03-001CRITICAL
Cross join / cartesian product
crossJoin() creates O(n²) rows — almost always unintentional, can OOM indefinitely.
▸ details
Impact
Unbounded row explosion; can OOM or run indefinitely
Before
df.crossJoin(other)  # O(n²) rows!
After
df.join(other, on='id', how='inner')
SPL-D03-002WARNING
Missing broadcast hint on small DataFrame
join() without broadcast hint — small lookup tables should be broadcast to eliminate shuffle.
SPL-D03-003CRITICAL
Broadcast threshold disabled
autoBroadcastJoinThreshold = -1 disables all broadcast joins — every join shuffles both sides.
SPL-D03-004WARNING
Join without prior filter/select
join() on unfiltered read — push filter/select before join to reduce shuffle volume.
SPL-D03-005WARNING
Join key type mismatch risk
Joining on different column names — type mismatch produces silent empty results.
SPL-D03-006INFO
Multiple joins without intermediate repartition
3+ joins without broadcast/repartition hints — each adds a full shuffle.
SPL-D03-007INFO
Self-join that could be window function
Self-join doubles shuffle — window functions accomplish the same with zero cross-node transfer.
SPL-D03-008CRITICAL
Join inside loop
join() in a loop creates N shuffles and O(N) plan growth — causes driver OOM for large N.
▸ details
Impact
N shuffles where N = loop iterations; plan explosion for large loops
Before
for date in date_range:
    result = result.join(
        daily_df.filter(f'date="{date}"'), 'id')
After
all_dates = spark.createDataFrame(date_range)
daily_all = daily_df.join(all_dates, 'date')
result = result.join(daily_all, 'id')
SPL-D03-009INFO
Left join without null handling
Left join without null handling — nulls propagate silently, causing wrong aggregations.
SPL-D03-010WARNING
CBO/statistics not enabled for complex joins
spark.sql.cbo.enabled not set — Catalyst uses heuristics instead of cost-based join ordering.
📦D04 — Partitioning10 rules
Rule IDSeverityName & Description
SPL-D04-001CRITICAL
repartition(1) bottleneck
repartition(1) forces all data through one task — eliminates all parallelism.
SPL-D04-002WARNING
coalesce(1) bottleneck
coalesce(1) reduces to one partition — all downstream stages run single-threaded.
SPL-D04-003WARNING
Repartition with very high partition count
repartition(n) above 10,000 — driver OOM, millions of shuffle files, slow task scheduling.
SPL-D04-004WARNING
Repartition with very low partition count
repartition(n) where n < 10 — cluster cores sit idle, up to 25× slower.
SPL-D04-005INFO
Coalesce before write — unbalanced output files
coalesce() before write creates unbalanced files — use repartition() for even output.
SPL-D04-006INFO
Missing partitionBy on write
Writing without partitionBy — every query does a full table scan, no pruning possible.
SPL-D04-007WARNING
Over-partitioning — high-cardinality partition column
partitionBy on user_id / *_id creates millions of tiny files — driver OOM on partition listing.
SPL-D04-008INFO
Missing bucketBy for repeatedly joined tables
saveAsTable() without bucketBy — every future join on this table requires a full shuffle.
SPL-D04-009WARNING
Partition column not used in query filters
Reading partitioned table without a partition filter — full table scan instead of pruned scan.
SPL-D04-010WARNING
Repartition by column different from join key
repartition(col_A) then join on col_B — two full shuffles instead of one.
⚖️D05 — Data Skew7 rules
Rule IDSeverityName & Description
SPL-D05-001WARNING
Join on low-cardinality column
join() on status/type/is_* — dominant value routes to one task, rest of cluster idles.
SPL-D05-002WARNING
GroupBy on low-cardinality column without secondary key
groupBy on low-cardinality column creates unbalanced partitions — dominant group straggler.
SPL-D05-003WARNING
AQE skew join handling disabled
skewJoin.enabled = false — skewed partitions not auto-split, straggler tasks persist.
SPL-D05-004INFO
AQE skew threshold too high
Skew detection threshold exceeds 1 GB — moderate skew silently degrades performance.
SPL-D05-005INFO
Missing salting pattern for known skewed keys
join() on potentially skewed ID column without rand()/explode() salting pattern.
SPL-D05-006WARNING
Window function partitioned by skew-prone column
Window.partitionBy on low-cardinality column — executor OOM for dominant partition.
SPL-D05-007INFO
Null-heavy join key
join() on null-heavy column without isNotNull filter — all nulls hash to same partition.
💾D06 — Caching & Persistence8 rules
Rule IDSeverityName & Description
SPL-D06-001WARNING
cache() without unpersist()
Cached DataFrame never released — memory leak causes OOM in long-running jobs.
SPL-D06-002WARNING
cache() used only once
Caching a DataFrame that is only read once — serialisation overhead with no benefit.
SPL-D06-003CRITICAL
cache() inside loop
cache() in a loop accumulates memory each iteration — OOM after O(N) iterations.
▸ details
Impact
OOM after O(N) iterations; executor memory accumulates without release
Before
for epoch in range(10):
    df = model.transform(df).cache()
After
for epoch in range(10):
    prev = df
    df = model.transform(df).checkpoint()
    prev.unpersist()
SPL-D06-004WARNING
cache() before filter
filter() after cache() — caches all rows when only filtered subset is needed.
SPL-D06-005INFO
MEMORY_ONLY storage level for potentially large datasets
MEMORY_ONLY silently drops partitions under pressure — prefer MEMORY_AND_DISK.
SPL-D06-006WARNING
Reused DataFrame without cache
Expensive DataFrame (join/groupBy) used multiple times without cache — N recomputations.
SPL-D06-007INFO
cache() after repartition
cache() after repartition() is wasteful if the result is only used once.
SPL-D06-008INFO
checkpoint vs cache misuse
checkpoint() in non-iterative code — unnecessary disk write, use cache() instead.
📂D07 — I/O & File Formats10 rules
Rule IDSeverityName & Description
SPL-D07-001WARNING
CSV/JSON used for analytical workload
Row-oriented formats — no column pruning or predicate pushdown, 5–20× slower than Parquet.
SPL-D07-002WARNING
Schema inference enabled
inferSchema=True reads data twice — 2× I/O per job, unstable schemas in production.
SPL-D07-003WARNING
select("*") — no column pruning
select("*") forces all columns to be decoded from storage — defeats columnar pruning.
SPL-D07-004INFO
Filter applied after join — missing predicate pushdown
filter() after join() — push filter before join to reduce shuffle volume.
SPL-D07-005INFO
Small file problem on write
partitionBy() without controlling partition count — creates O(partitions × values) small files.
SPL-D07-006CRITICAL
JDBC read without partition parameters
JDBC read without partitionColumn/numPartitions — single-thread sequential read, does not scale.
▸ details
Impact
Sequential single-thread read; does not scale to large JDBC tables
Before
df = spark.read.jdbc(url, 'orders')
After
df = spark.read.jdbc(
  url, 'orders',
  column='order_id',
  lowerBound=1,
  upperBound=10_000_000,
  numPartitions=50)
SPL-D07-007INFO
Parquet compression not set
Writing Parquet without explicit compression — cluster-dependent codec, may write uncompressed.
SPL-D07-008INFO
Write mode not specified
write without .mode() defaults to 'error' — job fails on re-run if output exists.
SPL-D07-009INFO
No format specified on read/write
read.load()/write.save() without .format() — format depends on cluster config.
SPL-D07-010INFO
mergeSchema enabled without necessity
mergeSchema=true reads footer of every Parquet file — O(files) metadata scan on every read.
🧠D08 — Adaptive Query Execution7 rules
Rule IDSeverityName & Description
SPL-D08-001CRITICAL
AQE disabled
spark.sql.adaptive.enabled = false — loses partition coalescing, skew join, and join strategy switching.
SPL-D08-002WARNING
AQE coalesce partitions disabled
coalescePartitions.enabled = false — hundreds of tiny tasks, high driver scheduling overhead.
SPL-D08-003INFO
AQE advisory partition size too small
advisoryPartitionSizeInBytes below 16 MB — too many small post-coalesce partitions.
SPL-D08-004WARNING
AQE skew join disabled with skew-prone joins detected
skewJoin.enabled = false with joins present — skewed partitions not auto-split at runtime.
SPL-D08-005INFO
AQE skew factor too aggressive
skewedPartitionFactor below 2 — every above-median partition treated as skewed, excessive splits.
SPL-D08-006INFO
AQE local shuffle reader disabled
localShuffleReader.enabled = false — forces remote reads even for locally available shuffle data.
SPL-D08-007INFO
Manual shuffle partition count set high with AQE enabled
shuffle.partitions > 400 with AQE on — unnecessary map-side overhead; tune advisoryPartitionSizeInBytes instead.
🐍D09 — UDFs & Code Patterns12 rules
Rule IDSeverityName & Description
SPL-D09-001WARNING
Python UDF (row-at-a-time) detected
@udf executes row-at-a-time across the JVM↔Python boundary — 2–10× slowdown, blocks Catalyst.
SPL-D09-002WARNING
Python UDF replaceable with native Spark function
@udf applies a built-in string method — use native SQL function (lower, trim, etc.) instead.
SPL-D09-003CRITICAL
withColumn() inside loop
withColumn() in a loop rebuilds the full query plan each iteration — O(N²) plan growth, driver hangs for N>100.
▸ details
Impact
O(N²) plan building; driver hangs for N > 100 columns
Before
for col_name in column_list:
    df = df.withColumn(col_name, f(col(col_name)))
After
exprs = [f(col(c)).alias(c) for c in column_list]
df = df.select('*', *exprs)
SPL-D09-004CRITICAL
Row-by-row iteration over DataFrame
Iterating over df.collect() — forces all data to driver, entire cluster idles.
SPL-D09-005CRITICAL
.collect() without prior filter or limit
.collect() on unbounded DataFrame — driver OOM for datasets larger than heap.
SPL-D09-006CRITICAL
.toPandas() on potentially large DataFrame
.toPandas() without limit() — materialises entire dataset in driver memory.
SPL-D09-007WARNING
.count() used for emptiness check
df.count() == 0 triggers full table scan — use df.isEmpty (Spark 3.3+) instead.
SPL-D09-008INFO
.show() in production code
.show() triggers materialisation and prints to stdout — remove before production.
SPL-D09-009INFO
.explain() or .printSchema() in production code
Debugging tools that print to stdout — remove or guard with a debug flag.
SPL-D09-010WARNING
.rdd conversion dropping out of DataFrame API
.rdd access deserialises columnar rows to Python Row objects — bypasses Catalyst and Tungsten.
SPL-D09-011INFO
pandas_udf without type annotations
@pandas_udf without type hints — relies on runtime inference, uses deprecated API.
SPL-D09-012WARNING
Nested UDF calls
A UDF calls another UDF — each adds a separate JVM↔Python serialisation boundary.
🔍D10 — Catalyst Optimizer6 rules
Rule IDSeverityName & Description
SPL-D10-001WARNING
UDF blocks predicate pushdown
@udf creates opaque boundary — Catalyst cannot push filters below the UDF, full scan required.
SPL-D10-002WARNING
CBO not enabled for complex queries
File has 2+ joins but CBO disabled — Catalyst uses heuristics, 5–20× sub-optimal plans.
SPL-D10-003INFO
Join reordering disabled for multi-table joins
3+ joins without CBO join reordering — executed left-to-right, may produce large intermediates.
SPL-D10-004INFO
Table statistics not collected
Joins present but no ANALYZE TABLE — CBO operates without accurate row counts or cardinalities.
SPL-D10-005INFO
Non-deterministic function in filter
rand()/uuid() in filter() — potential double-evaluation by Catalyst, non-reproducible results.
SPL-D10-006INFO
Deep method chain (>20 chained ops)
Single expression with 20+ chained operations — plan visitor stack overflow risk, hard to debug.
📊D11 — Monitoring & Observability5 rules
Rule IDSeverityName & Description
SPL-D11-001INFO
No explain() for plan validation in test file
Test file has complex Spark ops but no explain() — plan regressions invisible until production.
SPL-D11-002INFO
No Spark listener configured
SparkSession without a listener — job metrics and query events not captured in production.
SPL-D11-003INFO
No metrics logging in long-running job
2+ write operations but no Python logging — silent data drops and slow stages invisible.
SPL-D11-004INFO
Missing error handling around Spark actions
Spark actions (count, collect, write) not in try/except — partial writes, no error context in logs.
SPL-D11-005INFO
Hardcoded storage path
Spark read/write uses hardcoded string path — environment coupling, production paths in source control.