Apache Spark Performance Linter
spark-perf-lint
Enterprise-grade static analysis for PySpark — catches anti-patterns before they reach production, explains why they hurt, and hands you a concrete fix.
93
Performance Rules
CRITICAL · WARNING · INFO
11
Dimensions
Config · Shuffle · Joins · …
3
Tiers
Pre-commit · CI/LLM · Runtime
<3s
Pre-commit Speed
Zero Spark dependency
Installation
Quick Start
Up and running in 30 seconds. No Spark installation required for Tier 1.
terminal
# Install pip install spark-perf-lint # Scan your PySpark code spark-perf-lint scan src/ # Or add as pre-commit hook spark-perf-lint init && pre-commit install
How it works
1
Discovers .py files that import PySpark
2
Parses Python AST — no Spark runtime needed
3
Matches 93 rules across 11 performance dimensions
4
Reports findings with severity, fix, and impact
Example Output
What You'll See
Clear, actionable findings with severity levels and concrete recommendations.
bash
$ spark-perf-lint scan etl/customer_pipeline.py
✗ CRITICAL SPL-D03-001 etl/customer_pipeline.py:42
Cross join / cartesian product
→ df.join(other, on='id', how='inner')
✗ CRITICAL SPL-D09-005 etl/customer_pipeline.py:87
.collect() without prior filter or limit
→ df.join(other, 'id').limit(1000).collect()
⚠ WARNING SPL-D06-001 etl/customer_pipeline.py:55
.cache() without .unpersist()
→ Call df_cached.unpersist() when done
⚠ WARNING SPL-D01-010 etl/customer_pipeline.py:12
Default shuffle partitions unchanged (200)
→ .config("spark.sql.shuffle.partitions", "400")
ℹ INFO SPL-D07-003 etl/customer_pipeline.py:31
select("*") disables column pruning
→ df.select('status').groupBy('status').count()
──────────────────────────────────────────────
Scanned 1 file · 5 findings · 2 critical · 2 warnings · 1 info
✗ FAILED (2 critical findings)
Coverage
11 Dimensions of Spark Performance
Every dimension of PySpark performance, from cluster configuration to production observability.
D01
Cluster Config
Serializer, memory settings, executor sizing, timeouts, and dynamic allocation.
10 rules
D02
Shuffle & Sort
groupByKey vs reduceByKey, redundant repartition, orderBy without limit.
8 rules
D03
Joins
Cross joins, missing broadcast hints, join inside loop, CBO settings.
10 rules
D04
Partitioning
repartition(1) bottlenecks, over/under-partitioning, bucketBy strategy.
10 rules
D05
Data Skew
Low-cardinality join keys, missing salting patterns, null-heavy keys.
7 rules
D06
Caching
cache without unpersist, cache inside loop, wrong storage level.
8 rules
D07
I/O & Formats
CSV for analytics, schema inference, JDBC without partitioning, mergeSchema.
10 rules
D08
AQE
AQE disabled, coalesce disabled, skew join off, advisory partition size.
7 rules
D09
UDFs & Code
Python UDFs, withColumn in loop, collect without filter, toPandas on large data.
12 rules
D10
Catalyst
UDF blocks pushdown, CBO disabled, missing table statistics, deep chains.
6 rules
D11
Monitoring
No Spark listeners, no metrics logging, hardcoded paths, missing error handling.
5 rules
93
Total Rules
across 11 dimensions
Architecture
Three-Tier Architecture
From fast pre-commit hook to deep runtime analysis — use the tier that fits your workflow.
Tier 1
Pre-Commit Hook
Pure Python AST analysis. Runs on every
git commit, scanning only staged files. Zero Spark dependency — works in any environment.<3s
Scan time
Free
Cost
No
Spark needed
93
Rules
Tier 2
CI/PR Analysis
Claude API enrichment. Sends full file context to Claude for context-aware explanations and refactoring suggestions. Runs on pull requests.
10–30s
Scan time
~$0.05
Per PR
No
Spark needed
API key
Required
Tier 3
Deep Audit
Live Spark runtime analysis. Inspects physical query plans, runs benchmark comparisons, and profiles real execution metrics against a running cluster.
5–15m
Scan time
Cluster
Cost
Yes
Spark needed
Notebook
Interface
Integration
Pre-Commit Hook Setup
Block bad PySpark from ever reaching your repo. Installs in 60 seconds.
Step 01
Install pre-commit
pip install pre-commitStep 02
Add to .pre-commit-config.yaml
Add the repo stanza shown below
Step 03
Install the hook
pre-commit installStep 04
Test it
pre-commit run --all-files
.pre-commit-config.yaml
repos: - repo: https://github.com/sunilpradhansharma/spark-perf-lint rev: v0.1.0 hooks: - id: spark-perf-lint
What happens on commit
$ git commit -m "add customer ETL pipeline"
spark-perf-lint...................................................Failed
✗ CRITICAL SPL-D08-001 etl/pipeline.py:8
AQE disabled — spark.sql.adaptive.enabled = false
→ Remove the line — AQE is on by default in Spark 3.2+
Commit blocked. Fix findings and try again.
Use git commit --no-verify to bypass (emergencies only)
Complete Reference
All 93 Rules
Every rule with ID, severity, description, and expandable before/after examples.
D01 — Cluster Configuration
10 rules
| Rule ID | Severity | Name & Description |
|---|---|---|
| SPL-D01-001 | WARNING | Missing Kryo serializer spark.serializer not set — 10× slower Java serializer used. ▸ detailsImpact 10× serialization speedup; 3–5× reduction in shuffle data size Before SparkSession.builder.appName("job").getOrCreate()After spark = (SparkSession.builder
.config("spark.serializer",
"org.apache.spark.serializer.KryoSerializer")
.getOrCreate()) |
| SPL-D01-002 | WARNING | Executor memory not configured spark.executor.memory not set — cluster default (often 1g) used. |
| SPL-D01-003 | WARNING | Driver memory not configured spark.driver.memory not set — risk of OOM on collect or broadcast. |
| SPL-D01-004 | INFO | Dynamic allocation not enabled spark.dynamicAllocation.enabled not true — idle executors waste resources. |
| SPL-D01-005 | WARNING | Executor cores set too high spark.executor.cores above 5 — HDFS throughput degrades, GC pauses increase. |
| SPL-D01-006 | WARNING | Memory overhead too low spark.executor.memoryOverhead below 384 MB — container killed without OOM log. |
| SPL-D01-007 | WARNING | Missing PySpark worker memory config Python UDFs present but spark.python.worker.memory not set. |
| SPL-D01-008 | WARNING | Network timeout too low spark.network.timeout below 120s — false-positive executor evictions during GC. |
| SPL-D01-009 | INFO | Speculation not enabled spark.speculation not enabled — slow nodes stall entire stages. |
| SPL-D01-010 | WARNING | Default shuffle partitions unchanged spark.sql.shuffle.partitions at default 200 — too low for large data, too high for small. |
D02 — Shuffle & Sort8 rules
| Rule ID | Severity | Name & Description |
|---|---|---|
| SPL-D02-001 | CRITICAL | groupByKey() instead of reduceByKey() groupByKey shuffles all values — use reduceByKey for 10–100× less shuffle data. ▸ detailsImpact 10–100× shuffle data reduction for aggregation workloads Before rdd.groupByKey().mapValues(sum) After rdd.reduceByKey(lambda a, b: a + b) |
| SPL-D02-002 | WARNING | Default shuffle partitions unchanged spark.sql.shuffle.partitions at default 200 — causes spill on large data. |
| SPL-D02-003 | WARNING | Unnecessary repartition before join repartition() before join() causes a redundant shuffle — join already shuffles on key. |
| SPL-D02-004 | WARNING | orderBy/sort without limit Global sort without .limit() is O(n log n) — add limit for top-N patterns. |
| SPL-D02-005 | INFO | distinct() without prior filter distinct() triggers full shuffle — apply filters first to reduce data volume. |
| SPL-D02-006 | INFO | Shuffle followed by coalesce coalesce() after shuffle creates unbalanced partitions — use repartition(). |
| SPL-D02-007 | WARNING | Multiple shuffles in sequence Two or more shuffle operations in sequence without intermediate caching. |
| SPL-D02-008 | INFO | Shuffle file buffer too small spark.shuffle.file.buffer below 1 MB — up to 32× more syscalls per shuffle write. |
D03 — Joins10 rules
| Rule ID | Severity | Name & Description |
|---|---|---|
| SPL-D03-001 | CRITICAL | Cross join / cartesian product crossJoin() creates O(n²) rows — almost always unintentional, can OOM indefinitely. ▸ detailsImpact Unbounded row explosion; can OOM or run indefinitely Before df.crossJoin(other) # O(n²) rows! After df.join(other, on='id', how='inner') |
| SPL-D03-002 | WARNING | Missing broadcast hint on small DataFrame join() without broadcast hint — small lookup tables should be broadcast to eliminate shuffle. |
| SPL-D03-003 | CRITICAL | Broadcast threshold disabled autoBroadcastJoinThreshold = -1 disables all broadcast joins — every join shuffles both sides. |
| SPL-D03-004 | WARNING | Join without prior filter/select join() on unfiltered read — push filter/select before join to reduce shuffle volume. |
| SPL-D03-005 | WARNING | Join key type mismatch risk Joining on different column names — type mismatch produces silent empty results. |
| SPL-D03-006 | INFO | Multiple joins without intermediate repartition 3+ joins without broadcast/repartition hints — each adds a full shuffle. |
| SPL-D03-007 | INFO | Self-join that could be window function Self-join doubles shuffle — window functions accomplish the same with zero cross-node transfer. |
| SPL-D03-008 | CRITICAL | Join inside loop join() in a loop creates N shuffles and O(N) plan growth — causes driver OOM for large N. ▸ detailsImpact N shuffles where N = loop iterations; plan explosion for large loops Before for date in date_range:
result = result.join(
daily_df.filter(f'date="{date}"'), 'id')After all_dates = spark.createDataFrame(date_range) daily_all = daily_df.join(all_dates, 'date') result = result.join(daily_all, 'id') |
| SPL-D03-009 | INFO | Left join without null handling Left join without null handling — nulls propagate silently, causing wrong aggregations. |
| SPL-D03-010 | WARNING | CBO/statistics not enabled for complex joins spark.sql.cbo.enabled not set — Catalyst uses heuristics instead of cost-based join ordering. |
D04 — Partitioning10 rules
| Rule ID | Severity | Name & Description |
|---|---|---|
| SPL-D04-001 | CRITICAL | repartition(1) bottleneck repartition(1) forces all data through one task — eliminates all parallelism. |
| SPL-D04-002 | WARNING | coalesce(1) bottleneck coalesce(1) reduces to one partition — all downstream stages run single-threaded. |
| SPL-D04-003 | WARNING | Repartition with very high partition count repartition(n) above 10,000 — driver OOM, millions of shuffle files, slow task scheduling. |
| SPL-D04-004 | WARNING | Repartition with very low partition count repartition(n) where n < 10 — cluster cores sit idle, up to 25× slower. |
| SPL-D04-005 | INFO | Coalesce before write — unbalanced output files coalesce() before write creates unbalanced files — use repartition() for even output. |
| SPL-D04-006 | INFO | Missing partitionBy on write Writing without partitionBy — every query does a full table scan, no pruning possible. |
| SPL-D04-007 | WARNING | Over-partitioning — high-cardinality partition column partitionBy on user_id / *_id creates millions of tiny files — driver OOM on partition listing. |
| SPL-D04-008 | INFO | Missing bucketBy for repeatedly joined tables saveAsTable() without bucketBy — every future join on this table requires a full shuffle. |
| SPL-D04-009 | WARNING | Partition column not used in query filters Reading partitioned table without a partition filter — full table scan instead of pruned scan. |
| SPL-D04-010 | WARNING | Repartition by column different from join key repartition(col_A) then join on col_B — two full shuffles instead of one. |
D05 — Data Skew7 rules
| Rule ID | Severity | Name & Description |
|---|---|---|
| SPL-D05-001 | WARNING | Join on low-cardinality column join() on status/type/is_* — dominant value routes to one task, rest of cluster idles. |
| SPL-D05-002 | WARNING | GroupBy on low-cardinality column without secondary key groupBy on low-cardinality column creates unbalanced partitions — dominant group straggler. |
| SPL-D05-003 | WARNING | AQE skew join handling disabled skewJoin.enabled = false — skewed partitions not auto-split, straggler tasks persist. |
| SPL-D05-004 | INFO | AQE skew threshold too high Skew detection threshold exceeds 1 GB — moderate skew silently degrades performance. |
| SPL-D05-005 | INFO | Missing salting pattern for known skewed keys join() on potentially skewed ID column without rand()/explode() salting pattern. |
| SPL-D05-006 | WARNING | Window function partitioned by skew-prone column Window.partitionBy on low-cardinality column — executor OOM for dominant partition. |
| SPL-D05-007 | INFO | Null-heavy join key join() on null-heavy column without isNotNull filter — all nulls hash to same partition. |
D06 — Caching & Persistence8 rules
| Rule ID | Severity | Name & Description |
|---|---|---|
| SPL-D06-001 | WARNING | cache() without unpersist() Cached DataFrame never released — memory leak causes OOM in long-running jobs. |
| SPL-D06-002 | WARNING | cache() used only once Caching a DataFrame that is only read once — serialisation overhead with no benefit. |
| SPL-D06-003 | CRITICAL | cache() inside loop cache() in a loop accumulates memory each iteration — OOM after O(N) iterations. ▸ detailsImpact OOM after O(N) iterations; executor memory accumulates without release Before for epoch in range(10):
df = model.transform(df).cache()After for epoch in range(10):
prev = df
df = model.transform(df).checkpoint()
prev.unpersist() |
| SPL-D06-004 | WARNING | cache() before filter filter() after cache() — caches all rows when only filtered subset is needed. |
| SPL-D06-005 | INFO | MEMORY_ONLY storage level for potentially large datasets MEMORY_ONLY silently drops partitions under pressure — prefer MEMORY_AND_DISK. |
| SPL-D06-006 | WARNING | Reused DataFrame without cache Expensive DataFrame (join/groupBy) used multiple times without cache — N recomputations. |
| SPL-D06-007 | INFO | cache() after repartition cache() after repartition() is wasteful if the result is only used once. |
| SPL-D06-008 | INFO | checkpoint vs cache misuse checkpoint() in non-iterative code — unnecessary disk write, use cache() instead. |
D07 — I/O & File Formats10 rules
| Rule ID | Severity | Name & Description |
|---|---|---|
| SPL-D07-001 | WARNING | CSV/JSON used for analytical workload Row-oriented formats — no column pruning or predicate pushdown, 5–20× slower than Parquet. |
| SPL-D07-002 | WARNING | Schema inference enabled inferSchema=True reads data twice — 2× I/O per job, unstable schemas in production. |
| SPL-D07-003 | WARNING | select("*") — no column pruning select("*") forces all columns to be decoded from storage — defeats columnar pruning. |
| SPL-D07-004 | INFO | Filter applied after join — missing predicate pushdown filter() after join() — push filter before join to reduce shuffle volume. |
| SPL-D07-005 | INFO | Small file problem on write partitionBy() without controlling partition count — creates O(partitions × values) small files. |
| SPL-D07-006 | CRITICAL | JDBC read without partition parameters JDBC read without partitionColumn/numPartitions — single-thread sequential read, does not scale. ▸ detailsImpact Sequential single-thread read; does not scale to large JDBC tables Before df = spark.read.jdbc(url, 'orders') After df = spark.read.jdbc( url, 'orders', column='order_id', lowerBound=1, upperBound=10_000_000, numPartitions=50) |
| SPL-D07-007 | INFO | Parquet compression not set Writing Parquet without explicit compression — cluster-dependent codec, may write uncompressed. |
| SPL-D07-008 | INFO | Write mode not specified write without .mode() defaults to 'error' — job fails on re-run if output exists. |
| SPL-D07-009 | INFO | No format specified on read/write read.load()/write.save() without .format() — format depends on cluster config. |
| SPL-D07-010 | INFO | mergeSchema enabled without necessity mergeSchema=true reads footer of every Parquet file — O(files) metadata scan on every read. |
D08 — Adaptive Query Execution7 rules
| Rule ID | Severity | Name & Description |
|---|---|---|
| SPL-D08-001 | CRITICAL | AQE disabled spark.sql.adaptive.enabled = false — loses partition coalescing, skew join, and join strategy switching. |
| SPL-D08-002 | WARNING | AQE coalesce partitions disabled coalescePartitions.enabled = false — hundreds of tiny tasks, high driver scheduling overhead. |
| SPL-D08-003 | INFO | AQE advisory partition size too small advisoryPartitionSizeInBytes below 16 MB — too many small post-coalesce partitions. |
| SPL-D08-004 | WARNING | AQE skew join disabled with skew-prone joins detected skewJoin.enabled = false with joins present — skewed partitions not auto-split at runtime. |
| SPL-D08-005 | INFO | AQE skew factor too aggressive skewedPartitionFactor below 2 — every above-median partition treated as skewed, excessive splits. |
| SPL-D08-006 | INFO | AQE local shuffle reader disabled localShuffleReader.enabled = false — forces remote reads even for locally available shuffle data. |
| SPL-D08-007 | INFO | Manual shuffle partition count set high with AQE enabled shuffle.partitions > 400 with AQE on — unnecessary map-side overhead; tune advisoryPartitionSizeInBytes instead. |
D09 — UDFs & Code Patterns12 rules
| Rule ID | Severity | Name & Description |
|---|---|---|
| SPL-D09-001 | WARNING | Python UDF (row-at-a-time) detected @udf executes row-at-a-time across the JVM↔Python boundary — 2–10× slowdown, blocks Catalyst. |
| SPL-D09-002 | WARNING | Python UDF replaceable with native Spark function @udf applies a built-in string method — use native SQL function (lower, trim, etc.) instead. |
| SPL-D09-003 | CRITICAL | withColumn() inside loop withColumn() in a loop rebuilds the full query plan each iteration — O(N²) plan growth, driver hangs for N>100. ▸ detailsImpact O(N²) plan building; driver hangs for N > 100 columns Before for col_name in column_list:
df = df.withColumn(col_name, f(col(col_name)))After exprs = [f(col(c)).alias(c) for c in column_list]
df = df.select('*', *exprs) |
| SPL-D09-004 | CRITICAL | Row-by-row iteration over DataFrame Iterating over df.collect() — forces all data to driver, entire cluster idles. |
| SPL-D09-005 | CRITICAL | .collect() without prior filter or limit .collect() on unbounded DataFrame — driver OOM for datasets larger than heap. |
| SPL-D09-006 | CRITICAL | .toPandas() on potentially large DataFrame .toPandas() without limit() — materialises entire dataset in driver memory. |
| SPL-D09-007 | WARNING | .count() used for emptiness check df.count() == 0 triggers full table scan — use df.isEmpty (Spark 3.3+) instead. |
| SPL-D09-008 | INFO | .show() in production code .show() triggers materialisation and prints to stdout — remove before production. |
| SPL-D09-009 | INFO | .explain() or .printSchema() in production code Debugging tools that print to stdout — remove or guard with a debug flag. |
| SPL-D09-010 | WARNING | .rdd conversion dropping out of DataFrame API .rdd access deserialises columnar rows to Python Row objects — bypasses Catalyst and Tungsten. |
| SPL-D09-011 | INFO | pandas_udf without type annotations @pandas_udf without type hints — relies on runtime inference, uses deprecated API. |
| SPL-D09-012 | WARNING | Nested UDF calls A UDF calls another UDF — each adds a separate JVM↔Python serialisation boundary. |
D10 — Catalyst Optimizer6 rules
| Rule ID | Severity | Name & Description |
|---|---|---|
| SPL-D10-001 | WARNING | UDF blocks predicate pushdown @udf creates opaque boundary — Catalyst cannot push filters below the UDF, full scan required. |
| SPL-D10-002 | WARNING | CBO not enabled for complex queries File has 2+ joins but CBO disabled — Catalyst uses heuristics, 5–20× sub-optimal plans. |
| SPL-D10-003 | INFO | Join reordering disabled for multi-table joins 3+ joins without CBO join reordering — executed left-to-right, may produce large intermediates. |
| SPL-D10-004 | INFO | Table statistics not collected Joins present but no ANALYZE TABLE — CBO operates without accurate row counts or cardinalities. |
| SPL-D10-005 | INFO | Non-deterministic function in filter rand()/uuid() in filter() — potential double-evaluation by Catalyst, non-reproducible results. |
| SPL-D10-006 | INFO | Deep method chain (>20 chained ops) Single expression with 20+ chained operations — plan visitor stack overflow risk, hard to debug. |
D11 — Monitoring & Observability5 rules
| Rule ID | Severity | Name & Description |
|---|---|---|
| SPL-D11-001 | INFO | No explain() for plan validation in test file Test file has complex Spark ops but no explain() — plan regressions invisible until production. |
| SPL-D11-002 | INFO | No Spark listener configured SparkSession without a listener — job metrics and query events not captured in production. |
| SPL-D11-003 | INFO | No metrics logging in long-running job 2+ write operations but no Python logging — silent data drops and slow stages invisible. |
| SPL-D11-004 | INFO | Missing error handling around Spark actions Spark actions (count, collect, write) not in try/except — partial writes, no error context in logs. |
| SPL-D11-005 | INFO | Hardcoded storage path Spark read/write uses hardcoded string path — environment coupling, production paths in source control. |