Apache Spark Performance Linter

spark-perf-lint

Enterprise-grade static analysis for PySpark — catches anti-patterns before they reach production, explains why they hurt, and hands you a concrete fix.

Get Started → View on GitHub

Performance Rules

CRITICAL · WARNING · INFO

Dimensions

Config · Shuffle · Joins · …

Tiers

Pre-commit · CI/LLM · Runtime

<3s

Pre-commit Speed

Zero Spark dependency

Installation

Quick Start

Up and running in 30 seconds. No Spark installation required for Tier 1.

terminal

# Install
pip install spark-perf-lint

# Scan your PySpark code
spark-perf-lint scan src/

# Or add as pre-commit hook
spark-perf-lint init && pre-commit install

How it works

Discovers .py files that import PySpark

Parses Python AST — no Spark runtime needed

Matches 93 rules across 11 performance dimensions

Reports findings with severity, fix, and impact

Example Output

What You'll See

Clear, actionable findings with severity levels and concrete recommendations.

bash

$ spark-perf-lint scan etl/customer_pipeline.py

✗ CRITICAL SPL-D03-001 etl/customer_pipeline.py:42

Cross join / cartesian product

→ df.join(other, on='id', how='inner')

✗ CRITICAL SPL-D09-005 etl/customer_pipeline.py:87

.collect() without prior filter or limit

→ df.join(other, 'id').limit(1000).collect()

⚠ WARNING SPL-D06-001 etl/customer_pipeline.py:55

.cache() without .unpersist()

→ Call df_cached.unpersist() when done

⚠ WARNING SPL-D01-010 etl/customer_pipeline.py:12

Default shuffle partitions unchanged (200)

→ .config("spark.sql.shuffle.partitions", "400")

ℹ INFO SPL-D07-003 etl/customer_pipeline.py:31

select("*") disables column pruning

→ df.select('status').groupBy('status').count()

──────────────────────────────────────────────

Scanned 1 file · 5 findings · 2 critical · 2 warnings · 1 info

✗ FAILED (2 critical findings)

Coverage

11 Dimensions of Spark Performance

Every dimension of PySpark performance, from cluster configuration to production observability.

⚙️

D01

Cluster Config

Serializer, memory settings, executor sizing, timeouts, and dynamic allocation.

groupByKey vs reduceByKey, redundant repartition, orderBy without limit.

Cross joins, missing broadcast hints, join inside loop, CBO settings.

repartition(1) bottlenecks, over/under-partitioning, bucketBy strategy.

Low-cardinality join keys, missing salting patterns, null-heavy keys.

cache without unpersist, cache inside loop, wrong storage level.

CSV for analytics, schema inference, JDBC without partitioning, mergeSchema.

AQE disabled, coalesce disabled, skew join off, advisory partition size.

Python UDFs, withColumn in loop, collect without filter, toPandas on large data.

UDF blocks pushdown, CBO disabled, missing table statistics, deep chains.

No Spark listeners, no metrics logging, hardcoded paths, missing error handling.

5 rules

Total Rules

across 11 dimensions

Architecture

Three-Tier Architecture

From fast pre-commit hook to deep runtime analysis — use the tier that fits your workflow.

🔒

Tier 1

Pre-Commit Hook

Pure Python AST analysis. Runs on every git commit, scanning only staged files. Zero Spark dependency — works in any environment.

<3s

Scan time

Free

Cost

Spark needed

Rules

🔍

Tier 2

CI/PR Analysis

Claude API enrichment. Sends full file context to Claude for context-aware explanations and refactoring suggestions. Runs on pull requests.

10–30s

Scan time

~$0.05

Per PR

Spark needed

API key

Required

🔬

Tier 3

Deep Audit

Live Spark runtime analysis. Inspects physical query plans, runs benchmark comparisons, and profiles real execution metrics against a running cluster.

5–15m

Scan time

Cluster

Cost

Yes

Spark needed

Notebook

Interface

Integration

Pre-Commit Hook Setup

Block bad PySpark from ever reaching your repo. Installs in 60 seconds.

Step 01

Install pre-commit

pip install pre-commit

Step 02

Add to .pre-commit-config.yaml

Add the repo stanza shown below

Step 03

Install the hook

pre-commit install

Step 04

Test it

pre-commit run --all-files

.pre-commit-config.yaml

repos:
  - repo: https://github.com/sunilpradhansharma/spark-perf-lint
    rev: v0.1.0
    hooks:
      - id: spark-perf-lint

What happens on commit

$ git commit -m "add customer ETL pipeline"

spark-perf-lint...................................................Failed

✗ CRITICAL SPL-D08-001 etl/pipeline.py:8

AQE disabled — spark.sql.adaptive.enabled = false

→ Remove the line — AQE is on by default in Spark 3.2+

Commit blocked. Fix findings and try again.

Use git commit --no-verify to bypass (emergencies only)

Complete Reference

All 93 Rules

Every rule with ID, severity, description, and expandable before/after examples.

⚙️ D01 — Cluster Configuration 10 rules

Rule ID	Severity	Name & Description
SPL-D01-001	WARNING	Missing Kryo serializer spark.serializer not set — 10× slower Java serializer used. ▸ details Impact 10× serialization speedup; 3–5× reduction in shuffle data size Before SparkSession.builder.appName("job").getOrCreate() After spark = (SparkSession.builder .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer") .getOrCreate())
SPL-D01-002	WARNING	Executor memory not configured spark.executor.memory not set — cluster default (often 1g) used.
SPL-D01-003	WARNING	Driver memory not configured spark.driver.memory not set — risk of OOM on collect or broadcast.
SPL-D01-004	INFO	Dynamic allocation not enabled spark.dynamicAllocation.enabled not true — idle executors waste resources.
SPL-D01-005	WARNING	Executor cores set too high spark.executor.cores above 5 — HDFS throughput degrades, GC pauses increase.
SPL-D01-006	WARNING	Memory overhead too low spark.executor.memoryOverhead below 384 MB — container killed without OOM log.
SPL-D01-007	WARNING	Missing PySpark worker memory config Python UDFs present but spark.python.worker.memory not set.
SPL-D01-008	WARNING	Network timeout too low spark.network.timeout below 120s — false-positive executor evictions during GC.
SPL-D01-009	INFO	Speculation not enabled spark.speculation not enabled — slow nodes stall entire stages.
SPL-D01-010	WARNING	Default shuffle partitions unchanged spark.sql.shuffle.partitions at default 200 — too low for large data, too high for small.

🔀D02 — Shuffle & Sort8 rules

Rule ID	Severity	Name & Description
SPL-D02-001	CRITICAL	groupByKey() instead of reduceByKey() groupByKey shuffles all values — use reduceByKey for 10–100× less shuffle data. ▸ details Impact 10–100× shuffle data reduction for aggregation workloads Before rdd.groupByKey().mapValues(sum) After rdd.reduceByKey(lambda a, b: a + b)
SPL-D02-002	WARNING	Default shuffle partitions unchanged spark.sql.shuffle.partitions at default 200 — causes spill on large data.
SPL-D02-003	WARNING	Unnecessary repartition before join repartition() before join() causes a redundant shuffle — join already shuffles on key.
SPL-D02-004	WARNING	orderBy/sort without limit Global sort without .limit() is O(n log n) — add limit for top-N patterns.
SPL-D02-005	INFO	distinct() without prior filter distinct() triggers full shuffle — apply filters first to reduce data volume.
SPL-D02-006	INFO	Shuffle followed by coalesce coalesce() after shuffle creates unbalanced partitions — use repartition().
SPL-D02-007	WARNING	Multiple shuffles in sequence Two or more shuffle operations in sequence without intermediate caching.
SPL-D02-008	INFO	Shuffle file buffer too small spark.shuffle.file.buffer below 1 MB — up to 32× more syscalls per shuffle write.

🔗D03 — Joins10 rules

Rule ID	Severity	Name & Description
SPL-D03-001	CRITICAL	Cross join / cartesian product crossJoin() creates O(n²) rows — almost always unintentional, can OOM indefinitely. ▸ details Impact Unbounded row explosion; can OOM or run indefinitely Before df.crossJoin(other) # O(n²) rows! After df.join(other, on='id', how='inner')
SPL-D03-002	WARNING	Missing broadcast hint on small DataFrame join() without broadcast hint — small lookup tables should be broadcast to eliminate shuffle.
SPL-D03-003	CRITICAL	Broadcast threshold disabled autoBroadcastJoinThreshold = -1 disables all broadcast joins — every join shuffles both sides.
SPL-D03-004	WARNING	Join without prior filter/select join() on unfiltered read — push filter/select before join to reduce shuffle volume.
SPL-D03-005	WARNING	Join key type mismatch risk Joining on different column names — type mismatch produces silent empty results.
SPL-D03-006	INFO	Multiple joins without intermediate repartition 3+ joins without broadcast/repartition hints — each adds a full shuffle.
SPL-D03-007	INFO	Self-join that could be window function Self-join doubles shuffle — window functions accomplish the same with zero cross-node transfer.
SPL-D03-008	CRITICAL	Join inside loop join() in a loop creates N shuffles and O(N) plan growth — causes driver OOM for large N. ▸ details Impact N shuffles where N = loop iterations; plan explosion for large loops Before for date in date_range: result = result.join( daily_df.filter(f'date="{date}"'), 'id') After all_dates = spark.createDataFrame(date_range) daily_all = daily_df.join(all_dates, 'date') result = result.join(daily_all, 'id')
SPL-D03-009	INFO	Left join without null handling Left join without null handling — nulls propagate silently, causing wrong aggregations.
SPL-D03-010	WARNING	CBO/statistics not enabled for complex joins spark.sql.cbo.enabled not set — Catalyst uses heuristics instead of cost-based join ordering.

📦D04 — Partitioning10 rules

Rule ID	Severity	Name & Description
SPL-D04-001	CRITICAL	repartition(1) bottleneck repartition(1) forces all data through one task — eliminates all parallelism.
SPL-D04-002	WARNING	coalesce(1) bottleneck coalesce(1) reduces to one partition — all downstream stages run single-threaded.
SPL-D04-003	WARNING	Repartition with very high partition count repartition(n) above 10,000 — driver OOM, millions of shuffle files, slow task scheduling.
SPL-D04-004	WARNING	Repartition with very low partition count repartition(n) where n < 10 — cluster cores sit idle, up to 25× slower.
SPL-D04-005	INFO	Coalesce before write — unbalanced output files coalesce() before write creates unbalanced files — use repartition() for even output.
SPL-D04-006	INFO	Missing partitionBy on write Writing without partitionBy — every query does a full table scan, no pruning possible.
SPL-D04-007	WARNING	Over-partitioning — high-cardinality partition column partitionBy on user_id / *_id creates millions of tiny files — driver OOM on partition listing.
SPL-D04-008	INFO	Missing bucketBy for repeatedly joined tables saveAsTable() without bucketBy — every future join on this table requires a full shuffle.
SPL-D04-009	WARNING	Partition column not used in query filters Reading partitioned table without a partition filter — full table scan instead of pruned scan.
SPL-D04-010	WARNING	Repartition by column different from join key repartition(col_A) then join on col_B — two full shuffles instead of one.

⚖️D05 — Data Skew7 rules

Rule ID	Severity	Name & Description
SPL-D05-001	WARNING	Join on low-cardinality column join() on status/type/is_* — dominant value routes to one task, rest of cluster idles.
SPL-D05-002	WARNING	GroupBy on low-cardinality column without secondary key groupBy on low-cardinality column creates unbalanced partitions — dominant group straggler.
SPL-D05-003	WARNING	AQE skew join handling disabled skewJoin.enabled = false — skewed partitions not auto-split, straggler tasks persist.
SPL-D05-004	INFO	AQE skew threshold too high Skew detection threshold exceeds 1 GB — moderate skew silently degrades performance.
SPL-D05-005	INFO	Missing salting pattern for known skewed keys join() on potentially skewed ID column without rand()/explode() salting pattern.
SPL-D05-006	WARNING	Window function partitioned by skew-prone column Window.partitionBy on low-cardinality column — executor OOM for dominant partition.
SPL-D05-007	INFO	Null-heavy join key join() on null-heavy column without isNotNull filter — all nulls hash to same partition.

💾D06 — Caching & Persistence8 rules

Rule ID	Severity	Name & Description
SPL-D06-001	WARNING	cache() without unpersist() Cached DataFrame never released — memory leak causes OOM in long-running jobs.
SPL-D06-002	WARNING	cache() used only once Caching a DataFrame that is only read once — serialisation overhead with no benefit.
SPL-D06-003	CRITICAL	cache() inside loop cache() in a loop accumulates memory each iteration — OOM after O(N) iterations. ▸ details Impact OOM after O(N) iterations; executor memory accumulates without release Before for epoch in range(10): df = model.transform(df).cache() After for epoch in range(10): prev = df df = model.transform(df).checkpoint() prev.unpersist()
SPL-D06-004	WARNING	cache() before filter filter() after cache() — caches all rows when only filtered subset is needed.
SPL-D06-005	INFO	MEMORY_ONLY storage level for potentially large datasets MEMORY_ONLY silently drops partitions under pressure — prefer MEMORY_AND_DISK.
SPL-D06-006	WARNING	Reused DataFrame without cache Expensive DataFrame (join/groupBy) used multiple times without cache — N recomputations.
SPL-D06-007	INFO	cache() after repartition cache() after repartition() is wasteful if the result is only used once.
SPL-D06-008	INFO	checkpoint vs cache misuse checkpoint() in non-iterative code — unnecessary disk write, use cache() instead.

📂D07 — I/O & File Formats10 rules

Rule ID	Severity	Name & Description
SPL-D07-001	WARNING	CSV/JSON used for analytical workload Row-oriented formats — no column pruning or predicate pushdown, 5–20× slower than Parquet.
SPL-D07-002	WARNING	Schema inference enabled inferSchema=True reads data twice — 2× I/O per job, unstable schemas in production.
SPL-D07-003	WARNING	select("") — no column pruning select("") forces all columns to be decoded from storage — defeats columnar pruning.
SPL-D07-004	INFO	Filter applied after join — missing predicate pushdown filter() after join() — push filter before join to reduce shuffle volume.
SPL-D07-005	INFO	Small file problem on write partitionBy() without controlling partition count — creates O(partitions × values) small files.
SPL-D07-006	CRITICAL	JDBC read without partition parameters JDBC read without partitionColumn/numPartitions — single-thread sequential read, does not scale. ▸ details Impact Sequential single-thread read; does not scale to large JDBC tables Before df = spark.read.jdbc(url, 'orders') After df = spark.read.jdbc( url, 'orders', column='order_id', lowerBound=1, upperBound=10_000_000, numPartitions=50)
SPL-D07-007	INFO	Parquet compression not set Writing Parquet without explicit compression — cluster-dependent codec, may write uncompressed.
SPL-D07-008	INFO	Write mode not specified write without .mode() defaults to 'error' — job fails on re-run if output exists.
SPL-D07-009	INFO	No format specified on read/write read.load()/write.save() without .format() — format depends on cluster config.
SPL-D07-010	INFO	mergeSchema enabled without necessity mergeSchema=true reads footer of every Parquet file — O(files) metadata scan on every read.

🧠D08 — Adaptive Query Execution7 rules

Rule ID	Severity	Name & Description
SPL-D08-001	CRITICAL	AQE disabled spark.sql.adaptive.enabled = false — loses partition coalescing, skew join, and join strategy switching.
SPL-D08-002	WARNING	AQE coalesce partitions disabled coalescePartitions.enabled = false — hundreds of tiny tasks, high driver scheduling overhead.
SPL-D08-003	INFO	AQE advisory partition size too small advisoryPartitionSizeInBytes below 16 MB — too many small post-coalesce partitions.
SPL-D08-004	WARNING	AQE skew join disabled with skew-prone joins detected skewJoin.enabled = false with joins present — skewed partitions not auto-split at runtime.
SPL-D08-005	INFO	AQE skew factor too aggressive skewedPartitionFactor below 2 — every above-median partition treated as skewed, excessive splits.
SPL-D08-006	INFO	AQE local shuffle reader disabled localShuffleReader.enabled = false — forces remote reads even for locally available shuffle data.
SPL-D08-007	INFO	Manual shuffle partition count set high with AQE enabled shuffle.partitions > 400 with AQE on — unnecessary map-side overhead; tune advisoryPartitionSizeInBytes instead.

🐍D09 — UDFs & Code Patterns12 rules

Rule ID	Severity	Name & Description
SPL-D09-001	WARNING	Python UDF (row-at-a-time) detected @udf executes row-at-a-time across the JVM↔Python boundary — 2–10× slowdown, blocks Catalyst.
SPL-D09-002	WARNING	Python UDF replaceable with native Spark function @udf applies a built-in string method — use native SQL function (lower, trim, etc.) instead.
SPL-D09-003	CRITICAL	withColumn() inside loop withColumn() in a loop rebuilds the full query plan each iteration — O(N²) plan growth, driver hangs for N>100. ▸ details Impact O(N²) plan building; driver hangs for N > 100 columns Before for col_name in column_list: df = df.withColumn(col_name, f(col(col_name))) After exprs = [f(col(c)).alias(c) for c in column_list] df = df.select('', exprs)
SPL-D09-004	CRITICAL	Row-by-row iteration over DataFrame Iterating over df.collect() — forces all data to driver, entire cluster idles.
SPL-D09-005	CRITICAL	.collect() without prior filter or limit .collect() on unbounded DataFrame — driver OOM for datasets larger than heap.
SPL-D09-006	CRITICAL	.toPandas() on potentially large DataFrame .toPandas() without limit() — materialises entire dataset in driver memory.
SPL-D09-007	WARNING	.count() used for emptiness check df.count() == 0 triggers full table scan — use df.isEmpty (Spark 3.3+) instead.
SPL-D09-008	INFO	.show() in production code .show() triggers materialisation and prints to stdout — remove before production.
SPL-D09-009	INFO	.explain() or .printSchema() in production code Debugging tools that print to stdout — remove or guard with a debug flag.
SPL-D09-010	WARNING	.rdd conversion dropping out of DataFrame API .rdd access deserialises columnar rows to Python Row objects — bypasses Catalyst and Tungsten.
SPL-D09-011	INFO	pandas_udf without type annotations @pandas_udf without type hints — relies on runtime inference, uses deprecated API.
SPL-D09-012	WARNING	Nested UDF calls A UDF calls another UDF — each adds a separate JVM↔Python serialisation boundary.

🔍D10 — Catalyst Optimizer6 rules

Rule ID	Severity	Name & Description
SPL-D10-001	WARNING	UDF blocks predicate pushdown @udf creates opaque boundary — Catalyst cannot push filters below the UDF, full scan required.
SPL-D10-002	WARNING	CBO not enabled for complex queries File has 2+ joins but CBO disabled — Catalyst uses heuristics, 5–20× sub-optimal plans.
SPL-D10-003	INFO	Join reordering disabled for multi-table joins 3+ joins without CBO join reordering — executed left-to-right, may produce large intermediates.
SPL-D10-004	INFO	Table statistics not collected Joins present but no ANALYZE TABLE — CBO operates without accurate row counts or cardinalities.
SPL-D10-005	INFO	Non-deterministic function in filter rand()/uuid() in filter() — potential double-evaluation by Catalyst, non-reproducible results.
SPL-D10-006	INFO	Deep method chain (>20 chained ops) Single expression with 20+ chained operations — plan visitor stack overflow risk, hard to debug.

📊D11 — Monitoring & Observability5 rules

Rule ID	Severity	Name & Description
SPL-D11-001	INFO	No explain() for plan validation in test file Test file has complex Spark ops but no explain() — plan regressions invisible until production.
SPL-D11-002	INFO	No Spark listener configured SparkSession without a listener — job metrics and query events not captured in production.
SPL-D11-003	INFO	No metrics logging in long-running job 2+ write operations but no Python logging — silent data drops and slow stages invisible.
SPL-D11-004	INFO	Missing error handling around Spark actions Spark actions (count, collect, write) not in try/except — partial writes, no error context in logs.
SPL-D11-005	INFO	Hardcoded storage path Spark read/write uses hardcoded string path — environment coupling, production paths in source control.