Don't want to miss a thing?Subscribe to get expert insights, in-depth research, and the latest updates from our team.

by Łukasz BielakJun 25, 2025

Cloud & DevOps Data & Analytics Operational Efficiency Software Development

Optimizing Spark Structured Streaming on Databricks: Best Practices You Should Know

13 min read

Spark Structured Streaming is a powerful streaming engine that enables you to build scalable data pipelines and perform real-time data transformations. Although it's referred to as streaming, it works based on micro-batch processing. This means that at regular intervals, a batch of data is ingested, optionally transformed, and then written to storage in formats like JSON, CSV, Parquet, or Delta.

Structured streaming supports various data sources, including message brokers like Kafka or Event Hub, but also file-based sources. For example, it's possible to stream data from a data lake by simply monitoring new files as they arrive.

There are different ways to write output data. The most efficient method is appending new records only, but update operations are also supported. It’s important to note that real-time streaming typically incurs higher costs compared to traditional batch pipelines. This is because the job cluster needs to run continuously, and you’re billed for every minute it operates.

Like any engine, Spark Structured Streaming can be fine-tuned to get the best performance for your specific use case. This article focuses on classical streaming pipelines in Databricks, specifically those using unmanaged tables.

Bear in mind that there’s no single configuration that guarantees optimal performance in every scenario — achieving good results often requires experimentation and tuning.

Trigger Types in Structured Streaming: Choose the Right One

In Spark Structured Streaming, we can control how often the engine triggers a new batch for processing. This is done using triggers, which define the frequency of micro-batch execution. Choosing the right trigger type and interval can significantly impact performance and cost.

There are several trigger types available:

Fixed Interval Micro-Batches (e.g., .trigger(processingTime='10 seconds')

This is the most commonly used trigger type. The engine reads data from the source at a fixed interval—for example, every 10 seconds or every minute. The batch size and processing time can vary depending on how much data has arrived during that interval.

Choosing the optimal interval requires experimentation — too short, and you'll read very small files frequently (potentially increasing storage costs and metadata operations); too long, and the batch may become large and require more memory and compute resources.
This type is suitable for near-real-time pipelines where low latency is important.

Available Now (.trigger(availableNow=True)

This trigger is used when we want to process all available data in one go and then stop the query automatically. It’s useful in two main scenarios:

When you want to run the pipeline once (e.g., as a hybrid between batch and streaming).
When you want to trigger processing on a schedule (e.g., once a day using an external orchestrator like Azure Data Factory or Apache Airflow).

This mode is more cost-efficient when you don't need continuous processing and can tolerate some delay in data freshness.

Fun fact: There’s also Trigger.Once(), which is similar to AvailableNow but processes just one batch of data and stops — even if more data is available, however, it’s deprecated and AvailableNow should be used.

Trigger Type	Description	Use Case Example	Notes
Trigger.ProcessingTime	Executes a micro-batch every X seconds/minutes. Code sample: dfs.writeStream\ .format("delta")\ .option("checkpointLocation", checkpoint_path)\ .trigger(availableNow=True) \ .outputMode("append")\ .start(table_path)	Near-real-time data pipelines.	Most common. Needs tuning for intervals to balance latency and resource usage.
Trigger.AvailableNow	Processes all available data in multiple batches and stops automatically. Code sample: dfs.writeStream\ .format("delta")\ .option("checkpointLocation", checkpoint_path)\ .trigger(processingTime='10 seconds') \ .outputMode("append")\ .start(table_path)	Daily scheduled runs, full-data refreshes.	Good for cost-saving scenarios. Can be combined with external schedulers.

MaxFilesPerTrigger: Tuning File-Based Streaming

When using a file-based source for your Spark Structured Streaming pipeline (like a Delta table or files in JSON/CSV format), one common issue is the number and size of files processed per batch. This is where the maxFilesPerTrigger parameter becomes crucial.

What does it do?

The maxFilesPerTrigger option limits how many new files will be picked up for processing in each micro-batch. By default, Spark sets this to 1,000, but depending on your data volume and file size, this might be far from optimal.

Why does it matter?

Let’s say you’re working with thousands of small files. If Spark reads them in small batches, your cluster will be underutilized — CPU and memory will sit idle, and you'll end up with longer processing times and higher costs.

On the other hand, trying to process too many files at once might overload the driver or executors, especially if you're running on limited resources.

Strategy to tune maxFilesPerTrigger:

Initial load: Start with a higher value, like 5,000 or more, to speed up the initial file processing (especially useful for historical backfill).
Regular streaming: Lower the value after the backlog is cleared to ensure stable performance and avoid OOM errors.
Monitor: Track the number of records per batch and batch processing times — you can do this directly in the notebook or via the Spark UI on your cluster.
Also, keep an eye on cluster metrics like memory usage, CPU load, and garbage collection activity. These help you spot if the pipeline is hitting resource limits.

Listing 1: maxFilesPerTrigger with a value of 2,000 files
dfs.writeStream\

.format("delta")\

.option("checkpointLocation", checkpoint_path)\

.trigger(processingTime='60 seconds') \

.option("maxFilesPerTrigger", 2000)\

.outputMode("append")\

.start(table_path)

State Store & File Formats: Delta vs. CSV in Streaming

When streaming from file-based sources (like CSV, JSON, or Delta tables), Spark needs to keep track of what has already been processed. This is handled through checkpoints and the state store.

How it works

Each time a batch is processed, Spark saves metadata (e.g., which files were already read or which Kafka offsets were committed) into a checkpoint directory. When the stream restarts, Spark uses that information to continue processing only new data.

Listing 2: Write stream with checkpoint location indicated
df.writeStream \

.format("delta") \

.option("checkpointLocation", checkpoint_path) \

.outputMode("append") \

.start(table_path)

However, with a large number of files, restarting the stream can become slow. Why? Because Spark’s default state store provider is based on HDFS, and under heavy load, it may not scale efficiently.

To improve state store performance, you can switch to RocksDB, which stores state in a local key-value database optimized for high throughput:

Listing 3: Spark session configuration to use the RocksDB provider
spark.conf.set(

"spark.sql.streaming.stateStore.providerClass",

"com.databricks.sql.streaming.state.RocksDBStateStoreProvider"

)

This small tweak can significantly improve the performance of streams with high state or large metadata, such as many small files or joins.

Use Delta Instead of CSV/JSON

Another key tip: if you're streaming from files, prefer Delta format over CSV or JSON. Why? Delta integrates with the transaction log, so Spark can track which data was processed using offsets rather than listing and scanning all files. This is both faster and more stable.

In the table below, you can see the results of comparing the loading of data from CSV files into Delta, and from Delta to Delta. In both cases, the same cluster was used: Driver: F4, Workers: 3xF4, Runtime: 15.4 LTS.

Source Format	Destination Format	# Files	Duration	Notes
CSV	Delta	3,200	6 minutes 44 sec	Optimized writes enabled — few output files created
Delta	Delta	~12	4 minutes 37 sec	Faster read due to transaction log & offset-based logic

Correct Partitioning Strategy

Partitioning is essential, but too fine-grained partitioning (e.g., per hour/day) can result in inefficient performance if each partition holds very little data.

Recommendations:

Aim for approximately 1 GB per partition.
Use monthly or yearly partitions instead of daily if the data per day is too small.
Do not partition tiny tables — you might get better performance without it!

Auto Optimize and Auto Compaction

When a streaming pipeline writes data into a Delta table, it typically generates many small files, which is inefficient from a read performance perspective.

To address this issue, Delta Lake offers two features: Auto Optimize and Auto Compaction. These features help reduce the number of small files during streaming writes and improve performance.

They can be enabled at the table level using table properties:

Listing 4: Set optimized writes and auto compact on the table level
sql

ALTER TABLE my_table SET TBLPROPERTIES (

delta.autoOptimize.optimizeWrite = true,

delta.autoOptimize.autoCompact = true

);

Or, on the notebook level:

Listing 5: Set optimized writes and auto compact on the notebook level
spark.conf.set("spark.databricks.delta.properties.defaults.autoOptimize.optimizeWrite", "true")

spark.conf.set("spark.databricks.delta.properties.defaults.autoOptimize.autoCompact", "true")

You can verify the properties using:

Listing 6: Check if the options are enabled on the table level
sql

SHOW TBLPROPERTIES my_table;

I’ve done a small comparison by loading around 18 GB (3,000 CSV files) into a Delta table, and here are the results (Data Lake folder where the Delta table is stored):

With Optimized Writes and Auto Compaction disabled:

Active blobs: 257 blobs, 2,42 GiB (2 594 153 665 bytes).
Snapshots: 0 blobs, 0 B (0 bytes).
Deleted blobs: 0 blobs, 0 B (0 bytes, does not include blobs in deleted folders).
Total: 257 items, 2,42 GB (2 594 153 665 bytes).
With Optimized Writes and Auto Compaction enabled:

Active blobs: 20 blobs, 2,43 GiB (2 610 875 123 bytes).
Snapshots: 0 blobs, 0 B (0 bytes).
Deleted blobs: 0 blobs, 0 B (0 bytes, does not include blobs in deleted folders).
Total: 20 items, 2,43 GiB (2 610 875 123 bytes)

As you can see, the number of files is significantly lower in the second case. This test was based on a one-time data load, but in real-world scenarios where data is ingested in multiple batches, the benefits, both in terms of cost and performance, will be even greater.

The OPTIMIZE Command: Manual Compaction

Even when auto-compaction is enabled, it is still recommended to run the OPTIMIZE command regularly. OPTIMIZE compacts small files, which significantly improves read performance.

This command can be executed on the entire table, but if the table is partitioned, it's more efficient to run it on specific partitions. This reduces processing time and resource usage.

Keep in mind that OPTIMIZEdoes not delete old files — to remove obsolete data files and free up storage, the VACUUM command should be used.

Listing 7: Run the optimize command for a specific period, and the data is partitioned by date
sql

OPTIMIZE my_table

WHERE date >= '2024-03-01' AND date < '2024-04-01';

Tips:

Run OPTIMIZE daily on recent partitions (e.g., last day/week).
Avoid full-table optimizations on large tables — they can interfere with active streaming jobs.

The VACUUM Command: Clean Up Old File Versions

Delta Lake maintains a history of table versions for time travel (by default: 30 days). This means historical versions of rows are stored in the underlying files. However, it’s important to note that these files are not deleted automatically, even if they are older than the retention period.

To physically remove them and free up storage, you need to run the VACUUM command. You can use the RETAIN parameter to specify how long to keep old files. Here's an example in the code below.

If you set a very short retention period (e.g., less than seven days), you must disable the retentionDurationCheck property — otherwise, the VACUUM command will fail.

Just like OPTIMIZE, it’s good practice to run VACUUM regularly to keep only the necessary files.

Pro tip: If your table might be used as a source in streaming jobs, use RETAIN 24 HOURS  instead of 0 HOURS to avoid accidental data loss before the stream has had a chance to read the data.

Listing 8: Disable the retentionDurationCheck option
spark.conf.set("spark.databricks.delta.retentionDurationCheck.enabled", False)

Listing 9: Run the Vacuum command with a retention of 24 hours
sql

VACUUM my_table RETAIN 24 HOURS;

Final Thoughts

There is no single silver bullet for making streaming pipelines always perform well. It often requires experimentation and tuning — adjust batch sizes, triggers, formats, and compaction strategies to fit your specific workload.

Start a conversation with us