• Apache Spark
  • Technical
Apache Spark™ Structured Streaming With DataFrames

This blog provides an exploration of Spark Structured Streaming with DataFrames

The blog extends the previous Spark MLLib Instametrics data prediction blog example to make predictions from streaming data.  We demonstrate a 2-phase approach to debugging, starting with static DataFrames first, and then turning on streaming. Finally, we explain Spark structured streaming in more detail by looking at the trigger, sliding, and window time.



A small continuously flowing watercourse.



(Australia) an ephemeral stream, often dry, but occasionally floods.

The Australian Outback can be extreme (hot, dry, wet).

Many roads have restrictions!

spark streaming dataframe(Source: Shutterstock)



(Australia) see Creek.

The Todd River Race is a “boat” race held annually in the dry sandy bed of the Todd River in Alice Springs, Australia (although it was canceled one year—due to rain!)

(Source: Alli Polin – Flickr: Boat Race in the Desert, CC BY-SA 2.0, https://commons.wikimedia.org/w/index.php?curid=25241465)

Apache Spark supports Streaming Data Analytics. The original RDD version was based on micro-batching. Traditionally “pure” stream processing works by executing the operations every time a new event arrives. This ensures fast latency but it is harder to ensure fault tolerance and scalability. Micro-batches were a trade-off and worked by grouping multiple individual records into batches for processing together. Latencies of around 2 seconds were realistic, which was adequate for many applications where the timescale of the trends being monitored and managed is longer than the latency of the micro-batches (i.e. > 2s). However, in practice, the batching latency is only one contributor of many to the overall latency of the system (not necessarily even the main contributor).

The current “Spark Structured Streaming” version supports DataFrames, and models stream as infinite tables rather than discrete collections of data. The benefits of the newer approach are:

  1. A simpler programming model (in theory you can develop, test, and debug code with DataFrames, and then switch to streaming data later after it’s working correctly on static data); and
  2. The time model is easier to understand and may have less latency.

1. Dry Run: DataFrame Design and Debugging

The Todd River is dry for most of the time!

(Source: Shutterstock)

The streaming problem we’re going to tackle in this blog is built on the predictive data analytics exploration in the previous blogs. Let’s say we have produced a model using Spark MLlib which can be applied to data over a time period (say 10 minutes) to predict if the SLA will be violated in the next 10 minute period and we want to put it into production using streaming data as the input. Rather than dividing the streaming data up into fixed 10-minute intervals, forcing us to wait for up to 10 minutes before obtaining an SLA warning, a better approach is to use a sliding window (e.g. 1-minute duration) that continuously updates the data available and aggregation calculations (such as a moving average) every minute. This will ensure that we get SLA warnings every minute – i.e. every minute (sliding interval) we want to know what happened over the last 10 minutes (window duration).

After spending several frustrating days attempting to design, debug and test a complete solution to a sample problem involving DataFrames and Spark Streaming at the same time, I recommend developing streaming code in two steps. First (1) design and debug a static DataFrame version, and then (2) add streaming. In theory, this should work as you can focus on the correct series of DataFrame transformations required to get from your input data to the desired output data, and then test it with static data. Also given that Spark DataFrames use lazy evaluation it’s best (for sanity) to test each operation in order before combining them. All without having to also worry about streaming data issues (yet).

Here’s the initial pure DataFrame code (I developed and ran this code on an Instaclustr Spark + Zeppelin cluster, see last blog on Apache Zeppelin):

The variables windowDuration and slideDuration are Strings defining the window size and sliding window times. Next, we define a class Raw and Seq inSeq (a few samples) for the raw data (with node, service, and metric fields). This uses the scale “case class” syntax which enables automatic construction. We then turn the inSeq data into a DataFrame (inDF).

To this raw data DataFrame, we apply a series of dataFrame transformations to turn the raw input data into a prediction of SLA violation for each (node, window). (When debugging, it’s best to try these one at a time.):


.withColumn(“time”, current_timestamp())
The raw data doesn’t have an event-timestamp so add a processing-time timestamp column using withColumn and the current_timestamp().


.filter($”service” === “s1″ || $”service” === “s2”)
Assuming we have a MLLib model for prediction of SLAs, and we know what features it uses, we can filter the rows by retaining only relevant metrics (for this simple demo we assume that only s1 and s2 are used).


.withColumn(“window”, window($”time”, windowDuration, slideDuration))
Add a new column called “window” using withColumn, which creates a sliding window using the sql window function.  .groupBy(“node”, “window”)


.groupBy(“node”, “window”)
.agg(min(“metric”), avg(“metric”), max(“metric”))
These three operations are used together to produce a wide table. groupBy produces a single row per node+window permutation. pivot and agg result in 3 new columns (min, avg, max) being computed for each service name.


.filter($”s1_avg(metric)” > 3 && $”s2_max(metric)” > 1)

The next filter is a “stand-in” for the MLLib model prediction function for demonstration purposes. Assuming we have previously produced a decision tree model it’s easy enough to extract simple conditional expressions for the positive example predictions as a filter which only returns rows that are predicted to have a SLA violation in the next time period.


.select(“node”, “window”)

Finally, select the node and window columns (for debugging it’s better to return all columns).

Pick up your boat and run!  Here’s the step by step output.

The raw sample input data:

Add current time column:

Filter relevant service names:

(Same output as above)

Add sliding windows:

This produced an error:

java.lang.RuntimeException: UnaryExpressions must override either eval or nullSafeEval

So I introduced a hack for testing, by using the time column as the window:

Group, pivot, agg (TODO Formatting of wider tables is a problem!!!!):

Filter (stand-in for MLLib model prediction):

Select node and window columns:

This seems to be what we want! Each row represents a warning that the SLA may be violated for the node in the next time period (so in theory we can take action in advance and check the prediction later on).  Now it’s time to get our feet wet with Streaming data…

2. Wet Run (Creeking?): Stream Debugging



Data Analytics with a Sporadic data stream?! That is, either none or lots of data.

I thought I’d invented this term, but Creeking was already taken. It’s extreme whitewater kayaking involving the descent of steep waterfalls and slides!


(Source: Shutterstock)

Perhaps Creeking Data (torrents of data with fast sliding windows) could be a thing? We do have creeks with flowing water in Australia. I once accidentally went “creeking” over an unexpected waterfall and lost my glasses, while “Liloing” (leisurely floating on an air mattress) down Bell’s Canyon).

Apart from DataFrames, the Spark structured streaming architecture has a few more moving parts of interest: input stream source (rawIn, in the code below), input table (inDF), query (querySLA), result table (outSLA), and output sink (slaTable).

Spark Structured Streaming

Spark Structured Streaming – Apache Spark Structured Streaming High-Level Architecture

The inbuilt streaming sources are FileStreamSource, Kafka Source, TextSocketSource, and MemoryStream. The last two are only recommended for testing as they are not fault-tolerant, and we’ll use the MemoryStream for our example, which oddly isn’t documented in the main documents

Spark structured streaming can provide fault-tolerant end-to-end exactly-once semantics using checkpointing in the engine. However, the streaming sinks must be idempotent for handling reprocessing, and sources must be “replayable” – sockets and memory streams aren’t replayable.

How do we add a streaming data source to a DataFrame? Simply replace inSeq in the above code with an input MemoryStream like this:

val rawIn = MemoryStream[Raw]

val inDF = rawIn.toDF

Note that once a DataFrame has a streaming data source it cannot be treated as a normal DataFrame, and you get an error saying:

org.apache.spark.sql.AnalysisException: Queries with streaming sources must be executed with writeStream.start();

And now put the streaming window function back in:

The final abstraction we need to add is a “query”, which requires options for output mode, an output sink, query name, trigger interval, checkpoint, etc. The query can then be started as follows:

Even though Streaming operations can be written as if they are just DataFrames on a static bounded table, Spark actually runs them as an incremental query on an unbounded input table. Every time the query is run (determined by the Trigger interval option), any new rows that have arrived on the input stream will be added to the input table, computations will be updated, and the results will be updated. The changed results can then be written to an external sink.

There are three write modes: Complete, Update and Append (default), but only some are applicable depending on the DataFrame operations used. You can optionally specify a trigger interval. If the trigger interval is not specified the query will be run as fast as possible (new data will be checked as soon as the previous query has been completed).  The syntax for providing an optional trigger time is:

.trigger(ProcessingTime(“10 seconds”))

For this example, we used the simple Memory Sink (which is for debugging only).  For production, a different sink should be used, for example, a Cassandra sink. For production, a real input stream will be needed. Kafka is a good choice, see the Instaclustr Spark Streaming, Kafka, and Cassandra Tutorial.

Did the streaming code actually work? Not exactly.  There is some “fine print” in the documentation about unsupported operators.  The list isn’t explicit and sometimes you will have to wait for a run-time error to find out if there’s a problem.  For example, the above code didn’t run (and the error message didn’t help), but by a process of trial and error I worked out that pivot isn’t supported. Here’s a workaround, assuming you know in advance which service names need to be aggregated for model prediction input (I used the spark sql when function in the agg to check for each required service name):

Start Your Streams!

Once a query is started it runs continuously in the background, automatically checking for new input data and updating the input table, computations, and results table. You can have as many queries as you like running at once, and queries can be managed (e.g. find, check and stop queries). Note that stop appears to result in the data in the input sink vanishing (logically I guess as the data has already been read once!)

Is that it? Well not exactly. What’s missing? You will notice that we don’t have any input data yet, and no way of checking the results! The addData() method is used to add input rows to the MemoryStream:

rawIn.addData(Raw(“n1”, “s1”, 3.14))

This is cool for testing. Just add some data, and then look at the result table (see below) to check what’s happened! But It is tedious to create a large data set for testing like this, so here’s the code I used to create more torrential and realistic input data:

(Note that for the final version of the code the names “nodeX” and “serviceX” were used instead of “nX” and “sX”).

Finally, here’s the code to inspect the results. “sla1” is the queryName from the query started above, and processAllAvailable()is only used for testing as it may block with real streaming data, see documentation:

val slaTable = spark.table(“sla1”).as[(String, String)]

One potential issue with streaming data and small sliding window times is data quality and quantity. I.e. there may be insufficient data in a window to compute aggregations to justify an SLA violation prediction. A simple hack is to include a count of the number of measurements in the window as follows:

Here are some results (with avg and max and count cols left in for debugging). Note that because there are 10 windows for each minute of events (as sliding duration was 1 minute) there may be multiple SLA warnings for the same node):

Even after debugging the static pure DataFrame version of the code, there were still a few surprises. One useful trick for debugging is to run multiple streaming queries. One query produces the final results, but other queries can be used to show intermediate steps. For example, use another query to look at the partially processed raw input data (after adding the window):

A cool Zeppelin fact.  I discovered a new Zeppelin trick while debugging this code. There’s an implicit Zeppelin context available (called “z”!) which provides a few tricks. For example, a nicer version of the show has options for different types of graphs.  Just type z.show(DataFrameName)”.

3. Triggers, Slides, and Windows

Triggers, Slides, and Windows sound like a risky combination?! I realised I’d glossed over the treatment of time in Spark Streaming, so here’s another attempt at trying to explain how “time works” (at least at a high level).

Trigger time: Rate at which new input data is added to the input table from the input source stream. The default is that after the query is finished it just looks again.

Sliding duration: How often the window slides (possibly resulting in some events being added or removed from the window if they no longer fall within the window interval).

Window duration: The duration of the window, determines the start and end time (end-start = window).

The following 3 diagrams illustrate three cases. Example 1 has a trigger time of 1 (unit of time), a slide of 1, and a window of 10. The timeline is at the top (from 1 to 17), and one event arrives during each period of time (a, b, c, etc). Events are added to the input table once per trigger duration, resulting in one event being added each unit time in this example. Multiple sliding windows are created, each is 10 time units long, and 1 unit apart.  After 10 time units, the 1st window has 10 events (a-j) and then doesn’t grow any further. The 2nd window (which started 1 time unit later than the 1st) contains events b-k, etc.

Example 2 shows what happens if the trigger time (2 time units) is longer than the sliding time (1 time unit). This results in a potential backlog of events, with all the events that have arrived on the input stream since the last trigger being added to the window at the next trigger time (in this case we get both events a and b added during the time 2 period, c and d during period 4, etc.).  

Spark Structured Streaming

Triggers, Slides and Windows Apache Spark Example 2

Example 3 shows a Tumbling window when the sliding and window times are equal (5 units):

Spark Structured Streaming

Triggers, Slides, and Windows Apache Spark Example 3

What are sensible defaults for these times? Trigger time <= Slide time <= Window duration.

Is there anything wrong with having a very short trigger or sliding times? Short trigger times may increase resource usage (as the query must be re-run over each new input event), and short sliding times will increase the number of concurrent windows that need to be managed.  Long window times may increase the amount of data and processing required for each window.  I.e. more frequent and larger computations will consume memory and CPU.

However, the actual times will need to be determined based on your use case, taking into account the data velocity and volumes and the time-scales of the “physical” system being monitored and managed.

Also (as we noticed from the example output), sliding windows will overlap each other, and each event can be in more than one window. “Tumbling” windows don’t overlap (i.e. duration and sliding intervals are the same).  If no sliding duration is provided in the window() function you get a tumbling window by default (slide time equal to duration time). Which sort of window do you need?  For the same use cases, the data must not (or just don’t need to be overlapping), so a tumbling window must be used is applicable (e.g. periodic report generation such as a daily summary) For other applications it’s important to have more frequent updates but still a longer period (the window time) for computing over, so sliding windows are the answer.

The location of the window() function documentation isn’t obvious. It’s in the org.apache.spark.sql.functions package, spark sql streaming window() function documentation.  The parameters windowDuration and slideDuration are strings, with valid durations defined in org.apache.spark.unsafe.types.CalendarInterval. In theory, durations can range from microseconds to years, although using units greater than weeks produces an exception, probably so that you don’t accidentally have a window with too much data in it. Durations greater than months can be specified using units less than months (e.g. “52 weeks”). Long windows can make sense when the timespan of the data is long but the quantity of data is manageable, otherwise, batch processing may be preferable. Also, note that slideDuration must be <= windowDuration.