• Apache Spark
  • Apache Kafka
  • Technical
Apache Kafka® Streams vs. Apache Spark™ Structured Streaming

In our previous blog, we compared Apache Flink® and Apache Kafka® Streams. Now, let’s compare Kafka Streams and Apache Spark™ Structured Streaming! 

How can ride-sharing applications display up-to-the-moment vehicle availability and price for thousands of concurrent ride requests? How do financial institutions detect anomalies that signal fraud in real time among millions of other transactions? And how do video streamers track subscriber behavior, user taste, and preferences across billions of hours of streaming content?    

The answer in each case begins with data streaming 

Data streaming is the high speed, continuous transmission of data from its source—often by thousands of sources—to one or many destinations. Businesses across all industries are using data streaming to improve cybersecurity, monitor IT infrastructure, improve customer success, and help make business decisions in real-time.   

But with so many data streaming platforms, finding the right solution can be confusing.  

In this post, we will discuss data streaming and compare 2 of the most popular solutions, Apache Kafka Streams and Spark Structured Streaming, and help you determine which is the best tech for you.  

The Evolution to Data Streaming  

In traditional IT infrastructures only a handful of applications generated data, such as CRM, ERP, or HR. Periodic ETL batch processing could then be used to extract, transform, and load the data into structured repositories, typically a data warehouse or data lake. 

A visual representation of a traditional IT infrastructure. Select applications such as CRM, ERP, or HR, generate data which is periodically transformed and stored into a Data Warehouse for reporting and additional analysis.

Once in the data warehouse, data analysts, or other business users could make informed decisions based on large amounts of historical data from a variety of sources.  

However, the advent of high-speed internet, cloud computing, the Internet of Things (IOT), and the need for real-time data, has evolved the way we view and process data. Sensors capturing events in real-time require a data streaming architecture capable of ingesting and processing massive amounts of data.   

Unlike traditional IT architectures which focus on batch writing and reading, a streaming data architecture consumes data as it’s generated, stores it persistently, and has the ability to conduct real-time processing, distribution, data transformations, and analytical operations. Data streaming platforms are also designed to be highly scalable and available, deliver high throughput, and can connect to nearly any data source.  

A visual representation of a modern data streaming platform. Data is streamed continuously from a variety of sources, then processed, analyzed and distributed in real-time. 

Data streaming platforms are designed to both receive and distribute data from and to multiple systems simultaneously and differ from traditional batch processing in the following ways:  

 

 Traditional Data Architecture Data  
Streaming 
Data Processing Periodic, processes all data at once Simultaneous, real-time processing, storage, and analysis 
Data Storage Centralized data warehouse or monolithic database Distributed computing and storage systems 
Integration and Transformation Periodic, batch mode ETL processes Real-time integration, transformation, and enrichment 
Event Handling Request driven inquiries (i.e., SQL queries) Real-time processing of incoming data 

 

What Is Kafka?  

Apache Kafka is a leading open source, distributed event streaming platform used by over 80% of the Fortune 500. Originally developed by LinkedIn and donated to Apache, Kafka is used by companies requiring high-performance data pipelines, streaming analytics, data integration, and support for mission-critical applications.

Kafka’s publish/subscribe model was designed with fault tolerance and scalability in mind, capable of handling over a million messages per second or trillions of messages per day.  

Kafka Key Concepts

Producers: Producers are responsible for data from event generating devices, applications, and other data sources to Kafka topics. As an example, in the case of data streams focused on detecting credit card fraud, producers might include payment processing systems, point of sales systems, payment gateways, and e-commerce platforms.  

Topics: Producers publish events into topics. Topics are categories of data, and consumers subscribe to them. In our credit card fraud detection stream example, transaction data might be stored into “in store” topics, “online” topics, or “ATM withdrawals” topics. 

Partitions: Kafka topics are configured into partitions, which divide the topic into smaller subsets of data. Partitioning helps Kafka achieve scalability, load balancing, and fault tolerance by writing, storing, and processing the partitions across brokers. In our example, the partitions might contain data about a transaction’s amount, location, or time which could be distributed and replicated across Kafka brokers.  

Consumer: Consumers are the applications, platforms, and systems that subscribe to analyze the real-time data stored in Kafka topics. Examples of fraud detection consumers might be fraud detection systems, consumer notification systems, transaction monitoring dashboard, or payment authorization systems.  

Kafka Brokers: Brokers are the Kafka instances where data published by the producer is made available to consumers by subscription. Kafka ensures fault tolerance by replicating data across multiple broker instances. Kafka brokers contain topic log partitions, where data is stored and distributed:

Kafka Brokers (Source: Instaclustr)

Combined, Kafka delivers 3 overarching capabilities: the ability to publish or subscribe to data streams, the ability to process records in real-time, and fault tolerant and reliable data storage.  

Note: After a message is written to a topic, it cannot be deleted for a preconfigured amount of time (e.g., 7 days) or altered at all. 

Kafka Streams

Kafka comes with 5 primary API libraries including:  

  1. Producer API: enables applications to publish data to Kafka topics 
  2. Consumer API: enables applications to subscribe to Kafka topics 
  3. Streams API: enables streaming data transformation between Kafka topics 
  4. Connect API: enables integration with publishers and consumer applications 
  5. Admin API: enables management and review of Kafka objects 

Apache Kafka on its own is a highly capable streaming solution. However, Kafka Streams delivers a more efficient and powerful framework for processing and transforming streaming data. 

As described by IBM, Streams APIs “builds on the Producer and Consumer APIs and adds complex processing capabilities that enable an application to perform continuous, front-to-back stream processing.” Although the Producer and Consumer APIs can be used together for stream processing, it’s the Streams API specifically that allows for advanced event and data streaming applications. 

What Is Apache Spark? 

Apache Spark is a distributed computing framework designed to perform processing tasks on big data workloads. First introduced as a Hadoop subproject in the UC Berkeley R&D labs in 2009, Spark was donated to Apache in 2013.

In-memory caching and optimized query execution for fast analytic queries, makes Spark a popular choice for data engineering, data science, and machine learning projects. Spark provides APIs in Java, Scala, Python, and R, allowing developers to write data processing applications in their preferred language. 

Spark Key Components 

Spark’s ecosystem consists of 5 key components:  

  • Spark Core: Contains the foundational functionality of Spark, including input-output operations, data transformation, fault tolerance, monitoring, and task scheduling. Spark core provides in-memory computing and references datasets to external storage systems. The core comes with REST APIs, which are language independent.
  • Spark SQL: Spark SQL is a Spark module used for processing structured data stored in Spark programs or externally stored structured data accessed via standard JDBC and ODBC connectors.
    • As an example, if a car manufacturer were using Spark to measure the effectiveness of a new car ad campaign, they could use Spark SQL to ingest structured data from website analytics, customer surveys, and dealership records. They could then develop SQL queries to correlate when and where ads are run, what the impact the ads had on website performance and ultimately on dealership new car sales performance.
  • Streaming: Spark Streaming is responsible for scalable, high-throughput, fault-tolerant processing of live data streams. Spark Streaming applies algorithms like map, reduce, join, and window on data streams, and then pushes the data to destinations such as file systems, dashboards, or databases. Spark Streaming natively supports both batch and streaming workloads.
    • In the new car campaign example, streaming data could continuously monitor social media platforms and news websites in real-time and generate alerts to the marketing team of changes in customer or market sentiment.
  • MLlib: The MLlib (Machine Learning Library) comes with common learning algorithms for building, training, and deploying machine learning models at scale. The broad set of MLlib algorithms include tasks such as classification, regression, clustering, recommendation, and frequency analysis.
    • In our new car campaign example, MLlib could be used to segment customers by their car preferences (i.e., luxury, eco-friendly, safety), region and buying behavior and then could be used to predict future sales and customer demand based on the segmentations.
  • GraphX: GraphX is a graph processing library built on top of Spark. GraphX leverages the distributed computing capabilities of Spark, making it an ideal approach for processing and analyzing massive and complex graph-structured data.
    • In our example, GraphX could be used to track and visualize user interactions across channels and identify audiences with interest in the campaign, such as auto enthusiasts or tech pundits so that marketing could further hone their messaging to the specific buyer groups.

An example of Apache Spark’s architecture designed to measure the effectiveness of a new car ad campaign. The combination of Spark components is used to measure the campaign’s reach and performance (Spark SQL), monitor sentiment in real-time across social media and news sites (Streaming), forecast future sales based on buyer segment (MLib), and visualize target audiences across channels (GraphX). (Source: Instaclustr)

Structured Streaming 

As of Spark 2.0, Apache launched Structured Streaming. Structured Streaming is generally favored for processing data streams. One reason is that Spark Streaming sends data via micro batches over fixed time intervals, which can add latency.

On the other hand, Structured Streaming has lower latency because it sends data on a record-by-record basis as the data is available, it has less data loss because of its event handling capabilities and is generally preferred for developing real-time streaming applications.

Kafka Streams vs. Structured Streaming: Comparison 

The release of Structured Streaming enabled Spark to stream data like Apache Kafka. Each platform consumes data as it arrives, making them both suitable low-latency data processing use cases. Both platforms tightly integrate into Kafka: Kafka with its Streams library for building applications and micro services and Structured Streaming can easily be used as a consumer of Kafka topics.

However, there are also differences between the Kafka Stream and Structured Streaming which are important to note:

 Kafka Streams Spark Structured Streaming 
Primary ObjectiveStreams is a foundation component built into Kafka Structured Streaming is library that enables streaming for Spark 
Performance With inherent data parallelism, scalability, and built-in fault tolerance, Kafka Streams is faster and easier to use than most platformsSpark Streaming offers high-throughput, scalable, and fault-tolerant processing, but requires regular performance tuning in order to scale. 
Processing Model Streams event-driven model processing Supports micro-batched and event-driven models 
Programming Language Natively, the Kafka Streams API supports Java and Scala Spark Streaming provides API libraries in Java, Scala, and Python 
Integration Kafka Streams is designed to import and process data provided by Kafka topics, but can export to a variety of systems, including Kafka® Connect. Structure Streaming can import and export to a variety of applications and devices.  
Machine Learning Kafka does not have a built-in machine learning library for applying ML algorithms to streaming data. You can use an external ML library or system with Kafka, but it may not be as straightforward as Spark’s MLlib Spark comes with a machine learning library (MLlib) that offers a suite of algorithms that can inject machine learning into streaming data processing workflows. This could enable you to apply machine learning to streaming data for tasks like real-time prediction.
Data Processing Data processing based on:  
Event time 
Ingestion time 
Processing time 
Data processing based on: 
Event time 
Ingestion time 
Processing time 
Fault Tolerance Uses Kafka topics as changelog topics to store application state Uses checkpointing to maintain processing state 

When To Use Kafka or Spark? 

The streaming solution you use depends on a variety of factors. If you want to analyze the streaming data against multiple other data sources, run analytics against the data, or train the data against a machine learning model, then Spark Structured Streaming is a good choice. 

Kafka Streams are good options where ultra-low latency is required, simple tasks such as data filtering, routing or transforming is required, and where high data throughput is needed. But ultimately, the decisions should come down to your existing architecture, use cases, and platform experience.  

Interested in learning more, or how Instaclustr can help with your use case? Reach out to our friendly sales team to get started or try out the Instaclustr Managed Platform and provision your first cluster for free!