What is Apache Spark?

Apache Spark is an open-source distributed computing system primarily used for big data processing and analytics. It was developed to improve the speed and simplicity of big data analysis by providing an easy-to-use interface and in-built capabilities for distributed data processing. Spark can process data from various sources such as Hadoop Distributed File System (HDFS), Apache Cassandra, Apache HBase, and S3, among others.

One of Spark’s defining features is its ability to perform in-memory computations, drastically reducing the time required for iterative tasks and interactive queries. The system leverages resilient distributed datasets (RDDs) to store data in memory, minimizing the need for disk I/O and significantly speeding up analytics processes. Furthermore, Spark supports multiple programming languages including Java, Python, Scala, and R, broadening its accessibility to data engineers and data scientists.

Editor’s note: Updated to reflect Apache Spark version 4.

This is part of a series of articles about Apache Spark

Understanding Apache Spark concepts

Resilient Distributed Datasets (RDDs)

Resilient distributed datasets (RDDs) are fundamental to Apache Spark. They are immutable and distributed collections of objects that can be processed in parallel. An RDD can be created from a local collection, or from datasets stored in external storage systems like HDFS, HBase, and S3.

Immutability of RDDs ensures fault tolerance by maintaining lineage information which tracks the sequence of operations that produced them. If a node fails, Spark uses the lineage to recompute lost data rather than replicating it across nodes. This approach not only reduces the replication overhead but also enhances data recovery speed.

Partition

Partitioning is essential for efficiently distributing data across the cluster. An RDD is divided into multiple partitions, each processed by a separate task. This method facilitates parallelism, leveraging the cluster’s computational power effectively.

Each partition resides on a different node in the cluster, and the number of partitions is governed by the cluster configuration and the dataset size. Managing partitions effectively can optimize resource utilization, reduce latency, and improve overall processing time of the data.

Transformation

Transformations in Spark are operations applied to RDDs to create a new RDD. Common transformations include map(), filter(), and reduceByKey(). These operations are lazily evaluated, meaning they don’t execute immediately but are instead recorded in a lineage graph.

Lazy evaluation ensures that Spark optimizes the entire execution plan rather than executing operations one by one. When an action is called, Spark utilizes this lineage graph to build an efficient query execution plan, significantly reducing computational overhead.

Action

Actions trigger the execution of transformations to return results to the driver program or to write data to an external storage system. Examples of actions include collect(), count(), and saveAsTextFile(). Unlike transformations, actions are executed immediately.

Actions play a critical role in materializing the computations applied on RDDs. They begin the data processing workflow and aggregate the results after all transformations are applied. Spark optimizes the execution by pipelining transformations and converting them into a series of execution tasks submitted to the cluster.

Directed Acyclic Graphs (DAG)

Directed acyclic graphs (DAGs) optimize the execution plan in Spark. When a transformation is performed, Spark constructs a DAG of stages that represent a sequence of computation steps derived from RDD operations. This graph structure ensures there are no cycles, which otherwise could lead to infinite loops.

DAGs contribute to fault tolerance and efficient job scheduling. By splitting tasks into stages, Spark can recover from failures by rerunning only the affected stages rather than the entire job, resulting in improved reliability and fault tolerance.

Datasets and DataFrames

Datasets and DataFrames are higher-level APIs that offer expressive data processing capabilities over RDDs. They provide a structured way to work with semi-structured and structured data, integrating schema information into the processing tasks.

Datasets are strongly-typed using JVM objects and provide compile-time type safety. DataFrames, on the other hand, are untyped, and their schema is represented by a collection of rows and columns, similar to a relational database. This abstraction simplifies complex data manipulations and optimizes execution through Catalyst optimizers.

Tasks

In Spark, tasks are units of work sent to worker nodes for execution. Each task operates on a partition of data, leveraging the distributed nature of the system to parallelize computations. Tasks are fundamental in the execution of transformations and actions defined on RDDs.

Detailed task tracking ensures fault tolerance and load balancing. If a task fails or straggles, Spark can reassign it to another node for re-execution. Effective task scheduling and execution management can significantly boost the speed of data processing workflows.

Learn more about Data architecture principles

What's new in Apache Spark 4?

Apache Spark 4 represents a major evolution in the Spark ecosystem, introducing a wide range of powerful new features and improvements that enhance performance, developer productivity, SQL expressiveness, Python usability, and real-time analytics. This release builds on the solid foundation of previous versions while modernizing key components like Spark Connect, SQL support, Structured Streaming, and the PySpark API to meet the needs of today’s data engineering and data science workloads.

Key new features in Apache Spark 4:

  • Enhanced SQL capabilities: Spark 4 introduces substantial SQL language improvements, including ANSI SQL mode by default for stricter semantics, SQL scripting with session variables and control flow, named parameter markers, and a new PIPE syntax to make queries more expressive and maintainable.
  • Spark Connect evolution: The Spark Connect architecture now achieves high feature parity with the classic mode and supports a broader range of languages (Go, Swift, Rust), delivering a modular client-server experience with easier migration via configuration.
  • Advanced Python support: PySpark sees major upgrades, such as native plotting using Plotly directly on DataFrames, a Python Data Source API for custom connectors, polymorphic UDTFs (user-defined table functions), and improved compatibility with recent Python versions.
  • New data types and data integrity: A new VARIANT data type simplifies handling semi-structured JSON and related data, and enabling ANSI SQL mode by default improves data reliability and portability of SQL workloads.
  • Structured Streaming improvements: Enhancements include a new arbitrary stateful processing API (e.g., transformWithState) for robust state handling and better state store observability and debugging support.
  • Developer productivity and observability: Spark 4 introduces structured logging with JSON output that integrates with modern observability tools, along with cleaner error messages and improved diagnostics.
  • Performance and ecosystem upgrades: This release includes broad performance optimizations, expanded SQL functions, improved Kubernetes support, and a more efficient shuffle and resource management framework, alongside ongoing enhancements to machine learning and connector ecosystems.

In addition to the initial 4.0 release, the Spark community continues to refine the platform with updates such as Spark 4.0.2 and previews for upcoming 4.1 and 4.2 releases, bringing incremental features and improvements post-4.0.

Spark architecture and components

Spark Core Engine

Spark’s core engine is the underlying distributed analytics execution framework that makes Spark a unified platform for large-scale data processing across clusters. It provides the foundational task scheduling and dispatch layer, which takes user transformation and action requests and breaks them down into fine-grained tasks that can run across many cluster nodes in parallel.

Core also manages memory and caching, allowing intermediate data to be held in-memory to reduce disk I/O and speed up iterative and interactive jobs, and implements fault tolerance through lineage tracking, meaning lost data segments can be recomputed from their transformation history rather than being redundantly stored. All higher-level libraries such as Spark SQL, Structured Streaming, and MLlib leverage this engine to execute complex workloads efficiently.

Driver program

The driver program in a Spark application is where the user’s main process runs and orchestrates the entire workflow. It initializes the SparkSession (or SparkContext), parses application code, and constructs the logical execution plan that represents the series of operations requested. The driver breaks this plan down, requests resources from the cluster manager, then schedules tasks on worker nodes via executors, continuously monitoring progress, handling retries, and consolidating results.

Cluster manager

The cluster manager acts as the resource allocator for Spark, deciding how cluster machines’ CPU, memory, and other resources are assigned to running applications. Spark supports multiple managers, including its own standalone manager, YARN in Hadoop ecosystems, and Kubernetes in containerized cloud deployments.

Once the driver requests resources, the cluster manager launches executor processes on selected worker nodes, ensuring that each application receives isolated runtime capacity. This division of resource control and execution allows Spark to scale efficiently and share cluster capacity among multiple concurrent jobs without interference.

Executors

Executors are the worker processes Spark launches on cluster nodes to carry out actual data processing tasks. Each Spark application gets its own set of executors, which persist for the application’s lifetime and run many tasks over their existence.

Executors pull tasks from the driver, execute them on partitions of the dataset in parallel, and handle intermediate data storage in memory or disk as needed. They also report task status and outputs back to the driver and deal with data shuffles between nodes when required by operations such as joins and aggregations. Because executors persist, they enable reuse of cached data and reduce overhead from repeatedly launching processes.

Directed Acyclic Graph (DAG) Scheduler

The DAG scheduler in Spark takes the logical transformations defined by a user and organizes them into a Directed Acyclic Graph (DAG) of operations with clear dependencies. This graph represents how data flows through each transformation or action in the job.

Spark then partitions this DAG into stages and tasks, grouping operations that can be run in parallel while respecting data dependencies. This transformation from high-level logic into executable units enables efficient parallel execution across the cluster and allows Spark to perform optimizations such as pipelining stages and minimizing data movement before sending tasks to executors.

Resilient Distributed Datasets (RDDs)

Resilient Distributed Datasets (RDDs) are the original distributed data abstraction in Spark, representing immutable collections of data partitioned across a cluster. RDDs track their lineage, or the sequence of operations that created them, enabling Spark to recompute partitions if they are lost due to failures rather than having to store extra copies.

Users perform functional transformations on RDDs (like map, filter, reduce) to express parallel computation, and RDDs form the low-level substrate beneath higher-level APIs like DataFrames and Datasets, meaning they still influence how computations are executed internally.

DataFrames and Datasets

DataFrames and Datasets are higher-level APIs built on RDDs that provide schema-aware abstractions for structured data. They allow developers to express complex queries with familiar relational operations, while Spark applies its Catalyst optimizer to refine and optimize the execution plan before running it.

DataFrames (like tables) and Datasets (typed, language-integrated structures) make Spark code easier to write and maintain compared to raw RDDs, and because they expose schema and query structure, Spark can plan more efficient data access paths and join strategies.

Spark SQL

Spark SQL is the component that enables SQL-centric and structured data processing on Spark. It allows users to execute SQL queries over DataFrames, integrate with BI tools via JDBC/ODBC, and benefit from optimized query planning and execution through Catalyst.

Spark SQL bridges the gap between traditional SQL database workflows and large-scale distributed data processing, letting users treat structured data just like in a relational database while still leveraging Spark’s scalability and speed.

Structured streaming

Structured Streaming is Spark’s stream processing engine that treats a live data stream as a continuously appended table, allowing developers to express streaming computations in terms of high-level operations similar to batch queries. Spark then incrementally updates results as new data arrives, enabling continuous data processing with built-in fault tolerance and state management.

Revent enhancements in Spark include a Real-Time Mode for ultra-low latency processing, making it possible to handle sub-second event-by-event patterns without rewriting existing structured streaming code.

MLlib (Machine Learning Library)

MLlib is Spark’s scalable machine learning library designed to work efficiently across distributed datasets. It includes implementations of common algorithms for classification, regression, clustering, and collaborative filtering, allowing model training and evaluation at scale.
MLlib integrates with Spark’s execution engine so that training and prediction tasks are distributed across executors and benefit from in-memory computation, simplifying the development of large-scale machine learning workflows on vast datasets.

GraphX / GraphFrames

GraphX and GraphFrames extend Spark for graph-centric analytics, enabling users to represent and process graph data structures at scale. GraphX provides a specialized API for graph algorithms and computation in Scala/Java, while GraphFrames offers a DataFrame-based approach for graph queries. Both leverage Spark’s distributed engine to execute graph traversals, pattern matching, and analytics across large networks of connected data.

Spark Connect

Spark Connect is a client-server API architecture introduced to decouple client applications from the execution engine. With Spark Connect, developers run a standalone Spark server that accepts remote commands from clients (e.g., Python or Scala) and executes them in the cluster.
This separation allows for cleaner API surfaces, easier upgrades, better stability for interactive workloads, and remote debugging from familiar development environments, all while maintaining compatibility with existing Spark features.

Spark Declarative Pipelines (SDP)

Spark Declarative Pipelines introduce a declarative approach to defining data pipelines, letting developers describe the desired state of datasets and their dependencies rather than manually orchestrating every transformation and execution step.

SDP automatically handles dependency resolution, execution sequencing, parallelism, checkpoints, and retries for both batch and streaming workflows. This model simplifies pipeline development, increases reliability, and leverages Spark’s unified engine to optimize performance and resource usage.

Learn more in our detailed guides to:

Tips from the expert

Ahmed Khaldi

Ahmed Khaldi

Solution Architect

Ahmed Khaldi is Solutions Architect with extensive experience in Apache SPARK and a proven track record in optimizing cloud infrastructure for various teams including DevOps, FinOps, SecOps, and ITOps.

In my experience, here are tips that can help you make the most of Apache Spark architecture:

  1. Leverage advanced memory tuning: Beyond basic caching, experiment with memory fractions (spark.memory.fraction, spark.memory.storageFraction) to optimize the balance between execution and storage memory. Fine-tuning these parameters can prevent excessive garbage collection and improve job stability.
  2. Optimize data serialization: Use Kryo serialization instead of Java serialization (spark.serializer). Kryo is more efficient and can significantly reduce the serialization overhead, especially with large and complex data structures.
  3. Strategic use of coalesce and repartition: While repartition increases parallelism, it can be costly due to shuffling. Use coalesce() to reduce the number of partitions efficiently without a full shuffle, particularly after filtering operations that reduce dataset size.
  4. Implement efficient join strategies: For large-scale joins, consider using broadcast joins (broadcast function) to distribute smaller datasets across all nodes, thus avoiding shuffling large datasets. Optimize join performance by ensuring the join keys are well-distributed.
  5. Utilize custom partitioners: For operations that involve repeated shuffling, such as aggregations and joins, creating custom partitioners can ensure data is distributed in a way that minimizes shuffling and balances workload across the cluster.

Best practices for designing Apache Spark applications

Optimize Data Partitioning

Effective data partitioning is crucial for maximizing the performance and scalability of your Spark application. When data is unevenly distributed across partitions, it can lead to skewed workloads where some nodes are overburdened while others are underutilized, causing bottlenecks and inefficiencies.

To avoid these issues, carefully design your partitioning strategy. Utilize hash partitioning or range partitioning based on the specific characteristics of your data and the operations you plan to perform. Additionally, consider re-partitioning your data after certain transformations to ensure balanced workloads throughout your processing pipeline. Monitoring partition sizes and adjusting the number of partitions based on the data volume can also help maintain optimal performance.

Utilize In-Memory Computations

Leveraging Spark’s in-memory computing capabilities can dramatically reduce the latency of your data processing tasks. By caching datasets that are frequently accessed or used in iterative algorithms, you minimize the need for expensive disk I/O operations.

Use the cache() method for datasets that fit entirely in memory and are used multiple times, or the persist() method with different storage levels for more control over memory usage and fault tolerance. In-memory computations are particularly beneficial for machine learning algorithms and interactive data analysis, where speed is essential. Regularly review your caching strategy to ensure that the most beneficial datasets are cached, and uncache datasets that are no longer needed to free up memory.

Choose the Right Data Format

Selecting the appropriate data format for your Spark application can significantly impact both performance and storage efficiency.

Columnar storage formats like Parquet and ORC are optimized for read-heavy operations, offering efficient compression and encoding schemes that reduce storage space and improve I/O performance. These formats support column pruning and predicate pushdown, which can greatly accelerate query execution times by reading only the necessary data.

For write-heavy workloads, consider formats like Avro, which are designed for efficient data serialization and deserialization. Additionally, if your application involves complex data types or nested structures, these formats handle such scenarios more effectively than traditional row-based formats.

Optimize Transformations and Actions

Designing efficient transformations is key to ensuring optimal performance in Spark. Narrow transformations, such as map, filter, and mapPartitions, operate on each partition independently and do not require data shuffling across the network, making them inherently more efficient.

Wide transformations like reduceByKey, groupByKey, and join involve redistributing data across partitions, which can introduce significant overhead. To minimize this overhead, apply filters and aggregations as early as possible in your processing pipeline to reduce the volume of data that needs to be shuffled. Additionally, use combiners for associative operations to reduce the amount of data transferred during shuffles. Monitoring and tuning these transformations can lead to substantial performance gains.

Use Broadcast Variables and Accumulators

Broadcast variables and accumulators are tools in Spark that can optimize your application’s performance. Broadcast variables allow you to efficiently distribute a read-only copy of a large dataset to all worker nodes, preventing the need to send a copy with every task. This is particularly useful for operations that join a large RDD with a smaller dataset.

To use broadcast variables, simply call the broadcast method on the dataset. Accumulators, on the other hand, provide a mechanism for aggregating information across tasks, such as counters or sums. They are particularly useful for implementing global counters or collecting debugging information. Use them judiciously to track metrics or gather aggregated results without introducing significant overhead.

Monitor and Tune Performance

Regular performance monitoring and tuning are essential for maintaining the efficiency of your Spark applications. The Spark UI provides valuable insights into job execution, including information on task durations, stage progress, and resource utilization. Use these metrics to identify bottlenecks, such as tasks that take significantly longer than others (stragglers) or excessive garbage collection times.

Adjust Spark configurations like executor memory, number of cores, and parallelism settings to optimize resource usage. Experiment with different configurations to find the optimal balance for your specific workload. Additionally, consider using external monitoring tools like Ganglia, Graphite, or custom dashboards to gain deeper insights into cluster performance and to automate alerting for potential issues.

Unlocking data insights: Apache Spark and Ocean for Apache Spark integration

Learn more about the power of integration with our product, Instaclustr. We bring the best of two worlds, Apache Spark and Ocean Protocol, together in a seamless and efficient data processing system. With Ocean for Apache Spark, developers can leverage the power of Apache Spark to perform complex data processing tasks on data from various sources within the Ocean Protocol network. This integration opens up new possibilities for data-driven applications, such as machine learning and analytics, by providing access to a wide range of diverse and valuable data sets.

Tap into a vast array of data sets, uncovering insights and possibilities that were previously out of reach:

  • Harness the power of Apache Spark for complex data processing tasks
  • Access and process data from various Ocean Protocol sources
  • Maintain data privacy, security, and control
  • Discover new insights and opportunities with big data analytics

Ready to explore the potential of your big data? For more information visit Ocean for Apache Spark.