What is Apache Spark™?
Apache Spark™ is a high performing, sophisticated, open source analytics engine built around speed, performance and ease of use. Developed in 2009 at UC Berkeley’s AMPLab and open sourced in 2010, Spark is a fast and general cluster computing system for big data, providing an optimized engine that supports general computation graphs for data analysis.
Spark has advanced DAG execution engine and supports cyclic data flow and in-memory computing. Spark run workload is 100x faster than the competing engine—Hadoop.
Apache Spark Ecosystem
Spark has an increasing number of use cases in various industries—including retail, healthcare, finance, advertising, and education. It powers many new-age companies and has contributors from 300+ companies. It is continuing to gain traction in the big data space.
Apache Spark Ecosystem components make it more popular than other big data frameworks. It is a platform for many use cases ranging from real-time data analytics, structured data processing, and graph processing.
Programming Language Support: Apache Spark supports popular languages for data analysis like Python and R, as well as enterprise-friendly Java and Scala. Whether you are an application developer or a data scientist, anyone can harness the scalability and speed of Apache Spark.
Spark components have a set of powerful, higher-level libraries that can be seamlessly used across the application.
Spark SQL is a module for working with the structured data. Spark SQL provides a standard interface for reading from and writing to other datastores. It also provides powerful integration with the rest of the Spark ecosystem (for example, integrating SQL query processing with machine learning). A server mode provides industry-standard JDBC and ODBC connectivity for business intelligence tools.
An early addition to Apache Spark, Spark Streaming makes it easy to build scalable fault-tolerant streaming applications. It enables powerful, interactive, and analytic applications streaming across both real-time and historical data and readily integrates with a wide variety of popular data sources. Spark Streaming extended the Apache Spark concept of batch processing into streaming by breaking the stream down into a continuous series of microbatches.
Check out Apache Spark Structured Streaming with DataFrames blog by Paul Brebner, Tech Evangelist. Our Spark Streaming, Kafka and Cassandra tutorial demonstrates how to set up Apache Kafka and use it to send data back to Spark Streaming where it is summarized before being saved in Cassandra.
MLliB—Apache Spark’s scalable machine learning library. This library is usable in Java, Scala, and Python as part of Spark applications. It includes a framework for creating machine learning pipelines, allowing for easy implementation of feature extraction, selections, and transformations on any structured dataset.
The “Third Contact with a Monolith: In the Pod” blog is an introduction to Spark Machine Learning using MLLib and RDD’s and covers data splitting, model training, and evaluation. The “Fourth Contact with a Monolith” blog shows how to build an example DataFrames Machine Learning pipeline in Scala.
The “behind the scenes” blog shows how to pre-process data from Cassandra to prepare a wide column format table for use in Spark MLlib, using DataFrames operations.
View our step-by-step example of using Apache Spark MLlib to do linear regression
GraphX—Apache Spark’s API for graphs and graph-parallel computation. It comes with a selection of distributed algorithms for processing graph structures. It includes a growing collection of graph algorithms and builders to simplify graph analytic tasks.
SPARK CORE (GENERAL EXECUTION ENGINE):
A general processing engine for the Spark platform provides in-memory computing capabilities to deliver fast execution of a wide variety of applications. Spark Core component is the foundation for parallel and distributed processing of large datasets. It provides distributed task dispatching, scheduling, and basic I/O functionality. It also handles node failures and re-computes missing pieces.
Spark supports many data sources including (with the Spark Cassandra connector) Apache Cassandra, Apache Kafka, Kinesis and Flumes.
Cluster Management in Apache Spark
Apache Spark applications can run in 3 different cluster managers. These are:
- Standalone Cluster: Used if only Spark is running. In the standalone cluster mode Spark manages its own cluster. Each application runs an executor on every node within the cluster.
- Apache Mesos: A dedicated cluster and resource manager that provides rich resource scheduling capabilities.
- YARN: Comes with most of the Hadoop distributions and is the only cluster manager in Spark that supports security. It allows dynamic sharing and central configuration of the same pool of cluster resources between various frameworks that run on YARN.
Check out our blog Debugging Jobs in Apache Spark UI. The post runs a basic Spark job that selects data from a Cassandra database using a couple of different methods and examines how to compare the performance of those methods using the Spark UI.
Apache Spark Architecture
Spark has defined and layered architecture where all the spark components and layers are loosely coupled and integrated with various extensions and libraries.
It uses a master/worker architecture. There is a driver that talks to a single coordinator called master that manages workers in which executors run.
Spark Engine is responsible for scheduling, distributing, and monitoring the data application across the cluster and has many advantages. It detects the pattern and provides actionable insight into the big data in real-time. It is the most active open big data tool reshaping the big data market.
Spark lets you write the application in different programming languages and for the developer working on it, this functionally helps make their life easier. Spark Streaming can handle real-time stream processing along with integrations of other frameworks which concludes that spark streaming ability is easy, fault-tolerant, and integrated.
Since Spark cluster can be deployed as a standalone or in the cluster mode it can access data from a diverse source. Also, the project has many developers contributing and has an active mailing list and JIRA for issue tracking.
Additionally, Spark it lets you seamlessly combine various libraries to create a workflow and manage analytics. The in-memory analytics accelerates machine learning algorithms and reduces data read and write round trip from/to disk.
Spark and Cassandra
Apache Spark provides advanced analytics capabilities however, it requires a fast, distributed back-end data store. Apache Cassandra is the most modern, reliable, and scalable choice for that data store. Apache Spark when fully integrated with the key components of Cassandra provides the resilience and scale required for big data analytics.
Additionally, the Spark engine resides in the operational database, so there is no need for extracting, transforming, and loading into a new environment. Fundamentally, Spark and Cassandra clusters are deployed to the same set of machines. While Cassandra stores the data, Spark nodes are co-located with Cassandra and do the data processing. Spark understands how Cassandra distributes the data and reads only from the local node.
Getting started with Instaclustr Spark and Cassandra tutorial is a good launch point to learn to provision a cluster using Spark, Cassandra, and more.
You can also download our white paper, Powering Intelligent Application with Apache Cassandra, Apache Spark, and Apache Kafka to understand how the three technologies work together.
We performed two benchmark studies. The first one, Multi Datacenter Apache Spark and Apache Cassandra benchmark compared the performance and stability of running Spark and Cassandra collocated on the same Data Center (DC), versus running a second DC that is aimed towards analytic jobs with Spark. Round 2 was focused on running 1 single DC vs 2 DC of the same price.
Spark Cassandra Connector
Thanks to the Spark Cassandra Connector, the integration between Spark and Cassandra is seamless allowing for efficient distributed processing.
Learn a few key lessons about how to get the best out of Cassandra Connector for Spark with the blog by Ben Slater, CPO at Instaclustr on Cassandra Connector for Spark:5 Tips for Success. Additionally, our tutorial Instaclustr Spark With SSL Configured Cassandra Cluster will help you determine additional steps that must be taken when submitting jobs to configure the Spark Cassandra connector to use SSL.
ElasticSearch is a JSON database popular with log processing systems. Since Spark 2.1, Spark has included native Elasticsearch support, known as Elasticsearch-Hadoop. Elasticsearch-Hadoop provides native integration between Elasticsearch and Apache Spark, in the form of an RDD that can read data from Elasticsearch. Also, it allows Elasticsearch to be used in Spark in two ways: through the dedicated support available since 2.1, or through the Map/Reduce bridge since 2.0. Elassandra (Elasticsearch + Cassandra) is a fork of Elasticsearch modified to run on top of Apache Cassandra to provide advanced search features on Cassandra tables. Our tutorial will help you learn the basic steps of setting up an Instaclustr Elassandra cluster with Spark on Amazon Web Services (AWS) and how to write and query Elassandra from Spark.
Apache Spark Use Case
Apache Spark’s ability to process streaming data determines its key use case. With so much data being generated on a daily basis, it has become essential for companies to be able to stream and analyze in real time.
Apache Spark allows for entirely new use cases to enhance the value of big data. Some of the areas where Spark can be used include:
- Supply chain optimization and maintenance
- Optimisation in advertising and marketing to understand the probability of users clicking on available ads to maximize revenue/engagement.
- Fraud detection by conducting real-time inspections of data packets to trace malicious activity and anomalous behaviour of users.
- Apache Spark is fast enough to perform exploratory queries without sampling.
- Fog computing increasingly requires low latency, massively parallel processing of machine learning and extremely complex graph analytics algorithms. Spark due to its key component qualifies as a fog computing solution.
Spark simplifies intensive, high volume, real-time, or archived data processing tasks and seamlessly integrates relevant complex capabilities such as machine learning and graph algorithms. Its unlimited potential is only limited by imagination.
Apache Spark vs Hadoop
Apache Spark is a powerful alternative to Hadoop MapReduce, with several, rich functionality features, like machine learning, real-time stream processing, and graph computations. It has been gaining popularity and is emerging as the standard execution engine in Hadoop because of its extensible and flexible API’s, high performance, ease of use, and increased developer productivity.
Spark is typically faster than Hadoop as it uses RAM to store intermediate results by default rather than disk (e.g. performance modelling of Hadoop vs. Spark). Any results that won’t fit in RAM are spilled to disk or recomputed on demand. This means that (1) you need lots of RAM for it to work efficiently, and (2) in theory you can have less RAM it should still work (sort of). Performance comparison between Hadoop and Spark frameworks using HiBench benchmarks draws the conclusions that performance of these frameworks varies significantly, and Spark is more efficient than Hadoop to deal with a large amount of data in major cases.
To provide easy access to your Spark processing engine, Instaclustr’s Spark cluster include Spark Jobserver (REST API).
We provide integrated Spark management and monitoring through the Instaclustr Management Console as well as provide 24×7 monitoring and support to our Spark customers. From our Managed Platform you can Pick‘n’Mix: Cassandra, Spark, Elasticsearch, Kibana, and Kafka.
Our Managed Spark 2.1.1 provides increased stability and feature enhancements while providing access to the key benefits of Spark 2.0 and these include up to 10x performance improvements, Structured Streaming API, ANSI SQL parser for Spark SQL, streamlined API. We have come a long way since made Managed Spark on Cassandra on preview release followed by full release a few years ago.
Instaclustr supports multi-cloud managed Spark on Amazon Web Services (Spark on AWS), Google Cloud Platform (Spark on GCP), and Microsoft Azure (Spark on Azure). Spark Job server provides a simple, secure method of submitting jobs to Spark without many of the complex set up requirements of connecting to the Spark master directly.
Gartner has some good resources to help understand the benefits of multi-cloud solutions:
- Architecting Portable and Multicloud Applications
- A Guidance Framework for Architecting Portable Cloud and Multicloud Applications
Performance Comparison of Spark Clusters Configured Conventionally and a Cloud Service, shows how Spark as a cloud service gives more promising outcomes in terms of time, effort and throughput.
Apache Spark Consulting
Our consulting experts have hands-on experience with open source big data technologies ready to assist at each lifecycle stage of your application. We offer Spark with Cassandra Kickstart Package. The package is meant for those looking at adopting Apache Spark on Cassandra. It brings to you all the benefits and features of the Cassandra Kickstart Package along with the expertise to evaluate the suitability of your concept and intended application on Apache Spark as well. To find out the cost of consulting packages talk to a consultant.