Managed Spark

Instaclustr provides a fully hosted and managed Apache Spark™ solution on Cassandra so that you can embrace the analytical power of Spark without having to move your data.

Features of Managed Spark

The Instaclustr Managed Spark is available on AWS, Azure, GCP, and IBM Cloud and provides a range of key features to ensure you can focus on the productive work of developing analytics with Spark.

Our management console allows you to deploy a fully managed and monitored Apache Spark.

We provide 24/7 technical expert support for our Managed Apache Spark customers.

Spark integrates on top of your data in Apache Cassandra to enable you to execute computation on your storage nodes.

Spark fully integrates with the key components of Cassandra and provides the resilience and scale you need for your application.

Instaclustr Managed Apache Spark is SOC 2 Certified, providing cluster security and availability assurance. Our SOC2 program includes security and availability considerations in our design, along with continually reviewing, testing and monitoring the environment.

Benefits of Apache Spark

Spark executes analytics and analysis across your data. Spark provides access to sophisticated open source machine learning and graphic analytics tools for your business.

We customize and optimize the configuration of your cluster so that you can receive all the benefits of our fully managed and hosted Spark service with your Cassandra deployment on AWS, Azure, GCP, IBM Cloud, and in your private data center.

Spark lets you seamlessly combine various libraries like Spark SQL, Spark streaming, MLliB (machine learning), and GraphX (graph) to create complex workflows and manage analytics.

Your Apache Spark engine is right where your operational database resides. There is no need for extracting, transforming (ELT), and loading into a new environment.

Apache Spark can be deployed as a standalone cluster mode, or in the cloud. Spark can access diverse data sources, including Cassandra. It has easy-to-use APIs to work with large datasets.

Apache Spark is entirely open source under the Apache Foundation model. This means there is no lock-in and an extensive ecosystem of supporting no lock-in companies and projects exist.

Instaclustr has done an amazing job helping us design and build the backbone of the platform with Cassandra and Spark. Their team of consultants integrated with our own team and went beyond their core expertise to provide value at all times.

Mehdi Charafeddine, Senior Industry Architect, Distribution Sector at IBM

What is Apache Spark™?

The fast and powerful open source processing engine, Apache Spark is built around speed, ease of use, and sophisticated analytics.

With an advanced DAG execution engine that supports cyclic data flow and in-memory computing, Apache Spark is 100x faster than competing analytic engines. UC Berkeley’s AMPLab developed Spark in 2009 and open sourced it in 2010. Since then, it has grown to become one of the largest open source communities in big data. Built by a wide array of developers from 200+ companies, more than a thousand developers have contributed to Spark since 2009.

Apache Spark Ecosystem

A lightning fast in-memory cluster computing, Apache Spark requires a fast, distributed back-end data store to provide advanced analytics capabilities, Apache Cassandra is the most modern, reliable and scalable choice for a data store.

Programming Languages

Apache Spark supports popular languages for data analysis like Python and R, as well as the enterprise-friendly Java and Scala, thus allowing everyone from application developers to data scientists to harness its scalability and speed.

Libraries

Libraries—the Spark core, is complemented by a set of powerful, higher-level libraries which can be seamlessly used in the same application.

Spark SQL: Spark SQL is a module for working with the structured data. Spark SQL provides a standard interface for reading from and writing to other datastores.  It also provides powerful integration with the rest of the Spark ecosystem (e.g. integrating SQL query processing with machine learning). A server mode provides industry-standard JDBC and ODBC connectivity for business intelligence tools.

Spark Streaming: An early addition to Apache Spark, Spark Streaming makes it easy to build scalable fault-tolerant streaming applications. Spark Streaming enables powerful interactive and analytic applications across both streaming (of new data in real-time) and historical data. It readily integrates with a wide variety of popular data sources.  Spark Streaming extended the Apache Spark concept of batch processing into streaming by breaking the stream down into a continuous series of microbatches.

MLliB: Apache Spark’s scalable machine learning library, this library is usable in Java, Scala, and Python as part of Spark applications. It includes a framework for creating machine learning pipelines, allowing for easy implementation of feature extraction, selections, and transformations on any structured dataset.

GraphX:  Apache Spark’s API for graphs and graph-parallel computation. It comes with a selection of distributed algorithms for processing graph structures. It includes a growing collection of graph algorithms and builders to simplify graph analytic tasks.

Apache Spark Core

Spark Core (General Execution Engine): A general processing engine for the Spark platform provides in-memory computing capabilities to deliver fast execution of a wide variety of applications. Spark Core component is the foundation for parallel and distributed processing of large datasets. It provides distributed task dispatching, scheduling, and basic I/O functionality. It also handles node failures and re-computes missing pieces.

Data Sources

Spark supports many data sources including (with the Spark Cassandra connector) Apache Cassandra.