Apache Spark Ecosystem
Spark hasan increasing number of use cases in various industries – including retail, healthcare, finance, advertising and education. It powers many new age companies and has contributors from 300+ companies. It is continuing to gain traction in the big data space.
Apache Spark Ecosystem components makes it popular than other big data frameworks. It is a platform for many use cases ranging from real-time data analytics, structured data processing, and graph processing.
Programming Language Support: Apache Spark supports popular languages for data analysis like Python and R, as well as enterprise-friendly Java and Scala. Whether it be an application developer or a data scientist, anyone can harness the scalability and speed of Apache Spark.
Spark components have a set of powerful, higher-level libraries that can be seamlessly used across the application.
Spark SQL is a module for working with the structured data. Spark SQL provides a standard interface for reading from and writing to other datastores. It also provides powerful integration with the rest of the Spark ecosystem (for example,, integrating SQL query processing with machine learning). A server mode provides industry standard JDBC and ODBC connectivity for business intelligence tools.
An early addition to Apache Spark, Spark Streaming makes it easy to build scalable fault-tolerant streaming applications. It enables powerful, interactive and analytic applications streaming across both real-time and historical data and readily integrates with a wide variety of popular data sources. Spark Streaming extended the Apache Spark concept of batch processing into streaming by breaking the stream down into a continuous series of microbatches.
MLliB – Apache Spark’s scalable machine learning library, this library is usable in Java, Scala, and Python as part of Spark applications. It includes a framework for creating machine learning pipelines, allowing for easy implementation of feature extraction, selections, and transformations on any structured dataset.
GraphX – Apache Spark’s API for graphs and graph-parallel computation. It comes with a selection of distributed algorithms for processing graph structures. It includes a growing collection of graph algorithms and builders to simplify graph analytic tasks.
Spark Core (General Execution Engine):
A general processing engine for the Spark platform provides in-memory computing capabilities to deliver fast execution of a wide variety of applications. Spark Core component is the foundation for parallel and distributed processing of large datasets. It provides distributed task dispatching, scheduling, and basic I/O functionality. It also handles node failures and re-computes missing pieces.
Spark supports many data sources including (with the Spark Cassandra connector), Apache Cassandra, Apache Kafka, Kinesis, Flumes.