Tutorial: Apache Zeppelin with Spark and Cassandra

Zeppelin is a web-based notebook, which facilitates interactive data analysis using Spark. Instaclustr now supports Apache Zeppelin as an add-on component to our managed clusters. In this tutorial, we will walk you through the basic steps of using Apache Zeppelin with Instaclustr Spark and Cassandra.

Table of Contents

Provision a cluster with Cassandra, Spark and Zeppelin

  1. If you haven’t already signed up for an Instaclustr account, refer our support article to sign up and create an account.
  2. Once you have signed up for Instaclustr and verified your email, log into the Instaclustr console and click the Create Cluster button.
  3. On the Create Cluster page, enter an appropriate name for your cluster. Under Applications section, select:
    • Apache Cassandra
    • Apache Spark as an Add-on (Apache Spark 2.1.3 – Hadoop 2.6)
    • Apache Zeppelin as an Add-on (Apache Zeppelin 0.7.1 with Scala 2.11/Spark 2.1.1)
  4. Under Data Centre section, select:
    • Amazon Web Services as the Infrastructure Provider
    • A minimum node size of t3.medium

  5. Under Cassandra Options section, select:
    • Use Private IP Addresses for node discovery
    • Do not enable client encryption for Cassandra (see this article if you want to use Spark with Cassandra client to server encryption)
    • Add

  6. Leave the other options as default. Accept the terms and conditions and click Create Cluster button. The cluster will automatically provision and will be available for use once all nodes are in the running state.

Getting Started with Zeppelin

  1. Once all nodes in the cluster are in the running state, click on the Zeppelin tab on the cluster’s page.
  2. Go to the listed URL and enter the given credentials to access Zeppelin.
  3. After which you should see the following page.

Basic Interaction with Zeppelin Notebook

  1. Create a new Notebook by clicking on the Create new note link. Give your note a preferred name and let Spark to be the Default Interpreter and click the Create Note button.
  2. The notebook has already been preconfigured to use Spark interpreter. Click the gear button on the top right of the notebook to see the enabled interpreters.
  3. Make sure the Spark interpreter is at the top of the list and Cassandra interpreter is enabled. Click Save button to save the settings.
  4. Load the dependencies using the following code:

    Then you will see the following output:

    Make sure you get the same output as shown in the above picture. If it throws out an error, click on the gear button on the top right, go to the Interpreter menu and then restart the spark interpreter. Then you can go back to the Notebook and re-run the code.

  5. Run the following code:

    You should then get a result like the following:

Using Spark SQL from Zeppelin Notebook

  1. In the same Notebook, add a new paragraph, write and run the following code.
  2. You should then get a result like the following:If you try to run the above code in a new Notebook, you have to load the dependencies in the new Notebook first.

Using CQL from Zeppelin Notebook

Zeppelin can also be used to connect directly to Cassandra to execute CQL commands.

  1. Create a new Notebook.
  2. Put the following code into your Notebook and run the code.

    You should then get a result like the following:

Need Support
Learn More

Already have an account?
Login to the Console

Experiencing difficulties on the website or console?
Status page for known incidents

Don’t have an account yet?
Sign up for a free trial

Why sign up?
To experience the ease of creating and managing clusters via the Instaclustr Console.