Using Spark to Sample Data from One Cassandra Cluster and Write to Another

Menu

This tutorial describes how you can use Apache Spark and Zeppelin as part of an Instaclustr-managed cluster to extract and sample data from one cluster and write to another cluster.

Prerequisites

  1. At least two clusters running in Instaclustr. In this tutorial, the cluster from which we read data is called “source cluster” and the cluster to which we write the data is called “target cluster”.
  2. The target cluster is provisioned with Zeppelin and Spark.
  3. The keyspace of the target table must be identical to that of the source table (table names can be different).

Configure Network Access

As the Spark in your target cluster needs to connect to your source cluster to read data, the public IP addresses of the nodes in your target cluster need to be added into the “Cassandra Allowed Addresses” of your source cluster. The detailed steps are as follows:

  • Open your source cluster dashboard page.
  • Click “Settings” panel.
  • Add the public IP addresses of your target cluster nodes to “Cassandra Allowed Addresses”
  • Click “Save Cluster Settings”.

Create Table Definition on Source Cassandra Cluster and Target Cassandra Cluster

  1. Check the public IP address of your source cluster node.
  2. Open a terminal.
  3. Make sure cqlsh is installed on your system.
  4. Execute:
  5. Change to instaclustr keyspace:
  6. Create a table called “users”:
  7. Insert test data:
  8. Execute “quit” to exit the Cassandra environment.
  9. Check the public IP addresses of your target cluster nodes.
  10. Execute:
  11. Change to instaclustr keyspace:
  12. Create a table called “users”:
  13. Execute “select” CQL command:

    The result should be empty.                                                         

  14. Keep this window until the end of step 4.

Sample and Load Data

  1. Open the dashboard page of your target cluster.
  2. Open “Details” panel and click “Zeppelin” button, then you will see Zeppelin webpage opened through your web browser.
  3. Create a new notebook by clicking the “Notebook” button on the home page of Zeppelin.
  4. Put the following code in the first paragraph to load dependencies.
  5. Use the following spark code in the next paragraph to sample

    For a large dataset, it is very time consuming to extract the whole dataset into Spark and then sample data on Spark. To make it more efficient, the method used in the above example is sampling partition key and joining the sampled partition key with the source table, which avoids pulling the complete data set down to spark.

  6. Check the result on target Cassandra Cluster.
    1. Go back to the terminal environment.
    2. Execute “select” CQL command again:

      The result should be as following:            

SSL Connection

If encryption is enabled in your source cluster, you will need to contact support@instaclustr.com to load the truststore file of the source cluster to your target cluster. Meanwhile, the Spark context should be configured using the following code:

Site by Swell Design Group