This article provides a step-by-step example of using Apache Spark MLlib to do linear regression illustrating some more advanced concepts of using Spark and Cassandra together. We will start from getting real data from an external source, and then we will begin doing some practical machine learning exercise. The data structure we plan to demonstrate in this article is Spark DataFrame. It allows users to conduct SQL/DataFrame queries  on the dataset which means it is more flexible than RDDs. The programming environment of this example is Zeppelin and the programming language is Scala. In this article, we assume that Zeppelin and a cluster have been set up and provisioned properly as shown in our previous tutorials, also, Cqlsh should be installed correctly on your local environment: “Getting started with Instaclustr Spark & Cassandra” , “Using Apache Zeppelin with Instaclustr Spark & Cassandra Tutorial” and “Connecting to Instaclustr Using Cqlsh”.
Find a suitable dataset
In real life when we want to buy a good CPU, we always want to check that this CPU reaches the best performance, and hence, we can make the optimal decisions in face of different choices. The real data set (cpu-performance) we get is from UCI Machine Learning Repository . The explanation of attributes  are shown as following:
ERP: estimated relative performance from the original article (integer)
In this article, we just use some simple strategy when selecting and normalising variables, and hence, the estimated relative performance might not be too close to the original result. With regard to column 10 (ERP), Ein-Dor and Feldmesser have estimated the relative performance by using Linear Regression Model . The result they have generated is much closer to the original one. If you are interested in achieving a better result, you can refer to their paper.
The first thing we need to do is to prepare the data that we want to use the UCI repository . You can go here to see what this data looks like. In this example, we need to create a cpuperformance table that shares the columns from the dataset. In the cpuperformance table, we have eight features (vendor_name, model_name, myct, mmin, mmax, cach, chmin, chmax) that we will be considering using in the future.
Create a new table
Open the Zeppelin page and copy the following code into a notebook.
Due to the restriction of Cassandra, we have to create a table with an additional uuid column which will be identified as the primary key. The second thing is that Cassandra will reorder the place of the columns according to their alphabetical order , . Since we need to retrieve data directly from the network and use Cassandra COPY to insert data, the data source will remain the same column order and we do not want to have a reordered table. The workaround we provide is to add an additional character that will align with alphabetical order.
Import the data from UCI data repository
Open the terminal Go to the folder that has been installed with cqlsh cd apache-cassandra-3.11.0/bin
Check that cqlsh has the latest version:
The result should be :
Go to the Instaclustrconsole and check the public IP of your cluster.
Replace the <PUBLIC IP> with your real IP and provide your username and password, and execute the following command: (This command will work on Mac & Linux, probably not Windows)
In this command, we have accomplished three things: 1. Create a new column called UUID and generate UUID automatically. 2. Combine this UUID column with other columns. 3. Import the new dataset into Cassandra cpuperformance table.
Check that the data has been inserted correctly
Open the Zeppelin Notebook and copy the following code
The result should be as following:
Given this prepared dataset, we can now start to train and test a model which can be used then to identify new data instances – typical regression problem.
Train and Test Model
In this section, we plan to use the linear regression algorithm to train the linear model. Before training, we need to select the suitable features that can be used for training and then transform those features in a way that can be accepted by Spark’s linear model. We also need to split our dataset into the training dataset and testing dataset. After training, we can print out the summary of the training result.
* If you type df.show() in the notebook, you can see the first 20 rows having been inserted in df.
By default, you do not need to explicitly configure the connection settings in your code. If you wish to make any changes to Cassandra connection configuration, you can go to the interpreter page and make the proper changes.
Create features input column
Before training the dataset, we need to prepare the required input columns, labelCol and the featuresCol. Their default name and type is “label”, “Double” and “features”, “Vector”. This means if the dataframe has already got columns with names of “label” and “features” and types of “Double” and “Vector”, you do not need to make any changes on the columns before training. Otherwise, you need to make the necessary changes. These required input columns and the output column are listed below :
In this case, we need to select the proper features and put them together in a Vector, and we also need to point out that “jprp” is our label column.
With regard to variable selection, we finally decide to use these four best features: “emmin”,”fmmax”, “gcach”, “ichmax” according to wekaleamstudios’s research .
Since we do not have columns with names of “label” or “features”, we need to explicitly set them in the code. Function setLabelCol() and setFeaturesCol() are used to identify label column and features column, and these two columns will be used as the input in the process of data training.
In addition, when we consider types of regularization  we want to use, Lasso method will provide indirect functions of selecting variables , so we will set elastic net param as 1.0  in order to activate Lasso method .
Some explanation of the output: “features” is the original input feature column; “normFeatures” is the one after normalisation; “jprp” is the label column; finally, “prediction” is the output column that has generated the predictable result.
Print out the summary of the model
// Print the coefficients and intercept for linear regression