Behind the scenes - Apache Spark DataFrames feature preprocessing

Spoiler alert! Kubrick’s scientific consultant Frederick Ordway once revealed that Kubrick had the props for the film destroyed because he didn’t want to ruin the illusion of 2001 for people. If you prefer to believe that 2001 was real, stop reading now, as behind-the-scenes photos did survive.

2001 pioneered lots of special effects! It was filmed before CGI and relied on a combination of full-size props, big models, “in camera” trick photography, and even rotating sets (for scenes inside the space station and spaceship Discovery).

Kubrick had tons of sand imported, washed, and painted (!) for the moon surface scenes (this was before the first real moon landing, or if you believe online conspiracy theories, Kubrick was employed by NASA to fake the footage for the moon landings! This rumour actually originated from alt.humor.best-of-usenet ha ha)

The MLlib example in the previous blog, Fourth Contact with a Monolith assumed that the data was already in a wide table format (all features as columns, and a label for the category to learn, each row is a labelled example). This requires some extra code (see below).

First (1), read in the rolled-up metrics data for 5 minute periods from Cassandra. This has host, bucket_time, time, and one service (metric) per row, and min, avg, and max values, and a state column (which may be null). This is the Cassandra schema:

CREATE TABLE instametrics.events_rollup_300 (
    bucket_time timestamp, 	// 1 hour bucket
    host text,
    service text,
    time timestamp,		// 5 minute bucket
    avg double,
    max double,
    min double,
    state text,
    PRIMARY KEY ((bucket_time, host, service), time)
) WITH CLUSTERING ORDER BY (time DESC)

CREATE TABLE instametrics.events_rollup_300 (

bucket_time timestamp, // 1 hour bucket

host text,

service text,

time timestamp, // 5 minute bucket

avg double,

max double,

min double,

state text,

PRIMARY KEY ((bucket_time, host, service), time)

) WITH CLUSTERING ORDER BY (time DESC)

(2) Take a subsample of the data to reduce the amount of data to something we can handle easily – in theory, it will scale with more data but it’s better to start with a smaller data set (for a specified time window to ensure that all the metrics are represented for the time period). Then (3) clean the data (remove rows that are of dubious value, and any that have state values, and drop the state column). This significantly reduces the number of columns (features) we have to work within the final ML step. Here’s the DataFrame na() function documentation. na() gives access to the DataFrameNaFunctions:

https://spark.apache.org/docs/1.4.0/api/java/org/apache/spark/sql/DataFrame.html

https://spark.apache.org/docs/1.4.0/api/java/org/apache/spark/sql/DataFrameNaFunctions.html

The pivot (4) is where the magic of converting the row-oriented data into column oriented wide data. There will be one row per unique host+time combination (groupBy), and one column for each service_min/avg/max (i.e. 3 columns per service, pivot and agg methods). For convenience, we also convert the time Timestamp into a Long value.

A function hasLongAsSLA(Row) (5), returns 1.0 or 0.0 if the example has a long SLA. The collect.foreach (6) is a hack to iterate through all rows in the pivot table and populate a map with host_time keys and label values computed by the hasLongAsSLA() function. Note that this is not ideal, and won’t scale as the collect forces the calculation to run on a single server. I’ve tried a few alternatives, some of which should work better in theory but in practice they all have unresolved issues still.

The lookupMapNextBucket user-defined function (UDF) (7) uses the map to determine if the next bucket from the current one has a long SLA or not (or -1 if an error). User Defined Functions are Column-based functions for use in withColumn().

A withColumn() function (8) is used to add (or replace) the “label” column value for each row using the lookupMapNextBucket() udf. I.e. each row (example) will be correctly labelled with a 1.0 if a long SLA occurred in the next (5 minutes ahead) example, otherwise 0.0.

This final labelled data pivot table can either be saved to Cassandra for use as input data for the MLlib example in the previous blog, or just used directly (9).

Pre-processing CODE

%dep
z.load("/opt/zeppelin/interpreter/spark/spark-cassandra-connector-assembly-2.0.2.jar")

%spark
import com.datastax.spark.connector._
import org.apache.spark.sql.cassandra._
import org.apache.spark.sql.functions._
import org.apache.spark.sql.functions.udf
import org.apache.spark.sql.types._
import org.apache.spark.sql._
import scala.collection.mutable.ArrayBuffer
import sqlContext.implicits._
import spark.implicits._

// map of key:value pairs to remember label for each example
val mymap = scala.collection.mutable.Map[String,Double]()

// 1
// read in raw metrics data from Cassandra
val data1 = spark
  .read
  .cassandraFormat(“table”, “keyspace”)
  .load()

// 2
// take subset to get a reasonable sized data set over a time period
val data2 = data1.filter('bucket_time > "2017-09-28 00:00:00")

// 3
// remove dubious rows, including:
// 	rows where the state column has a value
//	drop the state column
// 	rows with NaN values
// 	rows with min, avg and max values all 0.0
val cleanData = data2.filter('state.isNull).drop('state).na.drop().filter('avg > 0.0 && 'max > 0.0 && 'min > 0.0)

// 4
// convert into wide table format required by MLlib
// use time (5 minute buckets), not bucket_time (1 hour buckets)
// convert time timestamp to seconds since Epoch
val pivotTable = cleanData.groupBy("host", "time")
	.pivot("service")
	.agg(min("min"),avg("avg"),max("max"))
	.withColumn("time",
	 	unix_timestamp(cleanData("time"), "YYYY-MM-dd HH:mm:ss").cast(LongType)
	 	)

// 5
// function to return 1.0 or 0.0 for label creation
// takes a Row as input
def hasLongSLA(r:Row) : Double = {
	try {
   	 	val readavg = r.getAs[Double]("/cassandra/sla/latency/read_avg(avg)");
		val readmax = r.getAs[Double]("/cassandra/sla/latency/read_max(max)");
		val writeavg = r.getAs[Double]("/cassandra/sla/latency/write_avg(avg)");
		val writemax = r.getAs[Double]("/cassandra/sla/latency/write_max(max)");
	
		var v = 0.0
		if (readavg > 22 || writeavg > 22 || readmax > 100 || writemax > 100)
			v = 1.0
		return v
      }
      catch {
         case ex: IllegalArgumentException => {
         	// metric name not found
            return 0.0
         	}
         
         case ex: Throwable => {
            println("Error in getAs in hasLongSLA label function: " + ex)
            return 0.0
         	}
      }
}

// 6
// given the pivot table, populate the map with label values for
// each row
pivotTable.collect().foreach(
	r => mymap.put(
		r.getAs[String]("host") + "_" + r.getAs[Long]("time"),
		hasLongSLA(r)
				)
	)


// 7
// user defined function (udf) to return label for NEXT bucket time // period, or -1 if not found. Where NEXT is 5 minutes
// (60 * 5 = 300s) ahead. Uses mymap to find the label.
val lookupMapNextBucket = udf(
 	(h:String, b:Double) =>
 	mymap.getOrElse(h + "_" + (b.toLong() + 300), -1.0)
 )

// 8
// use the withColumn method to create (or update) the label column
// label = 1.0 if next bucket time for the host has a long SLA
// else label = 0.0 if no long SLA, or -1.0 if error
val dataWithLabels = pivotTable.withColumn("label", lookupMapNextBucket('host, 'time))

// 9
// To use this table as input to MLlib as in previous blog, can now // optionally save this table to Cassandra, and read it in
// again, e.g. table = "mllib_wide", keyspace = "instametrics”, or use // it as input to MLlib directly.

100

101

102

103

104

105

%dep

z.load("/opt/zeppelin/interpreter/spark/spark-cassandra-connector-assembly-2.0.2.jar")

%spark

import com.datastax.spark.connector._

import org.apache.spark.sql.cassandra._

import org.apache.spark.sql.functions._

import org.apache.spark.sql.functions.udf

import org.apache.spark.sql.types._

import org.apache.spark.sql._

import scala.collection.mutable.ArrayBuffer

import sqlContext.implicits._

import spark.implicits._

// map of key:value pairs to remember label for each example

val mymap = scala.collection.mutable.Map[String,Double]()

// 1

// read in raw metrics data from Cassandra

val data1 = spark

.read

.cassandraFormat(“table”, “keyspace”)

.load()

// 2

// take subset to get a reasonable sized data set over a time period

val data2 = data1.filter('bucket_time > "2017-09-28 00:00:00")

// 3

// remove dubious rows, including:

// rows where the state column has a value

// drop the state column

// rows with NaN values

// rows with min, avg and max values all 0.0

val cleanData = data2.filter('state.isNull).drop('state).na.drop().filter('avg > 0.0 && 'max > 0.0 && 'min > 0.0)

// 4

// convert into wide table format required by MLlib

// use time (5 minute buckets), not bucket_time (1 hour buckets)

// convert time timestamp to seconds since Epoch

val pivotTable = cleanData.groupBy("host", "time")

.pivot("service")

.agg(min("min"),avg("avg"),max("max"))

.withColumn("time",

unix_timestamp(cleanData("time"), "YYYY-MM-dd HH:mm:ss").cast(LongType)

)

// 5

// function to return 1.0 or 0.0 for label creation

// takes a Row as input

def hasLongSLA(r:Row) : Double = {

try {

val readavg = r.getAs[Double]("/cassandra/sla/latency/read_avg(avg)");

val readmax = r.getAs[Double]("/cassandra/sla/latency/read_max(max)");

val writeavg = r.getAs[Double]("/cassandra/sla/latency/write_avg(avg)");

val writemax = r.getAs[Double]("/cassandra/sla/latency/write_max(max)");

var v = 0.0

if (readavg > 22 || writeavg > 22 || readmax > 100 || writemax > 100)

v = 1.0

return v

}

catch {

case ex: IllegalArgumentException => {

// metric name not found

return 0.0

}

case ex: Throwable => {

println("Error in getAs in hasLongSLA label function: " + ex)

return 0.0

}

// 6

// given the pivot table, populate the map with label values for

// each row

pivotTable.collect().foreach(

r => mymap.put(

r.getAs[String]("host") + "_" + r.getAs[Long]("time"),

hasLongSLA(r)

)

// 7

// user defined function (udf) to return label for NEXT bucket time // period, or -1 if not found. Where NEXT is 5 minutes

// (60 * 5 = 300s) ahead. Uses mymap to find the label.

val lookupMapNextBucket = udf(

(h:String, b:Double) =>

mymap.getOrElse(h + "_" + (b.toLong() + 300), -1.0)

)

// 8

// use the withColumn method to create (or update) the label column

// label = 1.0 if next bucket time for the host has a long SLA

// else label = 0.0 if no long SLA, or -1.0 if error

val dataWithLabels = pivotTable.withColumn("label", lookupMapNextBucket('host, 'time))

// 9

// To use this table as input to MLlib as in previous blog, can now // optionally save this table to Cassandra, and read it in

// again, e.g. table = "mllib_wide", keyspace = "instametrics”, or use // it as input to MLlib directly.

A final spoiler. We’ve had a few contacts with a Monolith, but behind the scenes it was just a painted wooden block that required lots of careful treatment:

The alien slab became a tall, black monolith crafted from wood and painted black. Harry Lange, [one of the film’s three production designers], remembered a graphite mix being added to the paint in order to lend a particularly smooth sheen. Touching the immaculate surface on set with greasy fingers was forbidden. Between scenes, the 12-foot-high monolith was swaddled in thick layers of plastic sheeting. According to [Tony Masters, another production designer], “Keeping it clean was a nightmare. For one reason or another, it attracted dust. You’d put it on stage, and—foomp! It would be covered in dust. You’d think, ‘Oh, Christ! I wonder if Stanley will see that?’ And if anyone put their hand on it—‘Stop the shooting! Back to the paint shop for a respray!’ It was unbelievable, what went on to protect that thing.” In fact, several monoliths were constructed, because the heat from the lights tended to warp the wood or blister the paint, ruining their supposed alien perfection….

(Source: Shutterstock)

A final word from Kubrick who was quoted as saying: If anyone understands it [2001] on the first viewing, we’ve failed…

https://en.wikiquote.org/wiki/2001:_A_Space_Odyssey_(film)

Kubrick is also quoted as saying: I don’t like doing interviews. There is always the problem of being misquoted or, what’s even worse, of being quoted exactly.

Cassandra and Spark have a lot of real (not stage prop!) complexity, power and usefulness.

So, to quote Kubrick inexactly: If anyone understands Cassandra and Spark on the first “viewing”, we’ve failed…

Behind the scenes – Apache Spark DataFrames feature preprocessing

Pre-processing CODE

About the author

Behind the scenes – Apache Spark DataFrames feature preprocessing

Pre-processing CODE

About the author

Get the latest articles for open sourceIn your inbox

Sign upto ourNewsletter

Get the latest articles for open source
In your inbox

Sign up
to our
Newsletter