• Technical
Behind The Scenes

Spoiler alert! Kubrick’s scientific consultant Frederick Ordway once revealed that Kubrick had the props for the film destroyed because he didn’t want to ruin the illusion of 2001 for people.  If you prefer to believe that 2001 was real, stop reading now, as behind-the-scenes photos did survive.

2001 pioneered lots of special effects! It was filmed before CGI and relied on a combination of full-size props, big models, “in camera” trick photography, and even rotating sets (for scenes inside the space station and spaceship Discovery).  

Kubrick had tons of sand imported, washed, and painted (!) for the moon surface scenes (this was before the first real moon landing, or if you believe online conspiracy theories, Kubrick was employed by NASA to fake the footage for the moon landings! This rumour actually originated from alt.humor.best-of-usenet ha ha)

The MLlib example in the previous blog, Fourth Contact with a Monolith assumed that the data was already in a wide table format (all features as columns, and a label for the category to learn, each row is a labelled example). This requires some extra code (see below).

First (1), read in the rolled-up metrics data for 5 minute periods from Cassandra. This has host, bucket_time, time, and one service (metric) per row, and min, avg, and max values, and a state column (which may be null).  This is the Cassandra schema:

(2) Take a subsample of the data to reduce the amount of data to something we can handle easily – in theory, it will scale with more data but it’s better to start with a smaller data set (for a specified time window to ensure that all the metrics are represented for the time period). Then (3) clean the data (remove rows that are of dubious value, and any that have state values, and drop the state column). This significantly reduces the number of columns (features) we have to work within the final ML step. Here’s the DataFrame na() function documentation. na() gives access to the DataFrameNaFunctions:

https://spark.apache.org/docs/1.4.0/api/java/org/apache/spark/sql/DataFrame.html

https://spark.apache.org/docs/1.4.0/api/java/org/apache/spark/sql/DataFrameNaFunctions.html

The pivot (4) is where the magic of converting the row-oriented data into column oriented wide data. There will be one row per unique host+time combination (groupBy), and one column for each service_min/avg/max (i.e. 3 columns per service, pivot and agg methods). For convenience, we also convert the time Timestamp into a Long value.

A function hasLongAsSLA(Row) (5), returns 1.0 or 0.0 if the example has a long SLA. The collect.foreach (6) is a hack to iterate through all rows in the pivot table and populate a map with host_time keys and label values computed by the hasLongAsSLA() function.  Note that this is not ideal, and won’t scale as the collect forces the calculation to run on a single server.  I’ve tried a few alternatives, some of which should work better in theory but in practice they all have unresolved issues still.

The lookupMapNextBucket user-defined function (UDF) (7) uses the map to determine if the next bucket from the current one has a long SLA or not (or -1 if an error).  User Defined Functions are Column-based functions for use in withColumn().

A withColumn() function (8) is used to add (or replace) the “label” column value for each row using the lookupMapNextBucket() udf.  I.e. each row (example) will be correctly labelled with a 1.0 if a long SLA occurred in the next (5 minutes ahead) example, otherwise 0.0.

This final labelled data pivot table can either be saved to Cassandra for use as input data for the MLlib example in the previous blog, or just used directly (9).

Pre-processing CODE

A final spoiler. We’ve had a few contacts with a Monolith, but behind the scenes it was just a painted wooden block that required lots of careful treatment:

The alien slab became a tall, black monolith crafted from wood and painted black. Harry Lange, [one of the film’s three production designers], remembered a graphite mix being added to the paint in order to lend a particularly smooth sheen. Touching the immaculate surface on set with greasy fingers was forbidden. Between scenes, the 12-foot-high monolith was swaddled in thick layers of plastic sheeting. According to [Tony Masters, another production designer], “Keeping it clean was a nightmare. For one reason or another, it attracted dust. You’d put it on stage, and—foomp! It would be covered in dust. You’d think, ‘Oh, Christ! I wonder if Stanley will see that?’ And if anyone put their hand on it—‘Stop the shooting! Back to the paint shop for a respray!’ It was unbelievable, what went on to protect that thing.” In fact, several monoliths were constructed, because the heat from the lights tended to warp the wood or blister the paint, ruining their supposed alien perfection….

(Source: Shutterstock)

A final word from Kubrick who was quoted as saying: If anyone understands it [2001] on the first viewing, we’ve failed…

https://en.wikiquote.org/wiki/2001:_A_Space_Odyssey_(film)

Kubrick is also quoted as saying: I don’t like doing interviews. There is always the problem of being misquoted or, what’s even worse, of being quoted exactly.

Cassandra and Spark have a lot of real (not stage prop!) complexity, power and usefulness.

So, to quote Kubrick inexactly: If anyone understands Cassandra and Spark on the first “viewing”, we’ve failed…