Paul Brebner

Technology Evangelist

Since learning to program on a VAX 11/780, Paul has extensive R&D and consulting experience in distributed systems, technology innovation, software architecture and engineering, software performance and scalability, grid and cloud computing, and data analytics and machine learning.

Paul is the Technology Evangelist at Instaclustr. He’s been learning new scalable technologies, solving realistic problems, and building applications, and blogging about Apache Cassandra, Spark, Zeppelin, and Kafka.

Paul has worked at UNSW, several tech start-ups, CSIRO, UCL (UK), and NICTA. Paul has helped pre-empt and solve significant software architecture and performance problems for clients including Defence and NBN Co. Paul has an MSc in Machine Learning and a BSc (Computer Science and Philosophy).

Research net profile

Paul Brebner

Paul's Articles

Kafka Connect and Elasticsearch vs. PostgreSQL Pipelines: Final Performance Results (Pipeline Series Part 9)

Monday 18th October 2021

In Part 6 and Part 7 of the pipeline series we took a different path in the pipe/tunnel and explored PostgreSQL and Apache Superset, mainly from a functional perspective—how can you get JSON data into PostgreSQL from Kafka Connect, and what does it look like in Superset. In Part 8, we ran some initial load tests and found out how the capacity of the original Elasticsearch pipeline compared with the PostgreSQL variant. These results were surprising (PostgreSQL 41,000 inserts/s vs. Elasticsearch 1,800 inserts/s), so in true “MythBusters” style we had another attempt to make them more comparable. 

Read more

Kafka Connect and Elasticsearch vs. PostgreSQL Pipelines: Initial Performance Results (Pipeline Series Part 8)

Wednesday 13th October 2021

In Part 6 and Part 7 of the pipeline series we took a different path in the pipe/tunnel and explored PostgreSQL and Apache Superset, mainly from a functional perspective—how can you get JSON data into PostgreSQL from Kafka Connect, and what does it look like in Superset. In this blog, we run some initial load tests and find out how the capacity of the original Elasticsearch pipeline compares with the PostgreSQL variant.

Read more

The PostgreSQL Boolean Three-Valued Logic Data Type

Tuesday 1st June 2021

In my previous PostgreSQL blog, we discovered what data types are available in PostgreSQL (a lot) and hopefully determined the definitive mapping from PostgreSQL to SQL/JDBC to Java data types. However, even armed with this information you have to be careful about type conversion/casting, and watch out for run-time errors, truncation, or loss of information.

Read more

PostgreSQL Data Types: Mappings to SQL, JDBC, and Java Data Types

Monday 24th May 2021

How do you use PostgreSQL from Java? With JDBC! (Java Database Connectivity). There’s a PostgreSQL JDBC Driver (PgJDBC for short) which allows Java programs to connect using standard, database independent, Java code. It’s an open source Pure Java (Type 4, which talks native PostgreSQL protocol) driver and is well documented.

Read more

Apache ZooKeeper Meets the Dining Philosophers

Sunday 9th May 2021

A ZooKeeper walks into a pub… (actually an Outback pub) The ZooKeeper notices a very rowdy crowd at a round table who appear to be fighting over forks, and she can’t avoid overhearing this conservation: Karl (Marx): “Ludwig, I am hungry please lend me a fork” Ludwig (Wittgenstein): “Karl, I don’t fully understand what you […]

Read more

Apache Kafka MirrorMaker 2 (MM2) Part 2: Practice

Thursday 18th March 2021

In part 1 of this blog series, we focused on MirrorMaker 2 theory (Kafka replication, architecture, components and terminology) and invent some MirrorMaker 2 rules. In this part, we will be more practical, and try out Instaclustr’s managed MirrorMaker 2 service and test the rules out with some experiments.

Read more

Apache Kafka MirrorMaker 2 (MM2) Part 1: Theory

Tuesday 16th March 2021

In this new two-part blog series we’ll turn our gaze to the newest version of MirrorMaker 2 (MM2), the Apache Kafka cross-cluster mirroring, or replication, technology. MirrorMaker 2 is built on top of the Kafka Connect framework for increased reliability and scalability and is suitable for more demanding geo-replication use cases including migration, backup, disaster […]

Read more

Scaling Kafka Connect Streaming Data Processing (Pipeline Series Part 5)

Monday 25th January 2021

In Part 4 of this blog series, we started exploring Kafka Connector task scalability by configuring a new scalable load generator for our real-time streaming data pipeline, discovering relevant metrics, and configuring Prometheus and Grafana monitoring. We are now ready to increase the load and scale the number of Kafka Connector tasks and demonstrate the […]

Read more

Getting to Know Apache Camel Kafka Connectors (Pipeline Series Part 3)

Thursday 17th December 2020

In Part 1 and Part 2 of this blog series we started a journey building a real-time pipeline to acquire, ingest, graph, and map public tidal data using Apache Kafka, Kafka Connect, Elasticsearch, and Kibana. In this blog, we resume that journey and take an Apache “Camel” (Kafka Connector) through the desert (or the Australian […]

Read more

Kafka Technology Advances: Kafka Summit 2020

Thursday 8th October 2020

The annual Kafka Summit 2020 went ahead this year (August 24-25) with a lot of topics. In the previous blog, I examined talks 1-3 from the perspective of challenging Kafka Use Cases. In this blog, I’ll focus on some of the more interesting Kafka technology advances from the remaining talks that I watched. Here is […]

Read more

Building a Low-Latency Distributed Stock Broker Application: Part 4

Monday 5th October 2020

In the fourth blog of the  “Around the World ” series we built a prototype of the application, designed to run in two georegions. Recently I re-watched “Star Trek: The Motion Picture” (The original 1979 Star Trek Film). I’d forgotten how much like “2001: A Space Odyssey” the vibe was (a drawn out quest to […]

Read more

Redis Java Clients and Client-Side Caching

Thursday 24th September 2020

Achieving sub-millisecond Redis latency with Java clients and client-side caching In my previous Redis blog, we discovered what Redis really is! It’s an open-source in-memory data structures server. And we discovered how fast it is! For a 6 node Instaclustr Managed Redis cluster latencies are under 20ms and throughput is in the millions of “gets” […]

Read more

It’s an In-Memory Key-Value Store! It’s a Database! It’s Redis!

Wednesday 9th September 2020

Look! Up in the sky! It’s an in-memory key-value store! It’s a database! It’s Redis! Faster than a speeding database! More powerful than an in-memory key-value store!  Able to leap tall performance barriers at a single bound! “Yes, it’s Redis—strange visitor from another planet who came to Earth with powers and abilities far beyond those […]

Read more

Taking Elasticsearch for a Spin around the Race Track (Q&A): Part 3

Tuesday 14th July 2020

Then may I set the world on wheels, when she can spin for her living. (Two Gentlemen of Verona, III, 1) The weary sun hath made a golden set, And by the bright track of his fiery car, Gives signal, of a goodly day to-morrow.  (Richard III, V, 3) Thy burning car never had scorch’d […]

Read more

Taking Elasticsearch to the Mechanics: Under the Hood Q&A (Part 2)

Wednesday 8th July 2020

Marry, that’s a bountiful answer that fits all questions (All’s Well That Ends Well, II, 2) In Part 1 of this multi-part Elasticsearch Blog I revealed the most interesting things I learnt after taking Elasticsearch for my first “Test Drive”, including that Elasticsearch comes well equipped with some clever-sounding computational linguistics analysis tricks including Stemming, […]

Read more

Taking Elasticsearch for a “Test Drive”: The Basics and Inexact Matching

Thursday 25th June 2020

“Search out thy wit for secret policies, and we will make thee famous through the world.” [Henry VI Part I, III, 3] Earlier in 2020 I decided to learn about a new addition to Instaclustr’s managed open source platform, the scalable full text search technology Elasticsearch—using OpenDistro for Elasticsearch, an Apache 2.0-licensed distribution of Elasticsearch […]

Read more

Building a Low-Latency Distributed Stock Broker Application: Part 3

Friday 24th April 2020

In the third blog of the  “Around the World ” series focusing on globally distributed storage, streaming, and search, we build a Stock Broker Application.  1. Place Your Bets! The London Stock Exchange  How did Phileas Fogg make his fortune? Around the World in Eighty Days describes Phileas Fogg in this way:  “Was Phileas Fogg […]

Read more

An Introduction to Cassandra Multi-Data Centers: Part 2

Friday 3rd April 2020

In this second blog of “Around the World in (Approximately) 8 Data Centers” series we catch our first mode of transportation (Cassandra) and explore how it works to get us started on our journey to multiple destinations (Data Centers). 1. What Is a (Cassandra) Data Center? What does a Data Center (DC) look like? Here […]

Read more

An Introduction to Cassandra Multi-Data Centers: Part 1

Monday 9th March 2020

Quick! Grab your top hat, passport, carpetbag stuffed with (mainly) cash, and your valet (if you have one), and join with me on a wild journey around the world in approximately 8 data centers—a new blog series to explore the world of globally distributed storage, streaming, and search with Instaclustr Managed Open Source Technologies such […]

Read more

The Power of Kafka Partitions: How to Get the Most out of Your Kafka Cluster

Monday 6th January 2020

This blog provides an overview of the two fundamental concepts in Apache Kafka: Topics and Partitions. While developing and scaling our Anomalia Machina application we have discovered that distributed applications using Apache Kafka and Cassandra clusters require careful tuning to achieve close to linear scalability, and critical variables included the number of Apache Kafka topics […]

Read more

Cassandra Elastic Auto-Scaling Using Instaclustr’s Dynamic Cluster Resizing

Tuesday 3rd December 2019

This is the third and final part of a mini-series looking at the Instaclustr Provisioning API, including the new Open Service Broker.  In the last blog, we demonstrated a complete end to end example using the Instaclustr Provisioning API, which included dynamic Cassandra cluster resizing.  This blog picks up where we left off and explores […]

Read more

ApacheCon Berlin, 22-24 October 2019

Monday 2nd December 2019

ApacheCon Europe, October 22-24, 2019, Kulturbrauerei Berlin #ACEU19 What’s better than one ApacheCon? Another ApacheCon! This year there were two Apache Conferences, one in Las Vegas and then again in Berlin. They were similar but different. What were some differences between ApacheCon Berlin and Las Vegas? The location. In contrast to the hyper-real gambling […]

Read more

Instaclustr Provisioning API Demonstration: A Complete End-to-End Example

Monday 23rd September 2019

Overview An end-to-end demonstration of Instaclustr’s Provisioning API for any use case involving automated programmatic cluster provisioning, configuration, discovery, and de-provisioning (or a subset of these operations). 1. Provisioning Provisioning: Supply with food, drink, or equipment, especially for a journey. Provisioning is all about ensuring you have sufficient quantity of provisions (food, drink, etc.) sufficiently […]

Read more

Instaclustr Open Service Broker – A Complete End-to-End Example

Wednesday 11th September 2019

Introduction Instaclustr has recently launched the Instaclustr Service Broker, an implementation of the Open Service Broker (OSB) API for Instaclustr managed services (Apache Cassandra, Spark, Zeppelin, and Kafka).   Over a series of blogs I plan to try it out using the following “bottom-up” approach:  get a complete end-to-end Kubernetes workflow working to test and demonstrate […]

Read more

Geospatial Anomaly Detection (Terra-Locus Anomalia Machina) Part 2: Geohashes (2D)

Tuesday 18th June 2019

Massively Scalable Geospatial Anomaly Detection with Apache Kafka and Cassandra In this blog, we continue exploring how to build a scalable Geospatial Anomaly Detector. In the previous blog, we introduced the problem and tried an initial Cassandra data model with locations based on latitude and longitude. We now try another approach, Geohashes, to start with, […]

Read more

Anomalia Machina 8: Production Application Deployment With Kubernetes

Tuesday 5th March 2019

In the previous blog we explored deploying the Anomalia Machina application on Kubernetes with the help of AWS EKS. In the recent blogs (Anomalia Machina 5 and Anomalia Machina 6), we enhanced the observability of the Anomalia Machina Application using two Open Source technologies: Prometheus for distributed monitoring of metrics such as throughput and latency; […]

Read more

Anomalia Machina 7: Kubernetes Cluster Creation and Application Deployment

Monday 11th February 2019

Kubernetes—Greek: κυβερνήτης = Helmsman If you are Greek hero about to embark on an epic aquatic quest (encountering one-eyed rock-throwing monsters, unpleasant weather, a detour to the underworld, tempting sirens, angry gods, etc.) then having a trusty helmsman is mandatory (even though the helmsman survived the Cyclops, like all of Odysseus’s companions, he eventually came […]

Read more

Anomalia Machina 6: Application Tracing with OpenTracing—Massively Scalable Anomaly Detection with Apache Kafka and Cassandra

Tuesday 15th January 2019

In the previous blog (Anomalia Machina 5 – Application Monitoring with Prometheus) we explored how to better understand an Open Source system using Prometheus for distributed metrics monitoring. In this blog we have a look at another way of increasing visibility into a system using OpenTracing for distributed tracing. 1. A History of Tracing Over […]

Read more

Anomalia Machina 5: Application Monitoring with Prometheus—Massively Scalable Anomaly Detection with Apache Kafka and Cassandra

Wednesday 19th December 2018

1. Introduction In order to scale Anomalia Machina we plan to run the application (load generator and detector pipeline) on multiple EC2 instances. We are working on using Kubernetes (AWS EKS) to automate this, and progress so far is described in this webinar. However, before we can easily run a Kubernetes deployed application at scale […]

Read more

Anomalia Machina 1: Massively Scalable Anomaly Detection With Apache Kafka and Cassandra

Friday 28th September 2018

anomalia—Latin (1) irregularity, anomaly machina—Latin (1) machine, tool, (2) scheme, plan, machination What do you get if you combine Anomalia and Machina? Machine Anomaly—A broken machine (Machina Anomalia) Irregular Machinations—Too political (Anomalia Machina, 2nd definition) Anomaly Machine! (Anomalia Machina, 1st definition) Let’s Build the Anomalia Machina! A Steampunk Anomalia Machina—possibly, I actually have no clue […]

Read more

Apache Kafka “Kongo” 6.3: Production Kafka Application Scaling on Instaclustr

Thursday 2nd August 2018

The goal of this blog is to scale the Kongo IoT application on Production Instaclustr Kafka clusters. We’ll compare various approaches including scale-out, scale-up, and multiple clusters. There are two versions to the story. In the Blue Pill version scaling everything goes according to plan and scaling is easy. If you are interested in the […]

Read more

Apache Kafka “Kongo” 6.2: Production Kongo on Instaclustr

Friday 29th June 2018

In this blog (parts 6.1 and 6.2) we deploy the Kongo IoT application to a production Kafka cluster, using Instraclustr’s Managed Apache Kafka service on AWS.  In part 6.1 we explored Kafka cluster creation and how to deploy the Kongo code. Then we revisited the design choices made previously regarding how to best handle the […]

Read more

Apache Kafka “Kongo” 6.1: Production Kongo on Instaclustr

Friday 29th June 2018

In this blog we deploy the Kongo IoT application to a production Kafka cluster, using Instraclustr’s Managed Apache Kafka service on AWS. We explore Kafka cluster creation and how to deploy the Kongo code. Then we revisit the design choices made previously regarding how to best handle the high consumer fan out of the Kongo […]

Read more

Apache Kafka “Kongo” 5.3: Kongo Streams Example

Wednesday 20th June 2018

Introduction In the previous blog we tried a simple Kafka Streams application for Cluedo. It relied on a KTable to count the number of people in each room. In this blog, we’ll extend this idea and develop a more complex streams application to keep track of the weight of goods in trucks for our Kongo […]

Read more

Kongo 5.2: Apache Kafka Streams Examples

Tuesday 29th May 2018

In this blog, we’ll look at some simple Apache Kafka Streams examples using the murder mystery game Cluedo as a simple problem domain. Dr Black has been murdered in the Billiard Room with a Candlestick! Whodunnit?! There are six suspects and a mansion with multiple rooms. The suspects are: Miss Scarlet Professor Plum Mrs Peacock […]

Read more

Kongo 5.1: Apache Kafka Streams Introduction

Tuesday 29th May 2018

Abstract Apache Kafka Streams is a framework for stream data processing. In this blog, we’ll introduce Kafka Streams concepts and take a look at one of the DSL operations, Joins, in more detail. In the next blog, we’ll have a look at some more complete Kafka Streams examples based on the murder mystery game Cluedo. […]

Read more

Apache Kafka Connect Architecture Overview

Wednesday 9th May 2018

Kafka Connect is an API and ecosystem of 3rd party connectors that enables Apache Kafka to be scalable, reliable, and easily integrated with other heterogeneous systems (such as Cassandra, Spark, and Elassandra) without having to write any extra code. This blog is an overview of Kafka Connect Architecture with a focus on the main Kafka […]

Read more

“Kongo” Part 3. Apache Kafka: Kafkafying Kongo—Serialization, One or Many topics, Event Order Matters

Thursday 26th April 2018

Kafkafying: the transformation of a primitive monolithic program into a sophisticated scalable low-latency distributed streaming application (c.f. “An epidemic of a zombifying virus ravaged the country”) Steps for Kafkafying Kongo In the previous blog (“Kongo” Part 2: Exploring Apache Kafka application architecture: Event Types and Loose Coupling)  we made a few changes to the original […]

Read more

“Kongo” Part 2: Exploring Apache Kafka application architecture: Event Types and Loose Coupling

Thursday 5th April 2018

This is the second post in our series exploring designing and developing and example IOT application with Apache Kafka to illustrate typical design and implementation considerations and patterns. In the previous blog, we introduced our Instaclustr “Kongo” IoT Logistics Streaming Demo Application. The code for Version 1 of the Kongo application was designed as an initial […]

Read more

“Kongo” Part 1 Apache Kafka: IoT Logistics Streaming Demo Application

Thursday 15th March 2018

What’s a good name to give a demo IoT streaming application dealing with large scale logistics? How about a river… Maybe The “Amazon” application? That’s sort of taken. The Amazon is the longest river and has the most water flow, but what’s the 2nd ranking river? The Congo! The Congo is the 2nd biggest river […]

Read more

Exploring the Apache Kafka “Castle” Part B: Event Reprocessing

Thursday 18th January 2018

In this second part of the Apache Kafka Castle blog we contemplate the being or not being of Kafka Event Reprocessing, and speeding up time! Reprocessing Use Cases Reprocess: /riːˈprəʊsɛs/ verb Process (something, especially spent nuclear fuel) again or differently. Repeat event processing is called reprocessing (or sometimes replaying or rewinding), and some reprocessing use […]

Read more

Exploring the Apache Kafka “Castle” Part A: Architecture and Semantics

Friday 12th January 2018

NEWS FLASH Apache Kafka Coming Soon to Instaclustr’s Service Offering! (Source: Wikipedia) If you haven’t read Kafka’s “The Castle” (I haven’t) a few online observations are sufficient for a concise summary (and will save you the trouble of reading it): Time seems to have stopped in the village The story has no ending (Kafka died […]

Read more

Pick‘n’Mix: Cassandra, Spark, Zeppelin, Elassandra, Kibana, and Kafka

Tuesday 5th December 2017

Kafkaesque:  \ käf-kə-ˈesk \ Marked by a senseless, disorienting, menacing, nightmarishly complexity. One morning when I woke from troubled dreams, I decided to blog about something potentially Kafkaesque: Which Instaclustr managed open-source-as-a-service(s) can be used together (current and future)? Which combinations are actually possible? Which ones are realistically sensible? And which are nightmarishly Kafkaesque!? In previous blogs, […]

Read more

Spark Structured Streaming with DataFrames

Tuesday 28th November 2017

This blog provides an exploration of Spark Structured Streaming with DataFrames The blog extends the previous Spark MLLib Instametrics data prediction blog example to make predictions from streaming data.  We demonstrate a two-phase approach to debugging, starting with static DataFrames first, and then turning on streaming. Finally, we explain Spark structured streaming in more detail […]

Read more

A Luxury Voyage of (Data) Exploration by Apache Zeppelin

Thursday 9th November 2017

Data Exploration into the cutting-edge technology of Apache Zeppelin (Source: Shutterstock) The catastrophic crash of the Hindenburg in 1937 ended the era of luxury travel in the colossal fast ships of the air that were pushing the boundaries of air travel technology.  Zeppelins had many experimental innovations like an auto-pilot, were made from Duralumin girders […]

Read more

Behind The Scenes

Wednesday 25th October 2017

Spoiler alert! Kubrick’s scientific consultant Frederick Ordway once revealed that Kubrick had the props for the film destroyed because he didn’t want to ruin the illusion of 2001 for people.  If you prefer to believe that 2001 was real, stop reading now, as behind-the-scenes photos did survive. 2001 pioneered lots of special effects! It was […]

Read more

Fourth Contact With a Monolith

Friday 20th October 2017

“The thing’s hollow — it goes on forever — and — oh my God! — it’s full of stars!” It’s full of Spreadsheets! (DataFrames) (Source: Wikimedia Commons) Given that a dog, Laika, was the 1st astronaut to orbit the earth, it’s appropriate for a dog to travel through the wormhole. After travelling through the wormhole, […]

Read more

Third Contact with a Monolith: Part C—In the Pod

Friday 29th September 2017

A Simple Classification Problem: Will the Monolith React? Is It Safe?! Maybe a cautious approach to a bigger version of the Monolith (2km long) in a POD that is only 2m in diameter is advisable.   What do we know about how Monoliths react to stimuli? A simple classification problem consists of the category (label) “no […]

Read more

Third Contact With a Monolith—Beam Me Down Scotty

Wednesday 20th September 2017

Regression Analysis is (relatively) easy Hypothesis: Using only a subset of GC metrics we can compute linear regression functions using only heap space used to predict when the next GC occurs. To do this we don’t need access to all the metrics per host, just a subset. And we can extend it in the future to […]

Read more

Third contact with a Monolith—Long Range Sensor Scan

Thursday 14th September 2017

The Odyssey Continues: A Long Trip to Jupiter Earth to Mars distance = 0.52 AU (1.52-1AU, 78M km) Earth to Jupiter distance = 4.2 AU (5.2-1AU, 628M km) It’s a long way to Jupiter, would you like to: (a) sleep the whole way in suspended animation?  (bad choice, you don’t wake up) (b) be embodied […]

Read more

Hello Cassandra! A Java Client Example

Thursday 7th September 2017

This is the third (and final) part of my blog-series on creating a demonstration Cassandra cluster, connecting, and communicating. We landed on the moon and made Second Contact with the Monolith (CQL shell) in our last blog, but what can we do to understand the Monolith better? Let’s explore Cassandra Java client program. Java Client […]

Read more

Consulting Cassandra: Second Contact with the Monolith

Wednesday 6th September 2017

(Source: Shutterstock) In the first part of this blog (Cluster Creation in Under Ten Minutes), I created a Cassandra cluster. In this part, we blast off to the Moon for 2nd contact. Consulting the Oracles Croesus: Hi Oracle.  How will my war with Cyrus the Persian go? Oracle: If you proceed, a great empire will be […]

Read more

Cassandra Cluster Creation in Under 10 Minutes

Tuesday 29th August 2017

 (Source: Shutterstock – edited) Enough Information I watched the classic movie “2001: a Space Odyssey” for the nth time on the weekend.  My previous favourite quote from HAL (the eventually paranoid and murderous ship AI) was:         Dave: Open the pod bay doors, HAL.         HAL:  I’m sorry, Dave. […]

Read more