Cassandra User Defined Functions & Aggregates

Overview

User Defined Functions (UDF) and Aggregates (UDA) have seen a number of improvements in Cassandra version 3.x. In particular the sand boxing of UDF code makes this functionality safer in a production environment and has led us to include Java UDF support in our Cassandra 3.x managed service offering.

UDF/UDAs allow the execution of user provided code on the server side (Coordinator Node). This code will be simple with no dependencies and only using input parameters that come from table data. These functions are sandboxed with a custom security manager that does not allow access to things like the file system or cassandra internals where there is potential for malicious use. Java and Javascript are supported out of the box. Some languages that are compatible with JSR223 are also supported but may require additional libraries to be installed (See the Read Me files in the lib/jsr223 of your Cassandra install). The custom security manager does the best sandboxing with pure Java. Also Java will perform best with the least invocation latency. So sticking to pure Java is recommended.

Use Cases

Generally the use cases are where there is advantage of running code on the cassandra nodes.

Performance benefit from reducing network usage. This is the case where reducing the amount of data returned to the client is significant to overall performance. Which may mean performance benefits of UDFs are more easily seen on larger clusters.
Can simplify/clean client side code.
Can be used to provide functionality that is familiar to SQL users like group or distinct.
Can be used to perform pre-aggregation for spark jobs.

Setting Up

There are two aspects to setting up UDF/UDA. Enabling UDF in the cassandra configuration and setting user permissions.

As UDF can be misused it is turned off by default. To enable it the following needs to be set in the cassandra.yaml:

enable_user_defined_functions: true

1	enable_user_defined_functions: true

It is recommended to only use pure Java so leave the following as false:

enable_scripted_user_defined_functions: false

1	enable_scripted_user_defined_functions: false

Instaclustr provisioned clusters will have both of these settings set as recommended.

In order to run a UDF as part of a query a role needs execute permission on the functions. e.g.

GRANT EXECUTE ON ALL FUNCTIONS IN KEYSPACE test to football_admin;

1	GRANT EXECUTE ON ALL FUNCTIONS IN KEYSPACE test to football_admin;

You will also need CREATE or ALTER permissions in order to add and replace functions within a keyspace. e.g.

GRANT CREATE ON ALL FUNCTIONS IN KEYSPACE football to football_admin;

1	GRANT CREATE ON ALL FUNCTIONS IN KEYSPACE football to football_admin;

GRANT ALTER ON ALL FUNCTIONS IN KEYSPACE football to football_admin;

1	GRANT ALTER ON ALL FUNCTIONS IN KEYSPACE football to football_admin;

Throughout the rest of this blog I will be using examples based around a football dataset. Here is the table creation for reference.

CREATE TABLE football.games_by_team (
    league_id text,
    season text,
    round int,
    team text,
    against text,
    home boolean,
    goals_for int,
    goals_against int,
    PRIMARY KEY ((league_id, season), team, round));

CREATE TABLE football.games_by_team (

league_id text,

season text,

round int,

team text,

against text,

home boolean,

goals_for int,

goals_against int,

PRIMARY KEY ((league_id, season), team, round));

You can also find the CQL used at this GitHub page.

User Defined Functions

UDF’s are added to a keyspace using the CREATE FUNCTION statement. It takes the form:

CREATE ( OR REPLACE )? FUNCTION ( IF NOT EXISTS )?
	(  '.' )? 
	'('   ( ','   )* ')'
      ( CALLED | RETURNS NULL ) ON NULL INPUT
      RETURNS 
      LANGUAGE 
      AS

CREATE ( OR REPLACE )? FUNCTION ( IF NOT EXISTS )?

( '.' )?

'(' ( ',' )* ')'

( CALLED | RETURNS NULL ) ON NULL INPUT

RETURNS

LANGUAGE

Where:

Arguments types are CQL types. Refer to the below table.
CALLED|RETURNS NULL defines the behaviour when a null value is encountered. RETURNS NULL simply returns NULL. CALLED will call the function with a null value. Meaning that in this case your code needs to handle nulls appropriately.
Language is the name of the language used for the body of the function e.g. Java
Body will consist of the custom code for the function
Return type must be valid cql type refer to below table.

CQL	JAVA
boolean	java.lang.Boolean
int	java.lang.Integer
bigint	java.lang.Long
float	java.lang.Float
double	java.lang.Double
inet	java.net.InetAddress
text	java.lang.String
ascii	java.lang.String
timestamp	java.util.Date
uuid	java.util.UUID
timeuuid	java.util.UUID
varint	java.math.BigInteger
decimal	java.math.BigDecimal
blob	java.nio.ByteBuffer
list<E>	java.util.List<E> where E is also a type from this list
set<E>	java.util.Set<E> where E is also a type from this list
map<K,V>	java.util.Map<K,V> where K and V is also a types from this list
(user type)	com.datastax.driver.core.UDTValue
(tuple type)	com.datastax.driver.core.TupleValue

Example

Here we create a simple function that will give us the margin of a football game. Notice the Math and Integer libraries are referenced directly.

football_admin@cqlsh:football&gt; USE football;
football_admin@cqlsh:football&gt; CREATE OR REPLACE FUNCTION football.margin (goals_for int, goals_against int)
            ... RETURNS NULL ON NULL INPUT
            ... RETURNS int LANGUAGE java AS '
            ... return Integer.valueOf(Math.abs(goals_for - goals_against));
            ... ';
football_admin@cqlsh:football&gt; SELECT margin(goals_for, goals_against) FROM games_by_team
            ... WHERE league_id='English Premier League' AND season='2015/16' and team = 'Tottenham Hotspur';

 football.margin(goals_for, goals_against)
-------------------------------------------
                                         1
                                         0
                                         0
                                         3
                                         0
                                         2
                                         3
                                         0
...
                                         3
                                         4
                                         0
                                         0
                                         1
                                         4

(38 rows)

football_admin@cqlsh:football> USE football;

football_admin@cqlsh:football> CREATE OR REPLACE FUNCTION football.margin (goals_for int, goals_against int)

... RETURNS NULL ON NULL INPUT

... RETURNS int LANGUAGE java AS '

... return Integer.valueOf(Math.abs(goals_for - goals_against));

... ';

football_admin@cqlsh:football> SELECT margin(goals_for, goals_against) FROM games_by_team

... WHERE league_id='English Premier League' AND season='2015/16' and team = 'Tottenham Hotspur';

football.margin(goals_for, goals_against)

-------------------------------------------

...

(38 rows)

Aggregates

Aggregates provide a combined result based on all the rows matching the query. Cassandra already has a number of built in aggregates which are in the system keyspace these are:

Count
Min
Max
Avg
Sum

These of course can be combined with functions for practical benefit.

Examples

We can calculate the number of goals scored for a team:

football_admin@cqlsh&gt; select sum(goals_for) FROM football.games_by_team WHERE league_id='English Premier League' AND season='2015/16' AND team = 'Aston Villa';

 system.sum(goals_for)
-----------------------
                    27

(1 rows)

football_admin@cqlsh> select sum(goals_for) FROM football.games_by_team WHERE league_id='English Premier League' AND season='2015/16' AND team = 'Aston Villa';

system.sum(goals_for)

-----------------------

(1 rows)

We can use the max and margin function to find greatest winning margin for the season.

football_admin@cqlsh:football&gt; select max(margin(goals_for, goals_against)) from games_by_team WHERE league_id='English Premier League' AND season='2015/16';

 system.max(football.margin(goals_for, goals_against))
-------------------------------------------------------
                                                     6

(1 rows)

football_admin@cqlsh:football> select max(margin(goals_for, goals_against)) from games_by_team WHERE league_id='English Premier League' AND season='2015/16';

system.max(football.margin(goals_for, goals_against))

-------------------------------------------------------

(1 rows)

User Defined Aggregates

You can also create custom aggregates. These will utilise a user defined state function and an optional final function. You can add them to your keyspace with create statements that have the following syntax:

 ::= CREATE ( OR REPLACE )? 
	AGGREGATE ( IF NOT EXISTS )?
	(  '.' )? 
      '('  ( ','  )* ')'
      SFUNC 
      STYPE 
      ( FINALFUNC  )?
      ( INITCOND  )?

::= CREATE ( OR REPLACE )?

AGGREGATE ( IF NOT EXISTS )?

( '.' )?

'(' ( ',' )* ')'

SFUNC

STYPE

( FINALFUNC )?

( INITCOND )?

SFUNC The state function that is called once for every row returned. The return value of the state function becomes the state parameter for the next call.

STYPE The type of the state parameter, which must be valid CQL type.

FINALFUNC an optional function called once after the state function has been called for every row. The input is the return of the last state function call.

INITCOND sets the initial value for the state passed to the first state function call. The default value is null.

Examples

In this example we create an aggregate for calculating the total goals scored per team. This does not require a final function. It simply returns the map from the last call to the state function.

football_admin@cqlsh:football&gt; CREATE OR REPLACE FUNCTION group_and_sum_state (state map&lt;text,int&gt;, group text, score int)
            ... CALLED ON NULL INPUT
            ... RETURNS map&lt;text,int&gt;
            ... LANGUAGE java AS '
            ... state.put(group, score + state.getOrDefault(group, 0));
            ... return state;
            ... ';
football_admin@cqlsh:football&gt; CREATE AGGREGATE IF NOT EXISTS group_and_sum (text, int)
            ... SFUNC group_and_sum_state
            ... STYPE map&lt;text,int&gt;
            ... INITCOND {};

football_admin@cqlsh:football> CREATE OR REPLACE FUNCTION group_and_sum_state (state map<text,int>, group text, score int)

... CALLED ON NULL INPUT

... RETURNS map<text,int>

... LANGUAGE java AS '

... state.put(group, score + state.getOrDefault(group, 0));

... return state;

... ';

football_admin@cqlsh:football> CREATE AGGREGATE IF NOT EXISTS group_and_sum (text, int)

... SFUNC group_and_sum_state

... STYPE map<text,int>

... INITCOND {};

Here you can see the UDA being used in a query. We get back single row with a map showing the teams and their total goals scored across the season.

football_admin@cqlsh:football&gt; select group_and_sum (team, goals_for) from football.games_by_team
            ... WHERE league_id='English Premier League' AND season='2015/16';

 football.group_and_sum(team, goals_for)
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 {'AFC Bournemouth': 45, 'Arsenal FC': 65, 'Aston Villa': 27, 'Chelsea FC': 59, 'Crystal Palace': 39, 'Everton FC': 59, 'Leicester City': 68, 'Liverpool FC': 63, 'Manchester City': 71, 'Manchester United': 49, 'Newcastle United': 44, 'Norwich City': 39, 'Southampton FC': 59, 'Stoke City': 41, 'Sunderland AFC': 48, 'Swansea City': 42, 'Tottenham Hotspur': 69, 'Watford FC': 40, 'West Bromwich Albion': 34, 'West Ham United': 65}

(1 rows)

football_admin@cqlsh:football> select group_and_sum (team, goals_for) from football.games_by_team

... WHERE league_id='English Premier League' AND season='2015/16';

football.group_and_sum(team, goals_for)

----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

{'AFC Bournemouth': 45, 'Arsenal FC': 65, 'Aston Villa': 27, 'Chelsea FC': 59, 'Crystal Palace': 39, 'Everton FC': 59, 'Leicester City': 68, 'Liverpool FC': 63, 'Manchester City': 71, 'Manchester United': 49, 'Newcastle United': 44, 'Norwich City': 39, 'Southampton FC': 59, 'Stoke City': 41, 'Sunderland AFC': 48, 'Swansea City': 42, 'Tottenham Hotspur': 69, 'Watford FC': 40, 'West Bromwich Albion': 34, 'West Ham United': 65}

(1 rows)

In the following example we create mode functionality to find the most common scores. Here a final function is required where we traverse the map to find the mode result. It is worth noting that care should be taken when using logic with loops. You can imagine the cost of expensive calculations on large partitions.

football_admin@cqlsh:football&gt; CREATE OR REPLACE FUNCTION football.mode_state (state map&lt;ascii,int&gt;, goals_for int, goals_against int)
            ... CALLED ON NULL INPUT RETURNS map&lt;ascii,int&gt; LANGUAGE java AS '
            ... String key = new String (goals_for + " - " + goals_against);
            ... state.put(key, 1 + state.getOrDefault(key, 0));
            ... return state;
            ... ';
football_admin@cqlsh:football&gt; CREATE OR REPLACE FUNCTION football.mode_final (state map&lt;ascii,int&gt;)
            ... CALLED ON NULL INPUT RETURNS ascii LANGUAGE JAVA AS '
            ... String mostCommon = null;
            ... int max_count = 0;
            ... for (String key : state.keySet()) {
            ... int value = state.get(key);
            ... if (value &gt; max_count) { mostCommon = key;max_count = value; }}
            ... return mostCommon;
            ... ';
football_admin@cqlsh:football&gt; CREATE AGGREGATE IF NOT EXISTS football.mode (int, int)
            ...   SFUNC mode_state
            ...   STYPE map&lt;ascii,int&gt;
            ...   FINALFUNC mode_final
            ...   INITCOND {};
football_admin@cqlsh:football&gt; select football.mode(goals_for,goals_against) FROM football.games_by_team WHERE league_id='English Premier League' AND season='2015/16' AND team = 'Leicester City';

 football.mode(goals_for, goals_against)
-----------------------------------------
                                   1 - 0

(1 rows)
football_admin@cqlsh:football&gt; select football.mode(goals_for,goals_against) FROM football.games_by_team WHERE league_id='English Premier League' AND season='2015/16';

 football.mode(goals_for, goals_against)
-----------------------------------------
                                   1 - 1

(1 rows)

football_admin@cqlsh:football> CREATE OR REPLACE FUNCTION football.mode_state (state map<ascii,int>, goals_for int, goals_against int)

... CALLED ON NULL INPUT RETURNS map<ascii,int> LANGUAGE java AS '

... String key = new String (goals_for + " - " + goals_against);

... state.put(key, 1 + state.getOrDefault(key, 0));

... return state;

... ';

football_admin@cqlsh:football> CREATE OR REPLACE FUNCTION football.mode_final (state map<ascii,int>)

... CALLED ON NULL INPUT RETURNS ascii LANGUAGE JAVA AS '

... String mostCommon = null;

... int max_count = 0;

... for (String key : state.keySet()) {

... int value = state.get(key);

... if (value > max_count) { mostCommon = key;max_count = value; }}

... return mostCommon;

... ';

football_admin@cqlsh:football> CREATE AGGREGATE IF NOT EXISTS football.mode (int, int)

... SFUNC mode_state

... STYPE map<ascii,int>

... FINALFUNC mode_final

... INITCOND {};

football_admin@cqlsh:football> select football.mode(goals_for,goals_against) FROM football.games_by_team WHERE league_id='English Premier League' AND season='2015/16' AND team = 'Leicester City';

football.mode(goals_for, goals_against)

-----------------------------------------

1 - 0

(1 rows)

football_admin@cqlsh:football> select football.mode(goals_for,goals_against) FROM football.games_by_team WHERE league_id='English Premier League' AND season='2015/16';

football.mode(goals_for, goals_against)

-----------------------------------------

1 - 1

(1 rows)

Schema

If you require details on the functions and aggregates in your schema, you can query the system schema tables as follows:

football_admin@cqlsh:football&gt; select function_name from system_schema.functions where keyspace_name = 'football';

 function_name
---------------------
 group_and_sum_state
              margin
          mode_final
          mode_state

(4 rows)
football_admin@cqlsh:football&gt; select aggregate_name from system_schema.aggregates where keyspace_name = 'football';

 aggregate_name
----------------
  group_and_sum
           mode

(2 rows)

football_admin@cqlsh:football&gt; select body from system_schema.functions where keyspace_name = 'football' AND function_name = 'margin';

 body
------------------------------------------------------------------
 nreturn Integer.valueOf(Math.abs(goals_for - goals_against));n

(1 rows)

football_admin@cqlsh:football> select function_name from system_schema.functions where keyspace_name = 'football';

function_name

---------------------

group_and_sum_state

margin

mode_final

mode_state

(4 rows)

football_admin@cqlsh:football> select aggregate_name from system_schema.aggregates where keyspace_name = 'football';

aggregate_name

----------------

group_and_sum

mode

(2 rows)

football_admin@cqlsh:football> select body from system_schema.functions where keyspace_name = 'football' AND function_name = 'margin';

body

------------------------------------------------------------------

nreturn Integer.valueOf(Math.abs(goals_for - goals_against));n

(1 rows)

Conclusions

So UDF and UDAs are a useful addition in your CQL tool belt. They enable you to enrich the out of the box functionality of Cassandra. However as they can be misused, appropriate care must be taken when using them. They are not a substitute for well designed data models. Consideration should be given to the overhead on the coordinator node from their use. They are best used in queries that hit small partitions. For large or multiple partitions we would stick to using Spark. Expect more to come on UDFs as further features are planned that build on this functionality such as functional indexes.

Cassandra User Defined Functions and Aggregates

Overview

Use Cases

Setting Up

User Defined Functions

Example

Aggregates

Examples

User Defined Aggregates

Examples

Schema

Conclusions

Get the latest articles for open sourceIn your inbox

Sign upto ourNewsletter

Get the latest articles for open source
In your inbox

Sign up
to our
Newsletter