Using Open Table Format for Integration

ClickHouse clusters on the NetApp Instaclustr platform can be used with Open Table Formats, such as Apache Iceberg and Delta Lake, that provide a read-only, table-like interface to Apache Iceberg and Delta Lake. This enables you to execute ad-hoc queries directly on open table formats that are hosted in Object Storage solutions without the need for data ingestion.

This guide provides you with key preparation steps for querying from object store; and includes SELECT queries for supported table functions.

Key Preparation Before You Begin

1. Grant the ClickHouse Cluster Storage Access

Depending on the type of object store being queried and its access level, it is necessary to take one of the following steps:

  • If you are using a private AWS S3 Bucket, you will need to configure a Private S3 Bucket Integration to grant ClickHouse permission to access your bucket. This automatically attaches IAM policies that enable secure egress.You must also update your own managed bucket policy to allow ClickHouse instance role or specified principals to action the following:
    • s3:ListBucket
    • s3:GetObject
  • If you are using a public AWS S3 bucket in a different region from your cluster, you must enable egress to the region where the S3 bucket is located. This can be accomplished by utilizing the S3 Region Integration.
  • If you are using Azure Blob Storage, you need to enable a Domain Integration for the storage account’s domain to facilitate outbound (egress) traffic for the cluster.Depending on the security policies currently applied to your Azure Blob Storage, you may need to update these policies to permit inbound traffic from the ClickHouse instance network to enable successful querying.

2. Configure a Named Collection

This step is optional but recommended as you can run queries without named collections. Named Collections let you avoid entering the same credentials or connection details repeatedly. If you are using named collections, however, they must exist on all nodes when running distributed (i.e. icebergS3Cluster, icebergAzureCluster, deltaLakeS3Cluster) queries.

The following examples illustrate how to create a named collection based on the object stores you intend to use with ClickHouse.

Setting up a Named Collection for AWS S3

Setting up a Named Collection for Azure Blob Storage

3. Potential Change to Egress Firewall Rules

Clusters on the NetApp Instaclustr managed platform are secured through egress firewall rules to protect against data exfiltration. The integration you choose may trigger the need for one or more firewall whitelist rules to be added for access. Do not enable AWS S3 or Domain integrations without first understanding and accepting the associated change risk.

How to Perform Ad-Hoc Queries with Open Table Formats

This section explains how to query Open Table Formats using ClickHouse table functions.

The query structure is consistent across all scenarios. Only the table function name and parameters change based on:

  • Storage Provider AWS S3 or Azure Blob Storage.
  • Table Format Apache Iceberg or Delta Lake.
  • Query Scope single node or cluster-wide.
  • Supplying Credentials a named collection is used.

Use the 3 steps below to design your Open table query.

Step 1: Choose Your Query Scope

Single Node

Use a single node table function when running the query on one ClickHouse node.

Cluster-wide

Use a cluster-wide table function to distribute the query across all nodes in the cluster.

Step 2: Choose How Credentials Are Supplied

Using a Named Collection (Recommended)

Using a named collection for your storage will simplify queries by avoiding repeated credential configuration. See guidance above to Configure a Named Collection.

For single node queries:

For cluster-wide queries:

Without a Named Collection

Alternatively, the following example illustrates the query where credentials are input directly into the query.

For single node queries:

For cluster-wide queries:

Step 3: Select the Appropriate Table Function

Use the tables below to determine the correct table function for your configuration.

Table Functions for AWS S3

Table Format Single Node Function Cluster-Wide Function
Iceberg icebergS3 icebergS3Cluster
Delta Lake deltaLakeS3 deltaLakeS3Cluster
Example S3 Iceberg query (Cluster-wide, No Named Collection):

Example S3 Delta Lake query (Single Node, Named Collection):

Table Functions for Azure Blob Storage

Table Format Single Node Function Cluster-Wide Function
Iceberg icebergAzure icebergAzureCluster
Delta Lake deltaLakeAzure N/A
Example Azure Iceberg query (Cluster-wide, No Named Collection):

Example Azure Delta Lake query (Single Node, No Named Collection):

Following these steps completes the setup. You can now query open table formats using ClickHouse table functions.

 

For assistance with NetApp Instaclustr for ClickHouse support for Open Table Formats or any further inquiries, please contact Instaclustr Support.