Mastering ClickHouse best practices: Infrastructure and operational excellence

As organizations increasingly seek high-performance analytic database solutions, ClickHouse stands out as a leading choice for its speed and efficiency. To help you get the most out of ClickHouse, we’ve compiled a comprehensive guide of best practices. These recommendations cover various aspects of ClickHouse deployment and usage, ensuring that your implementation is both robust and efficient. This article is the first of a two-part blog series where we will cover topics relating to infrastructure and operational excellence.

ClickHouse infrastructure essentials

Unlocking CPU potential

ClickHouse is heavily multi-threaded, benefiting significantly from multiple CPU cores. It’s recommended to have at least four or more cores to handle parallel data processing efficiently each with higher clock speeds and strong single-threaded performance per core, as ClickHouse thrives on higher clock speeds.

Modern CPUs with advanced vector extensions (like AVX2 or AVX-512) can significantly speed up ClickHouse operations, so ensure your hardware supports these extensions. Additionally, configuring the max_threads setting to match the number of CPU cores can optimize CPU utilization, making your ClickHouse deployment even more efficient.

The Instaclustr for ClickHouse service is already designed keeping in mind these recommendations allowing you to focus on getting the best from your ClickHouse deployments.

Mastering memory for peak performance

ClickHouse performs most operations directly in memory, making ample RAM crucial. This is even more important if you are using integrations like for AWS S3 or Apache Kafka®, and we suggest you consider using even higher-memory machines as these are high-throughput, streaming-heavy integrations that require plenty of memory to handle buffered writes and reads efficiently.

To prevent queries from hogging all the memory, use settings like max_memory_usage, max_bytes_before_external_group_by, and max_bytes_before_external_sort to limit memory usage per query (see Restrictions on Query Complexity for details). Similarly, setting global memory usage limits per query (max_memory_usage) and overall (max_memory_usage_for_all_queries) can prevent out-of-memory situations. However, be cautious as these settings can impact query performance.

Depending on the frequency and nature of queries, adjust the maximum number of concurrent queries allowed using max_concurrent_queries. This will prevent excessive memory usage and improve overall query performance.

Use data skipping indexes to reduce the amount of data read and processed, saving memory. Ensure memory overcommit is not disabled (cat /proc/sys/vm/overcommit_memory should be 0 or 1) to allow the system to allocate more memory than is physically available.

Avoid using swap space, as frequent memory swaps can severely degrade performance; ensure you have sufficient physical memory to handle workloads.

Certain operations like GROUP BY, JOIN, and large array functions can be particularly memory-intensive, so understand their implications and size your system appropriately or adjust queries to manage memory usage.

Turbocharging disk I/O

For optimal I/O performance, preferably use SSDs. In cloud deployments, choose IO-optimized instances. This will ensure data can be written to and read from disk fast enough for running queries performantly. Storages with relatively lower performance, such as magnetic HDDs or object storage, are better suited for cold storage tiers.

Adequate disk space planning is essential to accommodate data growth and temporary files generated during query execution and merges. Ensure you have more than enough disk space to handle your anticipated data volumes and operations.

Instaclustr for ClickHouse managed service includes support for Tiered Storage allowing you to use predefined patterns for offloading some data to remote storage.

Network secrets for fast and secure queries

Deploy high-speed network interfaces (10GbE or higher) to avoid network throughput bottlenecks, which is crucial for processing distributed queries and replication between nodes. Nodes supported with Instaclustr for ClickHouse have such high-speed network interfaces.

Secure communication is also vital, so use encryption (TLS/SSL) for data in transit, especially if your ClickHouse servers are distributed across different data centers or cloud regions.

Keeping a cluster within a virtual private network without an Internet Gateway makes it inaccessible from the Internet and thereby more secure. NetApp Instaclustr allows you to provision Private Network Clusters, where the setup and configuration of such clusters is all managed for you.

Operational excellence

Version management and updates

It is essential to keep your ClickHouse version up to date. However, upgrading a cluster can be complex. There are several factors that you should carefully consider, including potential breaking changes in the new version, selecting the most appropriate version for the upgrade, thoroughly testing the new version prior to implementation, as well as minimizing human errors. You may also need to ensure the cluster remains available throughout the upgrade process, especially if high availability is critical. You can achieve this by configuring your cluster to have replicas for each shard and upgrade one node per shard at a time.

Additionally, if you also have a Network Load Balancer set up, client connections will be automatically routed to the remaining replicas ensuring high availability. On the other hand, it may be troublesome to keep up with the continuous stream of minor patch releases with important bug fixes and critical security updates. Hence, you should determine an optimal frequency of updates that is practical to maintain while not falling too far behind.

Instaclustr performs regular updates to our managed ClickHouse clusters so that customers can be confident about application security and reliability. At any given time, we support multiple ClickHouse versions in our managed fleet, with a full list available here.

Load balancing like a pro

Implement load balancing to distribute query loads evenly across cluster nodes, either through application logic or dedicated load balancers. This not only optimizes performance but also provides high availability by ensuring client requests are not forwarded to unhealthy or offline nodes. Load balancers add an additional layer of health check, making node replace or resize operations transparent to clients. (On a related note, ClickHouse drivers for most popular languages also support client-side load balancing and failover.)

With Instaclustr for ClickHouse, load balancing is another feature we provide which can be used by simply checking a box when creating a new cluster. Read more about it here.

ClickHouse Keeper best practices

ClickHouse Keepers must be deployed in an ensemble with an odd number of nodes (3 or 5) for high availability and fault tolerance through quorum-based decision making. Running ClickHouse Keeper on separate nodes from your ClickHouse servers helps isolate the coordination workload from query processing, benefiting performance and stability, and is recommended for production workloads. When deployed on independent nodes, with Instaclustr for ClickHouse, your ClickHouse cluster may be eligible for higher uptime SLAs.

Ensure ClickHouse Keeper deployments are updated and compatible with your ClickHouse version and maintain quorum by promptly repairing or replacing downed nodes. Use SSDs for Keeper coordination service due to its sensitivity to disk write latency.

Where to go from here

In this article, we’ve explored some of the essential best practices to optimize your ClickHouse deployment, providing you with valuable insights to maximize the performance and efficiency of your setup. However, we recognize that the implementation and management of these practices can be both complex and time-consuming.

So, why manage it all yourself? With the Instaclustr for ClickHouse Managed Platform we have the right expertise to handle these complexities for you, allowing you to focus on your core business needs. Our team ensures that you benefit from a curated and carefully selected hardware and network configuration tailored to your specific requirements.

There is a lot we can take off your shoulders. To name a few—we take care of regular patching and migration management, so you do not have to worry about staying up to date with the latest updates and security fixes. Additionally, we provide comprehensive monitoring and alerting to keep your ClickHouse deployment running smoothly. Our out-of-the-box, click-and-go, production-ready setups mean you can get started quickly and efficiently. With Infrastructure as Code (IaC) support, we ensure that your infrastructure is managed consistently and reliably.

Let NetApp Instaclustr manage your ClickHouse deployment, so you can concentrate on what you do best—growing your business. Ready to implement these ClickHouse best practices?

Mastering ClickHouse best practices: Infrastructure and operational excellence

ClickHouse infrastructure essentials

Unlocking CPU potential

Mastering memory for peak performance

Turbocharging disk I/O

Network secrets for fast and secure queries

Operational excellence

Version management and updates

Load balancing like a pro

ClickHouse Keeper best practices

Where to go from here

Contact Instaclustr for expert-managed services today!

Get the latest articles for open sourceIn your inbox

Sign upto ourNewsletter

Get the latest articles for open source
In your inbox

Sign up
to our
Newsletter