ClickHouse® best practices part 2: Scaling, data management, and optimization

As organizations continue to demand high-performance analytic database solutions, ClickHouse remains a top choice due to its exceptional speed and efficiency. To help you harness the full potential of ClickHouse, we’ve put together a comprehensive guide of best practices. This article is the second part of our two-part blog series, where we delve into crucial aspects of effectively and efficiently running your ClickHouse cluster. In this part, we will explore strategies for scalability, high availability, data management, and more. If you haven’t read the first part yet, you can check it out here.

Scalability and high availability

Sharding: The key to scalability

Choosing a sharding key that distributes data evenly across shards is crucial to avoiding performance-impairing hotspots. For example, if age is used as a sharding key, most users might fall into a few age ranges making the data unevenly distributed.

To ensure consistency and efficiency, apply schema changes across all shards using the ON CLUSTER clause. Executing such queries on individual nodes involves redundant effort and increases the risk of unintended errors, which may lead to schema inconsistencies.

Use the distributed table engine for automatic row distribution across shards. Prefer local joins within a shard over distributed joins and co-locate frequently joined tables on the same shard to optimize performance.

Replicas: Your high availability heroes

Each shard should have multiple replicas, (three recommended) for high availability and to handle node failures with ideally one per availability zone. With Instaclustr for ClickHouse, you decide the desired number of replicas, and the managed platform would automatically ensure replica nodes are distributed over availability zones so that your data is available and safe in the event of any unexpected partial outage.

Apply schema changes consistently across all replicas using the ON CLUSTER clause to avoid inconsistencies. Monitor replication lag using system tables like system.replicas and address any significant lags promptly.

Scaling ClickHouse: Go big or go home

ClickHouse’s architecture is designed to fully utilize all available CPU cores and memory on individual nodes. If your query performance is being limited by CPU, RAM, or storage you would likely benefit from scaling vertically. To achieve this, increase CPU, RAM, or storage capacity as needed. After scaling, tune relevant ClickHouse settings (such as max_server_memory_usage_to_ram_ratio, cache_size_to_ram_max_ratio, etc.) as appropriate to match the new hardware.

Horizontal scaling is accomplished by means of sharding. Plan for sharding from the start, if possible, as it’s easier to scale up a sharded setup than to scale out a large, monolithic database later. The Instaclustr Managed Platform for ClickHouse enables you to easily initiate vertical or horizontal scaling, and the rest is all managed for you.

Ensure similar hardware and configurations across all cluster nodes to avoid bottlenecks and achieve uniform performance.

Data management and optimization

Crafting the perfect schema

Select columns frequently used in query filters with low cardinality for the primary key and use the same columns or a superset for the ORDER BY clause. This approach optimizes indexing and sorting leading to faster query execution and reduced resource consumption.

Use appropriate data types to minimize disk usage and improve performance, and the LowCardinality data type for string-type columns with limited distinct values. For instance, suppose you have a column for country codes, which are always two-character strings (like US, AU, etc.). In this case, instead of using a generic String type, you could use FixedString(2) or even LowCardinality(FixedString(2)) if the number of unique codes is small. Likewise, if a numeric column only contains values from 0 to 255, you should use UInt8 instead of Int32 or Int64.

Use secondary or data skipping indexes sparingly after careful consideration. This is because, if a particular filtering condition is likely to retrieve lots of data chunks (granules) from disk, applying data skipping index may incur unexpectedly high runtime overhead.

Denormalize data to reduce the need for joins, which are generally more expensive in columnar databases. Design schemas to minimize joins, especially large many-to-many joins, and consider pre-joining data in ETL processes or using materialized views when necessary.

Test and iterate on your schema with real queries and data before finalizing it. Use materialized views to pre-aggregate data and accelerate common queries and choose the right compression codecs for your columns based on data patterns.

TTL magic: Automate data management

Use TTL settings to automatically delete rows older than a certain age, aligning with data retention policies specific to your needs. TTL can also move less frequently accessed data to cheaper storage, optimizing costs. Create clear TTL rules and ensure they are:

Defined based on your data lifecycle requirements
Aligned with partitioning schemes for efficient data management
Replicated across all replicas for consistency

Test and validate TTL settings before deploying them in production to ensure they work as expected and do not inadvertently delete or move data prematurely. Ensure data remains secure during and after TTL-based moves between storage media. Instaclustr for ClickHouse managed service offers support for Tiered Storage whereby you can use TTL rules to offload data to cloud object storage such as to AWS S3 or Azure Blob Storage.

Parts and merges: The hidden performance boosters

In ClickHouse, each insert creates a new part, and too many small parts increase overhead for background merges. This can result in significantly degraded write and query performance. Therefore, it is highly recommended to insert data in large batches.

Use the OPTIMIZE command to merge parts manually if automatic merges aren’t keeping up, and avoid having too many small parts, which can increase system load and reduce performance. Monitor parts using the system.parts table, aiming for an ideal number of parts per table and appropriate sizes.

Please note that the OPTIMIZE command can consume significant resources and should be run during off-peak hours. Avoid overusing this command; instead, investigate and address the root causes of excessive part creation.

Partitioning: Divide and conquer

Choose a partition key that reflects common data access patterns and distributes data evenly. Select partition granularity that suits your data volume and query patterns and use partitioning for efficient data retention management.

Align TTL policies with partitioning schemes to automate data management at the partition level. Avoid too many partitions, which can create excessive overhead, and design queries to take advantage of partition pruning for optimized performance.

Other considerations

In addition to the comprehensive best practices already outlined above, there are several other valuable strategies to consider for optimizing your ClickHouse deployment.

Monitoring query performance, disk I/O, CPU and memory usage as well as replication lag as these are some of the crucial metrics to keep tabs on. Setting up alerting mechanisms ensures you are notified of potential issues before they become critical.

Regular backups are essential for data protection. Utilize ClickHouse’s built-in backup and restore features or integrate with external solutions to ensure data can be recovered in case of loss. Periodically test your backup and restore processes to verify that data recovery is successful during emergencies.

Security is another critical aspect. Implement robust access control measures using ClickHouse’s user management and permissions system to restrict access to sensitive data and operations. Ensure data at rest is encrypted, especially if storing sensitive information, and use TLS/SSL to encrypt data in transit to protect it from interception. Enabling and monitoring audit logging helps track access and changes to your data, aiding in identifying and responding to security incidents.

Regularly reviewing and optimizing your queries can lead to significant performance improvements. Use EXPLAIN to understand query execution plans or inspect system.query_log and system.trace_log for a comprehensive analysis to identify bottlenecks and optimize inefficient query patterns. Materialized views can be employed to pre-aggregate and store the results of complex queries, greatly reducing execution time for repetitive and resource-intensive queries.

Managing configurations effectively is also important. Use version control for your ClickHouse configuration files to track changes, roll back to previous configurations, and collaborate with team members. Configuration management tools like Ansible, Puppet, or Chef can help deploy configurations consistently across your environment.

Data lifecycle management includes implementing data archiving strategies for long-term storage of historical data, moving older data to cheaper storage solutions while keeping it accessible for analysis if needed. Regularly clean up unused or obsolete data to free up resources and maintain optimal performance.

Maintaining up-to-date documentation for your ClickHouse deployment is vital. This should include architecture diagrams, configuration details, operational procedures, and troubleshooting guides. Providing training for your team on ClickHouse best practices, query optimization techniques, and operational procedures ensures that everyone is equipped to manage and use ClickHouse effectively.

ClickHouse is a fast-evolving project with a large and enthusiastic community of contributors, leading to numerous changes and fixes being introduced with each new version. To benefit from those, it is important to keep abreast of the new changes and plan upgrades.

Last but not the least, engaging with the ClickHouse community through forums, mailing lists, and conferences can provide valuable insights, tips, and support from other users and contributors. Additionally, consider subscribing to professional support services if you require guaranteed response times and expert assistance for your ClickHouse deployment.

Simplify your ClickHouse management with NetApp Instaclustr

In this two-part blog series (read part 1 here), we’ve covered a comprehensive set of best practices to optimize your ClickHouse deployment. From unlocking CPU potential and mastering memory to advanced sharding strategies and efficient scaling, these guidelines are designed to help you get the most out of your ClickHouse setup. However, we understand that implementing and administering these best practices can be overwhelming and time-consuming.

So, why manage it all yourself? With Instaclustr for ClickHouse, we aim to deliver the expertise to handle these complexities for you, allowing you to focus on your core business needs. Our team strives to ensure that you benefit from a curated and carefully selected hardware and network configuration tailored to your specific requirements.

There is a lot that we take off your shoulders—this includes:

Taking care of regular patching and migration management, so you do not have to worry about staying up to date with the latest updates and security fixes.
Providing comprehensive monitoring and alerting, backed by a 24×7 teams of experts to keep your ClickHouse deployment running smoothly.
Starting quickly and efficiently with our out-of-the-box, click-and-go, production-ready setups.
Ensuring that your infrastructure is managed consistently and reliably with Infrastructure as Code (IaC) support.

Ready to simplify your ClickHouse management? Let us handle the technical complexities so you can focus on driving your business forward. Contact Instaclustr today and discover how we can help you achieve optimal performance and reliability with ClickHouse.