In the evolving landscape of data analytics and management, performance benchmarking is pivotal to understanding how different systems handle real-world workloads. At NetApp, we performance test each of the ClickHouse nodes we offer to provide insightful results and showcase the capabilities of our managed platform. This blog presents an overview of our testing methodology, the ClickBench method, and the results.
Testing methodology
1. Provisioning: We provisioned ClickHouse clusters using the Instaclustr Console.
2. Node selection: For each cluster we ran performance tests with, we selected a different production node size from the available options.
3. Cluster initialization: Then we waited for the cluster to reach the running state, which took no more than 5 minutes.
4. Security configuration: By default, Instaclustr for ClickHouse clusters have security firewalls configured to protect against data infiltration or exfiltration. So, we configured an integration with the http://datasets.clickhouse.com domain, enabling access, allowing us to download the ClickBench “hits” data. See the support page for detailed information on how to do this.
5. Create table: Next we created a table to store the ClickBench data using the query found on GitHub.
6. Data loading: Then inserted the hits data into the table, using the following query:
1 |
insert into hits SELECT * FROM url('https://datasets.clickhouse.com/hits_compatible/hits.parquet', 'Parquet') SETTINGS enable_url_encoding = 0; |
7. Execution: Finally, we ran the ClickBench queries, which can be found on GitHub, using a script. The query time for each was recorded, then processed to calculate the relative query time for each node size.
Limitations
The ClickBench benchmark, developed by ClickHouse, is designed to simulate typical workloads. By using a dataset derived from the traffic recordings of one of the largest web analytics platforms worldwide, the benchmark maintains realism and relevance. The dataset, while anonymized, retains essential data distributions, enabling accurate performance testing across various queries.
However, the following limitations should be noted:
- The dataset consists of a single flat table, rather than a set of related tables.
- The table contains exactly 99,997,497 records, which is relatively small for a production dataset and does not fully utilize the resources available for larger instances.
- The benchmark only tests single-node performance and does not utilize the entire ClickHouse cluster. This level of detail is more helpful for making cluster sizing decisions.
- The benchmark runs queries sequentially and does not test workloads with concurrent requests, nor does it test for system capacity. Each query is run only a few times, which can introduce some variability in the results.
Results
Our performance testing, using the ClickBench methodology, on the Instaclustr Managed Platform yielded the following results. Our findings demonstrate the efficiency of Instaclustr for ClickHouse, showing significant performance improvements as additional vCPUs are leveraged on larger nodes to reduce relative query times. However, as node size increases, the test becomes less challenging, and the additional resources were not fully utilized, resulting in diminishing returns.
Sizing clusters is always workload dependent. Our results are useful for comparing the performance of these nodes handling the ClickBench workload, which is only about 15 GB when compressed. By understanding comparative performance for this workload, decisions can be made regarding optimal sizing. To better model real-world performance, a similar test could be repeated with a workload which more accurately reflects your production data.
Conclusion
In conclusion, our comprehensive performance testing has demonstrated how effectively scaling ClickHouse clusters on the Instaclustr platform can significantly enhance performance. By methodically provisioning clusters and leveraging the ClickBench benchmark, we have shown that larger node sizes directly contribute to improved query times. However, it is important to note that as size increases, the additional resources may lead to diminishing returns.
NetApp is committed to providing our customers with the insights and support needed to make informed decisions about their data infrastructure. Understanding performance characteristics for specific workloads is crucial for optimal cluster sizing, and our team is ready to assist you in tailoring your ClickHouse clusters to meet your production demands effectively.
Ready to harness the power of ClickHouse? Sign up here and elevate your open source data infrastructure strategy today.