Apache Kafka Tiered Storage
Apache Kafka’s Tiered Storage is a data management strategy that categorizes storage into two tiers i.e. local and remote. Data moves asynchronously from local tier to remote based on set retention values, allowing storing huge volumes of data efficiently at reduced costs.
Background Knowledge
In the typical Apache Kafka cluster, each broker has some local storage attached to it and data in the topics is stored only in local storage on these brokers. There are a few issues with this approach. As the amount of data increases, the local storage needs to be increased. Without support of tiered storage, a Kafka cluster’s storage can be scaled by adding more broker nodes or replacing current nodes with higher capacity nodes. It is not a cost-effective way of increasing storage. Adding new nodes would also require copying a lot of data that makes operations difficult and time consuming.
With Tiered Storage in Apache Kafka, the storage is categorised into two tiers, local where the most recent data can be stored and remote storage where historical data can be archived. The references to remote data are stored in the broker so when needed, it can quickly be retrieved by Kafka from the remote tier. The tiered storage approach offers many benefits, and a few noteworthy ones are:
- Cost Optimization: Data can be stored based on the specific needs of the data. For example, high performance, expensive storage can be used for latency sensitive applications and lower cost, slower storage for less frequently accessed data. This helps in reducing overall storage costs.
- Scalability: With Tiered Storage, Kafka cluster can be scaled more efficiently by scaling compute and storage independently. This allows increasing data volumes to be stored by expanding the remote storage rather than adding new broker nodes just for storage capacity. The new node added for additional computing power also takes less time to catch up, since the remote storage is shared across all nodes.
How it works
In the tiered storage approach, there are two tiers of storage – local and remote. Local is usually the faster, more expensive storage and remote could be slower, cost-effective storage. Enablement of Tiered Storage on Apache Kafka cluster does not change the way producers and consumers interact with each other, it only impacts the data retention and data retrieval process.
Data retention and segment management
When the producers write to a Tiered Storage enabled topic, the data is stored in local storage as normal. The data is organized into segments. The local storage uses the local disks to store segments on the Kafka brokers. For Tiered Storage enabled topic, there are additional retention settings that set retention threshold and how long the data stays in each tier. Size of the segment also influences the data retention capacity. Depending on the local retention settings, the segments are transferred asynchronously to the remote storage and the leader creates and saves the metadata of the remote object in internal topic. This metadata is then used to build the remote references within the broker that keep track of the data.
In situations such as high data ingestion or temporary errors like network connectivity issues, the local storage may temporarily exceed the specified local retention threshold, resulting in an accumulation of additional data in local tier. The log cleaner will not remove this data until it has been successfully uploaded to remote storage.
Increasing the local retention threshold won’t move segments already uploaded to remote storage back to local storage. This change only affects new data segments.
Data retrieval by consumer
For data retrieval consumer requests on Tiered Storage enabled topics, if the data is available in local storage, it is served from the local disk. If the requested data sits in the remote storage, the broker streams the data from it into its in-memory buffer (and on-disk cache), and then sends it back to the client.
Limitations of Tiered storage
The following are the limitations of Kafka’s Tiered Storage. For more information, please refer directly to the Kafka Tiered Storage Release Notes here.
- Tiered Storage is not supported for compacted topics.
- If you enable Tiered Storage for a topic, you cannot deactivate yourself. Later versions of Kafka will support this feature. For now, please reach out to our support team for help with this.
For help with setting up the remote storage, please refer here. To provision an Instaclustr for Apache Kafka cluster with Tiered Storage enabled, please see Creating an Apache Kafka Cluster.
Frequently Asked Questions
Q: Should I create 1 bucket per cluster I want to enable Tiered Storage for, or can I have multiple clusters’ tiered data being stored in the same S3 bucket?
A: We recommend creating different buckets for each cluster. However, if you do want to save all data in the same bucket, that is supported since data goes into a folder with the cluster ID as the prefix and so can be easily identified as belonging to a specific cluster. You must ensure not to change the prefix, once it’s created as doing so could potentially lead to breaking the tiered storage functionality.
Q: How are instance IAM profiles/roles updated to allow the writes to the S3 bucket?
A: This should be done automatically by our systems when tiered storage is enabled on your cluster.
Q: Can I restrict the policy so the bucket can only be accessed by the VPC endpoint of the cluster?
A: There are two options if you want to access the S3 bucket via VPC endpoint only:
- The recommended approach is to provision a private network cluster and the endpoint gets provisioned automatically.
- If you have a public network cluster, after provisioning with a regular S3 bucket, you can contact our support to add a backup S3 gateway, which is to add a new VPC endpoint in your VPC. And once its done, the endpoint’s ID can be used in the bucket policy to restrict the traffic.
Q: Is it better to use the S3 Express One Zone tier or is S3 Standard or another S3 option preferred?
A: We advise all customers to go for S3 standard and save on costs of storage. For data which you need to access faster, it is advisable to configure that topic’s local retention period to be such that the more-frequently needed data is retained in local, thereby allowing faster access, instead of going for a more expensive S3 storage options which would apply for all data ever stored on it (and not just data for that one topic). Please also note that some of the other S3 storage options are either only accessible from one AZ (for example, S3 Express One Zone) or enable some form of additional tiering (for example, S3 Intelligent Tiering) both of which would cause issues with how Kafka or Kafka Tiered Storage work.
Please contact [email protected] for any further inquiries.