Technical Technical — Cassandra Wednesday 30th September 2020

Understanding the Impacts of the Native Transport Requests Change Introduced in Cassandra 3.11.5

By Adam Zegelin

Summary

Recently, Cassandra made changes to the Native Transport Requests (NTR) queue behaviour. Through our performance testing, we found the new NTR change to be good for clusters that have a constant load causing the NTR queue to block. Under the new mechanism the queue no longer blocks, but throttles the load based on queue size setting, which by default is 10% of the heap.

Compared to the Native Transport Requests queue length limit, this improves how Cassandra handles traffic when queue capacity is reached. The “back pressure” mechanism more gracefully handles the overloaded NTR queue, resulting in a significant lift of operations without clients timing out. In summary, clusters with later versions of Cassandra can handle more load before hitting hard limits.

Introduction

At Instaclustr, we are responsible for managing the Cassandra versions that we release to the public. This involves performing a review of Cassandra release changes, followed by performance testing. In cases where major changes have been made in the behaviour of Cassandra, further research is required. So without further delay let’s introduce the change to be investigated.

Change:
  • Prevent client requests from blocking on executor task queue (CASSANDRA-15013)
Versions affected:

Background

Native Transport Requests

Native transport requests (NTR) are any requests made via the CQL Native Protocol. CQL Native Protocol is the way the Cassandra driver communicates with the server. This includes all reads, writes, schema changes, etc. There are a limited number of threads available to process incoming requests. When all threads are in use, some requests wait in a queue (pending). If the queue fills up, some requests are silently rejected (blocked). The server never replies, so this eventually causes a client-side timeout. The main way to prevent blocked native transport requests is to throttle load, so the requests are performed over a longer period.

Prior to 3.11.5

Prior to 3.11.5, Cassandra used the following configuration settings to set the size and throughput of the queue:

  • native_transport_max_threads is used to set the maximum threads for handling requests.  Each thread pulls requests from the NTR queue.
  • cassandra.max_queued_native_transport_requests is used to set queue size. Once the queue is full the Netty threads are blocked waiting for the queue to have free space (default 128).

Once the NTR queue is full requests from all clients are not accepted. There is no strict ordering by which blocked Netty threads will process requests. Therefore in 3.11.4 latency becomes random once all Netty threads are blocked.

Native Transport Requests - Cassandra 3.11.4

Change After 3.11.5

In 3.11.5 and above, instead of blocking the NTR queue as previously described, it throttles. The NTR queue is throttled based on the heap size. The native transport requests are limited in terms of total size occupied in memory rather than the number of them. Requests are paused after the queue is full.

  • native_transport_max_concurrent_requests_in_bytes a global limit on the number of NTR requests, measured in bytes. (default heapSize / 10)
  • native_transport_max_concurrent_requests_in_bytes_per_ip an endpoint limit on the number of NTR requests, measured in bytes. (default heapSize / 40)

Maxed Queue Behaviour

From previously conducted performance testing of 3.11.4 and 3.11.6 we noticed similar behaviour when the traffic pressure has not yet reached the point of saturation in the NTR queue. In this section, we will discuss the expected behaviour when saturation does occur and breaking point is reached. 

In 3.11.4, when the queue has been maxed, client requests will be refused. For example, when trying to make a connection via cqlsh, it will yield an error, see Figure 2.

Cassandra 3.11.4 - queue maxed out, client requests refused
Figure 2: Timed out request

Or on the client that tries to run a query, you may see NoHostAvailableException

Where a 3.11.4 cluster previously got blocked NTRs, when upgraded to 3.11.6 NTRs are no longer blocked. The reason is that 3.11.6 doesn’t place a limit on the number of NTRs but rather on the size of memory of all those NTRs. Thus when the new size limit is reached, NTRs are paused. Default settings in 3.11.6 result in a much larger NTR queue in comparison to the small 128 limit in 3.11.4 (in normal situations where the payload size would not be extremely large).

Benchmarking Setup

This testing procedure requires the NTR queue on a cluster to be at max capacity with enough load to start blocking requests at a constant rate. In order to do this we used multiple test boxes to stress the cluster. This was achieved by using 12 active boxes to create multiple client connections to the test cluster. Once the cluster NTR queue is in constant contention, we monitored the performance using:

  • Client metrics: requests per second, latency from client perspective
  • NTR Queue metrics: Active Tasks, Pending Tasks, Currently Blocked Tasks, and Paused Connections.

For testing purposes we used two testing clusters with details provided in the table below:

CassandraCluster sizeInstance TypeCoresRAMDisk
3.11.43M5xl-1600-v2 416GB1600 GB
3.11.63m5xl-1600-v2416GB1600 GB
Table 1: Cluster Details

To simplify the setup we disabled encryption and authentication. Multiple test instances were set up in the same region as the clusters. For testing purposes we used 12 KB blob payloads. To give each cluster node a balanced mixed load, we kept the number of test boxes generating write load equal to the number of instances generating read load. We ran the load against the cluster for 10 mins to temporarily saturate the queue with read and write requests and cause contention for the Netty threads.

Our test script used cassandra-stress for generating the load, you can also refer to Deep Diving cassandra-stress – Part 3 (Using YAML Profiles) for more information.

In the stressSpec.yaml, we used the following table definition and queries:

Write loads were generated with:

Read loads were generated by changing ops to

Comparison

3.11.4 Queue Saturation Test

The active NTR queue reached max capacity (at 128) and remained in contention under load. Pending NTR tasks remained above 128 throughout. At this point, timeouts were occurring when running 12 load instances to stress the cluster. Each node had 2 load instances performing reads and another 2 performing writes. 4 of the read load instances constantly logged NoHostAvailableExceptions as shown in the example below.

The client results we got from this stress run are shown in Table 2.

BoxOp rate (op/s)Latency mean (ms)Latency median (ms)Latency 95th percentile (ms)latency 99th percentile (ms)Latency 99.9th percentile (ms)Latency max (ms)
1700.002,862.202,078.307,977.6011,291.1019,495.1034,426.80
2651.003,054.502,319.508,048.9011,525.9019,528.7032,950.50
3620.003,200.902,426.408,409.6012,599.7020,367.5034,158.40
4607.003,312.802,621.408,304.7011,769.2019,730.0031,977.40
5568.003,529.803,011.508,216.6011,618.2019,260.2032,698.80
6553.003,627.103,028.308,631.9012,918.5020,115.9034,292.60
Writes3,699.003,264.552,580.908,264.8811,953.7719,749.5734,426.80
7469.004,296.503,839.909,101.6014,831.1021,290.3035,634.80
8484.004,221.503,808.408,925.5011,760.8020,468.2034,863.10
9Crashed due to time out
10Crashed due to time out
11Crashed due to time out
12Crashed due to time out
Reads953.004,259.003,824.159,092.8014,800.4021,289.4835,634.80
Summary4,652.003,761.783,202.538,678.8413,377.0820,519.5235,634.80
Table 2: 3.11.4 Mixed Load Saturating The NTR Queue

* To calculate the total write operations, we summed the values from 6 instances. For max write latency we used the max value from all instances and for the rest of latency values, we calculated the average of results. Write results are summarised in the Table 2 “Write” row. For the read result we did the same, and results are recorded in the “Read” row. The last row in the table summarises the results in “Write” and “Read” rows.

The 6 write load instances finished normally, but the read instances struggled. Only 2 of the read load instances were able to send traffic through normally, the other clients received too many timeout errors causing them to crash. Another observation we have made is that the Cassandra timeout metrics, under client-request-metrics, did not capture any of the client timeout we have observed.

Same Load on 3.11.6

Next, we proceeded to test 3.11.6 with the same load. Using the default NTR settings, all test instances were able to finish the stress test successfully.

BoxOp rate (op/s)Latency mean (ms)Latency median (ms)Latency 95th percentile (ms)latency 99th percentile (ms)Latency 99.9th percentile (ms)Latency max (ms)
1677.002,992.602,715.807,868.509,303.009,957.3010,510.90
2658.003,080.202,770.307,918.809,319.7010,116.7010,510.90
3653.003,102.802,785.007,939.809,353.3010,116.7010,510.90
4608.003,340.903,028.308,057.309,386.9010,192.2010,502.50
5639.003,178.302,868.907,994.309,370.1010,116.7010,510.90
6650.003,120.502,799.707,952.409,353.3010,116.7010,510.90
Writes3,885.003,135.882,828.007,955.189,347.7210,102.7210,510.90
7755.002,677.702,468.307,923.009,378.509,982.4010,762.60
8640.003,160.702,812.308,132.809,529.5010,418.7011,031.00
9592.003,427.603,101.708,262.809,579.8010,452.2011,005.90
10583.003,483.003,160.408,279.609,579.8010,435.4011,022.60
11582.003,503.603,181.408,287.909,588.2010,469.0011,047.80
12582.003,506.703,181.408,279.609,588.2010,460.6011,014.20
Reads3,734.003,293.222,984.258,194.289,540.6710,369.7211,047.80
Summary7,619.003,214.552,906.138,074.739,444.1910,236.2211,047.80
Table 3: 3.11.6 Mixed Load

Default Native Transport Requests (NTR) Setting Comparison

Taking the summary row from both versions (Table 2 and Table 3), we produced Table 4.

Op rate (op/s)Latency mean (ms)Latency median (ms)Latency 95th percentile (ms)latency 99th percentile (ms)Latency 99.9th percentile (ms)Latency max (ms)
3.11.446523761.7753202.5258678.83916713377.0818320519.5222835634.8
3.11.676193214.552906.1258074.7333339444.19166710236.2166711047.8
Table 4: Mixed Load 3.11.4 vs 3.11.6


Figure 2: Latency 3.11.4 vs 3.11.6

Figure 2 shows the latencies from Table 4. From the results, 3.11.6 had slightly better average latency than 3.11.4. Furthermore, in the worst case where contention is high, 3.11.6 handled the latency of a request better than 3.11.4. This is shown by the difference in Latency Max. Not only did 3.11.6 have lower latency but it was able to process many more requests due to not having a blocked queue.

3.11.6 Queue Saturation Test

The default native_transport_max_concurrent_requests_in_bytes is set to 1/10 of the heap size. The Cassandra max heap size of our cluster is 8 GB, so the default queue size for our queue is 0.8 GB. This turns out to be too large for this cluster size, as this configuration will run into CPU and other bottlenecks before we hit NTR saturation.

So we took the reverse approach to investigate full queue behaviour, which is setting the queue size to a lower number. In cassandra.yaml, we added:

This means we set the global queue size to be throttled at 1MB. Once Cassandra was restarted and all nodes were online with the new settings, we ran the same mixed load on this cluster, the results we got are shown in Table 5.

3.11.6Op rate (op/s)Latency mean (ms)Latency median (ms)Latency 95th percentile (ms)latency 99th percentile (ms)Latency 99.9th percentile (ms)Latency max (ms)
Write: Default setting3,885.003,135.882,828.007,955.189,347.7210,102.7210,510.90
Write: 1MB setting2,105.005,749.133,471.8216,924.0226,172.4529,681.6831,105.00
Read: Default setting3,734.003,293.222,984.258,194.289,540.6710,369.7211,047.80
Read: 1MB setting5,395.002,263.131,864.555,176.478,074.739,693.0315,183.40
Summary: Default setting7,619.003,214.552,906.138,074.739,444.1910,236.2211,047.80
Summary: 1MB setting7,500.004,006.132,668.1811,050.2417,123.5919,687.3631,105.00

Table 5: 3.11.6 native_transport_max_concurrent_requests_in_bytes default and 1MB setting 

During the test, we observed a lot of paused connections and discarded requests—see Figure 3. For a full list of Instaclustr exposed metrics see our support documentation.

NTR Test - Paused Connections and Discarded Requests
Figure 3: 3.11.6 Paused Connections and Discarded Requests

After setting native_transport_max_concurrent_requests_in_bytes to a lower number, we start to get paused connections and discarded requests, write latency increased resulting in fewer processed operations, shown in Table 5. The increased write latency is illustrated Figure 4.

Cassandra 3.11.6 Write Latency Under Different Settings
Figure 4: 3.11.6 Write Latency Under Different Settings

On the other hand, read latency decreased, see Figure 5, resulting in a higher number of operations being processed.

Cassandra 3.11.6 Read Latency Under Different Settings
Figure 5: 3.11.6 Read Latency Under Different Settings
Cassandra 3.11.6 Operations Rate Under Different Settings
Figure 6: 3.11.6 Operations Rate Under Different Settings

As illustrated in Figure 6, the total number of operations decreased slightly with the 1MB setting, but the difference is very small and the effect of read and write almost “cancel each other out”. However, when we look at each type of operation individually, we can see that rather than getting equal share of the channel in a default setting of “almost unlimited queue”, the lower queue size penalizes writes and favors read. While our testing identified this outcome, further investigation will be required to determine exactly why this is the case.

Conclusion

In conclusion, the new NTR change offers an improvement over the previous NTR queue behaviour. Through our performance testing we found the change to be good for clusters that have a constant load causing the NTR queue to block. Under the new mechanism the queue no longer blocks, but throttles the load based on the amount of memory allocated to requests.

The results from testing indicated that the changed queue behaviour reduced latency and provided a significant lift in the number of operations without clients timing out. Clusters with our latest version of Cassandra can handle more load before hitting hard limits. For more information feel free to comment below or reach out to our Support team to learn more about changes to 3.11.6 or any of our other supported Cassandra versions.