Originally developed and open sourced by Uber, Cadence® is a workflow engine that greatly simplifies the development of complex, long-running automated business processes at scale.
The multi-region functionality of Cadence replicates workflow state across 2 Cadence clusters (primary and secondary/backup) and provides failover in the event of a full region failure. The replication of workflow state between the 2 clusters as well as their health can be monitored using the cross-cluster canary test suit (https://github.com/uber/cadence/tree/master/canary#cross-cluster).
This blog explores how the dynamic config property frontend.failoverCoolDown can help cross-cluster canary operate in a correct manner.
The cross-cluster canary service executes the cross-cluster feature which allows child workflows to be launched in the secondary/backup cluster. This canary service attempts to fail the target domain over to the secondary/backup cluster and back again with some small probability at each iteration. This ensures that both the primary and secondary/backup clusters are operating correctly and can communicate with each other.
By default, the canary test suite executes a number of test workflows (https://github.com/uber/cadence/blob/master/canary/const.go#L67).
It usually takes a few minutes to successfully complete the entire test suite (i.e., each iteration takes a few minutes to complete). If you set up canary to execute the full test suite, then the domain failover that occurs after each iteration will go through without any issue.
However, in certain circumstances—and based on your business needs—you may want to run a trimmed down version of the canary test suite (https://github.com/uber/cadence/tree/master/canary#configurations). This means that each iteration of the canary service will complete much faster.
For example: in a multi-region scenario, if you were to just run 3 test workflows (instead of the default 22 workflows), then it would take approximately 20 seconds to complete the entire canary test suite. As a result, the cross-cluster canary will attempt to fail over the target domain every 20 seconds.
Hence, under these modified conditions, you will often see the following error pop up during domain failover:
1 |
Domain update too frequent. |
This happens due to the fact that the dynamic config property frontend.failoverCoolDown has a default value of 1 minute. This indicates that the duration between 2 domain failovers should be at least 1 minute (i.e., 60 seconds).
To elaborate, before failing over a domain, the Cadence server always checks when the last failover occurred for the same domain. If the last failover occurred too recently (in this case, if it occurred less than 1 minute ago), then the current domain failover operation will fail with the aforementioned error message (https://github.com/uber/cadence/blob/master/common/domain/handler.go#L525). It will cause this modified canary test suite to fail at certain iterations. As a result, if you have alerts set up based on this to monitor your clusters’ health, then those alerts will get triggered (in this case, false positive alerts).
The solution is simple. Just add the following property to your dynamic config file to override the default value (of 1 minute). Set the value to be anything less than the canary service execution time. For the modified scenario, you can set:
1 2 3 |
frontend.failoverCoolDown: - value: 10s |
If you like the sound of this but don’t want to go through the work of figuring out how to set them up yourself, Instaclustr’s Managed Cadence offering already supports this through its Multi-Region Cadence Offering.
Ready to try out Cadence for yourself? Sign up today with a free trial and get started!