How to tune Heartbeat on heavily loaded system to avoid split-brain?
“No local heartbeat” or “Cluster node returning after partition” under heavy load is typically caused by too small a deadtime interval, or an older version of Heartbeat. Make sure you’re running at least version 1.2.0. Here is a suggestion for how to tune deadtime: • Set deadtime to 60 seconds or higher • Set warntime to 1/4 to 1/2 of whatever you *want* your deadtime to be. • Run your system under heavy load for a few weeks. • Look at your logs for the longest time either system went without hearing a heartbeat. If your never saw a “late heartbeat” message, then your chosen deadtime is fine – use it. • Set your deadtime to 1.5-2 times that amount. • Set warntime to keepalive*2. • Continue to monitor logs for warnings about long heartbeat times. If you don’t do this, you may get “Cluster node … returning after partition” which will cause Heartbeat to restart on all machines in the cluster. This will almost certainly annoy you at a minimum. Adding memory to the machine generally helps