The oslo.messaging package now supports active call monitoring which allows tolerating much longer timeout values for RPC calls without sacrificing down-service detection time. Nova should use this feature, especially for known to be long-running activities, such as live migration.
*** Bug 1584268 has been marked as a duplicate of this bug. ***
This is merged upstream in rocky. I'm not really sure how best to test this in an automated way, because it requires an environment where things are happening abnormally slowly, which will be hard to synthesize. We have tested this with a one-off patch that introduces an artificial delay during setup, which would have failed before the introduction of this feature. Perhaps a one-time manual verification of this is the best we can do at the moment. The upstream hack to make the pre-migration setup call take longer is here: https://review.openstack.org/#/c/574482/2/nova/compute/manager.py Likely a longer delay would be useful for manual testing. Also, overriding the oslo.messaging log level during the run will generate "received heartbeat, reset timeout" messages which can be used to verify that the heartbeat functionality is working.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2019:0045