Description of problem:
Right now RHEL HA (both RHCS and Pacemaker based stacks) require running on networks with LAN-like latency that we have defined to be <= 2ms.
The primary constraints on latency are in the membership which is done via Corosync. In addition, plocks via GFS are also latency sensitive.
For the context of this bug, we are not concerned about GFS over high latency links, just the core cluster infrastructure.
What we need to do is simulate high latency links and test out the HA stacks to determine what is the highest latency that we can support w/o needing to make significant code or configuration (timeout) changes.
Then we can begin officially QE testing at this higher latency and support links with up to this delay.
This bug for the time being should be considered TestOnly, but it needs testing first from development perspective before QE can begin running more comprehensive tests.
The initial use case is to run stretch clusters with 2 sites and between 1 and 8 nodes at each site. The membership list should be configured so that the Totem token does not bounce back and forth the high latency link, but crosses it minimally. (i.e. nodes 1-8 on SiteA and 9-16 on SiteB, meaning token only crosses high latency link between nodes 8 and 9 and between nodes 16 and 1)
If specific code changes are required to support this, the engineers testing this feature should file dependent bugs on their components (for example a bug on Corosync)
Corosync with properly set timeouts (specially token timeout) should be able to handle non lan conditions quite well. It is really testonly.
Development Management has reviewed and declined this request.
You may appeal this decision by reopening this request.