Description of problem: This is eventually caused by the single-thread working schema of glusterd. From Atin Mukherjee: "Sahina - This is the root cause. Since glusterd works in single thread e-poll, it can only process packets under a big lock one at a time. Now in this case if both the shutdown and volume status race with each other, glusterd say picks up v status first, gets into a big lock and send a rpc request assuming the peer is still not disconnected and wait for 10 minutes for rpc frame to bail out. On the other hand the rpc_clnt_disconnect packet which was queued up wasn't processed which resulted peer status to still show the status of the peer to be connected even though the node is down and after 10 minutes once the big lock was released due to the rpc frame bail out this packet was processed and the peer was marked disconnected." Need to ensure that the volume status call during fencing pre-check ensures that there is a check to ping the host to see if it's online and add a small delay before triggering the volume status
Since this patch does not address the race in Gluster when a node reboot and volume status request causes the node status to be shown as connected, closing the bug as can't fix. Bug 1698519 is tracking the original issue, but this cannot be fixed in vdsm-gluster.