Red Hat Bugzilla – Bug 1469182
Peering should be blocked if cluster network is unavailable
Last modified: 2018-02-22 07:27:03 EST
Description of problem:
If the cluster network is unavailable for some reason, but the public network is up an OSD may be marked down, but report itself as up. This can result in peering that cannot complete.
Version-Release number of selected component (if applicable):
Steps to Reproduce:
1.Take cluster network offline
2.Wait for OSD to be reported down by peers
3.Watch for peering blocked by OSD.X
Actual results:Peering is blocked forever
Expected results:pgs are remapped
(In reply to Tupper Cole from comment #0)
> Description of problem:
> If the cluster network is unavailable for some reason, but the public
> network is up an OSD may be marked down, but report itself as up. This can
> result in peering that cannot complete.
> Actual results:Peering is blocked forever
> Expected results:pgs are remapped
The cluster network is how OSDs talk to each other. If it's down, nothing will function - since the OSDs can't communicate with each other, writes cannot work.
So asking for pgs to be 'remapped' in this case does not make sense to me. Are you asking for better detection/clearer warnings when the cluster network is not working?
There are situation where a single host may not have connectivity to the cluster network, but will still have public network access.
To explain the scenario:
Customer has a five rack cluster, with single top of rack switches (failure domain is the rack). Cluster and public network are VLAN tagged, both through same switch. There is no issue if the switch dies, however a single port shutting down leaves the public network up and the cluster network unavailable on one node.
This exact scenario caused an outage on an upstream cluster slated for migration to supported RHCS.
Sounds like our existing cluster/client network heartbeats aren't working right then - needs investigation.