Bug 1469182
Summary: | Peering should be blocked if cluster network is unavailable | ||
---|---|---|---|
Product: | [Red Hat Storage] Red Hat Ceph Storage | Reporter: | Tupper Cole <tcole> |
Component: | RADOS | Assignee: | Greg Farnum <gfarnum> |
Status: | CLOSED NOTABUG | QA Contact: | Manohar Murthy <mmurthy> |
Severity: | medium | Docs Contact: | |
Priority: | medium | ||
Version: | 2.3 | CC: | ceph-eng-bugs, dzafman, jdurgin, kchai, tcole |
Target Milestone: | rc | ||
Target Release: | 4.* | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2020-04-29 21:51:01 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Tupper Cole
2017-07-10 15:44:49 UTC
(In reply to Tupper Cole from comment #0) > Description of problem: > If the cluster network is unavailable for some reason, but the public > network is up an OSD may be marked down, but report itself as up. This can > result in peering that cannot complete. > > Actual results:Peering is blocked forever > > > Expected results:pgs are remapped The cluster network is how OSDs talk to each other. If it's down, nothing will function - since the OSDs can't communicate with each other, writes cannot work. So asking for pgs to be 'remapped' in this case does not make sense to me. Are you asking for better detection/clearer warnings when the cluster network is not working? There are situation where a single host may not have connectivity to the cluster network, but will still have public network access. To explain the scenario: Customer has a five rack cluster, with single top of rack switches (failure domain is the rack). Cluster and public network are VLAN tagged, both through same switch. There is no issue if the switch dies, however a single port shutting down leaves the public network up and the cluster network unavailable on one node. This exact scenario caused an outage on an upstream cluster slated for migration to supported RHCS. Sounds like our existing cluster/client network heartbeats aren't working right then - needs investigation. Updating the QA Contact to a Hemant. Hemant will be rerouting them to the appropriate QE Associate. Regards, Giri Updating the QA Contact to a Hemant. Hemant will be rerouting them to the appropriate QE Associate. Regards, Giri Closing since this doesn't seem to be an issue. Please reopen if it occurs again. |