Bug 1469182 - Peering should be blocked if cluster network is unavailable
Peering should be blocked if cluster network is unavailable
Status: NEW
Product: Red Hat Ceph Storage
Classification: Red Hat
Component: RADOS (Show other bugs)
2.3
Unspecified Unspecified
unspecified Severity unspecified
: rc
: 3.*
Assigned To: Josh Durgin
ceph-qe-bugs
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2017-07-10 11:44 EDT by Tupper Cole
Modified: 2018-02-22 07:27 EST (History)
4 users (show)

See Also:
Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed:
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Tupper Cole 2017-07-10 11:44:49 EDT
Description of problem:
If the cluster network is unavailable for some reason, but the public network is up an OSD may be marked down, but report itself as up. This can result in peering that cannot complete. 

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.Take cluster network offline
2.Wait for OSD to be reported down by peers
3.Watch for peering blocked by OSD.X 

Actual results:Peering is blocked forever


Expected results:pgs are remapped


Additional info:
Comment 2 Josh Durgin 2017-07-10 16:43:53 EDT
(In reply to Tupper Cole from comment #0)
> Description of problem:
> If the cluster network is unavailable for some reason, but the public
> network is up an OSD may be marked down, but report itself as up. This can
> result in peering that cannot complete. 
>
> Actual results:Peering is blocked forever
> 
> 
> Expected results:pgs are remapped

The cluster network is how OSDs talk to each other. If it's down, nothing will function - since the OSDs can't communicate with each other, writes cannot work.
So asking for pgs to be 'remapped' in this case does not make sense to me. Are you asking for better detection/clearer warnings when the cluster network is not working?
Comment 3 Tupper Cole 2017-07-11 08:46:50 EDT
There are situation where a single host may not have connectivity to the cluster network, but will still have public network access. 

To explain the scenario: 

Customer has a five rack cluster, with single top of rack switches (failure domain is the rack). Cluster and public network are VLAN tagged, both through same switch. There is no issue if the switch dies, however a single port shutting down leaves the public network up and the cluster network unavailable on one node. 

This exact scenario caused an outage on an upstream cluster slated for migration to supported RHCS.
Comment 4 Josh Durgin 2017-07-18 20:37:17 EDT
Sounds like our existing cluster/client network heartbeats aren't working right then - needs investigation.

Note You need to log in before you can comment on or make changes to this bug.