Bug 1469182 - Peering should be blocked if cluster network is unavailable
Summary: Peering should be blocked if cluster network is unavailable
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Red Hat Ceph Storage
Classification: Red Hat Storage
Component: RADOS
Version: 2.3
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: rc
: 4.*
Assignee: Greg Farnum
QA Contact: Manohar Murthy
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-07-10 15:44 UTC by Tupper Cole
Modified: 2020-04-29 21:51 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-04-29 21:51:01 UTC
Embargoed:


Attachments (Terms of Use)

Description Tupper Cole 2017-07-10 15:44:49 UTC
Description of problem:
If the cluster network is unavailable for some reason, but the public network is up an OSD may be marked down, but report itself as up. This can result in peering that cannot complete. 

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.Take cluster network offline
2.Wait for OSD to be reported down by peers
3.Watch for peering blocked by OSD.X 

Actual results:Peering is blocked forever


Expected results:pgs are remapped


Additional info:

Comment 2 Josh Durgin 2017-07-10 20:43:53 UTC
(In reply to Tupper Cole from comment #0)
> Description of problem:
> If the cluster network is unavailable for some reason, but the public
> network is up an OSD may be marked down, but report itself as up. This can
> result in peering that cannot complete. 
>
> Actual results:Peering is blocked forever
> 
> 
> Expected results:pgs are remapped

The cluster network is how OSDs talk to each other. If it's down, nothing will function - since the OSDs can't communicate with each other, writes cannot work.
So asking for pgs to be 'remapped' in this case does not make sense to me. Are you asking for better detection/clearer warnings when the cluster network is not working?

Comment 3 Tupper Cole 2017-07-11 12:46:50 UTC
There are situation where a single host may not have connectivity to the cluster network, but will still have public network access. 

To explain the scenario: 

Customer has a five rack cluster, with single top of rack switches (failure domain is the rack). Cluster and public network are VLAN tagged, both through same switch. There is no issue if the switch dies, however a single port shutting down leaves the public network up and the cluster network unavailable on one node. 

This exact scenario caused an outage on an upstream cluster slated for migration to supported RHCS.

Comment 4 Josh Durgin 2017-07-19 00:37:17 UTC
Sounds like our existing cluster/client network heartbeats aren't working right then - needs investigation.

Comment 6 Giridhar Ramaraju 2019-08-05 13:09:52 UTC
Updating the QA Contact to a Hemant. Hemant will be rerouting them to the appropriate QE Associate. 

Regards,
Giri

Comment 7 Giridhar Ramaraju 2019-08-05 13:11:03 UTC
Updating the QA Contact to a Hemant. Hemant will be rerouting them to the appropriate QE Associate. 

Regards,
Giri

Comment 9 Josh Durgin 2020-04-29 21:51:01 UTC
Closing since this doesn't seem to be an issue. Please reopen if it occurs again.


Note You need to log in before you can comment on or make changes to this bug.