Bug 1469182

Summary:	Peering should be blocked if cluster network is unavailable
Product:	[Red Hat Storage] Red Hat Ceph Storage	Reporter:	Tupper Cole <tcole>
Component:	RADOS	Assignee:	Greg Farnum <gfarnum>
Status:	CLOSED NOTABUG	QA Contact:	Manohar Murthy <mmurthy>
Severity:	medium	Docs Contact:
Priority:	medium
Version:	2.3	CC:	ceph-eng-bugs, dzafman, jdurgin, kchai, tcole
Target Milestone:	rc
Target Release:	4.*
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2020-04-29 21:51:01 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Tupper Cole 2017-07-10 15:44:49 UTC

Description of problem:
If the cluster network is unavailable for some reason, but the public network is up an OSD may be marked down, but report itself as up. This can result in peering that cannot complete. 

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.Take cluster network offline
2.Wait for OSD to be reported down by peers
3.Watch for peering blocked by OSD.X 

Actual results:Peering is blocked forever


Expected results:pgs are remapped


Additional info:

Comment 2 Josh Durgin 2017-07-10 20:43:53 UTC

(In reply to Tupper Cole from comment #0)
> Description of problem:
> If the cluster network is unavailable for some reason, but the public
> network is up an OSD may be marked down, but report itself as up. This can
> result in peering that cannot complete. 
>
> Actual results:Peering is blocked forever
> 
> 
> Expected results:pgs are remapped

The cluster network is how OSDs talk to each other. If it's down, nothing will function - since the OSDs can't communicate with each other, writes cannot work.
So asking for pgs to be 'remapped' in this case does not make sense to me. Are you asking for better detection/clearer warnings when the cluster network is not working?

Comment 3 Tupper Cole 2017-07-11 12:46:50 UTC

There are situation where a single host may not have connectivity to the cluster network, but will still have public network access. 

To explain the scenario: 

Customer has a five rack cluster, with single top of rack switches (failure domain is the rack). Cluster and public network are VLAN tagged, both through same switch. There is no issue if the switch dies, however a single port shutting down leaves the public network up and the cluster network unavailable on one node. 

This exact scenario caused an outage on an upstream cluster slated for migration to supported RHCS.

Comment 4 Josh Durgin 2017-07-19 00:37:17 UTC

Sounds like our existing cluster/client network heartbeats aren't working right then - needs investigation.

Comment 6 Giridhar Ramaraju 2019-08-05 13:09:52 UTC

Updating the QA Contact to a Hemant. Hemant will be rerouting them to the appropriate QE Associate. 

Regards,
Giri

Comment 7 Giridhar Ramaraju 2019-08-05 13:11:03 UTC

Updating the QA Contact to a Hemant. Hemant will be rerouting them to the appropriate QE Associate. 

Regards,
Giri

Comment 9 Josh Durgin 2020-04-29 21:51:01 UTC

Closing since this doesn't seem to be an issue. Please reopen if it occurs again.