Bug 1817850

Summary:	[BAREMETAL] rook-ceph-operator does not reconcile when osd deployment is deleted when performed node replacement
Product:	[Red Hat Storage] Red Hat OpenShift Container Storage	Reporter:	Pratik Surve <prsurve>
Component:	rook	Assignee:	Sébastien Han <shan>
Status:	CLOSED ERRATA	QA Contact:	Itzhak <ikave>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	4.3	CC:	madam, muagarwa, ocs-bugs, ratamir, shan
Target Milestone:	---	Keywords:	AutomationBackLog
Target Release:	OCS 4.6.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	No Doc Update
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2020-12-17 06:22:30 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1800691

Comment 4 Travis Nielsen 2020-04-13 13:35:26 UTC

The operator does not currently start an orchestration when an OSD deployment is deleted. In the OCS 4.6 timeline we will be looking at that behavior when we convert the CephCluster controller to the controller runtime. Until then a step can be simply added to the node replacement guide to restart the operator.

@Pratik Would that address the issue for now? 
@Raz More details about why you see this as a blocker for 4.4?

Comment 7 Travis Nielsen 2020-04-24 18:18:34 UTC

The question has not been answered about why this should be a blocker for 4.4. Removing the blocker flag and from 4.4. Per https://bugzilla.redhat.com/show_bug.cgi?id=1817850#c4, this improvement isn't expected to land until OCS 4.6.

Comment 9 Sébastien Han 2020-04-28 15:17:58 UTC

Will be part of release-4.6 once we branch it so leaving this to POST until then.

Comment 10 Sébastien Han 2020-06-25 13:20:09 UTC

Moving to MODIFIED since the release-4.6 branch is there.

Comment 13 Itzhak 2020-10-22 12:32:04 UTC

I used a Vsphere LSO 4.6 to check the bug.
From I understand, to verify the bug, we don't need to perform all the steps of node replacement - 
we only need to delete the osd deployment and check that the osd deployment and osd pod should be recreated

Steps I did to verify the bug:

1. Delete the osd deployment
$oc delete deployment rook-ceph-osd-0 -n openshift-storage

2. After approximately 30 seconds the osd deployment and osd pod recreated, 
   and ceph health is OK(without restarting the rook-ceph-operator)


Additional info:

OCP version:
Client Version: 4.3.8
Server Version: 4.6.0-0.nightly-2020-10-20-172149
Kubernetes Version: v1.19.0+d59ce34

OCS verison:
ocs-operator.v4.6.0-134.ci   OpenShift Container Storage   4.6.0-134.ci              Succeeded

cluster version
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.6.0-0.nightly-2020-10-20-172149   True        False         20h     Cluster version is 4.6.0-0.nightly-2020-10-20-172149

Rook version
rook: 4.6-67.afaf3353.release_4.6
go: go1.15.0

Ceph version
ceph version 14.2.8-111.el8cp (2e6029d57bc594eceba4751373da6505028c2650) nautilus (stable)

Comment 15 errata-xmlrpc 2020-12-17 06:22:30 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: Red Hat OpenShift Container Storage 4.6.0 security, bug fix, enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5605