Bug 1817850

Summary: [BAREMETAL] rook-ceph-operator does not reconcile when osd deployment is deleted when performed node replacement
Product: [Red Hat Storage] Red Hat OpenShift Container Storage Reporter: Pratik Surve <prsurve>
Component: rookAssignee: Sébastien Han <shan>
Status: CLOSED ERRATA QA Contact: Itzhak <ikave>
Severity: high Docs Contact:
Priority: unspecified    
Version: 4.3CC: madam, muagarwa, ocs-bugs, ratamir, shan
Target Milestone: ---Keywords: AutomationBackLog
Target Release: OCS 4.6.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-12-17 06:22:30 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1800691    

Comment 4 Travis Nielsen 2020-04-13 13:35:26 UTC
The operator does not currently start an orchestration when an OSD deployment is deleted. In the OCS 4.6 timeline we will be looking at that behavior when we convert the CephCluster controller to the controller runtime. Until then a step can be simply added to the node replacement guide to restart the operator.

@Pratik Would that address the issue for now? 
@Raz More details about why you see this as a blocker for 4.4?

Comment 7 Travis Nielsen 2020-04-24 18:18:34 UTC
The question has not been answered about why this should be a blocker for 4.4. Removing the blocker flag and from 4.4. Per https://bugzilla.redhat.com/show_bug.cgi?id=1817850#c4, this improvement isn't expected to land until OCS 4.6.

Comment 9 Sébastien Han 2020-04-28 15:17:58 UTC
Will be part of release-4.6 once we branch it so leaving this to POST until then.

Comment 10 Sébastien Han 2020-06-25 13:20:09 UTC
Moving to MODIFIED since the release-4.6 branch is there.

Comment 13 Itzhak 2020-10-22 12:32:04 UTC
I used a Vsphere LSO 4.6 to check the bug.
From I understand, to verify the bug, we don't need to perform all the steps of node replacement - 
we only need to delete the osd deployment and check that the osd deployment and osd pod should be recreated

Steps I did to verify the bug:

1. Delete the osd deployment
$oc delete deployment rook-ceph-osd-0 -n openshift-storage

2. After approximately 30 seconds the osd deployment and osd pod recreated, 
   and ceph health is OK(without restarting the rook-ceph-operator)


Additional info:

OCP version:
Client Version: 4.3.8
Server Version: 4.6.0-0.nightly-2020-10-20-172149
Kubernetes Version: v1.19.0+d59ce34

OCS verison:
ocs-operator.v4.6.0-134.ci   OpenShift Container Storage   4.6.0-134.ci              Succeeded

cluster version
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.6.0-0.nightly-2020-10-20-172149   True        False         20h     Cluster version is 4.6.0-0.nightly-2020-10-20-172149

Rook version
rook: 4.6-67.afaf3353.release_4.6
go: go1.15.0

Ceph version
ceph version 14.2.8-111.el8cp (2e6029d57bc594eceba4751373da6505028c2650) nautilus (stable)

Comment 15 errata-xmlrpc 2020-12-17 06:22:30 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: Red Hat OpenShift Container Storage 4.6.0 security, bug fix, enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5605