Bug 2216803

Summary: Rook ceph exporter pod remains stuck in terminating state when node is offline
Product: [Red Hat Storage] Red Hat OpenShift Data Foundation Reporter: Aman Agrawal <amagrawa>
Component: rookAssignee: Santosh Pillai <sapillai>
Status: CLOSED ERRATA QA Contact: Itzhak <ikave>
Severity: high Docs Contact:
Priority: unspecified    
Version: 4.13CC: ikave, muagarwa, nberry, odf-bz-bot, sapillai, sheggodu, tnielsen
Target Milestone: ---   
Target Release: ODF 4.16.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: 4.16.0-94 Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of:
: 2227161 (view as bug list) Environment:
Last Closed: 2024-07-17 13:11:03 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 2227161    

Comment 7 Itzhak 2024-04-08 12:54:59 UTC
I ran the test "test_check_pods_status_after_node_failure" again with the vSphere 4.14 cluster, and it passed: https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster/36067/. I also checked the console output to see that the rook-ceph pods weren't stuck in a terminating state. 


Cluster versions:

OC version:
Client Version: 4.10.24
Server Version: 4.14.0-0.nightly-2024-04-04-140720
Kubernetes Version: v1.27.11+749fe1d

OCS version:
ocs-operator.v4.14.6-rhodf              OpenShift Container Storage   4.14.6-rhodf   ocs-operator.v4.14.5-rhodf              Succeeded

Cluster version
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.14.0-0.nightly-2024-04-04-140720   True        False         3h8m    Cluster version is 4.14.0-0.nightly-2024-04-04-140720

Rook version:
rook: v4.14.6-0.7522dc8ddafd09860f2314db3965ef97671cd138
go: go1.20.12

Ceph version:
ceph version 17.2.6-196.el9cp (cbbf2cfb549196ca18c0c9caff9124d83ed681a4) quincy (stable)

Comment 12 Itzhak 2024-04-30 13:11:55 UTC
When testing with vSphere UPI 4.16, The test fails. 
The failure is different from the previous error. It occurs because after shutting down the worker node, the status of the rook-ceph pods not in the node did not change after 6 minutes.

So, in summary, the current issue with 4.16 vSphere UPI is the following:
1. Shutting down a worker node. 
2. Waiting for the status of the rook-ceph pods not in the node to change.

Actual result:
The status of the rook-ceph pods not in the node did not change after 6 minutes

Expected result:
The status of the rook-ceph pods not in the node should change, or the pods should be deleted


Additional info:

Report portal link: https://reportportal-ocs4.apps.ocp-c1.prod.psi.redhat.com/ui/#ocs/launches/all/20685/991918/991929/log.

Versions: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j-020vup1cs33-t4a/j-020vup1cs33-t4a_20240423T010622/logs/test_report_1713834101.html

Comment 13 Santosh Pillai 2024-05-02 12:34:40 UTC
@ikave Comment 12 is a bit confusing for me. 

The BZ is about the `rook-ceph-exporter` stuck in terminating state when node if offline.  We had fixed that by adding changes that removed any rook-ceph-exporter pod stuck in terminating state. 

>>Expected result:
>> The status of the rook-ceph pods not in the node should change, or the pods should be deleted

What rook-ceph pods are you referring to?

Comment 14 Itzhak 2024-05-02 13:56:47 UTC
Yes, this is a different issue. 
The test checks that the status of the rook-ceph pods not in the node will change after shutting down a worker node.
This is the first step before checking the issue described in the BZ.

Comment 15 Mudit Agarwal 2024-05-07 05:49:30 UTC
Itzhak, please open a new issue for the recent failure.

Comment 17 Itzhak 2024-05-07 12:17:42 UTC
I raised a new BZ: https://bugzilla.redhat.com/show_bug.cgi?id=2279538, regarding the issue mentioned in https://bugzilla.redhat.com/show_bug.cgi?id=2216803#c12. 
I will wait another few days to see if the error here appears again, and if not, I will close it.

Comment 18 Itzhak 2024-05-21 15:51:19 UTC
I reran the test "test_check_pods_status_after_node_failure" with AWS and vSphere 4.16, and it went fine. 
Also, I checked the console output to see that the process was accurate. 
AWS test result: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/ikave-aws416/ikave-aws416_20240521T052423/logs/test_report_1716300838.html.

Therefore, I am moving the BZ to Verified.

Comment 23 Sunil Kumar Acharya 2024-06-18 06:45:26 UTC
Please update the RDT flag/text appropriately.

Comment 26 errata-xmlrpc 2024-07-17 13:11:03 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.16.0 security, enhancement & bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2024:4591