Bug 2216803 - Rook ceph exporter pod remains stuck in terminating state when node is offline
Summary: Rook ceph exporter pod remains stuck in terminating state when node is offline
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenShift Data Foundation
Classification: Red Hat Storage
Component: rook
Version: 4.13
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: ODF 4.16.0
Assignee: Santosh Pillai
QA Contact: Itzhak
URL:
Whiteboard:
Depends On:
Blocks: 2227161
TreeView+ depends on / blocked
 
Reported: 2023-06-22 16:15 UTC by Aman Agrawal
Modified: 2024-07-17 13:11 UTC (History)
7 users (show)

Fixed In Version: 4.16.0-94
Doc Type: No Doc Update
Doc Text:
Clone Of:
: 2227161 (view as bug list)
Environment:
Last Closed: 2024-07-17 13:11:03 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github red-hat-storage rook pull 501 0 None open Sync from upstream release-1.12 to downstream release-4.14 2023-07-27 18:50:23 UTC
Github rook rook pull 12575 0 None Merged core: force delete rook-ceph-exporter pod 2023-07-27 14:59:23 UTC
Red Hat Product Errata RHSA-2024:4591 0 None None None 2024-07-17 13:11:13 UTC

Comment 7 Itzhak 2024-04-08 12:54:59 UTC
I ran the test "test_check_pods_status_after_node_failure" again with the vSphere 4.14 cluster, and it passed: https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster/36067/. I also checked the console output to see that the rook-ceph pods weren't stuck in a terminating state. 


Cluster versions:

OC version:
Client Version: 4.10.24
Server Version: 4.14.0-0.nightly-2024-04-04-140720
Kubernetes Version: v1.27.11+749fe1d

OCS version:
ocs-operator.v4.14.6-rhodf              OpenShift Container Storage   4.14.6-rhodf   ocs-operator.v4.14.5-rhodf              Succeeded

Cluster version
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.14.0-0.nightly-2024-04-04-140720   True        False         3h8m    Cluster version is 4.14.0-0.nightly-2024-04-04-140720

Rook version:
rook: v4.14.6-0.7522dc8ddafd09860f2314db3965ef97671cd138
go: go1.20.12

Ceph version:
ceph version 17.2.6-196.el9cp (cbbf2cfb549196ca18c0c9caff9124d83ed681a4) quincy (stable)

Comment 12 Itzhak 2024-04-30 13:11:55 UTC
When testing with vSphere UPI 4.16, The test fails. 
The failure is different from the previous error. It occurs because after shutting down the worker node, the status of the rook-ceph pods not in the node did not change after 6 minutes.

So, in summary, the current issue with 4.16 vSphere UPI is the following:
1. Shutting down a worker node. 
2. Waiting for the status of the rook-ceph pods not in the node to change.

Actual result:
The status of the rook-ceph pods not in the node did not change after 6 minutes

Expected result:
The status of the rook-ceph pods not in the node should change, or the pods should be deleted


Additional info:

Report portal link: https://reportportal-ocs4.apps.ocp-c1.prod.psi.redhat.com/ui/#ocs/launches/all/20685/991918/991929/log.

Versions: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j-020vup1cs33-t4a/j-020vup1cs33-t4a_20240423T010622/logs/test_report_1713834101.html

Comment 13 Santosh Pillai 2024-05-02 12:34:40 UTC
@ikave Comment 12 is a bit confusing for me. 

The BZ is about the `rook-ceph-exporter` stuck in terminating state when node if offline.  We had fixed that by adding changes that removed any rook-ceph-exporter pod stuck in terminating state. 

>>Expected result:
>> The status of the rook-ceph pods not in the node should change, or the pods should be deleted

What rook-ceph pods are you referring to?

Comment 14 Itzhak 2024-05-02 13:56:47 UTC
Yes, this is a different issue. 
The test checks that the status of the rook-ceph pods not in the node will change after shutting down a worker node.
This is the first step before checking the issue described in the BZ.

Comment 15 Mudit Agarwal 2024-05-07 05:49:30 UTC
Itzhak, please open a new issue for the recent failure.

Comment 17 Itzhak 2024-05-07 12:17:42 UTC
I raised a new BZ: https://bugzilla.redhat.com/show_bug.cgi?id=2279538, regarding the issue mentioned in https://bugzilla.redhat.com/show_bug.cgi?id=2216803#c12. 
I will wait another few days to see if the error here appears again, and if not, I will close it.

Comment 18 Itzhak 2024-05-21 15:51:19 UTC
I reran the test "test_check_pods_status_after_node_failure" with AWS and vSphere 4.16, and it went fine. 
Also, I checked the console output to see that the process was accurate. 
AWS test result: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/ikave-aws416/ikave-aws416_20240521T052423/logs/test_report_1716300838.html.

Therefore, I am moving the BZ to Verified.

Comment 23 Sunil Kumar Acharya 2024-06-18 06:45:26 UTC
Please update the RDT flag/text appropriately.

Comment 26 errata-xmlrpc 2024-07-17 13:11:03 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.16.0 security, enhancement & bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2024:4591


Note You need to log in before you can comment on or make changes to this bug.