2216803 – Rook ceph exporter pod remains stuck in terminating state when node is offline

Bug 2216803 - Rook ceph exporter pod remains stuck in terminating state when node is offline

Summary: Rook ceph exporter pod remains stuck in terminating state when node is offline

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenShift Data Foundation
Classification:	Red Hat Storage
Component:	rook
Sub Component:
Version:	4.13
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	ODF 4.16.0
Assignee:	Santosh Pillai
QA Contact:	Itzhak
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	2227161
TreeView+	depends on / blocked

Reported:	2023-06-22 16:15 UTC by Aman Agrawal
Modified:	2024-07-17 13:11 UTC (History)
CC List:	7 users (show)
Fixed In Version:	4.16.0-94
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Clones:	2227161 (view as bug list)
Environment:
Last Closed:	2024-07-17 13:11:03 UTC
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	red-hat-storage rook pull 501	None	open	Sync from upstream release-1.12 to downstream release-4.14	2023-07-27 18:50:23 UTC
Github	rook rook pull 12575	None	Merged	core: force delete rook-ceph-exporter pod	2023-07-27 14:59:23 UTC
Red Hat Product Errata	RHSA-2024:4591	None	None	None	2024-07-17 13:11:13 UTC

Comment 7 Itzhak 2024-04-08 12:54:59 UTC

I ran the test "test_check_pods_status_after_node_failure" again with the vSphere 4.14 cluster, and it passed: https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster/36067/. I also checked the console output to see that the rook-ceph pods weren't stuck in a terminating state. 


Cluster versions:

OC version:
Client Version: 4.10.24
Server Version: 4.14.0-0.nightly-2024-04-04-140720
Kubernetes Version: v1.27.11+749fe1d

OCS version:
ocs-operator.v4.14.6-rhodf              OpenShift Container Storage   4.14.6-rhodf   ocs-operator.v4.14.5-rhodf              Succeeded

Cluster version
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.14.0-0.nightly-2024-04-04-140720   True        False         3h8m    Cluster version is 4.14.0-0.nightly-2024-04-04-140720

Rook version:
rook: v4.14.6-0.7522dc8ddafd09860f2314db3965ef97671cd138
go: go1.20.12

Ceph version:
ceph version 17.2.6-196.el9cp (cbbf2cfb549196ca18c0c9caff9124d83ed681a4) quincy (stable)

Comment 12 Itzhak 2024-04-30 13:11:55 UTC

When testing with vSphere UPI 4.16, The test fails. 
The failure is different from the previous error. It occurs because after shutting down the worker node, the status of the rook-ceph pods not in the node did not change after 6 minutes.

So, in summary, the current issue with 4.16 vSphere UPI is the following:
1. Shutting down a worker node. 
2. Waiting for the status of the rook-ceph pods not in the node to change.

Actual result:
The status of the rook-ceph pods not in the node did not change after 6 minutes

Expected result:
The status of the rook-ceph pods not in the node should change, or the pods should be deleted


Additional info:

Report portal link: https://reportportal-ocs4.apps.ocp-c1.prod.psi.redhat.com/ui/#ocs/launches/all/20685/991918/991929/log.

Versions: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j-020vup1cs33-t4a/j-020vup1cs33-t4a_20240423T010622/logs/test_report_1713834101.html

Comment 13 Santosh Pillai 2024-05-02 12:34:40 UTC

@ikave Comment 12 is a bit confusing for me. 

The BZ is about the `rook-ceph-exporter` stuck in terminating state when node if offline.  We had fixed that by adding changes that removed any rook-ceph-exporter pod stuck in terminating state. 

>>Expected result:
>> The status of the rook-ceph pods not in the node should change, or the pods should be deleted

What rook-ceph pods are you referring to?

Comment 14 Itzhak 2024-05-02 13:56:47 UTC

Yes, this is a different issue. 
The test checks that the status of the rook-ceph pods not in the node will change after shutting down a worker node.
This is the first step before checking the issue described in the BZ.

Comment 15 Mudit Agarwal 2024-05-07 05:49:30 UTC

Itzhak, please open a new issue for the recent failure.

Comment 17 Itzhak 2024-05-07 12:17:42 UTC

I raised a new BZ: https://bugzilla.redhat.com/show_bug.cgi?id=2279538, regarding the issue mentioned in https://bugzilla.redhat.com/show_bug.cgi?id=2216803#c12. 
I will wait another few days to see if the error here appears again, and if not, I will close it.

Comment 18 Itzhak 2024-05-21 15:51:19 UTC

I reran the test "test_check_pods_status_after_node_failure" with AWS and vSphere 4.16, and it went fine. 
Also, I checked the console output to see that the process was accurate. 
AWS test result: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/ikave-aws416/ikave-aws416_20240521T052423/logs/test_report_1716300838.html.

Therefore, I am moving the BZ to Verified.

Comment 23 Sunil Kumar Acharya 2024-06-18 06:45:26 UTC

Please update the RDT flag/text appropriately.

Comment 26 errata-xmlrpc 2024-07-17 13:11:03 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.16.0 security, enhancement & bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2024:4591

Note You need to log in before you can comment on or make changes to this bug.