2233762 – exporter: crash of exporter daemons

Bug 2233762 - exporter: crash of exporter daemons

Summary: exporter: crash of exporter daemons

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Ceph Storage
Classification:	Red Hat Storage
Component:	Ceph-Dashboard
Sub Component:
Version:	6.1
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	urgent
Target Milestone:	---
Target Release:	6.1z2
Assignee:	avan
QA Contact:	Sayalee
Docs Contact:	Akash Raj
URL:
Whiteboard:
Depends On:
Blocks:	2232226 2235257
TreeView+	depends on / blocked

Reported:	2023-08-23 10:07 UTC by avan
Modified:	2023-11-03 04:01 UTC (History)
CC List:	8 users (show)
Fixed In Version:	ceph-17.2.6-115.el9cp
Doc Type:	Bug Fix
Doc Text:	.ceph-exporter daemons no longer crash during upgrade from 6.1z1 to 6.1z2 Previously, there was a format change for the output of `counter dump` and `counter schema` commands, which was delivered to 6.1z2. Due to this, during the upgrade from 6.1z1 to 6.1z2, some exporter daemons crashed which were still in queue to be upgraded and using the old format. With this fix, looping through the output of `counter dump` and `counter schema` is avoided if the the new format is unsupported. ceph-exporter daemons no longer crash.
Clone Of:
Environment:
Last Closed:	2023-10-12 16:34:36 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	ceph ceph pull 53089	None	open	exporter: skip the loop if counter schema items isn't an array	2023-08-23 10:08:38 UTC
Red Hat Issue Tracker	RHCEPH-7257	None	None	None	2023-08-23 10:10:58 UTC
Red Hat Issue Tracker	RHCSDASH-1068	None	None	None	2023-08-23 10:11:00 UTC
Red Hat Product Errata	RHSA-2023:5693	None	None	None	2023-10-12 16:35:38 UTC

Description avan 2023-08-23 10:07:14 UTC

Description of problem:
Red Hathttps://bugzilla.redhat.com/show_bug.cgi?id=2232226
 
The crashing issue occurs when the cluster is under upgrade state(from 6.1->6.1z2). So exporter basically is using the old format(object format) for fetching counter dump/schema for ceph daemons (which will be already upgraded to new version) ,
until it isn't upgraded. Once it gets upgraded it will stop crashing.

Comment 1 RHEL Program Management 2023-08-23 10:07:21 UTC

Please specify the severity of this bug. Severity is defined here:
https://bugzilla.redhat.com/page.cgi?id=fields.html#bug_severity.

Comment 6 Mudit Agarwal 2023-08-28 12:32:29 UTC

Daniel, can you try on a fresh cluster. The automated upgrade test which we run with nightly tests (along with ocs-ci) passed with this build.

Comment 7 Daniel Horák 2023-08-28 12:51:21 UTC

(In reply to Mudit Agarwal from comment #6)
> Daniel, can you try on a fresh cluster. The automated upgrade test which we
> run with nightly tests (along with ocs-ci) passed with this build.

Mudit, you meant to try just the upgrade itself, without the pre/post upgrade tests?

Comment 8 Mudit Agarwal 2023-08-28 13:00:57 UTC

No, actually all the tests passed with this build. See https://jenkins.ceph.redhat.com/job/ocs-ci/2472/ so pre/post should also pass.
I was wondering if there was some issue with the cluster, if it has some residual or may be some of the pods were not upgraded etc.

Comment 9 Daniel Horák 2023-08-28 13:47:56 UTC

(In reply to Mudit Agarwal from comment #8)
> No, actually all the tests passed with this build. See
> https://jenkins.ceph.redhat.com/job/ocs-ci/2472/ so pre/post should also
> pass.
> I was wondering if there was some issue with the cluster, if it has some
> residual or may be some of the pods were not upgraded etc.

The upgrade was performed on freshly deployed cluster and it happened second time the same way. Should I destroy the existing cluster (from comment 5) and try it once more time?

Comment 10 Mudit Agarwal 2023-08-28 13:55:01 UTC

Yes, please

Comment 11 Daniel Horák 2023-08-31 13:09:49 UTC

(In reply to Mudit Agarwal from comment #8)
> No, actually all the tests passed with this build. See
> https://jenkins.ceph.redhat.com/job/ocs-ci/2472/ so pre/post should also
> pass.
> I was wondering if there was some issue with the cluster, if it has some
> residual or may be some of the pods were not upgraded etc.

I didn't check the linked job before, but actually it failed the same way as I'm observing in my jobs - the upgrade itself passed, but few of the post upgrade/acceptance tests failed because of following or similar error:

> failed on teardown with "ocs_ci.ocs.exceptions.CephHealthException: Ceph cluster health is not OK. Health: HEALTH_WARN 2 daemons have recently crashed"

https://jenkins.ceph.redhat.com/job/ocs-ci/2472/testReport/tests.manage.mcg.test_bucket_creation/TestBucketCreation/test_bucket_creation_3_CLI_DEFAULT_BACKINGSTORE_/

And lots of the other tests were actually skipped because:

>Ceph health check failed at setup

for example: https://jenkins.ceph.redhat.com/job/ocs-ci/2472/testReport/tests.manage.pv_services.pvc_clone.test_pvc_to_pvc_clone/TestClone/test_pvc_to_pvc_clone_CephBlockPool_/

I've tried it multiple times and the behaviour is still the same - acceptance tests executed just after the upgrade starts failing or are skipped becuase of ceph HEALTH_WARN X daemons have recently crashed.
I also tried to perform just the upgrade and trigger the acceptance tests later (after few hours) and it looks like this time it is progressing without any issue (the job is still running).

So there is definitely some issue, I'm not sure how much related to the original one, but the symptoms are very similar.

Comment 15 Mudit Agarwal 2023-09-05 08:53:50 UTC

As mentioned by Avan in #comment11, there is a small window where this issue will be observed.

ODF 4.13 still uses 6.1z1 which doesn't have this fix, now while you upgrade to ODF 4.14 (which has this fix)
there still maybe some exporter daemons under upgrade queue which are still on old patch (4.13).

If the tests are triggered after some time enabling upgrade to complete for all these daemons then we will not see this issue.

To avoid this issue:

1. Use a build which includes 6.1z1 with 4.13 (we don't have that build as of now)
2. Trigger the acceptance tests or the post-upgrade tests after some time of upgrade and not immediately.

I am moving this bug back to ON_QA because it is fixed from ceph side, if we still want to track then the ODF bug can be moved to ASSIGNED till we have the fix in 4.13 also.

Comment 20 errata-xmlrpc 2023-10-12 16:34:36 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: Red Hat Ceph Storage 6.1 security, enhancement, and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2023:5693

Note You need to log in before you can comment on or make changes to this bug.