Description of problem: Red Hathttps://bugzilla.redhat.com/show_bug.cgi?id=2232226 The crashing issue occurs when the cluster is under upgrade state(from 6.1->6.1z2). So exporter basically is using the old format(object format) for fetching counter dump/schema for ceph daemons (which will be already upgraded to new version) , until it isn't upgraded. Once it gets upgraded it will stop crashing.
Please specify the severity of this bug. Severity is defined here: https://bugzilla.redhat.com/page.cgi?id=fields.html#bug_severity.
Daniel, can you try on a fresh cluster. The automated upgrade test which we run with nightly tests (along with ocs-ci) passed with this build.
(In reply to Mudit Agarwal from comment #6) > Daniel, can you try on a fresh cluster. The automated upgrade test which we > run with nightly tests (along with ocs-ci) passed with this build. Mudit, you meant to try just the upgrade itself, without the pre/post upgrade tests?
No, actually all the tests passed with this build. See https://jenkins.ceph.redhat.com/job/ocs-ci/2472/ so pre/post should also pass. I was wondering if there was some issue with the cluster, if it has some residual or may be some of the pods were not upgraded etc.
(In reply to Mudit Agarwal from comment #8) > No, actually all the tests passed with this build. See > https://jenkins.ceph.redhat.com/job/ocs-ci/2472/ so pre/post should also > pass. > I was wondering if there was some issue with the cluster, if it has some > residual or may be some of the pods were not upgraded etc. The upgrade was performed on freshly deployed cluster and it happened second time the same way. Should I destroy the existing cluster (from comment 5) and try it once more time?
Yes, please
(In reply to Mudit Agarwal from comment #8) > No, actually all the tests passed with this build. See > https://jenkins.ceph.redhat.com/job/ocs-ci/2472/ so pre/post should also > pass. > I was wondering if there was some issue with the cluster, if it has some > residual or may be some of the pods were not upgraded etc. I didn't check the linked job before, but actually it failed the same way as I'm observing in my jobs - the upgrade itself passed, but few of the post upgrade/acceptance tests failed because of following or similar error: > failed on teardown with "ocs_ci.ocs.exceptions.CephHealthException: Ceph cluster health is not OK. Health: HEALTH_WARN 2 daemons have recently crashed" https://jenkins.ceph.redhat.com/job/ocs-ci/2472/testReport/tests.manage.mcg.test_bucket_creation/TestBucketCreation/test_bucket_creation_3_CLI_DEFAULT_BACKINGSTORE_/ And lots of the other tests were actually skipped because: >Ceph health check failed at setup for example: https://jenkins.ceph.redhat.com/job/ocs-ci/2472/testReport/tests.manage.pv_services.pvc_clone.test_pvc_to_pvc_clone/TestClone/test_pvc_to_pvc_clone_CephBlockPool_/ I've tried it multiple times and the behaviour is still the same - acceptance tests executed just after the upgrade starts failing or are skipped becuase of ceph HEALTH_WARN X daemons have recently crashed. I also tried to perform just the upgrade and trigger the acceptance tests later (after few hours) and it looks like this time it is progressing without any issue (the job is still running). So there is definitely some issue, I'm not sure how much related to the original one, but the symptoms are very similar.
As mentioned by Avan in #comment11, there is a small window where this issue will be observed. ODF 4.13 still uses 6.1z1 which doesn't have this fix, now while you upgrade to ODF 4.14 (which has this fix) there still maybe some exporter daemons under upgrade queue which are still on old patch (4.13). If the tests are triggered after some time enabling upgrade to complete for all these daemons then we will not see this issue. To avoid this issue: 1. Use a build which includes 6.1z1 with 4.13 (we don't have that build as of now) 2. Trigger the acceptance tests or the post-upgrade tests after some time of upgrade and not immediately. I am moving this bug back to ON_QA because it is fixed from ceph side, if we still want to track then the ODF bug can be moved to ASSIGNED till we have the fix in 4.13 also.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: Red Hat Ceph Storage 6.1 security, enhancement, and bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2023:5693