Description of problem (please be detailed as possible and provide log snippests): ceph health is in WARN state due to mon.a have recently crashed Version of all relevant components (if applicable): openshift installer (4.9.0-0.nightly-2021-09-07-201519) ocs-registry:4.9.0-129.ci Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? Yes Is there any workaround available to the best of your knowledge? NA Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? 1 Can this issue reproducible? 1/1 Can this issue reproduce from the UI? Not tried If this is a regression, please provide more details to justify this: Steps to Reproduce: 1. install OCS using ocs-ci 2. verify ceph health 3. Actual results: sh-4.4$ ceph health HEALTH_WARN 1 daemons have recently crashed sh-4.4$ Expected results: ceph health should be OK Additional info: sh-4.4$ ceph status cluster: id: 9fa8fddf-0463-4ad3-a128-0bd16b7361a0 health: HEALTH_WARN 1 daemons have recently crashed services: mon: 3 daemons, quorum a,b,c (age 55m) mgr: a(active, since 56m) mds: 1/1 daemons up, 1 hot standby osd: 3 osds: 3 up (since 55m), 3 in (since 56m) rgw: 1 daemon active (1 hosts, 1 zones) data: volumes: 1/1 healthy pools: 11 pools, 177 pgs objects: 735 objects, 682 MiB usage: 1.5 GiB used, 298 GiB / 300 GiB avail pgs: 177 active+clean io: client: 938 B/s rd, 72 KiB/s wr, 1 op/s rd, 2 op/s wr sh-4.4$ ceph health HEALTH_WARN 1 daemons have recently crashed sh-4.4$ ceph health detail HEALTH_WARN 1 daemons have recently crashed [WRN] RECENT_CRASH: 1 daemons have recently crashed mon.a crashed on host rook-ceph-mon-a-b8db6f4b5-4qss4 at 2021-09-08T08:48:47.214826Z sh-4.4$ sh-4.4$ ceph crash ls ID ENTITY NEW 2021-09-08T08:48:47.214826Z_ecea9053-5dc0-43ef-96fa-5c6df187f588 mon.a * sh-4.4$ > pod status $ oc get pods NAME READY STATUS RESTARTS AGE csi-cephfsplugin-d5flb 3/3 Running 0 69m csi-cephfsplugin-g474n 3/3 Running 0 69m csi-cephfsplugin-provisioner-6f657488b6-5g8qg 6/6 Running 0 69m csi-cephfsplugin-provisioner-6f657488b6-n7jk2 6/6 Running 0 69m csi-cephfsplugin-rvrm9 3/3 Running 0 69m csi-rbdplugin-8h9wh 3/3 Running 0 69m csi-rbdplugin-provisioner-676f49f6f4-24gql 6/6 Running 0 69m csi-rbdplugin-provisioner-676f49f6f4-4jf65 6/6 Running 0 69m csi-rbdplugin-smsdc 3/3 Running 0 69m csi-rbdplugin-zznvj 3/3 Running 0 69m noobaa-core-0 1/1 Running 0 63m noobaa-db-pg-0 1/1 Running 0 63m noobaa-endpoint-59497f9777-lbhqq 1/1 Running 0 61m noobaa-operator-6dbfdbdc99-bqqqk 1/1 Running 0 72m ocs-metrics-exporter-6cc98866f4-2t8kd 1/1 Running 0 72m ocs-operator-868c5746f-cc2qc 1/1 Running 0 72m odf-console-766cb86c59-n58hf 2/2 Running 0 72m odf-operator-controller-manager-6854f4697-lhh52 2/2 Running 0 72m rook-ceph-crashcollector-compute-0-649f6f59f9-fs28c 1/1 Running 0 64m rook-ceph-crashcollector-compute-1-67789d7888-wqzgb 1/1 Running 0 64m rook-ceph-crashcollector-compute-2-647c4b678f-mrzw7 1/1 Running 0 64m rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-c9598457btl4r 2/2 Running 0 63m rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-fcc95fdbwsc48 2/2 Running 0 63m rook-ceph-mgr-a-866b66b6d8-rzx2c 2/2 Running 0 65m rook-ceph-mon-a-b8db6f4b5-4qss4 2/2 Running 1 (63m ago) 69m rook-ceph-mon-b-6865d9c76f-fnr97 2/2 Running 0 68m rook-ceph-mon-c-5987b6c8f7-mjzjs 2/2 Running 0 67m rook-ceph-operator-6989f694dd-jm4d2 1/1 Running 0 72m rook-ceph-osd-0-7665f6f9fc-vzbqv 2/2 Running 0 64m rook-ceph-osd-1-5c7b78fd68-6c87h 2/2 Running 0 64m rook-ceph-osd-2-b9f5ff4d5-bsksq 2/2 Running 0 64m rook-ceph-osd-prepare-ocs-deviceset-0-data-0gmbmx--1-d66mv 0/1 Completed 0 65m rook-ceph-osd-prepare-ocs-deviceset-1-data-0ffv6g--1-p2jg6 0/1 Completed 0 65m rook-ceph-osd-prepare-ocs-deviceset-2-data-0vbqqh--1-w6xmt 0/1 Completed 0 65m rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-8744fc7gggk7 2/2 Running 0 63m rook-ceph-tools-67bb846dc4-crrl8 1/1 Running 0 61m > job link: https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/job/qe-deploy-ocs-cluster/5847/console > must gather: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/vavuthuq-odf/vavuthuq-odf_20210908T074723/logs/failed_testcase_ocs_logs_1631088683/test_deployment_ocs_logs/
Another occurrence: https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/job/qe-deploy-ocs-cluster-prod/1767/testReport/tests.ecosystem.deployment/test_deployment/test_deployment/ > raise CephHealthException(f"Ceph cluster health is not OK. Health: {health}") E ocs_ci.ocs.exceptions.CephHealthException: Ceph cluster health is not OK. Health: HEALTH_WARN 1 daemons have recently crashed rook-ceph-mon-a-85d47c76bf-gtm5k 2/2 Running 1 (85m ago) In ceph health detail I see: HEALTH_WARN 1 daemons have recently crashed [WRN] RECENT_CRASH: 1 daemons have recently crashed mon.a crashed on host rook-ceph-mon-a-85d47c76bf-gtm5k at 2021-09-08T11:00:55.565478Z Job: https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/job/qe-deploy-ocs-cluster-prod/1767/ Must gather: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j-024vukv1en1cs33-t1/j-024vukv1en1cs33-t1_20210908T094825/logs/failed_testcase_ocs_logs_1631096407/test_deployment_ocs_logs/
Ceph BZ is already ON_QA
I have seen this with build: odf-operator.v4.9.0-161.ci Logs: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j-002zi1c33-uon/j-002zi1c33-uon_20211004T001807/ HEALTH_WARN 1 daemons have recently crashed [WRN] RECENT_CRASH: 1 daemons have recently crashed mon.a crashed on host rook-ceph-mon-a-86d9c44d77-4qlw9 at 2021-10-04T01:18:38.484138Z I can see from must gather: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j-002zi1c33-uon/j-002zi1c33-uon_20211004T001807/logs/failed_testcase_ocs_logs_1633307266/test_deployment_ocs_logs/ocs_must_gather/quay-io-rhceph-dev-ocs-must-gather-sha256-1a3ed74a00cd4bb1f0480fddf45ad4b6611584759f6510e284769f347ecfa270/ceph/must_gather_commands/ceph_health_detail Failing QE as I see it's supposed to be fixed here: v4.9.0-158.ci
Petr, can you please check with v4.9.0-164.ci. There was some build issue because of which ODF build #158 didn't have the correct ceph version. Sorry for the trouble.
The thing is that this issue is not reproducible all the time. Trying here: https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/job/qe-trigger-azure-ipi-fips-encryption-1az-rhcos-3m-3w-upgrade-ocp-nightly/3/console I think that we need to have more executions to see if we see or not see this issue again. Only then we can mark as verified.
(In reply to Petr Balogh from comment #14) > I have seen this with build: > odf-operator.v4.9.0-161.ci > > Logs: > http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j-002zi1c33- > uon/j-002zi1c33-uon_20211004T001807/ > > HEALTH_WARN 1 daemons have recently crashed > [WRN] RECENT_CRASH: 1 daemons have recently crashed > mon.a crashed on host rook-ceph-mon-a-86d9c44d77-4qlw9 at > 2021-10-04T01:18:38.484138Z > > I can see from must gather: > http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j-002zi1c33- > uon/j-002zi1c33-uon_20211004T001807/logs/failed_testcase_ocs_logs_1633307266/ > test_deployment_ocs_logs/ocs_must_gather/quay-io-rhceph-dev-ocs-must-gather- > sha256-1a3ed74a00cd4bb1f0480fddf45ad4b6611584759f6510e284769f347ecfa270/ceph/ > must_gather_commands/ceph_health_detail > > Failing QE as I see it's supposed to be fixed here: v4.9.0-158.ci Version of Ceph does not have the patches: > 2021-10-04T01:18:38.478139021Z /builddir/build/BUILD/ceph-16.2.0/src/mds/FSMap.cc: 856: FAILED ceph_assert(fs->mds_map.damaged.count(j.second.rank) == 0) > 2021-10-04T01:18:38.481153214Z ceph version 16.2.0-117.el8cp (0e34bb74700060ebfaa22d99b7d2cdc037b28a57) pacific (stable) From: /ceph/ocsci-jenkins/openshift-clusters/j-002zi1c33-uon/j-002zi1c33-uon_20211004T001807/logs/deployment_1633307266/ocs_must_gather/quay-io-rhceph-dev-ocs-must-gather-sha256-1a3ed74a00cd4bb1f0480fddf45ad4b6611584759f6510e284769f347ecfa270/namespaces/openshift-storage/pods/rook-ceph-mon-a-86d9c44d77-4qlw9/mon/mon/logs/previous.log Please retest.
Verified with build 4.9.0-194.ci > All operators are in succeeded state NAME DISPLAY VERSION REPLACES PHASE noobaa-operator.v4.9.0 NooBaa Operator 4.9.0 Succeeded ocs-operator.v4.9.0 OpenShift Container Storage 4.9.0 Succeeded odf-operator.v4.9.0 OpenShift Data Foundation 4.9.0 Succeeded > cluster health is Ok 2021-10-21 10:26:03 04:56:03 - MainThread - ocs_ci.utility.utils - INFO - Executing command: oc -n openshift-storage exec rook-ceph-tools-6bb55cd9f7-ckkxq -- ceph health 2021-10-21 10:26:03 04:56:03 - MainThread - ocs_ci.utility.utils - INFO - Ceph cluster health is HEALTH_OK. Job: https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/job/qe-deploy-ocs-cluster/6911/consoleFull Closing this bug for now. will reopen if we hit this bug again
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: Red Hat OpenShift Data Foundation 4.9.0 enhancement, security, and bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:5086