Bug 2002225
Summary: | [Tracker for Ceph BZ #2002398] [4.9.0-129.ci]: ceph health in WARN state due to mon.a crashed | ||
---|---|---|---|
Product: | [Red Hat Storage] Red Hat OpenShift Data Foundation | Reporter: | Vijay Avuthu <vavuthu> |
Component: | ceph | Assignee: | Patrick Donnelly <pdonnell> |
Status: | CLOSED ERRATA | QA Contact: | Vijay Avuthu <vavuthu> |
Severity: | urgent | Docs Contact: | |
Priority: | high | ||
Version: | 4.9 | CC: | bniver, ebenahar, madam, mashetty, muagarwa, ocs-bugs, odf-bz-bot, pbalogh, pdonnell, sostapov, tdesala |
Target Milestone: | --- | Keywords: | AutomationTriaged |
Target Release: | ODF 4.9.0 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | v4.9.0-164.ci | Doc Type: | No Doc Update |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2021-12-13 17:46:04 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | 2002398, 2002891 | ||
Bug Blocks: |
Description
Vijay Avuthu
2021-09-08 10:01:53 UTC
Another occurrence: https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/job/qe-deploy-ocs-cluster-prod/1767/testReport/tests.ecosystem.deployment/test_deployment/test_deployment/ > raise CephHealthException(f"Ceph cluster health is not OK. Health: {health}") E ocs_ci.ocs.exceptions.CephHealthException: Ceph cluster health is not OK. Health: HEALTH_WARN 1 daemons have recently crashed rook-ceph-mon-a-85d47c76bf-gtm5k 2/2 Running 1 (85m ago) In ceph health detail I see: HEALTH_WARN 1 daemons have recently crashed [WRN] RECENT_CRASH: 1 daemons have recently crashed mon.a crashed on host rook-ceph-mon-a-85d47c76bf-gtm5k at 2021-09-08T11:00:55.565478Z Job: https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/job/qe-deploy-ocs-cluster-prod/1767/ Must gather: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j-024vukv1en1cs33-t1/j-024vukv1en1cs33-t1_20210908T094825/logs/failed_testcase_ocs_logs_1631096407/test_deployment_ocs_logs/ Ceph BZ is already ON_QA I have seen this with build: odf-operator.v4.9.0-161.ci Logs: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j-002zi1c33-uon/j-002zi1c33-uon_20211004T001807/ HEALTH_WARN 1 daemons have recently crashed [WRN] RECENT_CRASH: 1 daemons have recently crashed mon.a crashed on host rook-ceph-mon-a-86d9c44d77-4qlw9 at 2021-10-04T01:18:38.484138Z I can see from must gather: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j-002zi1c33-uon/j-002zi1c33-uon_20211004T001807/logs/failed_testcase_ocs_logs_1633307266/test_deployment_ocs_logs/ocs_must_gather/quay-io-rhceph-dev-ocs-must-gather-sha256-1a3ed74a00cd4bb1f0480fddf45ad4b6611584759f6510e284769f347ecfa270/ceph/must_gather_commands/ceph_health_detail Failing QE as I see it's supposed to be fixed here: v4.9.0-158.ci Petr, can you please check with v4.9.0-164.ci. There was some build issue because of which ODF build #158 didn't have the correct ceph version. Sorry for the trouble. The thing is that this issue is not reproducible all the time. Trying here: https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/job/qe-trigger-azure-ipi-fips-encryption-1az-rhcos-3m-3w-upgrade-ocp-nightly/3/console I think that we need to have more executions to see if we see or not see this issue again. Only then we can mark as verified. (In reply to Petr Balogh from comment #14) > I have seen this with build: > odf-operator.v4.9.0-161.ci > > Logs: > http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j-002zi1c33- > uon/j-002zi1c33-uon_20211004T001807/ > > HEALTH_WARN 1 daemons have recently crashed > [WRN] RECENT_CRASH: 1 daemons have recently crashed > mon.a crashed on host rook-ceph-mon-a-86d9c44d77-4qlw9 at > 2021-10-04T01:18:38.484138Z > > I can see from must gather: > http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j-002zi1c33- > uon/j-002zi1c33-uon_20211004T001807/logs/failed_testcase_ocs_logs_1633307266/ > test_deployment_ocs_logs/ocs_must_gather/quay-io-rhceph-dev-ocs-must-gather- > sha256-1a3ed74a00cd4bb1f0480fddf45ad4b6611584759f6510e284769f347ecfa270/ceph/ > must_gather_commands/ceph_health_detail > > Failing QE as I see it's supposed to be fixed here: v4.9.0-158.ci Version of Ceph does not have the patches: > 2021-10-04T01:18:38.478139021Z /builddir/build/BUILD/ceph-16.2.0/src/mds/FSMap.cc: 856: FAILED ceph_assert(fs->mds_map.damaged.count(j.second.rank) == 0) > 2021-10-04T01:18:38.481153214Z ceph version 16.2.0-117.el8cp (0e34bb74700060ebfaa22d99b7d2cdc037b28a57) pacific (stable) From: /ceph/ocsci-jenkins/openshift-clusters/j-002zi1c33-uon/j-002zi1c33-uon_20211004T001807/logs/deployment_1633307266/ocs_must_gather/quay-io-rhceph-dev-ocs-must-gather-sha256-1a3ed74a00cd4bb1f0480fddf45ad4b6611584759f6510e284769f347ecfa270/namespaces/openshift-storage/pods/rook-ceph-mon-a-86d9c44d77-4qlw9/mon/mon/logs/previous.log Please retest. Verified with build 4.9.0-194.ci > All operators are in succeeded state NAME DISPLAY VERSION REPLACES PHASE noobaa-operator.v4.9.0 NooBaa Operator 4.9.0 Succeeded ocs-operator.v4.9.0 OpenShift Container Storage 4.9.0 Succeeded odf-operator.v4.9.0 OpenShift Data Foundation 4.9.0 Succeeded > cluster health is Ok 2021-10-21 10:26:03 04:56:03 - MainThread - ocs_ci.utility.utils - INFO - Executing command: oc -n openshift-storage exec rook-ceph-tools-6bb55cd9f7-ckkxq -- ceph health 2021-10-21 10:26:03 04:56:03 - MainThread - ocs_ci.utility.utils - INFO - Ceph cluster health is HEALTH_OK. Job: https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/job/qe-deploy-ocs-cluster/6911/consoleFull Closing this bug for now. will reopen if we hit this bug again Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: Red Hat OpenShift Data Foundation 4.9.0 enhancement, security, and bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:5086 |