2002225 – [Tracker for Ceph BZ #2002398] [4.9.0-129.ci]: ceph health in WARN state due to mon.a crashed

Bug 2002225 - [Tracker for Ceph BZ #2002398] [4.9.0-129.ci]: ceph health in WARN state due to mon.a crashed

Summary: [Tracker for Ceph BZ #2002398] [4.9.0-129.ci]: ceph health in WARN state due ...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenShift Data Foundation
Classification:	Red Hat Storage
Component:	ceph
Sub Component:
Version:	4.9
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	urgent
Target Milestone:	---
Target Release:	ODF 4.9.0
Assignee:	Patrick Donnelly
QA Contact:	Vijay Avuthu
Docs Contact:
URL:
Whiteboard:
Depends On:	2002398 2002891
Blocks:
TreeView+	depends on / blocked

Reported:	2021-09-08 10:01 UTC by Vijay Avuthu
Modified:	2023-08-09 16:37 UTC (History)
CC List:	11 users (show)
Fixed In Version:	v4.9.0-164.ci
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-12-13 17:46:04 UTC
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2021:5086	0	None	None	None	2021-12-13 17:46:23 UTC

Description Vijay Avuthu 2021-09-08 10:01:53 UTC

Description of problem (please be detailed as possible and provide log
snippests):

ceph health is in WARN state due to mon.a have recently crashed


Version of all relevant components (if applicable):

openshift installer (4.9.0-0.nightly-2021-09-07-201519)
ocs-registry:4.9.0-129.ci


Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
Yes

Is there any workaround available to the best of your knowledge?
NA

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
1

Can this issue reproducible?
1/1

Can this issue reproduce from the UI?
Not tried

If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1. install OCS using ocs-ci
2. verify ceph health
3.


Actual results:

sh-4.4$ ceph health
HEALTH_WARN 1 daemons have recently crashed
sh-4.4$


Expected results:

ceph health should be OK


Additional info:

sh-4.4$ ceph status
  cluster:
    id:     9fa8fddf-0463-4ad3-a128-0bd16b7361a0
    health: HEALTH_WARN
            1 daemons have recently crashed
 
  services:
    mon: 3 daemons, quorum a,b,c (age 55m)
    mgr: a(active, since 56m)
    mds: 1/1 daemons up, 1 hot standby
    osd: 3 osds: 3 up (since 55m), 3 in (since 56m)
    rgw: 1 daemon active (1 hosts, 1 zones)
 
  data:
    volumes: 1/1 healthy
    pools:   11 pools, 177 pgs
    objects: 735 objects, 682 MiB
    usage:   1.5 GiB used, 298 GiB / 300 GiB avail
    pgs:     177 active+clean
 
  io:
    client:   938 B/s rd, 72 KiB/s wr, 1 op/s rd, 2 op/s wr
 

sh-4.4$ ceph health
HEALTH_WARN 1 daemons have recently crashed


sh-4.4$ ceph health detail 
HEALTH_WARN 1 daemons have recently crashed
[WRN] RECENT_CRASH: 1 daemons have recently crashed
    mon.a crashed on host rook-ceph-mon-a-b8db6f4b5-4qss4 at 2021-09-08T08:48:47.214826Z
sh-4.4$ 

sh-4.4$ ceph crash ls
ID                                                                ENTITY  NEW  
2021-09-08T08:48:47.214826Z_ecea9053-5dc0-43ef-96fa-5c6df187f588  mon.a    *   
sh-4.4$ 


> pod status

$ oc get pods
NAME                                                              READY   STATUS      RESTARTS      AGE
csi-cephfsplugin-d5flb                                            3/3     Running     0             69m
csi-cephfsplugin-g474n                                            3/3     Running     0             69m
csi-cephfsplugin-provisioner-6f657488b6-5g8qg                     6/6     Running     0             69m
csi-cephfsplugin-provisioner-6f657488b6-n7jk2                     6/6     Running     0             69m
csi-cephfsplugin-rvrm9                                            3/3     Running     0             69m
csi-rbdplugin-8h9wh                                               3/3     Running     0             69m
csi-rbdplugin-provisioner-676f49f6f4-24gql                        6/6     Running     0             69m
csi-rbdplugin-provisioner-676f49f6f4-4jf65                        6/6     Running     0             69m
csi-rbdplugin-smsdc                                               3/3     Running     0             69m
csi-rbdplugin-zznvj                                               3/3     Running     0             69m
noobaa-core-0                                                     1/1     Running     0             63m
noobaa-db-pg-0                                                    1/1     Running     0             63m
noobaa-endpoint-59497f9777-lbhqq                                  1/1     Running     0             61m
noobaa-operator-6dbfdbdc99-bqqqk                                  1/1     Running     0             72m
ocs-metrics-exporter-6cc98866f4-2t8kd                             1/1     Running     0             72m
ocs-operator-868c5746f-cc2qc                                      1/1     Running     0             72m
odf-console-766cb86c59-n58hf                                      2/2     Running     0             72m
odf-operator-controller-manager-6854f4697-lhh52                   2/2     Running     0             72m
rook-ceph-crashcollector-compute-0-649f6f59f9-fs28c               1/1     Running     0             64m
rook-ceph-crashcollector-compute-1-67789d7888-wqzgb               1/1     Running     0             64m
rook-ceph-crashcollector-compute-2-647c4b678f-mrzw7               1/1     Running     0             64m
rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-c9598457btl4r   2/2     Running     0             63m
rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-fcc95fdbwsc48   2/2     Running     0             63m
rook-ceph-mgr-a-866b66b6d8-rzx2c                                  2/2     Running     0             65m
rook-ceph-mon-a-b8db6f4b5-4qss4                                   2/2     Running     1 (63m ago)   69m
rook-ceph-mon-b-6865d9c76f-fnr97                                  2/2     Running     0             68m
rook-ceph-mon-c-5987b6c8f7-mjzjs                                  2/2     Running     0             67m
rook-ceph-operator-6989f694dd-jm4d2                               1/1     Running     0             72m
rook-ceph-osd-0-7665f6f9fc-vzbqv                                  2/2     Running     0             64m
rook-ceph-osd-1-5c7b78fd68-6c87h                                  2/2     Running     0             64m
rook-ceph-osd-2-b9f5ff4d5-bsksq                                   2/2     Running     0             64m
rook-ceph-osd-prepare-ocs-deviceset-0-data-0gmbmx--1-d66mv        0/1     Completed   0             65m
rook-ceph-osd-prepare-ocs-deviceset-1-data-0ffv6g--1-p2jg6        0/1     Completed   0             65m
rook-ceph-osd-prepare-ocs-deviceset-2-data-0vbqqh--1-w6xmt        0/1     Completed   0             65m
rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-8744fc7gggk7   2/2     Running     0             63m
rook-ceph-tools-67bb846dc4-crrl8                                  1/1     Running     0             61m


> job link: https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/job/qe-deploy-ocs-cluster/5847/console

> must gather: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/vavuthuq-odf/vavuthuq-odf_20210908T074723/logs/failed_testcase_ocs_logs_1631088683/test_deployment_ocs_logs/

Comment 4 Petr Balogh 2021-09-08 13:12:14 UTC

Another occurrence:

https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/job/qe-deploy-ocs-cluster-prod/1767/testReport/tests.ecosystem.deployment/test_deployment/test_deployment/

>           raise CephHealthException(f"Ceph cluster health is not OK. Health: {health}")
E           ocs_ci.ocs.exceptions.CephHealthException: Ceph cluster health is not OK. Health: HEALTH_WARN 1 daemons have recently crashed


rook-ceph-mon-a-85d47c76bf-gtm5k                                  2/2     Running     1 (85m ago)  


In  ceph health detail
I see:
HEALTH_WARN 1 daemons have recently crashed
[WRN] RECENT_CRASH: 1 daemons have recently crashed
    mon.a crashed on host rook-ceph-mon-a-85d47c76bf-gtm5k at 2021-09-08T11:00:55.565478Z

Job:
https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/job/qe-deploy-ocs-cluster-prod/1767/

Must gather:
http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j-024vukv1en1cs33-t1/j-024vukv1en1cs33-t1_20210908T094825/logs/failed_testcase_ocs_logs_1631096407/test_deployment_ocs_logs/

Comment 10 Mudit Agarwal 2021-09-22 12:47:20 UTC

Ceph BZ is already ON_QA

Comment 14 Petr Balogh 2021-10-04 10:51:07 UTC

I have seen this with build:
odf-operator.v4.9.0-161.ci

Logs:
http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j-002zi1c33-uon/j-002zi1c33-uon_20211004T001807/

HEALTH_WARN 1 daemons have recently crashed
[WRN] RECENT_CRASH: 1 daemons have recently crashed
    mon.a crashed on host rook-ceph-mon-a-86d9c44d77-4qlw9 at 2021-10-04T01:18:38.484138Z

I can see from must gather:
http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j-002zi1c33-uon/j-002zi1c33-uon_20211004T001807/logs/failed_testcase_ocs_logs_1633307266/test_deployment_ocs_logs/ocs_must_gather/quay-io-rhceph-dev-ocs-must-gather-sha256-1a3ed74a00cd4bb1f0480fddf45ad4b6611584759f6510e284769f347ecfa270/ceph/must_gather_commands/ceph_health_detail

Failing QE as I see it's supposed to be fixed here: v4.9.0-158.ci

Comment 15 Mudit Agarwal 2021-10-04 12:11:06 UTC

Petr, can you please check with v4.9.0-164.ci. There was some build issue because of which ODF build #158 didn't have the correct ceph version.
Sorry for the trouble.

Comment 16 Petr Balogh 2021-10-04 13:39:42 UTC

The thing is that this issue is not reproducible all the time.
Trying here:
https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/job/qe-trigger-azure-ipi-fips-encryption-1az-rhcos-3m-3w-upgrade-ocp-nightly/3/console

I think that we need to have more executions to see if we see or not see this issue again. Only then we can mark as verified.

Comment 17 Patrick Donnelly 2021-10-08 13:46:38 UTC

(In reply to Petr Balogh from comment #14)
> I have seen this with build:
> odf-operator.v4.9.0-161.ci
> 
> Logs:
> http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j-002zi1c33-
> uon/j-002zi1c33-uon_20211004T001807/
> 
> HEALTH_WARN 1 daemons have recently crashed
> [WRN] RECENT_CRASH: 1 daemons have recently crashed
>     mon.a crashed on host rook-ceph-mon-a-86d9c44d77-4qlw9 at
> 2021-10-04T01:18:38.484138Z
> 
> I can see from must gather:
> http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j-002zi1c33-
> uon/j-002zi1c33-uon_20211004T001807/logs/failed_testcase_ocs_logs_1633307266/
> test_deployment_ocs_logs/ocs_must_gather/quay-io-rhceph-dev-ocs-must-gather-
> sha256-1a3ed74a00cd4bb1f0480fddf45ad4b6611584759f6510e284769f347ecfa270/ceph/
> must_gather_commands/ceph_health_detail
> 
> Failing QE as I see it's supposed to be fixed here: v4.9.0-158.ci

Version of Ceph does not have the patches:

> 2021-10-04T01:18:38.478139021Z /builddir/build/BUILD/ceph-16.2.0/src/mds/FSMap.cc: 856: FAILED ceph_assert(fs->mds_map.damaged.count(j.second.rank) == 0)
> 2021-10-04T01:18:38.481153214Z  ceph version 16.2.0-117.el8cp (0e34bb74700060ebfaa22d99b7d2cdc037b28a57) pacific (stable)

From: /ceph/ocsci-jenkins/openshift-clusters/j-002zi1c33-uon/j-002zi1c33-uon_20211004T001807/logs/deployment_1633307266/ocs_must_gather/quay-io-rhceph-dev-ocs-must-gather-sha256-1a3ed74a00cd4bb1f0480fddf45ad4b6611584759f6510e284769f347ecfa270/namespaces/openshift-storage/pods/rook-ceph-mon-a-86d9c44d77-4qlw9/mon/mon/logs/previous.log

Please retest.

Comment 19 Vijay Avuthu 2021-10-21 06:23:20 UTC

Verified with build 4.9.0-194.ci

> All operators are in succeeded state

NAME                     DISPLAY                       VERSION   REPLACES   PHASE
noobaa-operator.v4.9.0   NooBaa Operator               4.9.0                Succeeded
ocs-operator.v4.9.0      OpenShift Container Storage   4.9.0                Succeeded
odf-operator.v4.9.0      OpenShift Data Foundation     4.9.0                Succeeded

> cluster health is Ok 

2021-10-21 10:26:03  04:56:03 - MainThread - ocs_ci.utility.utils - INFO - Executing command: oc -n openshift-storage exec rook-ceph-tools-6bb55cd9f7-ckkxq -- ceph health
2021-10-21 10:26:03  04:56:03 - MainThread - ocs_ci.utility.utils - INFO - Ceph cluster health is HEALTH_OK.

Job: https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/job/qe-deploy-ocs-cluster/6911/consoleFull


Closing this bug for now. will reopen if we hit this bug again

Comment 21 errata-xmlrpc 2021-12-13 17:46:04 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: Red Hat OpenShift Data Foundation 4.9.0 enhancement, security, and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:5086

Note You need to log in before you can comment on or make changes to this bug.