1975795 – [Tracker for BZ #1977271] New item was not created on ceph crash list after killing osd/mon/mgr daemon

Bug 1975795 - [Tracker for BZ #1977271] New item was not created on ceph crash list after killing osd/mon/mgr daemon

Summary: [Tracker for BZ #1977271] New item was not created on ceph crash list after k...

Keywords:
Status:	VERIFIED
Alias:	None
Product:	Red Hat OpenShift Container Storage
Classification:	Red Hat Storage
Component:	ceph
Sub Component:
Version:	4.7
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	OCS 4.8.0
Assignee:	Matt Benjamin (redhat)
QA Contact:	Oded
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1977952 (view as bug list)
Depends On:
Blocks:	1977271
TreeView+	depends on / blocked

Reported:	2021-06-24 12:46 UTC by Oded
Modified:	2023-08-03 08:30 UTC (History)
CC List:	5 users (show)
Fixed In Version:	4.8.0-444.ci
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Clones:	1977271 (view as bug list)
Environment:
Last Closed:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Oded 2021-06-24 12:46:30 UTC

Description of problem (please be detailed as possible and provide log
snippests):
New item was not created on ceph crash list after killing osd/mon/mgr daemon

Version of all relevant components (if applicable):
ocs version:ocs-operator.v4.8.0-422.ci
ocp version:4.8.0-0.nightly-2021-06-22-022125
provider: vmware
type: lso
ceph versions:
sh-4.4# ceph versions
{
    "mon": {
        "ceph version 14.2.11-181.el8cp (68fea1005601531fe60d2979c56ea63bc073c84f) nautilus (stable)": 3
    },
    "mgr": {
        "ceph version 14.2.11-181.el8cp (68fea1005601531fe60d2979c56ea63bc073c84f) nautilus (stable)": 1
    },
    "osd": {
        "ceph version 14.2.11-181.el8cp (68fea1005601531fe60d2979c56ea63bc073c84f) nautilus (stable)": 3
    },
    "mds": {
        "ceph version 14.2.11-181.el8cp (68fea1005601531fe60d2979c56ea63bc073c84f) nautilus (stable)": 2
    },
    "rgw": {
        "ceph version 14.2.11-181.el8cp (68fea1005601531fe60d2979c56ea63bc073c84f) nautilus (stable)": 1
    },
    "overall": {
        "ceph version 14.2.11-181.el8cp (68fea1005601531fe60d2979c56ea63bc073c84f) nautilus (stable)": 10
    }
}


Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?


Is there any workaround available to the best of your knowledge?


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?


Can this issue reproducible?


Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1.Get worker node where osd-0 running -> compute-0
$ oc get pods -o wide
rook-ceph-osd-0-69b4d6d64b-v6mcg                                  2/2     Running     0          21h   10.130.2.107   compute-0   <none>           <none>

2.login worker node
$ oc debug node/compute-0
sh-4.4# chroot /host
sh-4.4# bash
[root@compute-0 /]#

3.Get ceph-osd pid:
[root@compute-0 /]# pidof ceph-osd
811668

4.Kill process id:
[root@compute-0 /]# kill -11 811668

5. Check coredumpctl list on compute-0
[root@compute-0 posted]# coredumpctl list
TIME                            PID   UID   GID SIG COREFILE  EXE
Thu 2021-06-24 10:00:05 UTC  811668   167   167  11 present   /usr/bin/ceph-osd

6.Check the content of "/var/lib/rook/openshift-storage/crash/posted" dir
[root@compute-0 posted]# pwd
/var/lib/rook/openshift-storage/crash/posted
[root@compute-0 posted]# ls
2021-06-24T10:00:02.339256Z_9b4866b6-6cd6-4821-9523-4005b96c2821
[root@compute-0 2021-06-24T10:00:02.339256Z_9b4866b6-6cd6-4821-9523-4005b96c2821]# pwd                  
/var/lib/rook/openshift-storage/crash/posted/2021-06-24T10:00:02.339256Z_9b4866b6-6cd6-4821-9523-4005b96c2821
[root@compute-0 2021-06-24T10:00:02.339256Z_9b4866b6-6cd6-4821-9523-4005b96c2821]# cat log | grep -i osd.0
   -10> 2021-06-24 09:59:57.865 7f105eff1700  5 osd.0 601 heartbeat osd_stat(store_statfs(0x189d620000/0x40000000/0x1900000000, data 0x1365b0ce/0x22960000, compress 0x0/0x0/0x0, omap 0x2fd85, meta 0x3ffd027b), peers [1,2] op hist [])
    -4> 2021-06-24 10:00:01.365 7f105eff1700  5 osd.0 601 heartbeat osd_stat(store_statfs(0x189d620000/0x40000000/0x1900000000, data 0x1365b0ce/0x22960000, compress 0x0/0x0/0x0, omap 0x2fd85, meta 0x3ffd027b), peers [1,2] op hist [])
 in thread 7f1086ac0dc0 thread_name:ceph-osd

7.Check crash list [empty after 20 min, Fail!!!]
$ oc rsh rook-ceph-tools-69867749f-jsltd
sh-4.4# ceph crash ls
sh-4.4#


Actual results:
New item was not created on ceph crash list 

Expected results:
New item was created in ceph crash list.
for example:
$ oc rsh rook-ceph-tools-84bc476959-2d2ps
sh-4.4# ceph crash ls
ID                                                               ENTITY NEW 
2021-02-10_10:21:02.360184Z_93f26ef2-bdee-407f-9581-cb5d0faa931c mon.b   *  

Additional info:

Comment 2 Sébastien Han 2021-06-24 14:52:04 UTC

Please include the logs :)

Comment 3 Oded 2021-06-24 16:36:44 UTC

+ logs:
http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j059vuf1cs33-t1/j059vuf1cs33-t1_20210622T143745/logs/failed_testcase_ocs_logs_1624379492/test_coredump_check_for_ceph_daemon_crash_ocs_logs/ocs_must_gather/quay-io-rhceph-dev-ocs-must-gather-sha256-2d11d7a16502a1635705f98f62213e9504d9dc6fb3a883b2165eaf4a219c8da9/


This issue reconstructed on vmware and AWS.

Comment 4 Oded 2021-06-27 09:36:22 UTC

This issue reconstructed on setup:
OCS version:v4.7.2-429.ci
OCP version:4.7.0-0.nightly-2021-06-24-030912
Provider:VSPHERE6-UPI-ENCRYPTION-1AZ-RHCOS-VSAN-LSO-VMDK

https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/job/qe-deploy-ocs-cluster-prod/1196/testReport/tests.manage.z_cluster.test_coredump_check_for_ceph_daemon_crash/TestKillCephDaemon/test_coredump_check_for_ceph_daemon_crash/

Comment 6 Oded 2021-06-28 07:58:31 UTC

This issue reconstructed on OCS4.7.2 and does not reconstructed on OCS4.7.1.

Comment 7 Sébastien Han 2021-06-28 15:26:11 UTC

Moving to Ceph since it's a ceph issue that was introduced in 4.7.2 I believe, the main issue is that the mgr error returns 0 on errors, see:

[root@rook-ceph-crashcollector-compute-1-66bdfbd886-d2zcd /]# ceph -n client.crash crash post -i /var/lib/ceph/crash/2021-06-28T07\:47\:37.859766Z_6ffb119c-930e-4047-9cfa-a92af783cdd0/meta
malformed crash metadata: time data '2021-06-28T07:47:37.859766' does not match format '%Y-%m-%d %H:%M:%S.%f'
[root@rook-ceph-crashcollector-compute-1-66bdfbd886-d2zcd /]# echo $?
0

So the crash-collector script cannot handle the error since it checks for the return code.

2 things to fix, the return code as well as fixing the timestamp error in the crash metadata.

Timestamp used to be 2021-06-28 07:03:19.750123Z and now is 2021-06-28T07:47:37.859766.

Comment 8 Oded 2021-06-29 13:05:07 UTC

logs:
http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/bz-1976980/

Comment 9 Scott Ostapovicz 2021-06-29 15:57:29 UTC

Did the manager change the way it handles errors?

Comment 11 Mudit Agarwal 2021-06-30 11:52:39 UTC

Ceph BZ is targeted for 4.3, AFAIK it won't be available in 4.8 time frame.

Moving this out of 4.8, please bring back if someone thinks otherwise.
Once 4.3 is released, this should get fixed.

Comment 12 Josh Durgin 2021-06-30 19:48:41 UTC

*** Bug 1977952 has been marked as a duplicate of this bug. ***

Comment 13 Mudit Agarwal 2021-07-05 12:59:21 UTC

Brining it back to 4.8 as we are going to use 4.2z2-sync 2ith 4.8 which has the fix for Bug #1977271

Comment 16 Oded 2021-07-06 07:43:32 UTC

Bug Fixed

SetUp:
OCP Version:4.8.0-0.nightly-2021-07-01-185624
OCS Version:ocs-operator.v4.8.0-444.ci
LSO version:local-storage-operator.4.7.0-202102110027.p0
provider: vmware
sh-4.4# ceph versions
{
    "mon": {
        "ceph version 14.2.11-183.el8cp (07313a094b455c7834e137d63da57830ef66b672) nautilus (stable)": 3
    },
    "mgr": {
        "ceph version 14.2.11-183.el8cp (07313a094b455c7834e137d63da57830ef66b672) nautilus (stable)": 1
    },
    "osd": {
        "ceph version 14.2.11-183.el8cp (07313a094b455c7834e137d63da57830ef66b672) nautilus (stable)": 3
    },
    "mds": {
        "ceph version 14.2.11-183.el8cp (07313a094b455c7834e137d63da57830ef66b672) nautilus (stable)": 2
    },
    "rgw": {
        "ceph version 14.2.11-183.el8cp (07313a094b455c7834e137d63da57830ef66b672) nautilus (stable)": 1
    },
    "overall": {
        "ceph version 14.2.11-183.el8cp (07313a094b455c7834e137d63da57830ef66b672) nautilus (stable)": 10
    }
}


Test Procedure:
1.Get worker node where osd-0 running -> compute-0
$ oc get pods -o wide | grep osd-0
rook-ceph-osd-0-84fc47b944-fm64f                                  2/2     Running   0          12m   10.128.4.39    compute-1   <none>           <none>


2.login worker node
$ oc debug node/compute-1
sh-4.4# chroot /host
sh-4.4# bash
[root@compute-1 /]# 

3.Get ceph-osd pid:
[root@compute-1 /]# pidof ceph-osd
2296152

4.Kill process id:
[root@compute-1 /]# kill -11 2296152

5. Check coredumpctl list on compute-0
[root@compute-1 /]# coredumpctl list
TIME                            PID   UID   GID SIG COREFILE  EXE
Tue 2021-07-06 07:36:01 UTC  2296152   167   167  11 present   /usr/bin/ceph-osd

6.Check the content of "/var/lib/rook/openshift-storage/crash/posted" dir
[root@compute-1 2021-07-06_07:35:58.994012Z_ca52cded-f164-441e-8429-f065b352e18a]# pwd                 
/var/lib/rook/openshift-storage/crash/posted/2021-07-06_07:35:58.994012Z_ca52cded-f164-441e-8429-f065b352e18a
[root@compute-1 2021-07-06_07:35:58.994012Z_ca52cded-f164-441e-8429-f065b352e18a]# cat log | grep osd.0              
    -1> 2021-07-06 07:35:58.915 7f5b0a369700  5 osd.0 438 heartbeat osd_stat(store_statfs(0x189c000000/0x40000000/0x1900000000, data 0x150bba85/0x23ff0000, compress 0x0/0x0/0x0, omap 0x3699e, meta 0x3ffc9662), peers [1,2] op hist [])

7.Check crash list [empty after 20 min, Fail!!!]
$ oc rsh rook-ceph-tools-98ffbd8c-jkgfz
sh-4.4# ceph crash ls
ID                                                               ENTITY NEW 
2021-07-06_07:35:58.994012Z_ca52cded-f164-441e-8429-f065b352e18a osd.0   *

Comment 17 Oded 2021-07-06 07:51:46 UTC

(In reply to Oded from comment #16)
> Bug Fixed
> 
> SetUp:
> OCP Version:4.8.0-0.nightly-2021-07-01-185624
> OCS Version:ocs-operator.v4.8.0-444.ci
> LSO version:local-storage-operator.4.7.0-202102110027.p0
> provider: vmware
> sh-4.4# ceph versions
> {
>     "mon": {
>         "ceph version 14.2.11-183.el8cp
> (07313a094b455c7834e137d63da57830ef66b672) nautilus (stable)": 3
>     },
>     "mgr": {
>         "ceph version 14.2.11-183.el8cp
> (07313a094b455c7834e137d63da57830ef66b672) nautilus (stable)": 1
>     },
>     "osd": {
>         "ceph version 14.2.11-183.el8cp
> (07313a094b455c7834e137d63da57830ef66b672) nautilus (stable)": 3
>     },
>     "mds": {
>         "ceph version 14.2.11-183.el8cp
> (07313a094b455c7834e137d63da57830ef66b672) nautilus (stable)": 2
>     },
>     "rgw": {
>         "ceph version 14.2.11-183.el8cp
> (07313a094b455c7834e137d63da57830ef66b672) nautilus (stable)": 1
>     },
>     "overall": {
>         "ceph version 14.2.11-183.el8cp
> (07313a094b455c7834e137d63da57830ef66b672) nautilus (stable)": 10
>     }
> }
> 
> 
> Test Procedure:
> 1.Get worker node where osd-0 running -> compute-0
> $ oc get pods -o wide | grep osd-0
> rook-ceph-osd-0-84fc47b944-fm64f                                  2/2    
> Running   0          12m   10.128.4.39    compute-1   <none>           <none>
> 
> 
> 2.login worker node
> $ oc debug node/compute-1
> sh-4.4# chroot /host
> sh-4.4# bash
> [root@compute-1 /]# 
> 
> 3.Get ceph-osd pid:
> [root@compute-1 /]# pidof ceph-osd
> 2296152
> 
> 4.Kill process id:
> [root@compute-1 /]# kill -11 2296152
> 
> 5. Check coredumpctl list on compute-0
> [root@compute-1 /]# coredumpctl list
> TIME                            PID   UID   GID SIG COREFILE  EXE
> Tue 2021-07-06 07:36:01 UTC  2296152   167   167  11 present  
> /usr/bin/ceph-osd
> 
> 6.Check the content of "/var/lib/rook/openshift-storage/crash/posted" dir
> [root@compute-1
> 2021-07-06_07:35:58.994012Z_ca52cded-f164-441e-8429-f065b352e18a]# pwd      
> 
> /var/lib/rook/openshift-storage/crash/posted/2021-07-06_07:35:58.
> 994012Z_ca52cded-f164-441e-8429-f065b352e18a
> [root@compute-1
> 2021-07-06_07:35:58.994012Z_ca52cded-f164-441e-8429-f065b352e18a]# cat log |
> grep osd.0              
>     -1> 2021-07-06 07:35:58.915 7f5b0a369700  5 osd.0 438 heartbeat
> osd_stat(store_statfs(0x189c000000/0x40000000/0x1900000000, data
> 0x150bba85/0x23ff0000, compress 0x0/0x0/0x0, omap 0x3699e, meta 0x3ffc9662),
> peers [1,2] op hist [])
> 
Step 7.Check crash list
$ oc rsh rook-ceph-tools-98ffbd8c-jkgfz
sh-4.4# ceph crash ls
ID                                                               ENTITY NEW 
2021-07-06_07:35:58.994012Z_ca52cded-f164-441e-8429-f065b352e18a osd.0   *

step 8: Check ceph status
Check ceph status:
sh-4.4# ceph status
  cluster:
    id:     266208d6-a2f3-4574-b41d-92285e6df3c1
    health: HEALTH_WARN
            1 daemons have recently crashed
 
  services:
    mon: 3 daemons, quorum b,e,g (age 24m)
    mgr: a(active, since 24m)
    mds: ocs-storagecluster-cephfilesystem:1 {0=ocs-storagecluster-cephfilesystem-a=up:active} 1 up:standby-replay
    osd: 3 osds: 3 up (since 9m), 3 in (since 41h)
    rgw: 1 daemon active (ocs.storagecluster.cephobjectstore.a)
 
  data:
    pools:   10 pools, 176 pgs
    objects: 3.32k objects, 422 MiB
    usage:   4.7 GiB used, 295 GiB / 300 GiB avail
    pgs:     176 active+clean
 
  io:
    client:   1.3 KiB/s rd, 1.2 KiB/s wr, 2 op/s rd, 0 op/s wr
 

sh-4.4# ceph health detail 
HEALTH_WARN 1 daemons have recently crashed
RECENT_CRASH 1 daemons have recently crashed
    osd.0 crashed on host rook-ceph-osd-0-84fc47b944-fm64f at 2021-07-06 07:35:58.994012Z

step 9:Remove records of failures
sh-4.4# ceph crash archive-all

step 10: Check ceph status:
sh-4.4# ceph health detail
HEALTH_WARN Long heartbeat ping times on back interface seen, longest is 2295.210 msec; Long heartbeat ping times on front interface seen, longest is 2503.655 msec
OSD_SLOW_PING_TIME_BACK Long heartbeat ping times on back interface seen, longest is 2295.210 msec
    Slow heartbeat ping on back interface from osd.0 to osd.1 2295.210 msec
OSD_SLOW_PING_TIME_FRONT Long heartbeat ping times on front interface seen, longest is 2503.655 msec
    Slow heartbeat ping on front interface from osd.0 to osd.1 2503.655 msec
sh-4.4# ceph health detail
HEALTH_OK

Note You need to log in before you can comment on or make changes to this bug.