Description of problem (please be detailed as possible and provide log snippests): New item was not created on ceph crash list after killing osd/mon/mgr daemon Version of all relevant components (if applicable): ocs version:ocs-operator.v4.8.0-422.ci ocp version:4.8.0-0.nightly-2021-06-22-022125 provider: vmware type: lso ceph versions: sh-4.4# ceph versions { "mon": { "ceph version 14.2.11-181.el8cp (68fea1005601531fe60d2979c56ea63bc073c84f) nautilus (stable)": 3 }, "mgr": { "ceph version 14.2.11-181.el8cp (68fea1005601531fe60d2979c56ea63bc073c84f) nautilus (stable)": 1 }, "osd": { "ceph version 14.2.11-181.el8cp (68fea1005601531fe60d2979c56ea63bc073c84f) nautilus (stable)": 3 }, "mds": { "ceph version 14.2.11-181.el8cp (68fea1005601531fe60d2979c56ea63bc073c84f) nautilus (stable)": 2 }, "rgw": { "ceph version 14.2.11-181.el8cp (68fea1005601531fe60d2979c56ea63bc073c84f) nautilus (stable)": 1 }, "overall": { "ceph version 14.2.11-181.el8cp (68fea1005601531fe60d2979c56ea63bc073c84f) nautilus (stable)": 10 } } Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? Is there any workaround available to the best of your knowledge? Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? Can this issue reproducible? Can this issue reproduce from the UI? If this is a regression, please provide more details to justify this: Steps to Reproduce: 1.Get worker node where osd-0 running -> compute-0 $ oc get pods -o wide rook-ceph-osd-0-69b4d6d64b-v6mcg 2/2 Running 0 21h 10.130.2.107 compute-0 <none> <none> 2.login worker node $ oc debug node/compute-0 sh-4.4# chroot /host sh-4.4# bash [root@compute-0 /]# 3.Get ceph-osd pid: [root@compute-0 /]# pidof ceph-osd 811668 4.Kill process id: [root@compute-0 /]# kill -11 811668 5. Check coredumpctl list on compute-0 [root@compute-0 posted]# coredumpctl list TIME PID UID GID SIG COREFILE EXE Thu 2021-06-24 10:00:05 UTC 811668 167 167 11 present /usr/bin/ceph-osd 6.Check the content of "/var/lib/rook/openshift-storage/crash/posted" dir [root@compute-0 posted]# pwd /var/lib/rook/openshift-storage/crash/posted [root@compute-0 posted]# ls 2021-06-24T10:00:02.339256Z_9b4866b6-6cd6-4821-9523-4005b96c2821 [root@compute-0 2021-06-24T10:00:02.339256Z_9b4866b6-6cd6-4821-9523-4005b96c2821]# pwd /var/lib/rook/openshift-storage/crash/posted/2021-06-24T10:00:02.339256Z_9b4866b6-6cd6-4821-9523-4005b96c2821 [root@compute-0 2021-06-24T10:00:02.339256Z_9b4866b6-6cd6-4821-9523-4005b96c2821]# cat log | grep -i osd.0 -10> 2021-06-24 09:59:57.865 7f105eff1700 5 osd.0 601 heartbeat osd_stat(store_statfs(0x189d620000/0x40000000/0x1900000000, data 0x1365b0ce/0x22960000, compress 0x0/0x0/0x0, omap 0x2fd85, meta 0x3ffd027b), peers [1,2] op hist []) -4> 2021-06-24 10:00:01.365 7f105eff1700 5 osd.0 601 heartbeat osd_stat(store_statfs(0x189d620000/0x40000000/0x1900000000, data 0x1365b0ce/0x22960000, compress 0x0/0x0/0x0, omap 0x2fd85, meta 0x3ffd027b), peers [1,2] op hist []) in thread 7f1086ac0dc0 thread_name:ceph-osd 7.Check crash list [empty after 20 min, Fail!!!] $ oc rsh rook-ceph-tools-69867749f-jsltd sh-4.4# ceph crash ls sh-4.4# Actual results: New item was not created on ceph crash list Expected results: New item was created in ceph crash list. for example: $ oc rsh rook-ceph-tools-84bc476959-2d2ps sh-4.4# ceph crash ls ID ENTITY NEW 2021-02-10_10:21:02.360184Z_93f26ef2-bdee-407f-9581-cb5d0faa931c mon.b * Additional info:
Please include the logs :)
+ logs: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j059vuf1cs33-t1/j059vuf1cs33-t1_20210622T143745/logs/failed_testcase_ocs_logs_1624379492/test_coredump_check_for_ceph_daemon_crash_ocs_logs/ocs_must_gather/quay-io-rhceph-dev-ocs-must-gather-sha256-2d11d7a16502a1635705f98f62213e9504d9dc6fb3a883b2165eaf4a219c8da9/ This issue reconstructed on vmware and AWS.
This issue reconstructed on setup: OCS version:v4.7.2-429.ci OCP version:4.7.0-0.nightly-2021-06-24-030912 Provider:VSPHERE6-UPI-ENCRYPTION-1AZ-RHCOS-VSAN-LSO-VMDK https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/job/qe-deploy-ocs-cluster-prod/1196/testReport/tests.manage.z_cluster.test_coredump_check_for_ceph_daemon_crash/TestKillCephDaemon/test_coredump_check_for_ceph_daemon_crash/
This issue reconstructed on OCS4.7.2 and does not reconstructed on OCS4.7.1.
Moving to Ceph since it's a ceph issue that was introduced in 4.7.2 I believe, the main issue is that the mgr error returns 0 on errors, see: [root@rook-ceph-crashcollector-compute-1-66bdfbd886-d2zcd /]# ceph -n client.crash crash post -i /var/lib/ceph/crash/2021-06-28T07\:47\:37.859766Z_6ffb119c-930e-4047-9cfa-a92af783cdd0/meta malformed crash metadata: time data '2021-06-28T07:47:37.859766' does not match format '%Y-%m-%d %H:%M:%S.%f' [root@rook-ceph-crashcollector-compute-1-66bdfbd886-d2zcd /]# echo $? 0 So the crash-collector script cannot handle the error since it checks for the return code. 2 things to fix, the return code as well as fixing the timestamp error in the crash metadata. Timestamp used to be 2021-06-28 07:03:19.750123Z and now is 2021-06-28T07:47:37.859766.
logs: http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/bz-1976980/
Did the manager change the way it handles errors?
Ceph BZ is targeted for 4.3, AFAIK it won't be available in 4.8 time frame. Moving this out of 4.8, please bring back if someone thinks otherwise. Once 4.3 is released, this should get fixed.
*** Bug 1977952 has been marked as a duplicate of this bug. ***
Brining it back to 4.8 as we are going to use 4.2z2-sync 2ith 4.8 which has the fix for Bug #1977271
Bug Fixed SetUp: OCP Version:4.8.0-0.nightly-2021-07-01-185624 OCS Version:ocs-operator.v4.8.0-444.ci LSO version:local-storage-operator.4.7.0-202102110027.p0 provider: vmware sh-4.4# ceph versions { "mon": { "ceph version 14.2.11-183.el8cp (07313a094b455c7834e137d63da57830ef66b672) nautilus (stable)": 3 }, "mgr": { "ceph version 14.2.11-183.el8cp (07313a094b455c7834e137d63da57830ef66b672) nautilus (stable)": 1 }, "osd": { "ceph version 14.2.11-183.el8cp (07313a094b455c7834e137d63da57830ef66b672) nautilus (stable)": 3 }, "mds": { "ceph version 14.2.11-183.el8cp (07313a094b455c7834e137d63da57830ef66b672) nautilus (stable)": 2 }, "rgw": { "ceph version 14.2.11-183.el8cp (07313a094b455c7834e137d63da57830ef66b672) nautilus (stable)": 1 }, "overall": { "ceph version 14.2.11-183.el8cp (07313a094b455c7834e137d63da57830ef66b672) nautilus (stable)": 10 } } Test Procedure: 1.Get worker node where osd-0 running -> compute-0 $ oc get pods -o wide | grep osd-0 rook-ceph-osd-0-84fc47b944-fm64f 2/2 Running 0 12m 10.128.4.39 compute-1 <none> <none> 2.login worker node $ oc debug node/compute-1 sh-4.4# chroot /host sh-4.4# bash [root@compute-1 /]# 3.Get ceph-osd pid: [root@compute-1 /]# pidof ceph-osd 2296152 4.Kill process id: [root@compute-1 /]# kill -11 2296152 5. Check coredumpctl list on compute-0 [root@compute-1 /]# coredumpctl list TIME PID UID GID SIG COREFILE EXE Tue 2021-07-06 07:36:01 UTC 2296152 167 167 11 present /usr/bin/ceph-osd 6.Check the content of "/var/lib/rook/openshift-storage/crash/posted" dir [root@compute-1 2021-07-06_07:35:58.994012Z_ca52cded-f164-441e-8429-f065b352e18a]# pwd /var/lib/rook/openshift-storage/crash/posted/2021-07-06_07:35:58.994012Z_ca52cded-f164-441e-8429-f065b352e18a [root@compute-1 2021-07-06_07:35:58.994012Z_ca52cded-f164-441e-8429-f065b352e18a]# cat log | grep osd.0 -1> 2021-07-06 07:35:58.915 7f5b0a369700 5 osd.0 438 heartbeat osd_stat(store_statfs(0x189c000000/0x40000000/0x1900000000, data 0x150bba85/0x23ff0000, compress 0x0/0x0/0x0, omap 0x3699e, meta 0x3ffc9662), peers [1,2] op hist []) 7.Check crash list [empty after 20 min, Fail!!!] $ oc rsh rook-ceph-tools-98ffbd8c-jkgfz sh-4.4# ceph crash ls ID ENTITY NEW 2021-07-06_07:35:58.994012Z_ca52cded-f164-441e-8429-f065b352e18a osd.0 *
(In reply to Oded from comment #16) > Bug Fixed > > SetUp: > OCP Version:4.8.0-0.nightly-2021-07-01-185624 > OCS Version:ocs-operator.v4.8.0-444.ci > LSO version:local-storage-operator.4.7.0-202102110027.p0 > provider: vmware > sh-4.4# ceph versions > { > "mon": { > "ceph version 14.2.11-183.el8cp > (07313a094b455c7834e137d63da57830ef66b672) nautilus (stable)": 3 > }, > "mgr": { > "ceph version 14.2.11-183.el8cp > (07313a094b455c7834e137d63da57830ef66b672) nautilus (stable)": 1 > }, > "osd": { > "ceph version 14.2.11-183.el8cp > (07313a094b455c7834e137d63da57830ef66b672) nautilus (stable)": 3 > }, > "mds": { > "ceph version 14.2.11-183.el8cp > (07313a094b455c7834e137d63da57830ef66b672) nautilus (stable)": 2 > }, > "rgw": { > "ceph version 14.2.11-183.el8cp > (07313a094b455c7834e137d63da57830ef66b672) nautilus (stable)": 1 > }, > "overall": { > "ceph version 14.2.11-183.el8cp > (07313a094b455c7834e137d63da57830ef66b672) nautilus (stable)": 10 > } > } > > > Test Procedure: > 1.Get worker node where osd-0 running -> compute-0 > $ oc get pods -o wide | grep osd-0 > rook-ceph-osd-0-84fc47b944-fm64f 2/2 > Running 0 12m 10.128.4.39 compute-1 <none> <none> > > > 2.login worker node > $ oc debug node/compute-1 > sh-4.4# chroot /host > sh-4.4# bash > [root@compute-1 /]# > > 3.Get ceph-osd pid: > [root@compute-1 /]# pidof ceph-osd > 2296152 > > 4.Kill process id: > [root@compute-1 /]# kill -11 2296152 > > 5. Check coredumpctl list on compute-0 > [root@compute-1 /]# coredumpctl list > TIME PID UID GID SIG COREFILE EXE > Tue 2021-07-06 07:36:01 UTC 2296152 167 167 11 present > /usr/bin/ceph-osd > > 6.Check the content of "/var/lib/rook/openshift-storage/crash/posted" dir > [root@compute-1 > 2021-07-06_07:35:58.994012Z_ca52cded-f164-441e-8429-f065b352e18a]# pwd > > /var/lib/rook/openshift-storage/crash/posted/2021-07-06_07:35:58. > 994012Z_ca52cded-f164-441e-8429-f065b352e18a > [root@compute-1 > 2021-07-06_07:35:58.994012Z_ca52cded-f164-441e-8429-f065b352e18a]# cat log | > grep osd.0 > -1> 2021-07-06 07:35:58.915 7f5b0a369700 5 osd.0 438 heartbeat > osd_stat(store_statfs(0x189c000000/0x40000000/0x1900000000, data > 0x150bba85/0x23ff0000, compress 0x0/0x0/0x0, omap 0x3699e, meta 0x3ffc9662), > peers [1,2] op hist []) > Step 7.Check crash list $ oc rsh rook-ceph-tools-98ffbd8c-jkgfz sh-4.4# ceph crash ls ID ENTITY NEW 2021-07-06_07:35:58.994012Z_ca52cded-f164-441e-8429-f065b352e18a osd.0 * step 8: Check ceph status Check ceph status: sh-4.4# ceph status cluster: id: 266208d6-a2f3-4574-b41d-92285e6df3c1 health: HEALTH_WARN 1 daemons have recently crashed services: mon: 3 daemons, quorum b,e,g (age 24m) mgr: a(active, since 24m) mds: ocs-storagecluster-cephfilesystem:1 {0=ocs-storagecluster-cephfilesystem-a=up:active} 1 up:standby-replay osd: 3 osds: 3 up (since 9m), 3 in (since 41h) rgw: 1 daemon active (ocs.storagecluster.cephobjectstore.a) data: pools: 10 pools, 176 pgs objects: 3.32k objects, 422 MiB usage: 4.7 GiB used, 295 GiB / 300 GiB avail pgs: 176 active+clean io: client: 1.3 KiB/s rd, 1.2 KiB/s wr, 2 op/s rd, 0 op/s wr sh-4.4# ceph health detail HEALTH_WARN 1 daemons have recently crashed RECENT_CRASH 1 daemons have recently crashed osd.0 crashed on host rook-ceph-osd-0-84fc47b944-fm64f at 2021-07-06 07:35:58.994012Z step 9:Remove records of failures sh-4.4# ceph crash archive-all step 10: Check ceph status: sh-4.4# ceph health detail HEALTH_WARN Long heartbeat ping times on back interface seen, longest is 2295.210 msec; Long heartbeat ping times on front interface seen, longest is 2503.655 msec OSD_SLOW_PING_TIME_BACK Long heartbeat ping times on back interface seen, longest is 2295.210 msec Slow heartbeat ping on back interface from osd.0 to osd.1 2295.210 msec OSD_SLOW_PING_TIME_FRONT Long heartbeat ping times on front interface seen, longest is 2503.655 msec Slow heartbeat ping on front interface from osd.0 to osd.1 2503.655 msec sh-4.4# ceph health detail HEALTH_OK