Description of problem (please be detailed as possible and provide log snippests): OSDs crash in the long running RDR cluster Version of all relevant components (if applicable): Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? Is there any workaround available to the best of your knowledge? Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? Can this issue reproducible? Can this issue reproduce from the UI? If this is a regression, please provide more details to justify this: Steps to Reproduce: 1. Keep the workload running for several weeks(one month in this scenario) 2. Observed the OSD crash Additional info: -> Did not perform any DR operations ceph crash ls ID ENTITY NEW 2023-07-14T20:04:57.339567Z_cc9c176b-3991-4b92-aec1-dbf1dee98520 osd.1 * 2023-07-14T20:05:04.597812Z_42744795-b51a-4169-82f6-f29639d2e150 osd.0 * ceph crash info 2023-07-14T20:04:57.339567Z_cc9c176b-3991-4b92-aec1-dbf1dee98520 { "backtrace": [ "/lib64/libc.so.6(+0x54df0) [0x7fa4b6eeadf0]", "/lib64/libc.so.6(+0x9c560) [0x7fa4b6f32560]", "pthread_mutex_lock()", "(PG::lock(bool) const+0x2b) [0x559763bee91b]", "(OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x45d) [0x559763bc5dfd]", "(ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x2a3) [0x5597640e7ad3]", "ceph-osd(+0xa89074) [0x5597640e8074]", "/lib64/libc.so.6(+0x9f802) [0x7fa4b6f35802]", "/lib64/libc.so.6(+0x3f450) [0x7fa4b6ed5450]" ], "ceph_version": "17.2.6-70.0.TEST.bz2119217.el9cp", "crash_id": "2023-07-14T20:04:57.339567Z_cc9c176b-3991-4b92-aec1-dbf1dee98520", "entity_name": "osd.1", "os_id": "rhel", "os_name": "Red Hat Enterprise Linux", "os_version": "9.2 (Plow)", "os_version_id": "9.2", "process_name": "ceph-osd", "stack_sig": "5c7afd3067dc17bd22ffd5987b09913e4018bf079244d12c2db1c472317a24d8", "timestamp": "2023-07-14T20:04:57.339567Z", "utsname_hostname": "rook-ceph-osd-1-5f946675bc-hhjwk", "utsname_machine": "x86_64", "utsname_release": "5.14.0-284.16.1.el9_2.x86_64", "utsname_sysname": "Linux", "utsname_version": "#1 SMP PREEMPT_DYNAMIC Thu May 18 19:03:13 EDT 2023" } coredumps in must gather for the nodes http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/keerthana/rbd-mirror/c2/must-gather.local.1781396938003127686/quay-io-rhceph-dev-ocs-must-gather-sha256-9ce39944596cbc4966404fb1ceb24be21093a708b1691e78453ab1b9a7a10f7b/ceph/ Complete Must gather logs :- c1 - http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/keerthana/rbd-mirror/c1/ c2 - http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/keerthana/rbd-mirror/c2/ hub - http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/keerthana/rbd-mirror/hub/ Live setup is available for debugging c1 - https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster/25313/ c2 - https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster/25312/ hub - https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster/25311/
Ref BZ - https://bugzilla.redhat.com/show_bug.cgi?id=2098118#c73
@akupczyk We have been using this build 6d74fefa15d1216867d1d112b47bb83c4913d28f, which has all changes required as part bluestore-rdr and doesn't require any configurables to be set explicitly. Please refer to this gchat for the related conversation - https://chat.google.com/room/AAAAqWkMm2s/Z1yykj7ae4Q Ref BZ - https://bugzilla.redhat.com/show_bug.cgi?id=2119217#c62
@akupczyk @pdhange Similar kind of OSD crash has been reported in this bug https://bugzilla.redhat.com/show_bug.cgi?id=2150996 and node logs also has similarities ``` Jul 14 20:05:13.055775 compute-0 kubenswrapper[2594]: E0714 20:05:13.055770 2594 pod_workers.go:965] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"manager\" with CreateContainerConfigError: \"failed to sync configmap cache: timed out waiting for the condition\"" pod="openshift-storage/odf-operator-controller-manager-865ddd6bc9-2pxqj" podUID=428b9bed-0d53-485d-9a8e-b2a89e0f6f2d Jul 14 20:05:14.931673 compute-0 kubenswrapper[2594]: I0714 20:05:14.924300 2594 prober.go:109] "Probe failed" probeType="Liveness" pod="openshift-storage/rook-ceph-osd-1-5f946675bc-hhjwk" podUID=83dd39d9-8a60-4128-a65f-4df39eb3e199 containerName="osd" probeResult=failure output="command timed out" Jul 14 19:53:27.845679 compute-0 kubenswrapper[2594]: E0714 19:53:27.845158 2594 desired_state_of_world_populator.go:312] "Error processing volume" err="error processing PVC openshift-storage/ocs-deviceset-thin-csi-1-data-07lvbj: failed to fetch PVC from API server: Get \"https://api-int.kmanohar-clu2.qe.rh-ocs.com:6443/api/v1/namespaces/openshift-storage/persistentvolumeclaims/ocs-deviceset-thin-csi-1-data-07lvbj\": dial tcp: lookup api-int.kmanohar-clu2.qe.rh-ocs.com on 10.10.160.1:53: read udp 10.1.114.115:46108->10.10.160.1:53: i/o timeout" pod="openshift-storage/rook-ceph-osd-1-5f946675bc-hhjwk" volumeName="ocs-deviceset-thin-csi-1-data-07lvbj" ``` Can these 2 issues be related or this nodes logs are due to slowness in the disk?
Prasanth - are you saying that investigation in this bug cannot move forward without the logs you requested in the previous comment or you are just unsure whether the crash is related to https://bugzilla.redhat.com/show_bug.cgi?id=2150996 I would like to know who owns the next action on this bug.
@pdhange As I mentioned in the description, we have coredumps generated for the OSD crash. Are you referring to that or, few more logs are required? http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/keerthana/rbd-mirror/c2/must-gather.local.1781396938003127686/quay-io-rhceph-dev-ocs-must-gather-sha256-9ce39944596cbc4966404fb1ceb24be21093a708b1691e78453ab1b9a7a10f7b/ceph/ Could you be little elaborate on this?