Bug 2224356

Summary: [RDR]Ceph OSD crashed on the long running RDR cluster
Product: [Red Hat Storage] Red Hat OpenShift Data Foundation Reporter: kmanohar
Component: cephAssignee: Adam Kupczyk <akupczyk>
ceph sub component: RADOS QA Contact: Elad <ebenahar>
Status: NEW --- Docs Contact:
Severity: high    
Priority: unspecified CC: akupczyk, amagrawa, bniver, kramdoss, muagarwa, nojha, odf-bz-bot, pdhange, sostapov
Version: 4.13Flags: kmanohar: needinfo? (akupczyk)
kmanohar: needinfo? (akupczyk)
kmanohar: needinfo? (pdhange)
pdhange: needinfo? (kmanohar)
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description kmanohar 2023-07-20 14:06:15 UTC
Description of problem (please be detailed as possible and provide log
snippests):
 
OSDs crash in the long running RDR cluster

Version of all relevant components (if applicable):


Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?


Is there any workaround available to the best of your knowledge?


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?


Can this issue reproducible?


Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1. Keep the workload running for several weeks(one month in this scenario)
2. Observed the OSD crash


Additional info:
-> Did not perform any DR operations


ceph crash ls
ID                                                                ENTITY  NEW         
2023-07-14T20:04:57.339567Z_cc9c176b-3991-4b92-aec1-dbf1dee98520  osd.1    *   
2023-07-14T20:05:04.597812Z_42744795-b51a-4169-82f6-f29639d2e150  osd.0    *  

ceph crash info 2023-07-14T20:04:57.339567Z_cc9c176b-3991-4b92-aec1-dbf1dee98520
{
    "backtrace": [
        "/lib64/libc.so.6(+0x54df0) [0x7fa4b6eeadf0]",
        "/lib64/libc.so.6(+0x9c560) [0x7fa4b6f32560]",
        "pthread_mutex_lock()",
        "(PG::lock(bool) const+0x2b) [0x559763bee91b]",
        "(OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x45d) [0x559763bc5dfd]",
        "(ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x2a3) [0x5597640e7ad3]",
        "ceph-osd(+0xa89074) [0x5597640e8074]",
        "/lib64/libc.so.6(+0x9f802) [0x7fa4b6f35802]",
        "/lib64/libc.so.6(+0x3f450) [0x7fa4b6ed5450]"
    ],
    "ceph_version": "17.2.6-70.0.TEST.bz2119217.el9cp",
    "crash_id": "2023-07-14T20:04:57.339567Z_cc9c176b-3991-4b92-aec1-dbf1dee98520",
    "entity_name": "osd.1",
    "os_id": "rhel",
    "os_name": "Red Hat Enterprise Linux",
    "os_version": "9.2 (Plow)",
    "os_version_id": "9.2",
    "process_name": "ceph-osd",
    "stack_sig": "5c7afd3067dc17bd22ffd5987b09913e4018bf079244d12c2db1c472317a24d8",
    "timestamp": "2023-07-14T20:04:57.339567Z",
    "utsname_hostname": "rook-ceph-osd-1-5f946675bc-hhjwk",
    "utsname_machine": "x86_64",
    "utsname_release": "5.14.0-284.16.1.el9_2.x86_64",
    "utsname_sysname": "Linux",
    "utsname_version": "#1 SMP PREEMPT_DYNAMIC Thu May 18 19:03:13 EDT 2023"
}

coredumps in must gather for the nodes

http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/keerthana/rbd-mirror/c2/must-gather.local.1781396938003127686/quay-io-rhceph-dev-ocs-must-gather-sha256-9ce39944596cbc4966404fb1ceb24be21093a708b1691e78453ab1b9a7a10f7b/ceph/


Complete Must gather logs :-

c1 - http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/keerthana/rbd-mirror/c1/

c2 - http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/keerthana/rbd-mirror/c2/

hub - http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/keerthana/rbd-mirror/hub/


Live setup is available for debugging

c1 - https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster/25313/

c2 - https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster/25312/

hub - https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster/25311/

Comment 2 kmanohar 2023-07-20 14:11:06 UTC
Ref BZ - https://bugzilla.redhat.com/show_bug.cgi?id=2098118#c73

Comment 5 kmanohar 2023-07-26 08:17:55 UTC
@akupczyk 
We have been using this build 6d74fefa15d1216867d1d112b47bb83c4913d28f, which has all changes required as part bluestore-rdr and doesn't require any configurables to be set explicitly.
Please refer to this gchat for the related conversation - https://chat.google.com/room/AAAAqWkMm2s/Z1yykj7ae4Q

Ref BZ - https://bugzilla.redhat.com/show_bug.cgi?id=2119217#c62

Comment 6 kmanohar 2023-07-28 13:12:36 UTC
@akupczyk @pdhange 
Similar kind of OSD crash has been reported in this bug https://bugzilla.redhat.com/show_bug.cgi?id=2150996 and node logs also has similarities

```
Jul 14 20:05:13.055775 compute-0 kubenswrapper[2594]: E0714 20:05:13.055770    2594 pod_workers.go:965] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"manager\" with CreateContainerConfigError: \"failed to sync configmap cache: timed out waiting for the condition\"" pod="openshift-storage/odf-operator-controller-manager-865ddd6bc9-2pxqj" podUID=428b9bed-0d53-485d-9a8e-b2a89e0f6f2d
Jul 14 20:05:14.931673 compute-0 kubenswrapper[2594]: I0714 20:05:14.924300    2594 prober.go:109] "Probe failed" probeType="Liveness" pod="openshift-storage/rook-ceph-osd-1-5f946675bc-hhjwk" podUID=83dd39d9-8a60-4128-a65f-4df39eb3e199 containerName="osd" probeResult=failure output="command timed out"


Jul 14 19:53:27.845679 compute-0 kubenswrapper[2594]: E0714 19:53:27.845158    2594 desired_state_of_world_populator.go:312] "Error processing volume" err="error processing PVC openshift-storage/ocs-deviceset-thin-csi-1-data-07lvbj: failed to fetch PVC from API server: Get \"https://api-int.kmanohar-clu2.qe.rh-ocs.com:6443/api/v1/namespaces/openshift-storage/persistentvolumeclaims/ocs-deviceset-thin-csi-1-data-07lvbj\": dial tcp: lookup api-int.kmanohar-clu2.qe.rh-ocs.com on 10.10.160.1:53: read udp 10.1.114.115:46108->10.10.160.1:53: i/o timeout" pod="openshift-storage/rook-ceph-osd-1-5f946675bc-hhjwk" volumeName="ocs-deviceset-thin-csi-1-data-07lvbj"

```
Can these 2 issues be related or this nodes logs are due to slowness in the disk?

Comment 9 krishnaram Karthick 2023-08-11 13:57:07 UTC
Prasanth - are you saying that investigation in this bug cannot move forward without the logs you requested in the previous comment or you are just unsure whether the crash is related to https://bugzilla.redhat.com/show_bug.cgi?id=2150996

I would like to know who owns the next action on this bug.

Comment 11 kmanohar 2023-08-17 08:02:09 UTC
@pdhange As I mentioned in the description, we have coredumps generated for the OSD crash. Are you referring to that or, few more logs are required?

http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/keerthana/rbd-mirror/c2/must-gather.local.1781396938003127686/quay-io-rhceph-dev-ocs-must-gather-sha256-9ce39944596cbc4966404fb1ceb24be21093a708b1691e78453ab1b9a7a10f7b/ceph/

Could you be little elaborate on this?