Bug 2250227

Summary: [ODF-4.13.z][CEPH bug 2249814 tracker] Health Warn after to upgrade to 4.13.5-6 - 1 daemons have recently crashed
Product: [Red Hat Storage] Red Hat OpenShift Data Foundation Reporter: Sunil Kumar Acharya <sheggodu>
Component: cephAssignee: Venky Shankar <vshankar>
ceph sub component: CephFS QA Contact: Elad <ebenahar>
Status: NEW --- Docs Contact:
Severity: unspecified    
Priority: unspecified CC: bniver, muagarwa, sostapov
Version: 4.13   
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Sunil Kumar Acharya 2023-11-17 06:56:46 UTC
This bug was initially created as a copy of Bug #2249844

I am copying this bug because: 



Description of problem (please be detailed as possible and provide log
snippests):
After upgrade execution to 4.13.5-6 from 4.12 - (both OCP and ODF upgrade)
we see ceph health warn issue:

sh-5.1$ ceph status
  cluster:
    id:     68dc565f-f700-4312-93be-265b7ed15941
    health: HEALTH_WARN
            1 daemons have recently crashed
 
  services:
    mon: 3 daemons, quorum a,b,c (age 78m)
    mgr: a(active, since 77m)
    mds: 1/1 daemons up, 1 hot standby
    osd: 3 osds: 3 up (since 77m), 3 in (since 2h)
    rgw: 1 daemon active (1 hosts, 1 zones)
 
  data:
    volumes: 1/1 healthy
    pools:   12 pools, 185 pgs
    objects: 1.05k objects, 2.0 GiB
    usage:   5.9 GiB used, 1.5 TiB / 1.5 TiB avail
    pgs:     185 active+clean
 
  io:
    client:   1.4 KiB/s rd, 134 KiB/s wr, 2 op/s rd, 2 op/s wr
 
sh-5.1$ ceph crash ls
ID                                                                ENTITY  NEW
2023-11-15T08:10:44.427601Z_b4fd4568-7eb7-4508-ab38-58e561dc809a  mgr.a    *
sh-5.1$ ceph crash info 2023-11-15T08:10:44.427601Z_b4fd4568-7eb7-4508-ab38-58e561dc809a
{
    "backtrace": [
        "/lib64/libc.so.6(+0x54df0) [0x7f7c91f2bdf0]",
        "/lib64/libc.so.6(+0xa154c) [0x7f7c91f7854c]",
        "raise()",
        "abort()",
        "/lib64/libstdc++.so.6(+0xa1a01) [0x7f7c92279a01]",
        "/lib64/libstdc++.so.6(+0xad37c) [0x7f7c9228537c]",
        "/lib64/libstdc++.so.6(+0xad3e7) [0x7f7c922853e7]",
        "/lib64/libstdc++.so.6(+0xad649) [0x7f7c92285649]",
        "/usr/lib64/ceph/libceph-common.so.2(+0x170d39) [0x7f7c9256fd39]",
        "(SnapRealmInfo::decode(ceph::buffer::v15_2_0::list::iterator_impl<true>&)+0x3b) [0x7f7c926a7f4b]",
        "/lib64/libcephfs.so.2(+0xaaec7) [0x7f7c86c43ec7]",
        "/lib64/libcephfs.so.2(+0xacc59) [0x7f7c86c45c59]",
        "/lib64/libcephfs.so.2(+0xadf10) [0x7f7c86c46f10]",
        "/lib64/libcephfs.so.2(+0x929e8) [0x7f7c86c2b9e8]",
        "(DispatchQueue::entry()+0x53a) [0x7f7c9272defa]",
        "/usr/lib64/ceph/libceph-common.so.2(+0x3bab31) [0x7f7c927b9b31]",
        "/lib64/libc.so.6(+0x9f802) [0x7f7c91f76802]",
        "/lib64/libc.so.6(+0x3f450) [0x7f7c91f16450]"
    ],
    "ceph_version": "17.2.6-148.el9cp",
    "crash_id": "2023-11-15T08:10:44.427601Z_b4fd4568-7eb7-4508-ab38-58e561dc809a",
    "entity_name": "mgr.a",
    "os_id": "rhel",
    "os_name": "Red Hat Enterprise Linux",
    "os_version": "9.2 (Plow)",
    "os_version_id": "9.2",
    "process_name": "ceph-mgr",
    "stack_sig": "4cb0911c06087a31d9752535de90ba18fd7aab25c037945b2c61f584dcf6a6db",
    "timestamp": "2023-11-15T08:10:44.427601Z",
    "utsname_hostname": "rook-ceph-mgr-a-5d475468dd-wzhmt",
    "utsname_machine": "x86_64",
    "utsname_release": "5.14.0-284.40.1.el9_2.x86_64",
    "utsname_sysname": "Linux",
    "utsname_version": "#1 SMP PREEMPT_DYNAMIC Wed Nov 1 10:30:09 EDT 2023"
}


Discussed here:
https://chat.google.com/room/AAAAREGEba8/fZvCCW1MQfU

Venky pointed out that it smells like this issue:
https://tracker.ceph.com/issues/63188
BZ:
https://bugzilla.redhat.com/show_bug.cgi?id=2247174

Venky cloned the 7.0 BZ to 6.1z4 target - https://bugzilla.redhat.com/show_bug.cgi?id=2249814

Version of all relevant components (if applicable):
ODF 4.13.5-6 

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?


Is there any workaround available to the best of your knowledge?


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?


Can this issue reproducible?
Trying to reproduce here:
https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-trigger-vsphere-upi-encryption-1az-rhcos-vsan-lso-vmdk-3m-3w-upgrade-ocp-ocs-auto/32/

Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1. Install ODF 4.12 and OCP 4.12
2. Upgrade OCP to 4.13
3. Upgrade ODF to 4.13.5-6 build
4. After some time we see the health warn


Actual results:
Do not have health warn

Expected results:


Additional info:
Must gather:
http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j-031vue1cslv33-uba/j-031vue1cslv33-uba_20231115T053551/logs/testcases_1700036781/j-031vue1cslv33-u/
Job:
https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-trigger-vsphere-upi-encryption-1az-rhcos-vsan-lso-vmdk-3m-3w-upgrade-ocp-ocs-auto/31/