Bug 2250227 - [ODF-4.13.z][CEPH bug 2249814 tracker] Health Warn after to upgrade to 4.13.5-6 - 1 daemons have recently crashed
Summary: [ODF-4.13.z][CEPH bug 2249814 tracker] Health Warn after to upgrade to 4.13.5...
Keywords:
Status: NEW
Alias: None
Product: Red Hat OpenShift Data Foundation
Classification: Red Hat Storage
Component: ceph
Version: 4.13
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
: ---
Assignee: Venky Shankar
QA Contact: Elad
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2023-11-17 06:56 UTC by Sunil Kumar Acharya
Modified: 2023-11-17 06:57 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:
Embargoed:


Attachments (Terms of Use)

Description Sunil Kumar Acharya 2023-11-17 06:56:46 UTC
This bug was initially created as a copy of Bug #2249844

I am copying this bug because: 



Description of problem (please be detailed as possible and provide log
snippests):
After upgrade execution to 4.13.5-6 from 4.12 - (both OCP and ODF upgrade)
we see ceph health warn issue:

sh-5.1$ ceph status
  cluster:
    id:     68dc565f-f700-4312-93be-265b7ed15941
    health: HEALTH_WARN
            1 daemons have recently crashed
 
  services:
    mon: 3 daemons, quorum a,b,c (age 78m)
    mgr: a(active, since 77m)
    mds: 1/1 daemons up, 1 hot standby
    osd: 3 osds: 3 up (since 77m), 3 in (since 2h)
    rgw: 1 daemon active (1 hosts, 1 zones)
 
  data:
    volumes: 1/1 healthy
    pools:   12 pools, 185 pgs
    objects: 1.05k objects, 2.0 GiB
    usage:   5.9 GiB used, 1.5 TiB / 1.5 TiB avail
    pgs:     185 active+clean
 
  io:
    client:   1.4 KiB/s rd, 134 KiB/s wr, 2 op/s rd, 2 op/s wr
 
sh-5.1$ ceph crash ls
ID                                                                ENTITY  NEW
2023-11-15T08:10:44.427601Z_b4fd4568-7eb7-4508-ab38-58e561dc809a  mgr.a    *
sh-5.1$ ceph crash info 2023-11-15T08:10:44.427601Z_b4fd4568-7eb7-4508-ab38-58e561dc809a
{
    "backtrace": [
        "/lib64/libc.so.6(+0x54df0) [0x7f7c91f2bdf0]",
        "/lib64/libc.so.6(+0xa154c) [0x7f7c91f7854c]",
        "raise()",
        "abort()",
        "/lib64/libstdc++.so.6(+0xa1a01) [0x7f7c92279a01]",
        "/lib64/libstdc++.so.6(+0xad37c) [0x7f7c9228537c]",
        "/lib64/libstdc++.so.6(+0xad3e7) [0x7f7c922853e7]",
        "/lib64/libstdc++.so.6(+0xad649) [0x7f7c92285649]",
        "/usr/lib64/ceph/libceph-common.so.2(+0x170d39) [0x7f7c9256fd39]",
        "(SnapRealmInfo::decode(ceph::buffer::v15_2_0::list::iterator_impl<true>&)+0x3b) [0x7f7c926a7f4b]",
        "/lib64/libcephfs.so.2(+0xaaec7) [0x7f7c86c43ec7]",
        "/lib64/libcephfs.so.2(+0xacc59) [0x7f7c86c45c59]",
        "/lib64/libcephfs.so.2(+0xadf10) [0x7f7c86c46f10]",
        "/lib64/libcephfs.so.2(+0x929e8) [0x7f7c86c2b9e8]",
        "(DispatchQueue::entry()+0x53a) [0x7f7c9272defa]",
        "/usr/lib64/ceph/libceph-common.so.2(+0x3bab31) [0x7f7c927b9b31]",
        "/lib64/libc.so.6(+0x9f802) [0x7f7c91f76802]",
        "/lib64/libc.so.6(+0x3f450) [0x7f7c91f16450]"
    ],
    "ceph_version": "17.2.6-148.el9cp",
    "crash_id": "2023-11-15T08:10:44.427601Z_b4fd4568-7eb7-4508-ab38-58e561dc809a",
    "entity_name": "mgr.a",
    "os_id": "rhel",
    "os_name": "Red Hat Enterprise Linux",
    "os_version": "9.2 (Plow)",
    "os_version_id": "9.2",
    "process_name": "ceph-mgr",
    "stack_sig": "4cb0911c06087a31d9752535de90ba18fd7aab25c037945b2c61f584dcf6a6db",
    "timestamp": "2023-11-15T08:10:44.427601Z",
    "utsname_hostname": "rook-ceph-mgr-a-5d475468dd-wzhmt",
    "utsname_machine": "x86_64",
    "utsname_release": "5.14.0-284.40.1.el9_2.x86_64",
    "utsname_sysname": "Linux",
    "utsname_version": "#1 SMP PREEMPT_DYNAMIC Wed Nov 1 10:30:09 EDT 2023"
}


Discussed here:
https://chat.google.com/room/AAAAREGEba8/fZvCCW1MQfU

Venky pointed out that it smells like this issue:
https://tracker.ceph.com/issues/63188
BZ:
https://bugzilla.redhat.com/show_bug.cgi?id=2247174

Venky cloned the 7.0 BZ to 6.1z4 target - https://bugzilla.redhat.com/show_bug.cgi?id=2249814

Version of all relevant components (if applicable):
ODF 4.13.5-6 

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?


Is there any workaround available to the best of your knowledge?


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?


Can this issue reproducible?
Trying to reproduce here:
https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-trigger-vsphere-upi-encryption-1az-rhcos-vsan-lso-vmdk-3m-3w-upgrade-ocp-ocs-auto/32/

Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1. Install ODF 4.12 and OCP 4.12
2. Upgrade OCP to 4.13
3. Upgrade ODF to 4.13.5-6 build
4. After some time we see the health warn


Actual results:
Do not have health warn

Expected results:


Additional info:
Must gather:
http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j-031vue1cslv33-uba/j-031vue1cslv33-uba_20231115T053551/logs/testcases_1700036781/j-031vue1cslv33-u/
Job:
https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-trigger-vsphere-upi-encryption-1az-rhcos-vsan-lso-vmdk-3m-3w-upgrade-ocp-ocs-auto/31/


Note You need to log in before you can comment on or make changes to this bug.