Bug 2250226 - [ODF-4.14.z][CEPH bug 2249814 tracker] Health Warn after to upgrade to 4.13.5-6 - 1 daemons have recently crashed
Summary: [ODF-4.14.z][CEPH bug 2249814 tracker] Health Warn after to upgrade to 4.13.5...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenShift Data Foundation
Classification: Red Hat Storage
Component: ceph
Version: 4.13
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
: ODF 4.14.5
Assignee: Venky Shankar
QA Contact: Petr Balogh
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2023-11-17 06:55 UTC by Sunil Kumar Acharya
Modified: 2024-02-29 09:13 UTC (History)
5 users (show)

Fixed In Version: 4.14.5-3
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2024-02-29 09:13:12 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2024:1043 0 None None None 2024-02-29 09:13:14 UTC

Description Sunil Kumar Acharya 2023-11-17 06:55:48 UTC
This bug was initially created as a copy of Bug #2249844

I am copying this bug because: 



Description of problem (please be detailed as possible and provide log
snippests):
After upgrade execution to 4.13.5-6 from 4.12 - (both OCP and ODF upgrade)
we see ceph health warn issue:

sh-5.1$ ceph status
  cluster:
    id:     68dc565f-f700-4312-93be-265b7ed15941
    health: HEALTH_WARN
            1 daemons have recently crashed
 
  services:
    mon: 3 daemons, quorum a,b,c (age 78m)
    mgr: a(active, since 77m)
    mds: 1/1 daemons up, 1 hot standby
    osd: 3 osds: 3 up (since 77m), 3 in (since 2h)
    rgw: 1 daemon active (1 hosts, 1 zones)
 
  data:
    volumes: 1/1 healthy
    pools:   12 pools, 185 pgs
    objects: 1.05k objects, 2.0 GiB
    usage:   5.9 GiB used, 1.5 TiB / 1.5 TiB avail
    pgs:     185 active+clean
 
  io:
    client:   1.4 KiB/s rd, 134 KiB/s wr, 2 op/s rd, 2 op/s wr
 
sh-5.1$ ceph crash ls
ID                                                                ENTITY  NEW
2023-11-15T08:10:44.427601Z_b4fd4568-7eb7-4508-ab38-58e561dc809a  mgr.a    *
sh-5.1$ ceph crash info 2023-11-15T08:10:44.427601Z_b4fd4568-7eb7-4508-ab38-58e561dc809a
{
    "backtrace": [
        "/lib64/libc.so.6(+0x54df0) [0x7f7c91f2bdf0]",
        "/lib64/libc.so.6(+0xa154c) [0x7f7c91f7854c]",
        "raise()",
        "abort()",
        "/lib64/libstdc++.so.6(+0xa1a01) [0x7f7c92279a01]",
        "/lib64/libstdc++.so.6(+0xad37c) [0x7f7c9228537c]",
        "/lib64/libstdc++.so.6(+0xad3e7) [0x7f7c922853e7]",
        "/lib64/libstdc++.so.6(+0xad649) [0x7f7c92285649]",
        "/usr/lib64/ceph/libceph-common.so.2(+0x170d39) [0x7f7c9256fd39]",
        "(SnapRealmInfo::decode(ceph::buffer::v15_2_0::list::iterator_impl<true>&)+0x3b) [0x7f7c926a7f4b]",
        "/lib64/libcephfs.so.2(+0xaaec7) [0x7f7c86c43ec7]",
        "/lib64/libcephfs.so.2(+0xacc59) [0x7f7c86c45c59]",
        "/lib64/libcephfs.so.2(+0xadf10) [0x7f7c86c46f10]",
        "/lib64/libcephfs.so.2(+0x929e8) [0x7f7c86c2b9e8]",
        "(DispatchQueue::entry()+0x53a) [0x7f7c9272defa]",
        "/usr/lib64/ceph/libceph-common.so.2(+0x3bab31) [0x7f7c927b9b31]",
        "/lib64/libc.so.6(+0x9f802) [0x7f7c91f76802]",
        "/lib64/libc.so.6(+0x3f450) [0x7f7c91f16450]"
    ],
    "ceph_version": "17.2.6-148.el9cp",
    "crash_id": "2023-11-15T08:10:44.427601Z_b4fd4568-7eb7-4508-ab38-58e561dc809a",
    "entity_name": "mgr.a",
    "os_id": "rhel",
    "os_name": "Red Hat Enterprise Linux",
    "os_version": "9.2 (Plow)",
    "os_version_id": "9.2",
    "process_name": "ceph-mgr",
    "stack_sig": "4cb0911c06087a31d9752535de90ba18fd7aab25c037945b2c61f584dcf6a6db",
    "timestamp": "2023-11-15T08:10:44.427601Z",
    "utsname_hostname": "rook-ceph-mgr-a-5d475468dd-wzhmt",
    "utsname_machine": "x86_64",
    "utsname_release": "5.14.0-284.40.1.el9_2.x86_64",
    "utsname_sysname": "Linux",
    "utsname_version": "#1 SMP PREEMPT_DYNAMIC Wed Nov 1 10:30:09 EDT 2023"
}


Discussed here:
https://chat.google.com/room/AAAAREGEba8/fZvCCW1MQfU

Venky pointed out that it smells like this issue:
https://tracker.ceph.com/issues/63188
BZ:
https://bugzilla.redhat.com/show_bug.cgi?id=2247174

Venky cloned the 7.0 BZ to 6.1z4 target - https://bugzilla.redhat.com/show_bug.cgi?id=2249814

Version of all relevant components (if applicable):
ODF 4.13.5-6 

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?


Is there any workaround available to the best of your knowledge?


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?


Can this issue reproducible?
Trying to reproduce here:
https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-trigger-vsphere-upi-encryption-1az-rhcos-vsan-lso-vmdk-3m-3w-upgrade-ocp-ocs-auto/32/

Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1. Install ODF 4.12 and OCP 4.12
2. Upgrade OCP to 4.13
3. Upgrade ODF to 4.13.5-6 build
4. After some time we see the health warn


Actual results:
Do not have health warn

Expected results:


Additional info:
Must gather:
http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j-031vue1cslv33-uba/j-031vue1cslv33-uba_20231115T053551/logs/testcases_1700036781/j-031vue1cslv33-u/
Job:
https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-trigger-vsphere-upi-encryption-1az-rhcos-vsan-lso-vmdk-3m-3w-upgrade-ocp-ocs-auto/31/

Comment 2 Mudit Agarwal 2024-01-25 04:21:09 UTC
This is already fixed with RHCS 6.1z3

Comment 12 errata-xmlrpc 2024-02-29 09:13:12 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenShift Data Foundation 4.14.5 Bug Fix Update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2024:1043


Note You need to log in before you can comment on or make changes to this bug.