Bug 2214499 - [Tracker for https://bugzilla.redhat.com/show_bug.cgi?id=2266035] ceph-client.admin crashed in ceph-exporter thread with "throw_invalid_argument(char const*, boost::source_location const&)+0x37) [0x557c40cab267]"
Summary: [Tracker for https://bugzilla.redhat.com/show_bug.cgi?id=2266035] ceph-client...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenShift Data Foundation
Classification: Red Hat Storage
Component: rook
Version: 4.13
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: ODF 4.16.0
Assignee: Divyansh Kamboj
QA Contact: Nagendra Reddy
URL:
Whiteboard:
: 2255648 2269122 (view as bug list)
Depends On: 2266035
Blocks:
TreeView+ depends on / blocked
 
Reported: 2023-06-13 07:19 UTC by Prasad Desala
Modified: 2024-11-15 04:25 UTC (History)
20 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
: 2266035 (view as bug list)
Environment:
Last Closed: 2024-07-17 13:10:53 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github ceph ceph pull 55773 0 None open exporter: handle exceptions gracefully 2024-05-09 07:30:38 UTC
Red Hat Product Errata RHSA-2024:4591 0 None None None 2024-07-17 13:10:58 UTC

Description Prasad Desala 2023-06-13 07:19:22 UTC
Description of problem (please be detailed as possible and provide log
snippests):
===========================================================================
On an ODF cluster with the following cluster-level parameters enabled:
FIPS
Hugepages
KMS - vault
Cluster-wide encryption
Encryption in transit

The ceph-client.admin crashed in the ceph-exporter thread with the error message "throw_invalid_argument(char const*, boost::source_location const&)+0x37) [0x557c40cab267]" while filling the cluster to the near full ratio.

Please note that the cluster had not reached the near full ratio at the time of the crash. The OSDs were filled up to 62.2%.

backtrace:

2023-06-12T20:38:44.709+0000 7f9898b80e80 -1 *** Caught signal (Aborted) **
 in thread 7f9898b80e80 thread_name:ceph-exporter

 ceph version 17.2.6-70.el9cp (fe62dcdbb2c6e05782a3e2b67d025b84ff5047cc) quincy (stable)
 1: /lib64/libc.so.6(+0x54df0) [0x7f9899274df0]
 2: /lib64/libc.so.6(+0xa154c) [0x7f98992c154c]
 3: raise()
 4: abort()
 5: /lib64/libstdc++.so.6(+0xa1a01) [0x7f98994e5a01]
 6: /lib64/libstdc++.so.6(+0xad37c) [0x7f98994f137c]
 7: /lib64/libstdc++.so.6(+0xad3e7) [0x7f98994f13e7]
 8: /lib64/libstdc++.so.6(+0xad649) [0x7f98994f1649]
 9: ceph-exporter(+0x2a218) [0x557c40c98218]
 10: (boost::json::detail::throw_invalid_argument(char const*, boost::source_location const&)+0x37) [0x557c40cab267]
 11: ceph-exporter(+0x683c7) [0x557c40cd63c7]
 12: (DaemonMetricCollector::dump_asok_metrics()+0x22c8) [0x557c40cbae98]
 13: ceph-exporter(+0x41230) [0x557c40caf230]
 14: ceph-exporter(+0x60b7d) [0x557c40cceb7d]
 15: ceph-exporter(+0xace0f) [0x557c40d1ae0f]
 16: (DaemonMetricCollector::main()+0x212) [0x557c40c9dc32]
 17: main()
 18: /lib64/libc.so.6(+0x3feb0) [0x7f989925feb0]
 19: __libc_start_main()
 20: _start()
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.


Version of all relevant components (if applicable):
OCP: 4.13.0-0.nightly-2023-06-09-152551
ODF: 4.13.0-218

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
ceph health went to warning state, other than that I did not observe any functionality impact, atleast from the positive scenarios.

Is there any workaround available to the best of your knowledge?


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
1

Can this issue reproducible?
Reporting at the first hit

Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:


Steps to Reproduce:
===================
1) Create an ODF cluster with the cluster level parms 1) FIPS 2) Hugepages 3) KMS - vault 4) cluster wide encrption 5) encryption in transit enabled 

2) Run automated system test - https://github.com/red-hat-storage/ocs-ci/blob/master/tests/e2e/system_test/test_cluster_full_and_recovery.py

The automated test executes below steps,
a) Create PVC1 [FS + RBD]
b) Verify new PVC1 [FS + RBD] on Bound state
c) Run FIO on PVC1_FS + PVC1_RBD
d) Calculate Checksum PVC1_FS + PVC1_RBD
e) Fill the cluster to “Full ratio” (usually 85%) using the benchmark-operator

Actual results:
===============
ceph-client.admin crashed in ceph-exporter thread with "throw_invalid_argument(char const*, boost::source_location const&)+0x37) [0x557c40cab267]" while filling the cluster to the near full ratio.


Expected results:
=================
No crashes should be observed during IO write operations.

Comment 5 avan 2023-06-13 12:14:24 UTC
Hi @Prasad

Comment 15 Mudit Agarwal 2023-08-17 05:39:52 UTC
This looks like a dup of https://bugzilla.redhat.com/show_bug.cgi?id=2232226, but that bug is being hit during upgrade from 4.13 to 4.14 
4.14 is using the ceph build which has exporter changes, but I can see the same crash there.

{
    "backtrace": [
        "/lib64/libc.so.6(+0x54df0) [0x7f122f9a0df0]",
        "/lib64/libc.so.6(+0xa154c) [0x7f122f9ed54c]",
        "raise()",
        "abort()",
        "/lib64/libstdc++.so.6(+0xa1a01) [0x7f122fc11a01]",
        "/lib64/libstdc++.so.6(+0xad37c) [0x7f122fc1d37c]",
        "/lib64/libstdc++.so.6(+0xad3e7) [0x7f122fc1d3e7]",
        "/lib64/libstdc++.so.6(+0xad649) [0x7f122fc1d649]",
        "ceph-exporter(+0x29767) [0x565474e04767]",
        "(boost::json::detail::throw_invalid_argument(char const*, boost::source_location const&)+0x37) [0x565474e18267]",
        "ceph-exporter(+0x65947) [0x565474e40947]",
        "(DaemonMetricCollector::dump_asok_metrics()+0x1de7) [0x565474e209e7]",
        "ceph-exporter(+0x45e20) [0x565474e20e20]",
        "ceph-exporter(+0x5caed) [0x565474e37aed]",
        "ceph-exporter(+0xab6df) [0x565474e866df]",
        "(DaemonMetricCollector::main()+0x212) [0x565474e0abf2]",
        "main()",
        "/lib64/libc.so.6(+0x3feb0) [0x7f122f98beb0]",
        "__libc_start_main()",
        "_start()"
    ],
    "ceph_version": "17.2.6-105.el9cp",
    "crash_id": "2023-08-14T23:05:06.978999Z_273bd10e-1d27-4e80-ab3c-838c6c5a9519",
    "entity_name": "client.admin",
    "os_id": "rhel",
    "os_name": "Red Hat Enterprise Linux",
    "os_version": "9.2 (Plow)",
    "os_version_id": "9.2",
    "process_name": "ceph-exporter",
    "stack_sig": "03972c98be910d1ce25645fdd11917d43497d8e45963b63cf072b005e7daee44",
    "timestamp": "2023-08-14T23:05:06.978999Z",
    "utsname_hostname": "rook-ceph-exporter-compute-0-68fdf6c8b5-rbdqb",
    "utsname_machine": "x86_64",
    "utsname_release": "5.14.0-284.25.1.el9_2.x86_64",
    "utsname_sysname": "Linux",
    "utsname_version": "#1 SMP PREEMPT_DYNAMIC Thu Jul 20 09:11:28 EDT 2023"
}

We can have the setup whenever you want as this is always reproducible.

Comment 16 Blaine Gardner 2023-08-29 15:22:06 UTC
@athakkar it looks like this is still reproducible. Since this is a bug for 4.14, please take a look with appropriate priority.

Comment 17 Mudit Agarwal 2023-08-30 04:45:49 UTC
Can we try with 4.14.0-117, it should be fixed.

Comment 28 Mudit Agarwal 2024-02-06 12:43:52 UTC
Marking it a blocker for now

Comment 30 Divyansh Kamboj 2024-02-13 05:20:56 UTC
4.15 already has the fix for the crash backported from what I see, https://github.com/red-hat-storage/rook/blob/3e3dba07d1fd6f730b856f2175de238fac6e6b5a/pkg/operator/ceph/cluster/nodedaemon/exporter.go#L123C1-L124C1
The reason looks similar though.

Comment 31 Mudit Agarwal 2024-02-13 05:40:38 UTC
Nagendra, please try again with the latest ODF build

Comment 37 Divyansh Kamboj 2024-02-22 11:19:55 UTC
The issue is that ceph-exporter doesn't handle exceptions while parsing jsons, so we need to add try-catch blocks to mitigate these crashes, and error out in the log. I'm working on the unit tests to verify if the fix is working.

Comment 40 Travis Nielsen 2024-02-28 15:08:39 UTC
(In reply to Divyansh Kamboj from comment #37)
> The issue is that ceph-exporter doesn't handle exceptions while parsing
> jsons, so we need to add try-catch blocks to mitigate these crashes, and
> error out in the log. I'm working on the unit tests to verify if the fix is
> working.

Why are we getting invalid json? Is there some other underlying issue?

Comment 41 Divyansh Kamboj 2024-03-01 08:28:26 UTC
> Why are we getting invalid json? Is there some other underlying issue?

the json is "invalid" per-say. But the function that parses the json into `object` and `array` gives an exception when certain data points are not popluated. So try-catch block to catch those exceptions

Comment 42 krishnaram Karthick 2024-04-01 10:30:24 UTC
Bug in NEW/ASSIGNED state. 
Moving the bug to 4.15.3 for a decision on RCA/FIX.

Comment 43 Brad Hubbard 2024-05-02 03:56:26 UTC
*** Bug 2269122 has been marked as a duplicate of this bug. ***

Comment 44 Brad Hubbard 2024-05-02 03:58:05 UTC
*** Bug 2255648 has been marked as a duplicate of this bug. ***

Comment 46 krishnaram Karthick 2024-05-02 11:34:40 UTC
Moving the bug to 4.15.4 as we have reached the limit on bugs intake for 4.15.3

Comment 48 Divyansh Kamboj 2024-05-09 12:09:47 UTC
Upstream PR https://github.com/ceph/ceph/pull/55773 has been merged, getting the patch backported to 7.1
Relevant BZ: https://bugzilla.redhat.com/show_bug.cgi?id=2266035

Comment 50 Divyansh Kamboj 2024-05-13 17:13:50 UTC
it just got backported to ceph 7.1 downstream, afaik 4.16 uses 7.1. so we can mark it modified for 4.16? wdyt @sheggodu

Comment 67 errata-xmlrpc 2024-07-17 13:10:53 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.16.0 security, enhancement & bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2024:4591

Comment 68 Red Hat Bugzilla 2024-11-15 04:25:04 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days


Note You need to log in before you can comment on or make changes to this bug.