Bug 2214499 - ceph-client.admin crashed in ceph-exporter thread with "throw_invalid_argument(char const*, boost::source_location const&)+0x37) [0x557c40cab267]" [NEEDINFO]
Summary: ceph-client.admin crashed in ceph-exporter thread with "throw_invalid_argumen...
Keywords:
Status: ASSIGNED
Alias: None
Product: Red Hat OpenShift Data Foundation
Classification: Red Hat Storage
Component: rook
Version: 4.13
Hardware: Unspecified
OS: Unspecified
unspecified
medium
Target Milestone: ---
: ---
Assignee: avan
QA Contact: Prasad Desala
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2023-06-13 07:19 UTC by Prasad Desala
Modified: 2023-08-17 05:39 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:
Embargoed:
tnielsen: needinfo? (tdesala)


Attachments (Terms of Use)

Description Prasad Desala 2023-06-13 07:19:22 UTC
Description of problem (please be detailed as possible and provide log
snippests):
===========================================================================
On an ODF cluster with the following cluster-level parameters enabled:
FIPS
Hugepages
KMS - vault
Cluster-wide encryption
Encryption in transit

The ceph-client.admin crashed in the ceph-exporter thread with the error message "throw_invalid_argument(char const*, boost::source_location const&)+0x37) [0x557c40cab267]" while filling the cluster to the near full ratio.

Please note that the cluster had not reached the near full ratio at the time of the crash. The OSDs were filled up to 62.2%.

backtrace:

2023-06-12T20:38:44.709+0000 7f9898b80e80 -1 *** Caught signal (Aborted) **
 in thread 7f9898b80e80 thread_name:ceph-exporter

 ceph version 17.2.6-70.el9cp (fe62dcdbb2c6e05782a3e2b67d025b84ff5047cc) quincy (stable)
 1: /lib64/libc.so.6(+0x54df0) [0x7f9899274df0]
 2: /lib64/libc.so.6(+0xa154c) [0x7f98992c154c]
 3: raise()
 4: abort()
 5: /lib64/libstdc++.so.6(+0xa1a01) [0x7f98994e5a01]
 6: /lib64/libstdc++.so.6(+0xad37c) [0x7f98994f137c]
 7: /lib64/libstdc++.so.6(+0xad3e7) [0x7f98994f13e7]
 8: /lib64/libstdc++.so.6(+0xad649) [0x7f98994f1649]
 9: ceph-exporter(+0x2a218) [0x557c40c98218]
 10: (boost::json::detail::throw_invalid_argument(char const*, boost::source_location const&)+0x37) [0x557c40cab267]
 11: ceph-exporter(+0x683c7) [0x557c40cd63c7]
 12: (DaemonMetricCollector::dump_asok_metrics()+0x22c8) [0x557c40cbae98]
 13: ceph-exporter(+0x41230) [0x557c40caf230]
 14: ceph-exporter(+0x60b7d) [0x557c40cceb7d]
 15: ceph-exporter(+0xace0f) [0x557c40d1ae0f]
 16: (DaemonMetricCollector::main()+0x212) [0x557c40c9dc32]
 17: main()
 18: /lib64/libc.so.6(+0x3feb0) [0x7f989925feb0]
 19: __libc_start_main()
 20: _start()
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.


Version of all relevant components (if applicable):
OCP: 4.13.0-0.nightly-2023-06-09-152551
ODF: 4.13.0-218

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
ceph health went to warning state, other than that I did not observe any functionality impact, atleast from the positive scenarios.

Is there any workaround available to the best of your knowledge?


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
1

Can this issue reproducible?
Reporting at the first hit

Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:


Steps to Reproduce:
===================
1) Create an ODF cluster with the cluster level parms 1) FIPS 2) Hugepages 3) KMS - vault 4) cluster wide encrption 5) encryption in transit enabled 

2) Run automated system test - https://github.com/red-hat-storage/ocs-ci/blob/master/tests/e2e/system_test/test_cluster_full_and_recovery.py

The automated test executes below steps,
a) Create PVC1 [FS + RBD]
b) Verify new PVC1 [FS + RBD] on Bound state
c) Run FIO on PVC1_FS + PVC1_RBD
d) Calculate Checksum PVC1_FS + PVC1_RBD
e) Fill the cluster to “Full ratio” (usually 85%) using the benchmark-operator

Actual results:
===============
ceph-client.admin crashed in ceph-exporter thread with "throw_invalid_argument(char const*, boost::source_location const&)+0x37) [0x557c40cab267]" while filling the cluster to the near full ratio.


Expected results:
=================
No crashes should be observed during IO write operations.

Comment 5 avan 2023-06-13 12:14:24 UTC
Hi @Prasad

Comment 15 Mudit Agarwal 2023-08-17 05:39:52 UTC
This looks like a dup of https://bugzilla.redhat.com/show_bug.cgi?id=2232226, but that bug is being hit during upgrade from 4.13 to 4.14 
4.14 is using the ceph build which has exporter changes, but I can see the same crash there.

{
    "backtrace": [
        "/lib64/libc.so.6(+0x54df0) [0x7f122f9a0df0]",
        "/lib64/libc.so.6(+0xa154c) [0x7f122f9ed54c]",
        "raise()",
        "abort()",
        "/lib64/libstdc++.so.6(+0xa1a01) [0x7f122fc11a01]",
        "/lib64/libstdc++.so.6(+0xad37c) [0x7f122fc1d37c]",
        "/lib64/libstdc++.so.6(+0xad3e7) [0x7f122fc1d3e7]",
        "/lib64/libstdc++.so.6(+0xad649) [0x7f122fc1d649]",
        "ceph-exporter(+0x29767) [0x565474e04767]",
        "(boost::json::detail::throw_invalid_argument(char const*, boost::source_location const&)+0x37) [0x565474e18267]",
        "ceph-exporter(+0x65947) [0x565474e40947]",
        "(DaemonMetricCollector::dump_asok_metrics()+0x1de7) [0x565474e209e7]",
        "ceph-exporter(+0x45e20) [0x565474e20e20]",
        "ceph-exporter(+0x5caed) [0x565474e37aed]",
        "ceph-exporter(+0xab6df) [0x565474e866df]",
        "(DaemonMetricCollector::main()+0x212) [0x565474e0abf2]",
        "main()",
        "/lib64/libc.so.6(+0x3feb0) [0x7f122f98beb0]",
        "__libc_start_main()",
        "_start()"
    ],
    "ceph_version": "17.2.6-105.el9cp",
    "crash_id": "2023-08-14T23:05:06.978999Z_273bd10e-1d27-4e80-ab3c-838c6c5a9519",
    "entity_name": "client.admin",
    "os_id": "rhel",
    "os_name": "Red Hat Enterprise Linux",
    "os_version": "9.2 (Plow)",
    "os_version_id": "9.2",
    "process_name": "ceph-exporter",
    "stack_sig": "03972c98be910d1ce25645fdd11917d43497d8e45963b63cf072b005e7daee44",
    "timestamp": "2023-08-14T23:05:06.978999Z",
    "utsname_hostname": "rook-ceph-exporter-compute-0-68fdf6c8b5-rbdqb",
    "utsname_machine": "x86_64",
    "utsname_release": "5.14.0-284.25.1.el9_2.x86_64",
    "utsname_sysname": "Linux",
    "utsname_version": "#1 SMP PREEMPT_DYNAMIC Thu Jul 20 09:11:28 EDT 2023"
}

We can have the setup whenever you want as this is always reproducible.


Note You need to log in before you can comment on or make changes to this bug.