Bug 2214499

Summary: ceph-client.admin crashed in ceph-exporter thread with "throw_invalid_argument(char const*, boost::source_location const&)+0x37) [0x557c40cab267]"
Product: [Red Hat Storage] Red Hat OpenShift Data Foundation Reporter: Prasad Desala <tdesala>
Component: rookAssignee: avan <athakkar>
Status: ASSIGNED --- QA Contact: Prasad Desala <tdesala>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 4.13CC: athakkar, muagarwa, odf-bz-bot, sagrawal, srai, tnielsen
Target Milestone: ---Flags: tnielsen: needinfo? (tdesala)
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Prasad Desala 2023-06-13 07:19:22 UTC
Description of problem (please be detailed as possible and provide log
snippests):
===========================================================================
On an ODF cluster with the following cluster-level parameters enabled:
FIPS
Hugepages
KMS - vault
Cluster-wide encryption
Encryption in transit

The ceph-client.admin crashed in the ceph-exporter thread with the error message "throw_invalid_argument(char const*, boost::source_location const&)+0x37) [0x557c40cab267]" while filling the cluster to the near full ratio.

Please note that the cluster had not reached the near full ratio at the time of the crash. The OSDs were filled up to 62.2%.

backtrace:

2023-06-12T20:38:44.709+0000 7f9898b80e80 -1 *** Caught signal (Aborted) **
 in thread 7f9898b80e80 thread_name:ceph-exporter

 ceph version 17.2.6-70.el9cp (fe62dcdbb2c6e05782a3e2b67d025b84ff5047cc) quincy (stable)
 1: /lib64/libc.so.6(+0x54df0) [0x7f9899274df0]
 2: /lib64/libc.so.6(+0xa154c) [0x7f98992c154c]
 3: raise()
 4: abort()
 5: /lib64/libstdc++.so.6(+0xa1a01) [0x7f98994e5a01]
 6: /lib64/libstdc++.so.6(+0xad37c) [0x7f98994f137c]
 7: /lib64/libstdc++.so.6(+0xad3e7) [0x7f98994f13e7]
 8: /lib64/libstdc++.so.6(+0xad649) [0x7f98994f1649]
 9: ceph-exporter(+0x2a218) [0x557c40c98218]
 10: (boost::json::detail::throw_invalid_argument(char const*, boost::source_location const&)+0x37) [0x557c40cab267]
 11: ceph-exporter(+0x683c7) [0x557c40cd63c7]
 12: (DaemonMetricCollector::dump_asok_metrics()+0x22c8) [0x557c40cbae98]
 13: ceph-exporter(+0x41230) [0x557c40caf230]
 14: ceph-exporter(+0x60b7d) [0x557c40cceb7d]
 15: ceph-exporter(+0xace0f) [0x557c40d1ae0f]
 16: (DaemonMetricCollector::main()+0x212) [0x557c40c9dc32]
 17: main()
 18: /lib64/libc.so.6(+0x3feb0) [0x7f989925feb0]
 19: __libc_start_main()
 20: _start()
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.


Version of all relevant components (if applicable):
OCP: 4.13.0-0.nightly-2023-06-09-152551
ODF: 4.13.0-218

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
ceph health went to warning state, other than that I did not observe any functionality impact, atleast from the positive scenarios.

Is there any workaround available to the best of your knowledge?


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
1

Can this issue reproducible?
Reporting at the first hit

Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:


Steps to Reproduce:
===================
1) Create an ODF cluster with the cluster level parms 1) FIPS 2) Hugepages 3) KMS - vault 4) cluster wide encrption 5) encryption in transit enabled 

2) Run automated system test - https://github.com/red-hat-storage/ocs-ci/blob/master/tests/e2e/system_test/test_cluster_full_and_recovery.py

The automated test executes below steps,
a) Create PVC1 [FS + RBD]
b) Verify new PVC1 [FS + RBD] on Bound state
c) Run FIO on PVC1_FS + PVC1_RBD
d) Calculate Checksum PVC1_FS + PVC1_RBD
e) Fill the cluster to “Full ratio” (usually 85%) using the benchmark-operator

Actual results:
===============
ceph-client.admin crashed in ceph-exporter thread with "throw_invalid_argument(char const*, boost::source_location const&)+0x37) [0x557c40cab267]" while filling the cluster to the near full ratio.


Expected results:
=================
No crashes should be observed during IO write operations.

Comment 5 avan 2023-06-13 12:14:24 UTC
Hi @Prasad

Comment 15 Mudit Agarwal 2023-08-17 05:39:52 UTC
This looks like a dup of https://bugzilla.redhat.com/show_bug.cgi?id=2232226, but that bug is being hit during upgrade from 4.13 to 4.14 
4.14 is using the ceph build which has exporter changes, but I can see the same crash there.

{
    "backtrace": [
        "/lib64/libc.so.6(+0x54df0) [0x7f122f9a0df0]",
        "/lib64/libc.so.6(+0xa154c) [0x7f122f9ed54c]",
        "raise()",
        "abort()",
        "/lib64/libstdc++.so.6(+0xa1a01) [0x7f122fc11a01]",
        "/lib64/libstdc++.so.6(+0xad37c) [0x7f122fc1d37c]",
        "/lib64/libstdc++.so.6(+0xad3e7) [0x7f122fc1d3e7]",
        "/lib64/libstdc++.so.6(+0xad649) [0x7f122fc1d649]",
        "ceph-exporter(+0x29767) [0x565474e04767]",
        "(boost::json::detail::throw_invalid_argument(char const*, boost::source_location const&)+0x37) [0x565474e18267]",
        "ceph-exporter(+0x65947) [0x565474e40947]",
        "(DaemonMetricCollector::dump_asok_metrics()+0x1de7) [0x565474e209e7]",
        "ceph-exporter(+0x45e20) [0x565474e20e20]",
        "ceph-exporter(+0x5caed) [0x565474e37aed]",
        "ceph-exporter(+0xab6df) [0x565474e866df]",
        "(DaemonMetricCollector::main()+0x212) [0x565474e0abf2]",
        "main()",
        "/lib64/libc.so.6(+0x3feb0) [0x7f122f98beb0]",
        "__libc_start_main()",
        "_start()"
    ],
    "ceph_version": "17.2.6-105.el9cp",
    "crash_id": "2023-08-14T23:05:06.978999Z_273bd10e-1d27-4e80-ab3c-838c6c5a9519",
    "entity_name": "client.admin",
    "os_id": "rhel",
    "os_name": "Red Hat Enterprise Linux",
    "os_version": "9.2 (Plow)",
    "os_version_id": "9.2",
    "process_name": "ceph-exporter",
    "stack_sig": "03972c98be910d1ce25645fdd11917d43497d8e45963b63cf072b005e7daee44",
    "timestamp": "2023-08-14T23:05:06.978999Z",
    "utsname_hostname": "rook-ceph-exporter-compute-0-68fdf6c8b5-rbdqb",
    "utsname_machine": "x86_64",
    "utsname_release": "5.14.0-284.25.1.el9_2.x86_64",
    "utsname_sysname": "Linux",
    "utsname_version": "#1 SMP PREEMPT_DYNAMIC Thu Jul 20 09:11:28 EDT 2023"
}

We can have the setup whenever you want as this is always reproducible.