Description of problem (please be detailed as possible and provide log snippests): =========================================================================== On an ODF cluster with the following cluster-level parameters enabled: FIPS Hugepages KMS - vault Cluster-wide encryption Encryption in transit The ceph-client.admin crashed in the ceph-exporter thread with the error message "throw_invalid_argument(char const*, boost::source_location const&)+0x37) [0x557c40cab267]" while filling the cluster to the near full ratio. Please note that the cluster had not reached the near full ratio at the time of the crash. The OSDs were filled up to 62.2%. backtrace: 2023-06-12T20:38:44.709+0000 7f9898b80e80 -1 *** Caught signal (Aborted) ** in thread 7f9898b80e80 thread_name:ceph-exporter ceph version 17.2.6-70.el9cp (fe62dcdbb2c6e05782a3e2b67d025b84ff5047cc) quincy (stable) 1: /lib64/libc.so.6(+0x54df0) [0x7f9899274df0] 2: /lib64/libc.so.6(+0xa154c) [0x7f98992c154c] 3: raise() 4: abort() 5: /lib64/libstdc++.so.6(+0xa1a01) [0x7f98994e5a01] 6: /lib64/libstdc++.so.6(+0xad37c) [0x7f98994f137c] 7: /lib64/libstdc++.so.6(+0xad3e7) [0x7f98994f13e7] 8: /lib64/libstdc++.so.6(+0xad649) [0x7f98994f1649] 9: ceph-exporter(+0x2a218) [0x557c40c98218] 10: (boost::json::detail::throw_invalid_argument(char const*, boost::source_location const&)+0x37) [0x557c40cab267] 11: ceph-exporter(+0x683c7) [0x557c40cd63c7] 12: (DaemonMetricCollector::dump_asok_metrics()+0x22c8) [0x557c40cbae98] 13: ceph-exporter(+0x41230) [0x557c40caf230] 14: ceph-exporter(+0x60b7d) [0x557c40cceb7d] 15: ceph-exporter(+0xace0f) [0x557c40d1ae0f] 16: (DaemonMetricCollector::main()+0x212) [0x557c40c9dc32] 17: main() 18: /lib64/libc.so.6(+0x3feb0) [0x7f989925feb0] 19: __libc_start_main() 20: _start() NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. Version of all relevant components (if applicable): OCP: 4.13.0-0.nightly-2023-06-09-152551 ODF: 4.13.0-218 Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? ceph health went to warning state, other than that I did not observe any functionality impact, atleast from the positive scenarios. Is there any workaround available to the best of your knowledge? Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? 1 Can this issue reproducible? Reporting at the first hit Can this issue reproduce from the UI? If this is a regression, please provide more details to justify this: Steps to Reproduce: =================== 1) Create an ODF cluster with the cluster level parms 1) FIPS 2) Hugepages 3) KMS - vault 4) cluster wide encrption 5) encryption in transit enabled 2) Run automated system test - https://github.com/red-hat-storage/ocs-ci/blob/master/tests/e2e/system_test/test_cluster_full_and_recovery.py The automated test executes below steps, a) Create PVC1 [FS + RBD] b) Verify new PVC1 [FS + RBD] on Bound state c) Run FIO on PVC1_FS + PVC1_RBD d) Calculate Checksum PVC1_FS + PVC1_RBD e) Fill the cluster to “Full ratio” (usually 85%) using the benchmark-operator Actual results: =============== ceph-client.admin crashed in ceph-exporter thread with "throw_invalid_argument(char const*, boost::source_location const&)+0x37) [0x557c40cab267]" while filling the cluster to the near full ratio. Expected results: ================= No crashes should be observed during IO write operations.
Hi @Prasad
This looks like a dup of https://bugzilla.redhat.com/show_bug.cgi?id=2232226, but that bug is being hit during upgrade from 4.13 to 4.14 4.14 is using the ceph build which has exporter changes, but I can see the same crash there. { "backtrace": [ "/lib64/libc.so.6(+0x54df0) [0x7f122f9a0df0]", "/lib64/libc.so.6(+0xa154c) [0x7f122f9ed54c]", "raise()", "abort()", "/lib64/libstdc++.so.6(+0xa1a01) [0x7f122fc11a01]", "/lib64/libstdc++.so.6(+0xad37c) [0x7f122fc1d37c]", "/lib64/libstdc++.so.6(+0xad3e7) [0x7f122fc1d3e7]", "/lib64/libstdc++.so.6(+0xad649) [0x7f122fc1d649]", "ceph-exporter(+0x29767) [0x565474e04767]", "(boost::json::detail::throw_invalid_argument(char const*, boost::source_location const&)+0x37) [0x565474e18267]", "ceph-exporter(+0x65947) [0x565474e40947]", "(DaemonMetricCollector::dump_asok_metrics()+0x1de7) [0x565474e209e7]", "ceph-exporter(+0x45e20) [0x565474e20e20]", "ceph-exporter(+0x5caed) [0x565474e37aed]", "ceph-exporter(+0xab6df) [0x565474e866df]", "(DaemonMetricCollector::main()+0x212) [0x565474e0abf2]", "main()", "/lib64/libc.so.6(+0x3feb0) [0x7f122f98beb0]", "__libc_start_main()", "_start()" ], "ceph_version": "17.2.6-105.el9cp", "crash_id": "2023-08-14T23:05:06.978999Z_273bd10e-1d27-4e80-ab3c-838c6c5a9519", "entity_name": "client.admin", "os_id": "rhel", "os_name": "Red Hat Enterprise Linux", "os_version": "9.2 (Plow)", "os_version_id": "9.2", "process_name": "ceph-exporter", "stack_sig": "03972c98be910d1ce25645fdd11917d43497d8e45963b63cf072b005e7daee44", "timestamp": "2023-08-14T23:05:06.978999Z", "utsname_hostname": "rook-ceph-exporter-compute-0-68fdf6c8b5-rbdqb", "utsname_machine": "x86_64", "utsname_release": "5.14.0-284.25.1.el9_2.x86_64", "utsname_sysname": "Linux", "utsname_version": "#1 SMP PREEMPT_DYNAMIC Thu Jul 20 09:11:28 EDT 2023" } We can have the setup whenever you want as this is always reproducible.
@athakkar it looks like this is still reproducible. Since this is a bug for 4.14, please take a look with appropriate priority.
Can we try with 4.14.0-117, it should be fixed.
Marking it a blocker for now
4.15 already has the fix for the crash backported from what I see, https://github.com/red-hat-storage/rook/blob/3e3dba07d1fd6f730b856f2175de238fac6e6b5a/pkg/operator/ceph/cluster/nodedaemon/exporter.go#L123C1-L124C1 The reason looks similar though.
Nagendra, please try again with the latest ODF build
The issue is that ceph-exporter doesn't handle exceptions while parsing jsons, so we need to add try-catch blocks to mitigate these crashes, and error out in the log. I'm working on the unit tests to verify if the fix is working.
(In reply to Divyansh Kamboj from comment #37) > The issue is that ceph-exporter doesn't handle exceptions while parsing > jsons, so we need to add try-catch blocks to mitigate these crashes, and > error out in the log. I'm working on the unit tests to verify if the fix is > working. Why are we getting invalid json? Is there some other underlying issue?
> Why are we getting invalid json? Is there some other underlying issue? the json is "invalid" per-say. But the function that parses the json into `object` and `array` gives an exception when certain data points are not popluated. So try-catch block to catch those exceptions
Bug in NEW/ASSIGNED state. Moving the bug to 4.15.3 for a decision on RCA/FIX.
*** Bug 2269122 has been marked as a duplicate of this bug. ***
*** Bug 2255648 has been marked as a duplicate of this bug. ***
Moving the bug to 4.15.4 as we have reached the limit on bugs intake for 4.15.3
Upstream PR https://github.com/ceph/ceph/pull/55773 has been merged, getting the patch backported to 7.1 Relevant BZ: https://bugzilla.redhat.com/show_bug.cgi?id=2266035
it just got backported to ceph 7.1 downstream, afaik 4.16 uses 7.1. so we can mark it modified for 4.16? wdyt @sheggodu
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.16.0 security, enhancement & bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2024:4591
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days