.The Cephadm commands run on the host from the cephadm mgr module now have timeouts
Previously, one of the Cephadm commands would occasionally hang indefinitely, and it was difficult for users to notice and sort the issue.
With this release, timeouts are introduced in the Cephadm commands that are run on the host from the Cephadm mgr module. Users are now alerted with a health warning about eventual failure if one of the commands hangs. The timeout is configurable with the `mgr/cephadm/default_cephadm_command_timeout` setting, and defaults to 900 seconds.
Description of problem:
Sometimes orchestrator operations might get stuck in nodes due to various reasons.
Example :
--- Logging error ---
Traceback (most recent call last):
File "/usr/lib64/python3.6/logging/__init__.py", line 998, in emit
self.flush()
File "/usr/lib64/python3.6/logging/__init__.py", line 978, in flush
self.stream.flush()
OSError: [Errno 28] No space left on device
ceph doesn't rever user back with any information but gets stuck without any notifications even in the DEBUG logs.
This BZ is a downstream tracker for on ongoing effort to add timeouts to help users to know that the operation was actually tried but timed-out due to possible x,y,z scenario.
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory (Red Hat Ceph Storage 6.1 Bug Fix update), and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.
https://access.redhat.com/errata/RHBA-2023:4473
Description of problem: Sometimes orchestrator operations might get stuck in nodes due to various reasons. Example : --- Logging error --- Traceback (most recent call last): File "/usr/lib64/python3.6/logging/__init__.py", line 998, in emit self.flush() File "/usr/lib64/python3.6/logging/__init__.py", line 978, in flush self.stream.flush() OSError: [Errno 28] No space left on device ceph doesn't rever user back with any information but gets stuck without any notifications even in the DEBUG logs. This BZ is a downstream tracker for on ongoing effort to add timeouts to help users to know that the operation was actually tried but timed-out due to possible x,y,z scenario.