Description of problem: Observed on a RHCS 6.1 cluser (17.2.6-35.el9cp) where a shell script was executed to periodically restart the OSD services, one of the cephadm shell command was found to be stuck which resulted in cephadm module getting to a failed state. At this point, existing cephadm shell session did work fine, but it was impossible to get into the cephadm container from a new session. Initially upon facing this issue, the mgr module 'cephadm' shell was disabled and enabled which brought back the Cluster health to HEALTH OK state only to return back to HEALTH_ERR state shortly. Similar behaviour was observed when mgr service was restarted. As a cephadm shell process was stuck, the succeeding cephadm commands triggered by the shell script piled up in a queue waiting for execution. As the process was stuck only on the installer node, cephadm was functional on all the other nodes part of the cluster. shell script used for periodic restart of OSDs - #!/usr/bin/env bash for i in {0..15} do echo "run number $i" sudo cephadm shell -- ceph orch restart osd.osd_spec_default sleep 3600 done ceph status # ceph -s cluster: id: 2fc5c244-e854-11ed-a87c-78ac445e9f2e health: HEALTH_ERR Module 'cephadm' has failed: services: mon: 5 daemons, quorum dell-r640-061,dell-r640-077,dell-r640-075,dell-r640-076,dell-r640-070 (age 2w) mgr: dell-r640-061.hnndmh(active, since 3d), standbys: dell-r640-070.dmgjgg, dell-r640-075.femjjq mds: 1/1 daemons up, 1 standby osd: 143 osds: 143 up (since 4d), 143 in (since 2w) rgw: 4 daemons active (4 hosts, 1 zones) data: volumes: 1/1 healthy pools: 35 pools, 2097 pgs objects: 1.82M objects, 3.4 TiB usage: 11 TiB used, 12 TiB / 23 TiB avail pgs: 2097 active+clean ceph health detail - # ceph health detail HEALTH_ERR Module 'cephadm' has failed: [ERR] MGR_MODULE_ERROR: Module 'cephadm' has failed: Module 'cephadm' has failed: Version-Release number of selected component (if applicable): ceph version 17.2.6-35.el9cp (e9423713abfccf04494c23555d3a9311e72e88cc) quincy (stable) Error when a new cephadm shell command was executed - # cephadm shell ^CTraceback (most recent call last): File "/usr/sbin/cephadm", line 9713, in <module> main() File "/usr/sbin/cephadm", line 9701, in main r = ctx.func(ctx) File "/usr/sbin/cephadm", line 2162, in _infer_config return func(ctx) File "/usr/sbin/cephadm", line 2086, in _infer_fsid daemon_list = list_daemons(ctx, detail=False) File "/usr/sbin/cephadm", line 6500, in list_daemons out, err, code = call( File "/usr/sbin/cephadm", line 1829, in call stdout, stderr, returncode = async_run(run_with_timeout()) File "/usr/lib64/python3.9/asyncio/runners.py", line 44, in run return loop.run_until_complete(main) File "/usr/lib64/python3.9/asyncio/base_events.py", line 634, in run_until_complete self.run_forever() File "/usr/lib64/python3.9/asyncio/base_events.py", line 601, in run_forever self._run_once() File "/usr/lib64/python3.9/asyncio/base_events.py", line 1869, in _run_once event_list = self._selector.select(timeout) File "/usr/lib64/python3.9/selectors.py", line 469, in select fd_event_list = self._selector.poll(timeout, max_ev) KeyboardInterrupt Exception ignored in: <function BaseSubprocessTransport.__del__ at 0x7f8c265178b0> Traceback (most recent call last): File "/usr/lib64/python3.9/asyncio/base_subprocess.py", line 126, in __del__ File "/usr/lib64/python3.9/asyncio/base_subprocess.py", line 104, in close File "/usr/lib64/python3.9/asyncio/unix_events.py", line 536, in close File "/usr/lib64/python3.9/asyncio/unix_events.py", line 560, in _close File "/usr/lib64/python3.9/asyncio/base_events.py", line 751, in call_soon File "/usr/lib64/python3.9/asyncio/base_events.py", line 515, in _check_closed RuntimeError: Event loop is closed ps -aux output displaying the cephadm processes running in the background - # ps -aux | grep cephadm root 1874221 0.0 0.0 119160 41524 pts/2 S+ May17 0:00 /usr/bin/python3 -s /usr/sbin/cephadm shell root 2231316 0.0 0.0 119168 41664 pts/0 T May17 0:00 /usr/bin/python3 -s /usr/sbin/cephadm shell -- ceph orch restart osd.osd_spec_default root 2231848 0.0 0.0 118416 41428 ? Ssl May17 0:00 /usr/bin/python3 /var/lib/ceph/2fc5c244-e854-11ed-a87c-78ac445e9f2e/cephadm.0205ef93e6ef23d2ef858d4f8c05a1eb2d8b178f84cbe935a0b80303c88e5a63 --image registry-proxy.engineering.redhat.com/rh-osbs/rhceph@sha256:1e03745ce273a23b0a9a1d7a295a7324c373a4ca02ae230753ef108d88cdda35 --timeout 895 unit --fsid 2fc5c244-e854-11ed-a87c-78ac445e9f2e --name osd.103 reset-failed root 2231961 0.0 0.0 118444 41372 pts/0 Sl May17 0:00 /usr/bin/python3 -s /usr/sbin/cephadm shell -- ceph orch restart osd.osd_spec_default root 2232062 0.0 0.0 118444 41980 pts/0 Sl May17 0:00 /usr/bin/python3 -s /usr/sbin/cephadm shell -- ceph orch restart osd.osd_spec_default root 2232245 0.0 0.0 16760 8488 pts/0 S May17 0:00 sudo cephadm shell -- ceph orch restart osd.osd_spec_default root 2232247 0.0 0.0 118440 42364 pts/0 Sl May17 0:00 /usr/bin/python3 -s /sbin/cephadm shell -- ceph orch restart osd.osd_spec_default root 2232781 0.0 0.0 118412 40868 ? Ssl May17 0:00 /usr/bin/python3 /var/lib/ceph/2fc5c244-e854-11ed-a87c-78ac445e9f2e/cephadm.0205ef93e6ef23d2ef858d4f8c05a1eb2d8b178f84cbe935a0b80303c88e5a63 --image registry-proxy.engineering.redhat.com/rh-osbs/rhceph@sha256:1e03745ce273a23b0a9a1d7a295a7324c373a4ca02ae230753ef108d88cdda35 --timeout 895 unit --fsid 2fc5c244-e854-11ed-a87c-78ac445e9f2e --name osd.103 restart root 2233514 0.0 0.0 118416 41712 ? Ssl May17 0:00 /usr/bin/python3 /var/lib/ceph/2fc5c244-e854-11ed-a87c-78ac445e9f2e/cephadm.0205ef93e6ef23d2ef858d4f8c05a1eb2d8b178f84cbe935a0b80303c88e5a63 --image registry-proxy.engineering.redhat.com/rh-osbs/rhceph@sha256:1e03745ce273a23b0a9a1d7a295a7324c373a4ca02ae230753ef108d88cdda35 --timeout 895 unit --fsid 2fc5c244-e854-11ed-a87c-78ac445e9f2e --name osd.111 reset-failed root 2233582 0.0 0.0 118440 42424 pts/0 Sl May17 0:00 /usr/bin/python3 -s /usr/sbin/cephadm shell -- ceph osd out root 2233824 0.0 0.0 118412 42008 ? Ssl May17 0:00 /usr/bin/python3 /var/lib/ceph/2fc5c244-e854-11ed-a87c-78ac445e9f2e/cephadm.0205ef93e6ef23d2ef858d4f8c05a1eb2d8b178f84cbe935a0b80303c88e5a63 --image registry-proxy.engineering.redhat.com/rh-osbs/rhceph@sha256:1e03745ce273a23b0a9a1d7a295a7324c373a4ca02ae230753ef108d88cdda35 --timeout 895 unit --fsid 2fc5c244-e854-11ed-a87c-78ac445e9f2e --name osd.111 restart root 2234017 0.0 0.0 118416 42460 ? Ssl May17 0:00 /usr/bin/python3 /var/lib/ceph/2fc5c244-e854-11ed-a87c-78ac445e9f2e/cephadm.0205ef93e6ef23d2ef858d4f8c05a1eb2d8b178f84cbe935a0b80303c88e5a63 --image registry-proxy.engineering.redhat.com/rh-osbs/rhceph@sha256:1e03745ce273a23b0a9a1d7a295a7324c373a4ca02ae230753ef108d88cdda35 --timeout 895 unit --fsid 2fc5c244-e854-11ed-a87c-78ac445e9f2e --name osd.119 reset-failed root 2234203 0.0 0.0 118412 42596 ? Ssl May17 0:00 /usr/bin/python3 /var/lib/ceph/2fc5c244-e854-11ed-a87c-78ac445e9f2e/cephadm.0205ef93e6ef23d2ef858d4f8c05a1eb2d8b178f84cbe935a0b80303c88e5a63 --image registry-proxy.engineering.redhat.com/rh-osbs/rhceph@sha256:1e03745ce273a23b0a9a1d7a295a7324c373a4ca02ae230753ef108d88cdda35 --timeout 895 unit --fsid 2fc5c244-e854-11ed-a87c-78ac445e9f2e --name osd.119 restart root 2234374 0.0 0.0 118416 41156 ? Ssl May17 0:00 /usr/bin/python3 /var/lib/ceph/2fc5c244-e854-11ed-a87c-78ac445e9f2e/cephadm.0205ef93e6ef23d2ef858d4f8c05a1eb2d8b178f84cbe935a0b80303c88e5a63 --image registry-proxy.engineering.redhat.com/rh-osbs/rhceph@sha256:1e03745ce273a23b0a9a1d7a295a7324c373a4ca02ae230753ef108d88cdda35 --timeout 895 unit --fsid 2fc5c244-e854-11ed-a87c-78ac445e9f2e --name osd.127 reset-failed root 2234588 0.0 0.0 118412 41696 ? Ssl May17 0:00 /usr/bin/python3 /var/lib/ceph/2fc5c244-e854-11ed-a87c-78ac445e9f2e/cephadm.0205ef93e6ef23d2ef858d4f8c05a1eb2d8b178f84cbe935a0b80303c88e5a63 --image registry-proxy.engineering.redhat.com/rh-osbs/rhceph@sha256:1e03745ce273a23b0a9a1d7a295a7324c373a4ca02ae230753ef108d88cdda35 --timeout 895 unit --fsid 2fc5c244-e854-11ed-a87c-78ac445e9f2e --name osd.127 restart root 2234757 0.0 0.0 118416 41812 ? Ssl May17 0:00 /usr/bin/python3 /var/lib/ceph/2fc5c244-e854-11ed-a87c-78ac445e9f2e/cephadm.0205ef93e6ef23d2ef858d4f8c05a1eb2d8b178f84cbe935a0b80303c88e5a63 --image registry-proxy.engineering.redhat.com/rh-osbs/rhceph@sha256:1e03745ce273a23b0a9a1d7a295a7324c373a4ca02ae230753ef108d88cdda35 --timeout 895 unit --fsid 2fc5c244-e854-11ed-a87c-78ac445e9f2e --name osd.135 reset-failed root 2234946 0.0 0.0 118412 41936 ? Ssl May17 0:00 /usr/bin/python3 /var/lib/ceph/2fc5c244-e854-11ed-a87c-78ac445e9f2e/cephadm.0205ef93e6ef23d2ef858d4f8c05a1eb2d8b178f84cbe935a0b80303c88e5a63 --image registry-proxy.engineering.redhat.com/rh-osbs/rhceph@sha256:1e03745ce273a23b0a9a1d7a295a7324c373a4ca02ae230753ef108d88cdda35 --timeout 895 unit --fsid 2fc5c244-e854-11ed-a87c-78ac445e9f2e --name osd.135 restart root 2235126 0.0 0.0 118412 41712 ? Ssl May17 0:00 /usr/bin/python3 /var/lib/ceph/2fc5c244-e854-11ed-a87c-78ac445e9f2e/cephadm.0205ef93e6ef23d2ef858d4f8c05a1eb2d8b178f84cbe935a0b80303c88e5a63 --image registry-proxy.engineering.redhat.com/rh-osbs/rhceph@sha256:1e03745ce273a23b0a9a1d7a295a7324c373a4ca02ae230753ef108d88cdda35 --timeout 895 unit --fsid 2fc5c244-e854-11ed-a87c-78ac445e9f2e --name osd.14 restart root 2235298 0.0 0.0 118416 41740 ? Ssl May17 0:00 /usr/bin/python3 /var/lib/ceph/2fc5c244-e854-11ed-a87c-78ac445e9f2e/cephadm.0205ef93e6ef23d2ef858d4f8c05a1eb2d8b178f84cbe935a0b80303c88e5a63 --image registry-proxy.engineering.redhat.com/rh-osbs/rhceph@sha256:1e03745ce273a23b0a9a1d7a295a7324c373a4ca02ae230753ef108d88cdda35 --timeout 895 unit --fsid 2fc5c244-e854-11ed-a87c-78ac445e9f2e --name osd.142 reset-failed root 2235471 0.0 0.0 118412 41220 ? Ssl May17 0:00 /usr/bin/python3 /var/lib/ceph/2fc5c244-e854-11ed-a87c-78ac445e9f2e/cephadm.0205ef93e6ef23d2ef858d4f8c05a1eb2d8b178f84cbe935a0b80303c88e5a63 --image registry-proxy.engineering.redhat.com/rh-osbs/rhceph@sha256:1e03745ce273a23b0a9a1d7a295a7324c373a4ca02ae230753ef108d88cdda35 --timeout 895 unit --fsid 2fc5c244-e854-11ed-a87c-78ac445e9f2e --name osd.142 restart root 2235650 0.0 0.0 118416 42196 ? Ssl May17 0:00 /usr/bin/python3 /var/lib/ceph/2fc5c244-e854-11ed-a87c-78ac445e9f2e/cephadm.0205ef93e6ef23d2ef858d4f8c05a1eb2d8b178f84cbe935a0b80303c88e5a63 --image registry-proxy.engineering.redhat.com/rh-osbs/rhceph@sha256:1e03745ce273a23b0a9a1d7a295a7324c373a4ca02ae230753ef108d88cdda35 --timeout 895 unit --fsid 2fc5c244-e854-11ed-a87c-78ac445e9f2e --name osd.22 reset-failed root 2235830 0.0 0.0 118412 41508 ? Ssl May17 0:00 /usr/bin/python3 /var/lib/ceph/2fc5c244-e854-11ed-a87c-78ac445e9f2e/cephadm.0205ef93e6ef23d2ef858d4f8c05a1eb2d8b178f84cbe935a0b80303c88e5a63 --image registry-proxy.engineering.redhat.com/rh-osbs/rhceph@sha256:1e03745ce273a23b0a9a1d7a295a7324c373a4ca02ae230753ef108d88cdda35 --timeout 895 unit --fsid 2fc5c244-e854-11ed-a87c-78ac445e9f2e --name osd.22 restart root 2236003 0.0 0.0 118416 42792 ? Ssl May17 0:00 /usr/bin/python3 /var/lib/ceph/2fc5c244-e854-11ed-a87c-78ac445e9f2e/cephadm.0205ef93e6ef23d2ef858d4f8c05a1eb2d8b178f84cbe935a0b80303c88e5a63 --image registry-proxy.engineering.redhat.com/rh-osbs/rhceph@sha256:1e03745ce273a23b0a9a1d7a295a7324c373a4ca02ae230753ef108d88cdda35 --timeout 895 unit --fsid 2fc5c244-e854-11ed-a87c-78ac445e9f2e --name osd.30 reset-failed root 2236309 0.0 0.0 118412 40748 ? Ssl May17 0:00 /usr/bin/python3 /var/lib/ceph/2fc5c244-e854-11ed-a87c-78ac445e9f2e/cephadm.0205ef93e6ef23d2ef858d4f8c05a1eb2d8b178f84cbe935a0b80303c88e5a63 --image registry-proxy.engineering.redhat.com/rh-osbs/rhceph@sha256:1e03745ce273a23b0a9a1d7a295a7324c373a4ca02ae230753ef108d88cdda35 --timeout 895 unit --fsid 2fc5c244-e854-11ed-a87c-78ac445e9f2e --name osd.30 restart root 2236630 0.0 0.0 118412 41388 ? Ssl May17 0:00 /usr/bin/python3 /var/lib/ceph/2fc5c244-e854-11ed-a87c-78ac445e9f2e/cephadm.0205ef93e6ef23d2ef858d4f8c05a1eb2d8b178f84cbe935a0b80303c88e5a63 --image registry-proxy.engineering.redhat.com/rh-osbs/rhceph@sha256:1e03745ce273a23b0a9a1d7a295a7324c373a4ca02ae230753ef108d88cdda35 --timeout 895 unit --fsid 2fc5c244-e854-11ed-a87c-78ac445e9f2e --name osd.38 reset-failed root 2237108 0.0 0.0 118412 41412 ? Ssl May17 0:00 /usr/bin/python3 /var/lib/ceph/2fc5c244-e854-11ed-a87c-78ac445e9f2e/cephadm.0205ef93e6ef23d2ef858d4f8c05a1eb2d8b178f84cbe935a0b80303c88e5a63 --image registry-proxy.engineering.redhat.com/rh-osbs/rhceph@sha256:1e03745ce273a23b0a9a1d7a295a7324c373a4ca02ae230753ef108d88cdda35 --timeout 895 unit --fsid 2fc5c244-e854-11ed-a87c-78ac445e9f2e --name osd.38 restart root 2237296 0.0 0.0 118416 42068 ? Ssl May17 0:00 /usr/bin/python3 /var/lib/ceph/2fc5c244-e854-11ed-a87c-78ac445e9f2e/cephadm.0205ef93e6ef23d2ef858d4f8c05a1eb2d8b178f84cbe935a0b80303c88e5a63 --image registry-proxy.engineering.redhat.com/rh-osbs/rhceph@sha256:1e03745ce273a23b0a9a1d7a295a7324c373a4ca02ae230753ef108d88cdda35 --timeout 895 unit --fsid 2fc5c244-e854-11ed-a87c-78ac445e9f2e --name osd.47 reset-failed root 2237718 0.0 0.0 118412 42808 ? Ssl May17 0:00 /usr/bin/python3 /var/lib/ceph/2fc5c244-e854-11ed-a87c-78ac445e9f2e/cephadm.0205ef93e6ef23d2ef858d4f8c05a1eb2d8b178f84cbe935a0b80303c88e5a63 --image registry-proxy.engineering.redhat.com/rh-osbs/rhceph@sha256:1e03745ce273a23b0a9a1d7a295a7324c373a4ca02ae230753ef108d88cdda35 --timeout 895 ls root 2239305 0.0 0.0 118412 41236 ? Ssl May18 0:00 /usr/bin/python3 /var/lib/ceph/2fc5c244-e854-11ed-a87c-78ac445e9f2e/cephadm.0205ef93e6ef23d2ef858d4f8c05a1eb2d8b178f84cbe935a0b80303c88e5a63 --image registry-proxy.engineering.redhat.com/rh-osbs/rhceph@sha256:1e03745ce273a23b0a9a1d7a295a7324c373a4ca02ae230753ef108d88cdda35 --timeout 895 ls root 2240174 0.0 0.0 118412 42436 ? Ssl May18 0:00 /usr/bin/python3 /var/lib/ceph/2fc5c244-e854-11ed-a87c-78ac445e9f2e/cephadm.0205ef93e6ef23d2ef858d4f8c05a1eb2d8b178f84cbe935a0b80303c88e5a63 --image registry-proxy.engineering.redhat.com/rh-osbs/rhceph@sha256:1e03745ce273a23b0a9a1d7a295a7324c373a4ca02ae230753ef108d88cdda35 --timeout 895 ls root 2249206 0.0 0.0 3092 996 pts/1 S+ 01:49 0:00 tail -f ceph-mgr.dell-r640-061.hnndmh.cephadm.log root 2249274 0.0 0.0 3876 1984 pts/0 S+ 01:50 0:00 grep --color=auto cephadm When the stuck process was killed, cephadm became responsive - [root@dell-r640-061 2fc5c244-e854-11ed-a87c-78ac445e9f2e]# kill -9 2231315 [root@dell-r640-061 2fc5c244-e854-11ed-a87c-78ac445e9f2e]# [2]+ Killed ./restart_osds.sh &> restart_osds_stdout.log (wd: ~) (wd now: /var/log/ceph/2fc5c244-e854-11ed-a87c-78ac445e9f2e) [root@dell-r640-061 2fc5c244-e854-11ed-a87c-78ac445e9f2e]# cephadm shell Inferring fsid 2fc5c244-e854-11ed-a87c-78ac445e9f2e Inferring config /var/lib/ceph/2fc5c244-e854-11ed-a87c-78ac445e9f2e/mon.dell-r640-061/config Using ceph image with id '76a87bb99dbb' and tag 'ceph-6.1-rhel-9-containers-candidate-33338-20230428094752' created on 2023-04-28 09:49:50 +0000 UTC registry-proxy.engineering.redhat.com/rh-osbs/rhceph@sha256:3c721ebb49fcc54a88f32b295b9fb6f2e6a8ccd8492454d745f34b9c117f8052 [ceph: root@dell-r640-061 /]# exit How reproducible: 1/1 Steps to Reproduce: 1.Configure RHCS 6.1 cluster 2.Run a shell script which periodically restart the OSDs services and mark the OSDs out and in 3.When a cephadm shell process gets stuck, cephadm module will remain in failed state Actual results: Expected results: As cephadm shell is using asyncio for asynchronous execution, it is waiting for the first process in the queue to get completed and then move on to the next one. If an erroneous command or process is observed to be in stuck in execution state for a very long time, it should be terminated and/or ignored. Additional info: While cephadm shell stopped responding on the installer node, client node still was functional.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Red Hat Ceph Storage 6.1 Bug Fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2023:4473