Bug 2210205 - cephadm module failed due to a cephadm shell command stuck as a background process
Summary: cephadm module failed due to a cephadm shell command stuck as a background pr...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Ceph Storage
Classification: Red Hat Storage
Component: Cephadm
Version: 6.1
Hardware: x86_64
OS: Linux
unspecified
medium
Target Milestone: ---
: 6.1z1
Assignee: Adam King
QA Contact: Mohit Bisht
Akash Raj
URL:
Whiteboard:
Depends On:
Blocks: 2221020
TreeView+ depends on / blocked
 
Reported: 2023-05-26 04:59 UTC by Harsh Kumar
Modified: 2023-08-03 16:45 UTC (History)
7 users (show)

Fixed In Version: ceph-17.2.6-84.el9cp
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2023-08-03 16:45:09 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker RHCEPH-6741 0 None None None 2023-05-26 05:00:54 UTC
Red Hat Product Errata RHBA-2023:4473 0 None None None 2023-08-03 16:45:54 UTC

Description Harsh Kumar 2023-05-26 04:59:41 UTC
Description of problem:
Observed on a RHCS 6.1 cluser (17.2.6-35.el9cp) where a shell script was executed to periodically restart the OSD services, one of the cephadm shell command was found to be stuck which resulted in cephadm module getting to a failed state.

At this point, existing cephadm shell session did work fine, but it was impossible to get into the cephadm container from a new session.

Initially upon facing this issue, the mgr module 'cephadm' shell was disabled and enabled which brought back the Cluster health to HEALTH OK state only to return back to HEALTH_ERR state shortly. Similar behaviour was observed when mgr service was restarted.

As a cephadm shell process was stuck, the succeeding cephadm commands triggered by the shell script piled up in a queue waiting for execution.

As the process was stuck only on the installer node, cephadm was functional on all the other nodes part of the cluster.

shell script used for periodic restart of OSDs -
#!/usr/bin/env bash

for i in {0..15}
do
        echo "run number $i"
        sudo cephadm shell -- ceph orch restart osd.osd_spec_default
        sleep 3600
done

ceph status
# ceph -s
  cluster:
    id:     2fc5c244-e854-11ed-a87c-78ac445e9f2e
    health: HEALTH_ERR
            Module 'cephadm' has failed: 
 
  services:
    mon: 5 daemons, quorum dell-r640-061,dell-r640-077,dell-r640-075,dell-r640-076,dell-r640-070 (age 2w)
    mgr: dell-r640-061.hnndmh(active, since 3d), standbys: dell-r640-070.dmgjgg, dell-r640-075.femjjq
    mds: 1/1 daemons up, 1 standby
    osd: 143 osds: 143 up (since 4d), 143 in (since 2w)
    rgw: 4 daemons active (4 hosts, 1 zones)
 
  data:
    volumes: 1/1 healthy
    pools:   35 pools, 2097 pgs
    objects: 1.82M objects, 3.4 TiB
    usage:   11 TiB used, 12 TiB / 23 TiB avail
    pgs:     2097 active+clean

ceph health detail - 
# ceph health detail
HEALTH_ERR Module 'cephadm' has failed: 
[ERR] MGR_MODULE_ERROR: Module 'cephadm' has failed: 
    Module 'cephadm' has failed: 

Version-Release number of selected component (if applicable):
ceph version 17.2.6-35.el9cp (e9423713abfccf04494c23555d3a9311e72e88cc) quincy (stable)


Error when a new cephadm shell command was executed - 
# cephadm shell
^CTraceback (most recent call last):
  File "/usr/sbin/cephadm", line 9713, in <module>
    main()
  File "/usr/sbin/cephadm", line 9701, in main
    r = ctx.func(ctx)
  File "/usr/sbin/cephadm", line 2162, in _infer_config
    return func(ctx)
  File "/usr/sbin/cephadm", line 2086, in _infer_fsid
    daemon_list = list_daemons(ctx, detail=False)
  File "/usr/sbin/cephadm", line 6500, in list_daemons
    out, err, code = call(
  File "/usr/sbin/cephadm", line 1829, in call
    stdout, stderr, returncode = async_run(run_with_timeout())
  File "/usr/lib64/python3.9/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/usr/lib64/python3.9/asyncio/base_events.py", line 634, in run_until_complete
    self.run_forever()
  File "/usr/lib64/python3.9/asyncio/base_events.py", line 601, in run_forever
    self._run_once()
  File "/usr/lib64/python3.9/asyncio/base_events.py", line 1869, in _run_once
    event_list = self._selector.select(timeout)
  File "/usr/lib64/python3.9/selectors.py", line 469, in select
    fd_event_list = self._selector.poll(timeout, max_ev)
KeyboardInterrupt
Exception ignored in: <function BaseSubprocessTransport.__del__ at 0x7f8c265178b0>
Traceback (most recent call last):
  File "/usr/lib64/python3.9/asyncio/base_subprocess.py", line 126, in __del__
  File "/usr/lib64/python3.9/asyncio/base_subprocess.py", line 104, in close
  File "/usr/lib64/python3.9/asyncio/unix_events.py", line 536, in close
  File "/usr/lib64/python3.9/asyncio/unix_events.py", line 560, in _close
  File "/usr/lib64/python3.9/asyncio/base_events.py", line 751, in call_soon
  File "/usr/lib64/python3.9/asyncio/base_events.py", line 515, in _check_closed
RuntimeError: Event loop is closed

ps -aux output displaying the cephadm processes running in the background -
# ps -aux | grep cephadm
root     1874221  0.0  0.0 119160 41524 pts/2    S+   May17   0:00 /usr/bin/python3 -s /usr/sbin/cephadm shell
root     2231316  0.0  0.0 119168 41664 pts/0    T    May17   0:00 /usr/bin/python3 -s /usr/sbin/cephadm shell -- ceph orch restart osd.osd_spec_default
root     2231848  0.0  0.0 118416 41428 ?        Ssl  May17   0:00 /usr/bin/python3 /var/lib/ceph/2fc5c244-e854-11ed-a87c-78ac445e9f2e/cephadm.0205ef93e6ef23d2ef858d4f8c05a1eb2d8b178f84cbe935a0b80303c88e5a63 --image registry-proxy.engineering.redhat.com/rh-osbs/rhceph@sha256:1e03745ce273a23b0a9a1d7a295a7324c373a4ca02ae230753ef108d88cdda35 --timeout 895 unit --fsid 2fc5c244-e854-11ed-a87c-78ac445e9f2e --name osd.103 reset-failed
root     2231961  0.0  0.0 118444 41372 pts/0    Sl   May17   0:00 /usr/bin/python3 -s /usr/sbin/cephadm shell -- ceph orch restart osd.osd_spec_default
root     2232062  0.0  0.0 118444 41980 pts/0    Sl   May17   0:00 /usr/bin/python3 -s /usr/sbin/cephadm shell -- ceph orch restart osd.osd_spec_default
root     2232245  0.0  0.0  16760  8488 pts/0    S    May17   0:00 sudo cephadm shell -- ceph orch restart osd.osd_spec_default
root     2232247  0.0  0.0 118440 42364 pts/0    Sl   May17   0:00 /usr/bin/python3 -s /sbin/cephadm shell -- ceph orch restart osd.osd_spec_default
root     2232781  0.0  0.0 118412 40868 ?        Ssl  May17   0:00 /usr/bin/python3 /var/lib/ceph/2fc5c244-e854-11ed-a87c-78ac445e9f2e/cephadm.0205ef93e6ef23d2ef858d4f8c05a1eb2d8b178f84cbe935a0b80303c88e5a63 --image registry-proxy.engineering.redhat.com/rh-osbs/rhceph@sha256:1e03745ce273a23b0a9a1d7a295a7324c373a4ca02ae230753ef108d88cdda35 --timeout 895 unit --fsid 2fc5c244-e854-11ed-a87c-78ac445e9f2e --name osd.103 restart
root     2233514  0.0  0.0 118416 41712 ?        Ssl  May17   0:00 /usr/bin/python3 /var/lib/ceph/2fc5c244-e854-11ed-a87c-78ac445e9f2e/cephadm.0205ef93e6ef23d2ef858d4f8c05a1eb2d8b178f84cbe935a0b80303c88e5a63 --image registry-proxy.engineering.redhat.com/rh-osbs/rhceph@sha256:1e03745ce273a23b0a9a1d7a295a7324c373a4ca02ae230753ef108d88cdda35 --timeout 895 unit --fsid 2fc5c244-e854-11ed-a87c-78ac445e9f2e --name osd.111 reset-failed
root     2233582  0.0  0.0 118440 42424 pts/0    Sl   May17   0:00 /usr/bin/python3 -s /usr/sbin/cephadm shell -- ceph osd out
root     2233824  0.0  0.0 118412 42008 ?        Ssl  May17   0:00 /usr/bin/python3 /var/lib/ceph/2fc5c244-e854-11ed-a87c-78ac445e9f2e/cephadm.0205ef93e6ef23d2ef858d4f8c05a1eb2d8b178f84cbe935a0b80303c88e5a63 --image registry-proxy.engineering.redhat.com/rh-osbs/rhceph@sha256:1e03745ce273a23b0a9a1d7a295a7324c373a4ca02ae230753ef108d88cdda35 --timeout 895 unit --fsid 2fc5c244-e854-11ed-a87c-78ac445e9f2e --name osd.111 restart
root     2234017  0.0  0.0 118416 42460 ?        Ssl  May17   0:00 /usr/bin/python3 /var/lib/ceph/2fc5c244-e854-11ed-a87c-78ac445e9f2e/cephadm.0205ef93e6ef23d2ef858d4f8c05a1eb2d8b178f84cbe935a0b80303c88e5a63 --image registry-proxy.engineering.redhat.com/rh-osbs/rhceph@sha256:1e03745ce273a23b0a9a1d7a295a7324c373a4ca02ae230753ef108d88cdda35 --timeout 895 unit --fsid 2fc5c244-e854-11ed-a87c-78ac445e9f2e --name osd.119 reset-failed
root     2234203  0.0  0.0 118412 42596 ?        Ssl  May17   0:00 /usr/bin/python3 /var/lib/ceph/2fc5c244-e854-11ed-a87c-78ac445e9f2e/cephadm.0205ef93e6ef23d2ef858d4f8c05a1eb2d8b178f84cbe935a0b80303c88e5a63 --image registry-proxy.engineering.redhat.com/rh-osbs/rhceph@sha256:1e03745ce273a23b0a9a1d7a295a7324c373a4ca02ae230753ef108d88cdda35 --timeout 895 unit --fsid 2fc5c244-e854-11ed-a87c-78ac445e9f2e --name osd.119 restart
root     2234374  0.0  0.0 118416 41156 ?        Ssl  May17   0:00 /usr/bin/python3 /var/lib/ceph/2fc5c244-e854-11ed-a87c-78ac445e9f2e/cephadm.0205ef93e6ef23d2ef858d4f8c05a1eb2d8b178f84cbe935a0b80303c88e5a63 --image registry-proxy.engineering.redhat.com/rh-osbs/rhceph@sha256:1e03745ce273a23b0a9a1d7a295a7324c373a4ca02ae230753ef108d88cdda35 --timeout 895 unit --fsid 2fc5c244-e854-11ed-a87c-78ac445e9f2e --name osd.127 reset-failed
root     2234588  0.0  0.0 118412 41696 ?        Ssl  May17   0:00 /usr/bin/python3 /var/lib/ceph/2fc5c244-e854-11ed-a87c-78ac445e9f2e/cephadm.0205ef93e6ef23d2ef858d4f8c05a1eb2d8b178f84cbe935a0b80303c88e5a63 --image registry-proxy.engineering.redhat.com/rh-osbs/rhceph@sha256:1e03745ce273a23b0a9a1d7a295a7324c373a4ca02ae230753ef108d88cdda35 --timeout 895 unit --fsid 2fc5c244-e854-11ed-a87c-78ac445e9f2e --name osd.127 restart
root     2234757  0.0  0.0 118416 41812 ?        Ssl  May17   0:00 /usr/bin/python3 /var/lib/ceph/2fc5c244-e854-11ed-a87c-78ac445e9f2e/cephadm.0205ef93e6ef23d2ef858d4f8c05a1eb2d8b178f84cbe935a0b80303c88e5a63 --image registry-proxy.engineering.redhat.com/rh-osbs/rhceph@sha256:1e03745ce273a23b0a9a1d7a295a7324c373a4ca02ae230753ef108d88cdda35 --timeout 895 unit --fsid 2fc5c244-e854-11ed-a87c-78ac445e9f2e --name osd.135 reset-failed
root     2234946  0.0  0.0 118412 41936 ?        Ssl  May17   0:00 /usr/bin/python3 /var/lib/ceph/2fc5c244-e854-11ed-a87c-78ac445e9f2e/cephadm.0205ef93e6ef23d2ef858d4f8c05a1eb2d8b178f84cbe935a0b80303c88e5a63 --image registry-proxy.engineering.redhat.com/rh-osbs/rhceph@sha256:1e03745ce273a23b0a9a1d7a295a7324c373a4ca02ae230753ef108d88cdda35 --timeout 895 unit --fsid 2fc5c244-e854-11ed-a87c-78ac445e9f2e --name osd.135 restart
root     2235126  0.0  0.0 118412 41712 ?        Ssl  May17   0:00 /usr/bin/python3 /var/lib/ceph/2fc5c244-e854-11ed-a87c-78ac445e9f2e/cephadm.0205ef93e6ef23d2ef858d4f8c05a1eb2d8b178f84cbe935a0b80303c88e5a63 --image registry-proxy.engineering.redhat.com/rh-osbs/rhceph@sha256:1e03745ce273a23b0a9a1d7a295a7324c373a4ca02ae230753ef108d88cdda35 --timeout 895 unit --fsid 2fc5c244-e854-11ed-a87c-78ac445e9f2e --name osd.14 restart
root     2235298  0.0  0.0 118416 41740 ?        Ssl  May17   0:00 /usr/bin/python3 /var/lib/ceph/2fc5c244-e854-11ed-a87c-78ac445e9f2e/cephadm.0205ef93e6ef23d2ef858d4f8c05a1eb2d8b178f84cbe935a0b80303c88e5a63 --image registry-proxy.engineering.redhat.com/rh-osbs/rhceph@sha256:1e03745ce273a23b0a9a1d7a295a7324c373a4ca02ae230753ef108d88cdda35 --timeout 895 unit --fsid 2fc5c244-e854-11ed-a87c-78ac445e9f2e --name osd.142 reset-failed
root     2235471  0.0  0.0 118412 41220 ?        Ssl  May17   0:00 /usr/bin/python3 /var/lib/ceph/2fc5c244-e854-11ed-a87c-78ac445e9f2e/cephadm.0205ef93e6ef23d2ef858d4f8c05a1eb2d8b178f84cbe935a0b80303c88e5a63 --image registry-proxy.engineering.redhat.com/rh-osbs/rhceph@sha256:1e03745ce273a23b0a9a1d7a295a7324c373a4ca02ae230753ef108d88cdda35 --timeout 895 unit --fsid 2fc5c244-e854-11ed-a87c-78ac445e9f2e --name osd.142 restart
root     2235650  0.0  0.0 118416 42196 ?        Ssl  May17   0:00 /usr/bin/python3 /var/lib/ceph/2fc5c244-e854-11ed-a87c-78ac445e9f2e/cephadm.0205ef93e6ef23d2ef858d4f8c05a1eb2d8b178f84cbe935a0b80303c88e5a63 --image registry-proxy.engineering.redhat.com/rh-osbs/rhceph@sha256:1e03745ce273a23b0a9a1d7a295a7324c373a4ca02ae230753ef108d88cdda35 --timeout 895 unit --fsid 2fc5c244-e854-11ed-a87c-78ac445e9f2e --name osd.22 reset-failed
root     2235830  0.0  0.0 118412 41508 ?        Ssl  May17   0:00 /usr/bin/python3 /var/lib/ceph/2fc5c244-e854-11ed-a87c-78ac445e9f2e/cephadm.0205ef93e6ef23d2ef858d4f8c05a1eb2d8b178f84cbe935a0b80303c88e5a63 --image registry-proxy.engineering.redhat.com/rh-osbs/rhceph@sha256:1e03745ce273a23b0a9a1d7a295a7324c373a4ca02ae230753ef108d88cdda35 --timeout 895 unit --fsid 2fc5c244-e854-11ed-a87c-78ac445e9f2e --name osd.22 restart
root     2236003  0.0  0.0 118416 42792 ?        Ssl  May17   0:00 /usr/bin/python3 /var/lib/ceph/2fc5c244-e854-11ed-a87c-78ac445e9f2e/cephadm.0205ef93e6ef23d2ef858d4f8c05a1eb2d8b178f84cbe935a0b80303c88e5a63 --image registry-proxy.engineering.redhat.com/rh-osbs/rhceph@sha256:1e03745ce273a23b0a9a1d7a295a7324c373a4ca02ae230753ef108d88cdda35 --timeout 895 unit --fsid 2fc5c244-e854-11ed-a87c-78ac445e9f2e --name osd.30 reset-failed
root     2236309  0.0  0.0 118412 40748 ?        Ssl  May17   0:00 /usr/bin/python3 /var/lib/ceph/2fc5c244-e854-11ed-a87c-78ac445e9f2e/cephadm.0205ef93e6ef23d2ef858d4f8c05a1eb2d8b178f84cbe935a0b80303c88e5a63 --image registry-proxy.engineering.redhat.com/rh-osbs/rhceph@sha256:1e03745ce273a23b0a9a1d7a295a7324c373a4ca02ae230753ef108d88cdda35 --timeout 895 unit --fsid 2fc5c244-e854-11ed-a87c-78ac445e9f2e --name osd.30 restart
root     2236630  0.0  0.0 118412 41388 ?        Ssl  May17   0:00 /usr/bin/python3 /var/lib/ceph/2fc5c244-e854-11ed-a87c-78ac445e9f2e/cephadm.0205ef93e6ef23d2ef858d4f8c05a1eb2d8b178f84cbe935a0b80303c88e5a63 --image registry-proxy.engineering.redhat.com/rh-osbs/rhceph@sha256:1e03745ce273a23b0a9a1d7a295a7324c373a4ca02ae230753ef108d88cdda35 --timeout 895 unit --fsid 2fc5c244-e854-11ed-a87c-78ac445e9f2e --name osd.38 reset-failed
root     2237108  0.0  0.0 118412 41412 ?        Ssl  May17   0:00 /usr/bin/python3 /var/lib/ceph/2fc5c244-e854-11ed-a87c-78ac445e9f2e/cephadm.0205ef93e6ef23d2ef858d4f8c05a1eb2d8b178f84cbe935a0b80303c88e5a63 --image registry-proxy.engineering.redhat.com/rh-osbs/rhceph@sha256:1e03745ce273a23b0a9a1d7a295a7324c373a4ca02ae230753ef108d88cdda35 --timeout 895 unit --fsid 2fc5c244-e854-11ed-a87c-78ac445e9f2e --name osd.38 restart
root     2237296  0.0  0.0 118416 42068 ?        Ssl  May17   0:00 /usr/bin/python3 /var/lib/ceph/2fc5c244-e854-11ed-a87c-78ac445e9f2e/cephadm.0205ef93e6ef23d2ef858d4f8c05a1eb2d8b178f84cbe935a0b80303c88e5a63 --image registry-proxy.engineering.redhat.com/rh-osbs/rhceph@sha256:1e03745ce273a23b0a9a1d7a295a7324c373a4ca02ae230753ef108d88cdda35 --timeout 895 unit --fsid 2fc5c244-e854-11ed-a87c-78ac445e9f2e --name osd.47 reset-failed
root     2237718  0.0  0.0 118412 42808 ?        Ssl  May17   0:00 /usr/bin/python3 /var/lib/ceph/2fc5c244-e854-11ed-a87c-78ac445e9f2e/cephadm.0205ef93e6ef23d2ef858d4f8c05a1eb2d8b178f84cbe935a0b80303c88e5a63 --image registry-proxy.engineering.redhat.com/rh-osbs/rhceph@sha256:1e03745ce273a23b0a9a1d7a295a7324c373a4ca02ae230753ef108d88cdda35 --timeout 895 ls
root     2239305  0.0  0.0 118412 41236 ?        Ssl  May18   0:00 /usr/bin/python3 /var/lib/ceph/2fc5c244-e854-11ed-a87c-78ac445e9f2e/cephadm.0205ef93e6ef23d2ef858d4f8c05a1eb2d8b178f84cbe935a0b80303c88e5a63 --image registry-proxy.engineering.redhat.com/rh-osbs/rhceph@sha256:1e03745ce273a23b0a9a1d7a295a7324c373a4ca02ae230753ef108d88cdda35 --timeout 895 ls
root     2240174  0.0  0.0 118412 42436 ?        Ssl  May18   0:00 /usr/bin/python3 /var/lib/ceph/2fc5c244-e854-11ed-a87c-78ac445e9f2e/cephadm.0205ef93e6ef23d2ef858d4f8c05a1eb2d8b178f84cbe935a0b80303c88e5a63 --image registry-proxy.engineering.redhat.com/rh-osbs/rhceph@sha256:1e03745ce273a23b0a9a1d7a295a7324c373a4ca02ae230753ef108d88cdda35 --timeout 895 ls
root     2249206  0.0  0.0   3092   996 pts/1    S+   01:49   0:00 tail -f ceph-mgr.dell-r640-061.hnndmh.cephadm.log
root     2249274  0.0  0.0   3876  1984 pts/0    S+   01:50   0:00 grep --color=auto cephadm

When the stuck process was killed, cephadm became responsive -
[root@dell-r640-061 2fc5c244-e854-11ed-a87c-78ac445e9f2e]# kill -9 2231315
[root@dell-r640-061 2fc5c244-e854-11ed-a87c-78ac445e9f2e]# 
[2]+  Killed                  ./restart_osds.sh &> restart_osds_stdout.log  (wd: ~)
(wd now: /var/log/ceph/2fc5c244-e854-11ed-a87c-78ac445e9f2e)
[root@dell-r640-061 2fc5c244-e854-11ed-a87c-78ac445e9f2e]# cephadm shell
Inferring fsid 2fc5c244-e854-11ed-a87c-78ac445e9f2e
Inferring config /var/lib/ceph/2fc5c244-e854-11ed-a87c-78ac445e9f2e/mon.dell-r640-061/config
Using ceph image with id '76a87bb99dbb' and tag 'ceph-6.1-rhel-9-containers-candidate-33338-20230428094752' created on 2023-04-28 09:49:50 +0000 UTC
registry-proxy.engineering.redhat.com/rh-osbs/rhceph@sha256:3c721ebb49fcc54a88f32b295b9fb6f2e6a8ccd8492454d745f34b9c117f8052
[ceph: root@dell-r640-061 /]# exit

How reproducible:
1/1

Steps to Reproduce:
1.Configure RHCS 6.1 cluster
2.Run a shell script which periodically restart the OSDs services and mark the OSDs out and in
3.When a cephadm shell process gets stuck, cephadm module will remain in failed state

Actual results:


Expected results:
As cephadm shell is using asyncio for asynchronous execution, it is waiting for the first process in the queue to get completed and then move on to the next one. If an erroneous command or process is observed to be in stuck in execution state for a very long time, it should be terminated and/or ignored.


Additional info:
While cephadm shell stopped responding on the installer node, client node still was functional.

Comment 13 errata-xmlrpc 2023-08-03 16:45:09 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat Ceph Storage 6.1 Bug Fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2023:4473


Note You need to log in before you can comment on or make changes to this bug.