This bug was initially created as a copy of Bug #2211324 I am copying this bug because: Description of problem: ----------------------- - After upgrading the cluster from RHCS 4.3z1 (baremetal) to RHCS 5.3z3 / RHCS 5.3z2, if we run cephadm-preflight playbook to install the latest ceph-common and cephadm packages on the ceph nodes, it stops the ceph.target service, which in turn stops all the ceph services running on the host. This happens only when ceph rpms like ceph-common,ceph-base,ceph-mon,ceph-osd etc from older version (RHCS 4.3z1) still exists on the hosts (as cluster is migrated from baremetal to container). Version-Release number of selected component (if applicable): ------------------------------------------------------------- RHCS 5.3 How reproducible: ----------------- Every time. Steps to Reproduce: -------------------- 1. Deploy RHCS 4.3z1 baremetal cluster 2. Convert the Ceph services to containerized 3. Upgrade the cluster to RHCS 5.3z2 / RHCS 5.3z3 4. Run cephadm-preflight playbook to upgrade the ceph-common and cephadm package on the host. Actual results: --------------- The ceph packages is upgraded but all ceph services on the host are stoped. Expected results: ----------------- The ceph packages should be upgraded and no services should be impacted.
Observing similar kind of issue when tried upgrade from 4.3z1 --> 5.3 (latest). Post running the cephadm-preflight playbook, the ceph services (mon,mgr,osd's) got failed on all the nodes.But ceph.target was running. As a result, ceph commands are getting hung. [root@ceph-msaini-taooh8-node2 ~]# systemctl | grep ceph ceph-crash loaded active running Ceph crash dump collector ● ceph-mgr loaded failed failed Ceph Manager ● ceph-mon loaded failed failed Ceph Monitor system-ceph\x2dcrash.slice loaded active active system-ceph\x2dcrash.slice system-ceph\x2dmds.slice loaded active active system-ceph\x2dmds.slice system-ceph\x2dmgr.slice loaded active active system-ceph\x2dmgr.slice system-ceph\x2dmon.slice loaded active active system-ceph\x2dmon.slice ceph.target loaded active active ceph target allowing to start/stop all ceph*@.service instances at once [root@ceph-msaini-taooh8-node5 ~]# systemctl status ceph.target ● ceph.target - ceph target allowing to start/stop all ceph*@.service instances at once Loaded: loaded (/etc/systemd/system/ceph.target; enabled; vendor preset: enabled) Active: active since Wed 2023-12-20 14:58:47 EST; 4h 5min ago Dec 20 14:58:47 ceph-msaini-taooh8-node5 systemd[1]: Reached target ceph target allowing to start/stop all ceph*@.service instances at once. ======================== Upgrade logs ======================== [root@ceph-msaini-taooh8-node1-installer ~]# ceph --version ceph version 14.2.22-128.el8cp (40a2bf9c4e79e39754d69a95cd51bd60991284be) nautilus (stable) [root@ceph-msaini-taooh8-node1-installer ~]# ceph versions { "mon": { "ceph version 14.2.22-128.el8cp (40a2bf9c4e79e39754d69a95cd51bd60991284be) nautilus (stable)": 3 }, "mgr": { "ceph version 14.2.22-128.el8cp (40a2bf9c4e79e39754d69a95cd51bd60991284be) nautilus (stable)": 3 }, "osd": { "ceph version 14.2.22-128.el8cp (40a2bf9c4e79e39754d69a95cd51bd60991284be) nautilus (stable)": 12 }, "mds": { "ceph version 14.2.22-128.el8cp (40a2bf9c4e79e39754d69a95cd51bd60991284be) nautilus (stable)": 3 }, "rgw": { "ceph version 14.2.22-128.el8cp (40a2bf9c4e79e39754d69a95cd51bd60991284be) nautilus (stable)": 4 }, "overall": { "ceph version 14.2.22-128.el8cp (40a2bf9c4e79e39754d69a95cd51bd60991284be) nautilus (stable)": 25 } } [root@ceph-msaini-taooh8-node1-installer ~]# ceph -s cluster: id: 07cd16a8-f925-4d09-a041-6d725b939582 health: HEALTH_WARN 1 pool(s) have non-power-of-two pg_num 1 pools have too few placement groups 3 pools have too many placement groups mons are allowing insecure global_id reclaim services: mon: 3 daemons, quorum ceph-msaini-taooh8-node3,ceph-msaini-taooh8-node2,ceph-msaini-taooh8-node1-installer (age 45m) mgr: ceph-msaini-taooh8-node1-installer(active, since 43m), standbys: ceph-msaini-taooh8-node2, ceph-msaini-taooh8-node3 mds: cephfs:1 {0=ceph-msaini-taooh8-node2=up:active} 2 up:standby osd: 12 osds: 12 up (since 38m), 12 in (since 57m) rgw: 4 daemons active (ceph-msaini-taooh8-node5.rgw0, ceph-msaini-taooh8-node5.rgw1, ceph-msaini-taooh8-node6.rgw0, ceph-msaini-taooh8-node6.rgw1) data: pools: 13 pools, 676 pgs objects: 382 objects, 456 MiB usage: 13 GiB used, 227 GiB / 240 GiB avail pgs: 676 active+clean io: client: 2.5 KiB/s rd, 2 op/s rd, 0 op/s wr [root@ceph-msaini-taooh8-node1-installer ~]# podman ps CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES b4bc2bbf0671 registry.redhat.io/openshift4/ose-prometheus-node-exporter:v4.6 --path.procfs=/ro... 54 minutes ago Up 54 minutes node-exporter 288dbf3d1416 registry.redhat.io/rhceph/rhceph-4-rhel8:latest 49 minutes ago Up 49 minutes ceph-mon-ceph-msaini-taooh8-node1-installer e02558859efb registry.redhat.io/rhceph/rhceph-4-rhel8:latest 46 minutes ago Up 46 minutes ceph-mgr-ceph-msaini-taooh8-node1-installer fdc68705313e registry.redhat.io/rhceph/rhceph-4-rhel8:latest 30 minutes ago Up 30 minutes ceph-crash-ceph-msaini-taooh8-node1-installer [root@ceph-msaini-taooh8-node1-installer ceph-ansible]# sudo ansible-playbook -i hosts infrastructure-playbooks/rolling_update.yml --extra-vars "health_osd_check_retries=50 health_osd_check_delay=30" PLAY RECAP ************************************************************************************************************************************************************************************************************************************** ceph-msaini-taooh8-node1-installer : ok=375 changed=59 unreachable=0 failed=0 skipped=633 rescued=0 ignored=0 ceph-msaini-taooh8-node2 : ok=370 changed=39 unreachable=0 failed=0 skipped=685 rescued=0 ignored=0 ceph-msaini-taooh8-node3 : ok=370 changed=39 unreachable=0 failed=0 skipped=690 rescued=0 ignored=0 ceph-msaini-taooh8-node4 : ok=252 changed=28 unreachable=0 failed=0 skipped=460 rescued=0 ignored=0 ceph-msaini-taooh8-node5 : ok=379 changed=38 unreachable=0 failed=0 skipped=625 rescued=0 ignored=0 ceph-msaini-taooh8-node6 : ok=368 changed=37 unreachable=0 failed=0 skipped=645 rescued=0 ignored=0 ceph-msaini-taooh8-node7 : ok=319 changed=38 unreachable=0 failed=0 skipped=495 rescued=0 ignored=0 localhost : ok=1 changed=1 unreachable=0 failed=0 skipped=1 rescued=0 ignored=0 [root@ceph-msaini-taooh8-node1-installer ceph-ansible]# ansible-playbook -vvvv infrastructure-playbooks/rolling_update.yml -i hosts stdout: |- { "mon": { "ceph version 16.2.10-220.el8cp (380780920862a7326df3e00903e9912b85af7d30) pacific (stable)": 3 }, "mgr": { "ceph version 16.2.10-220.el8cp (380780920862a7326df3e00903e9912b85af7d30) pacific (stable)": 3 }, "osd": { "ceph version 16.2.10-220.el8cp (380780920862a7326df3e00903e9912b85af7d30) pacific (stable)": 12 }, "mds": { "ceph version 16.2.10-220.el8cp (380780920862a7326df3e00903e9912b85af7d30) pacific (stable)": 3 }, "rgw": { "ceph version 16.2.10-220.el8cp (380780920862a7326df3e00903e9912b85af7d30) pacific (stable)": 4 }, "rgw-nfs": { "ceph version 16.2.10-220.el8cp (380780920862a7326df3e00903e9912b85af7d30) pacific (stable)": 1 }, "overall": { "ceph version 16.2.10-220.el8cp (380780920862a7326df3e00903e9912b85af7d30) pacific (stable)": 26 } } stdout_lines: <omitted> META: ran handlers META: ran handlers PLAY RECAP ************************************************************************************************************************************************************************************************************************************** ceph-msaini-taooh8-node1-installer : ok=372 changed=51 unreachable=0 failed=0 skipped=626 rescued=0 ignored=0 ceph-msaini-taooh8-node2 : ok=363 changed=27 unreachable=0 failed=0 skipped=676 rescued=0 ignored=0 ceph-msaini-taooh8-node3 : ok=364 changed=28 unreachable=0 failed=0 skipped=680 rescued=0 ignored=0 ceph-msaini-taooh8-node4 : ok=249 changed=21 unreachable=0 failed=0 skipped=453 rescued=0 ignored=0 ceph-msaini-taooh8-node5 : ok=375 changed=27 unreachable=0 failed=0 skipped=616 rescued=0 ignored=0 ceph-msaini-taooh8-node6 : ok=370 changed=27 unreachable=0 failed=0 skipped=629 rescued=0 ignored=0 ceph-msaini-taooh8-node7 : ok=317 changed=29 unreachable=0 failed=0 skipped=489 rescued=0 ignored=0 localhost : ok=1 changed=1 unreachable=0 failed=0 skipped=1 rescued=0 ignored=0 [root@ceph-msaini-taooh8-node1-installer ceph-ansible]# ceph --version ceph version 14.2.22-128.el8cp (40a2bf9c4e79e39754d69a95cd51bd60991284be) nautilus (stable) [root@ceph-msaini-taooh8-node1-installer ceph-ansible]# ceph versions { "mon": { "ceph version 16.2.10-220.el8cp (380780920862a7326df3e00903e9912b85af7d30) pacific (stable)": 3 }, "mgr": { "ceph version 16.2.10-220.el8cp (380780920862a7326df3e00903e9912b85af7d30) pacific (stable)": 3 }, "osd": { "ceph version 16.2.10-220.el8cp (380780920862a7326df3e00903e9912b85af7d30) pacific (stable)": 12 }, "mds": { "ceph version 16.2.10-220.el8cp (380780920862a7326df3e00903e9912b85af7d30) pacific (stable)": 3 }, "rgw": { "ceph version 16.2.10-220.el8cp (380780920862a7326df3e00903e9912b85af7d30) pacific (stable)": 4 }, "rgw-nfs": { "ceph version 16.2.10-220.el8cp (380780920862a7326df3e00903e9912b85af7d30) pacific (stable)": 1 }, "overall": { "ceph version 16.2.10-220.el8cp (380780920862a7326df3e00903e9912b85af7d30) pacific (stable)": 26 } } [root@ceph-msaini-taooh8-node1-installer ceph-ansible]# podman ps CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES 6ca1e2071341 registry-proxy.engineering.redhat.com/rh-osbs/rhceph:ceph-5.3-rhel-8-containers-candidate-88814-20231215195330 16 minutes ago Up 16 minutes ceph-mon-ceph-msaini-taooh8-node1-installer f518b6b7588d registry-proxy.engineering.redhat.com/rh-osbs/rhceph:ceph-5.3-rhel-8-containers-candidate-88814-20231215195330 13 minutes ago Up 13 minutes ceph-mgr-ceph-msaini-taooh8-node1-installer 74a1b25bee9e registry-proxy.engineering.redhat.com/rh-osbs/rhceph:ceph-5.3-rhel-8-containers-candidate-88814-20231215195330 3 minutes ago Up 3 minutes ceph-crash-ceph-msaini-taooh8-node1-installer 38e14828d9ae registry.redhat.io/openshift4/ose-prometheus-node-exporter:v4.6 --path.procfs=/ro... 2 minutes ago Up 2 minutes node-exporter [root@ceph-msaini-taooh8-node1-installer ceph-ansible]# # systemctl | grep ceph ceph-crash loaded active running Ceph crash dump collector ceph-mgr loaded active running Ceph Manager ceph-mon loaded active running Ceph Monitor system-ceph\x2dcrash.slice loaded active active system-ceph\x2dcrash.slice system-ceph\x2dmgr.slice loaded active active system-ceph\x2dmgr.slice system-ceph\x2dmon.slice loaded active active system-ceph\x2dmon.slice ceph-mgr.target loaded active active ceph target allowing to start/stop all ceph-mgr@.service instances at once ceph-mon.target loaded active active ceph target allowing to start/stop all ceph-mon@.service instances at once ceph.target loaded active active ceph target allowing to start/stop all ceph*@.service instances at once [root@ceph-msaini-taooh8-node1-installer ceph-ansible]# ansible-playbook infrastructure-playbooks/cephadm-adopt.yml -i hosts TASK [add ceph label for core component] ******************************************************************************************************************************************************************************************************** fatal: [ceph-msaini-taooh8-node2 -> ceph-msaini-taooh8-node1-installer]: FAILED! => changed=false cmd: - podman - run - --rm - --net=host - -v - /etc/ceph:/etc/ceph:z - -v - /var/lib/ceph:/var/lib/ceph:ro - -v - /var/run/ceph:/var/run/ceph:z - --entrypoint=ceph - registry-proxy.engineering.redhat.com/rh-osbs/rhceph:ceph-5.3-rhel-8-containers-candidate-88814-20231215195330 - --cluster - ceph - orch - host - label - add - ceph-msaini-taooh8-node2 - ceph delta: '0:00:01.795436' end: '2023-12-20 18:15:08.390207' msg: non-zero return code rc: 22 start: '2023-12-20 18:15:06.594771' stderr: 'Error EINVAL: host ceph-msaini-taooh8-node2 does not exist' stderr_lines: <omitted> stdout: '' stdout_lines: <omitted> [root@ceph-msaini-taooh8-node1-installer ceph-ansible]# dnf install cephadm-ansible Updating Subscription Management repositories. Last metadata expiration check: 0:01:06 ago on Wed 20 Dec 2023 06:22:31 PM EST. Dependencies resolved. ================================================================================================================================================================================================================================================= Package Architecture Version Repository Size ================================================================================================================================================================================================================================================= Installing: cephadm-ansible noarch 1.17.0-1.el8cp rhceph-5-tools-for-rhel-8-x86_64-rpms 32 k Installing dependencies: ansible-collection-ansible-posix noarch 1.2.0-1.el8cp.1 rhceph-5-tools-for-rhel-8-x86_64-rpms 131 k ansible-collection-community-general noarch 4.0.0-1.1.el8cp.1 rhceph-5-tools-for-rhel-8-x86_64-rpms 1.5 M ansible-core x86_64 2.15.3-1.el8 rhel-8-for-x86_64-appstream-rpms 3.6 M mpdecimal x86_64 2.5.1-3.el8 rhel-8-for-x86_64-appstream-rpms 93 k python3.11 x86_64 3.11.5-1.el8_9 rhel-8-for-x86_64-appstream-rpms 30 k python3.11-cffi x86_64 1.15.1-1.el8 rhel-8-for-x86_64-appstream-rpms 293 k python3.11-cryptography x86_64 37.0.2-5.el8 rhel-8-for-x86_64-appstream-rpms 1.1 M python3.11-libs x86_64 3.11.5-1.el8_9 rhel-8-for-x86_64-appstream-rpms 10 M python3.11-pip-wheel noarch 22.3.1-4.el8 rhel-8-for-x86_64-appstream-rpms 1.4 M python3.11-ply noarch 3.11-1.el8 rhel-8-for-x86_64-appstream-rpms 135 k python3.11-pycparser noarch 2.20-1.el8 rhel-8-for-x86_64-appstream-rpms 147 k python3.11-pyyaml x86_64 6.0-1.el8 rhel-8-for-x86_64-appstream-rpms 214 k python3.11-setuptools-wheel noarch 65.5.1-2.el8 rhel-8-for-x86_64-appstream-rpms 720 k sshpass x86_64 1.09-4.el8ap labrepo 30 k Transaction Summary ================================================================================================================================================================================================================================================= Install 15 Packages Total download size: 20 M Installed size: 78 M Is this ok [y/N]: y [root@ceph-msaini-taooh8-node1-installer cephadm-ansible]# systemctl | grep ceph ceph-crash loaded active running Ceph crash dump collector ● ceph-mgr loaded failed failed Ceph Manager ● ceph-mon loaded failed failed Ceph Monitor system-ceph\x2dcrash.slice loaded active active system-ceph\x2dcrash.slice system-ceph\x2dmgr.slice loaded active active system-ceph\x2dmgr.slice system-ceph\x2dmon.slice loaded active active system-ceph\x2dmon.slice ceph.target loaded active active ceph target allowing to start/stop all ceph*@.service instances at once [root@ceph-msaini-taooh8-node1-installer cephadm-ansible]# systemctl status ceph.target ● ceph.target - ceph target allowing to start/stop all ceph*@.service instances at once Loaded: loaded (/etc/systemd/system/ceph.target; enabled; vendor preset: enabled) Active: active since Wed 2023-12-20 14:58:46 EST; 3h 54min ago Dec 20 14:58:46 ceph-msaini-taooh8-node1-installer systemd[1]: Reached target ceph target allowing to start/stop all ceph*@.service instances at once. [root@ceph-msaini-taooh8-node1-installer cephadm-ansible]# ceph -s [root@ceph-msaini-taooh8-node1-installer cephadm-ansible]# systemctl -l status ceph-mgr ● ceph-mgr - Ceph Manager Loaded: loaded (/etc/systemd/system/ceph-mgr@.service; enabled; vendor preset: disabled) Active: failed (Result: exit-code) since Wed 2023-12-20 18:34:26 EST; 37min ago Main PID: 110855 (code=exited, status=143) Dec 20 18:34:24 ceph-msaini-taooh8-node1-installer ceph-mgr-ceph-msaini-taooh8-node1-installer[110855]: 2023-12-20T18:34:24.431-0500 7f0a8fddb700 0 log_channel(cluster) log [DBG] : pgmap v677: 701 pgs: 701 active+clean; 456 MiB data, 2.3 G> Dec 20 18:34:26 ceph-msaini-taooh8-node1-installer systemd[1]: Stopping Ceph Manager... Dec 20 18:34:26 ceph-msaini-taooh8-node1-installer ceph-mgr-ceph-msaini-taooh8-node1-installer[110855]: teardown: managing teardown after SIGTERM Dec 20 18:34:26 ceph-msaini-taooh8-node1-installer ceph-mgr-ceph-msaini-taooh8-node1-installer[110855]: teardown: Sending SIGTERM to PID 54 Dec 20 18:34:26 ceph-msaini-taooh8-node1-installer ceph-mgr-ceph-msaini-taooh8-node1-installer[110855]: teardown: Waiting PID 54 to terminate . Dec 20 18:34:26 ceph-msaini-taooh8-node1-installer ceph-mgr-ceph-msaini-taooh8-node1-installer[110855]: teardown: Process 54 is terminated Dec 20 18:34:26 ceph-msaini-taooh8-node1-installer sh[142533]: f518b6b7588de6ed1793a6f58a4fa9ca41df91f58a7543dd90d97508e6f612e5 Dec 20 18:34:26 ceph-msaini-taooh8-node1-installer systemd[1]: ceph-mgr: Main process exited, code=exited, status=143/n/a Dec 20 18:34:26 ceph-msaini-taooh8-node1-installer systemd[1]: ceph-mgr: Failed with result 'exit-code'. Dec 20 18:34:26 ceph-msaini-taooh8-node1-installer systemd[1]: Stopped Ceph Manager. [root@ceph-msaini-taooh8-node1-installer cephadm-ansible]# systemctl -l status ceph-mon ● ceph-mon - Ceph Monitor Loaded: loaded (/etc/systemd/system/ceph-mon@.service; enabled; vendor preset: disabled) Active: failed (Result: exit-code) since Wed 2023-12-20 18:34:26 EST; 38min ago Main PID: 106377 (code=exited, status=143) Dec 20 18:34:26 ceph-msaini-taooh8-node1-installer ceph-mon-ceph-msaini-taooh8-node1-installer[106377]: debug 2023-12-20T18:34:26.595-0500 7f4c0a23b880 1 rocksdb: close waiting for compaction thread to stop Dec 20 18:34:26 ceph-msaini-taooh8-node1-installer ceph-mon-ceph-msaini-taooh8-node1-installer[106377]: debug 2023-12-20T18:34:26.595-0500 7f4c0a23b880 1 rocksdb: close compaction thread to stopped Dec 20 18:34:26 ceph-msaini-taooh8-node1-installer ceph-mon-ceph-msaini-taooh8-node1-installer[106377]: debug 2023-12-20T18:34:26.595-0500 7f4c0a23b880 4 rocksdb: [db_impl/db_impl.cc:397] Shutdown: canceling all background work Dec 20 18:34:26 ceph-msaini-taooh8-node1-installer ceph-mon-ceph-msaini-taooh8-node1-installer[106377]: debug 2023-12-20T18:34:26.599-0500 7f4c0a23b880 4 rocksdb: [db_impl/db_impl.cc:573] Shutdown complete Dec 20 18:34:26 ceph-msaini-taooh8-node1-installer ceph-mon-ceph-msaini-taooh8-node1-installer[106377]: teardown: Waiting PID 86 to terminate . Dec 20 18:34:26 ceph-msaini-taooh8-node1-installer ceph-mon-ceph-msaini-taooh8-node1-installer[106377]: teardown: Process 86 is terminated Dec 20 18:34:26 ceph-msaini-taooh8-node1-installer sh[142608]: 6ca1e2071341cf2fa0140bced76763b36ec0f17f55ddba50794aa25a1245099e Dec 20 18:34:26 ceph-msaini-taooh8-node1-installer systemd[1]: ceph-mon: Main process exited, code=exited, status=143/n/a Dec 20 18:34:26 ceph-msaini-taooh8-node1-installer systemd[1]: ceph-mon: Failed with result 'exit-code'. Dec 20 18:34:26 ceph-msaini-taooh8-node1-installer systemd[1]: Stopped Ceph Monitor. [root@ceph-msaini-taooh8-node1-installer cephadm-ansible]# [root@ceph-msaini-taooh8-node1-installer cephadm-ansible]# rpm -qa | grep ceph ceph-grafana-dashboards-14.2.22-128.el8cp.noarch libcephfs2-16.2.10-208.el8cp.x86_64 cephadm-ansible-1.17.0-1.el8cp.noarch python3-ceph-common-16.2.10-208.el8cp.x86_64 ceph-base-16.2.10-208.el8cp.x86_64 cephadm-16.2.10-220.el9cp.noarch python3-ceph-argparse-16.2.10-208.el8cp.x86_64 python3-cephfs-16.2.10-208.el8cp.x86_64 ceph-common-16.2.10-208.el8cp.x86_64 ceph-selinux-16.2.10-208.el8cp.x86_64
Cephadm-preflight.yaml playbook was passing and the ceph.target status was active as updated in comment #5. As per comment #6 as offline discussion with Teoman Onay, this BZ was related to the first step only which is fixed and working as expected. QE will be reproducing the issue of the ceph services (mon,mgr,osd's) which got failed on all the nodes. If the issue is reproducible, QE will raise a new BZ for same. As the BZ fix is working as expected, marking this BZ as verified.
(In reply to Manisha Saini from comment #10) > Cephadm-preflight.yaml playbook was passing and the ceph.target status was > active as updated in comment #5. > > As per comment #6 as offline discussion with Teoman Onay, this BZ was > related to the first step only which is fixed and working as expected. > > QE will be reproducing the issue of the ceph services (mon,mgr,osd's) which > got failed on all the nodes. > If the issue is reproducible, QE will raise a new BZ for same. > > Unable to reproduce the issue seen in comment #5. Upgrade was successful and all services were up and running post upgrade. Detailed steps recorded - https://docs.google.com/document/d/1xhRCY-bSRWTrKXzibdI7SlQ9rASXdBquPvItCN_hcb4/edit > As the BZ fix is working as expected, marking this BZ as verified.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: Red Hat Ceph Storage 5.3 Security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2024:0745