Description of problem: After rebooting an HCI host with ceph-mon and ceph-osd containers the ceph containers do not come back up on reboot. Version-Release number of selected component (if applicable): ceph-ansible-4.0.25-1.el8cp.src.rpm How reproducible: Reproduced multiple times consistently Steps to Reproduce: 1. Deploy Compute HCI nodes with ceph(mon,mgr,osd) services and openstack compute services 2. on all nodes run "echo c> /proc/sysrq-trigger" 3. on all nodes run "init 6" 3. on any node observe that openstack containers start 4. on any node observe that ceph containers fail to start Actual results: Ceph containers do not start and we see errors that the old container has name the new container is trying to use (even on reboot). E.g. name "ceph-mon-dcn2-computehci2-1" is already in use by "6c0daedf9bec0645a6e174b3337a9c34f22eaa347f28a67d9edcfee2e66ffc61" The containers with this problem can be seen by running: journalctl -xe | grep ceph | grep "is already in use by" Expected results: ceph containers start just like the openstack containers.
Please specify the severity of this bug. Severity is defined here: https://bugzilla.redhat.com/page.cgi?id=fields.html#bug_severity.
WORKAROUND On each node running ceph containers operator intervention is required to get the ceph cluster back up and running. The operator intervention consists of removing the old container IDs with 'podman rm'. An example of the workaround is shown below. (undercloud) [stack@site-undercloud-0 ~]$ openstack server list +--------------------------------------+-----------------------+--------+------------------------+----------------+-------------+ | ID | Name | Status | Networks | Image | Flavor | +--------------------------------------+-----------------------+--------+------------------------+----------------+-------------+ | 4196210b-c5e4-4fe6-9ada-2f82ce0e809b | dcn2-computehci2-0 | BUILD | | overcloud-full | computehci2 | | c047b25b-86a7-4bf2-b18d-27fe5053da05 | dcn2-computehci2-2 | BUILD | | overcloud-full | computehci2 | | 94d3afe2-d0f4-485e-893b-253ed80fe7ad | dcn2-computehci2-1 | BUILD | | overcloud-full | computehci2 | | f3597aaf-fa99-4494-9761-fe79d2bcf6d8 | dcn1-computehci1-0 | ACTIVE | ctlplane=192.168.34.27 | overcloud-full | computehci1 | | aa5903cd-2fec-4d1d-8f16-d6c428b8ba0d | dcn1-computehci1-1 | ACTIVE | ctlplane=192.168.34.12 | overcloud-full | computehci1 | | 7e27b2d0-d8ee-4f86-8a66-6363b5dc70df | dcn1-computehci1-2 | ACTIVE | ctlplane=192.168.34.57 | overcloud-full | computehci1 | | 0f36ba61-30c3-4069-80cb-6139d5f1752c | central-controller0-1 | ACTIVE | ctlplane=192.168.24.25 | overcloud-full | control0 | | 2d7c4ac8-e1e7-4052-98a7-e26d337edddf | central-controller0-2 | ACTIVE | ctlplane=192.168.24.85 | overcloud-full | control0 | | 7acd9661-2b20-46e3-91de-64afacae2810 | central-controller0-0 | ACTIVE | ctlplane=192.168.24.16 | overcloud-full | control0 | | 15ce76ee-c065-4945-81c6-0b4fa15c1ad6 | central-compute0-0 | ACTIVE | ctlplane=192.168.24.75 | overcloud-full | compute0 | | c1e68e6d-8d97-4d93-a2b4-36ac65f4b8a9 | central-compute0-1 | ACTIVE | ctlplane=192.168.24.78 | overcloud-full | compute0 | +--------------------------------------+-----------------------+--------+------------------------+----------------+-------------+ (undercloud) [stack@site-undercloud-0 ~]$ (undercloud) [stack@site-undercloud-0 ~]$ ssh 192.168.34.57 Warning: Permanently added '192.168.34.57' (ECDSA) to the list of known hosts. This system is not registered to Red Hat Insights. See https://cloud.redhat.com/ To register this system, run: insights-client --register Last login: Mon Jul 20 04:36:34 2020 from 192.168.34.254 [heat-admin@dcn1-computehci1-2 ~]$ [heat-admin@dcn1-computehci1-2 ~]$ sudo su - [root@dcn1-computehci1-2 ~]# [root@dcn1-computehci1-2 ~]# podman ps | grep ceph [root@dcn1-computehci1-2 ~]# [root@dcn1-computehci1-2 ~]# for C in $(journalctl -xe | grep ceph | grep "is already in use by" | awk {'print $20'} | sort | uniq | sed s/\"//g | sed s/\\.//g); do echo $C ; podman rm $C; done 12e9794483503ee0aa9b6c138eb55b8c99df1002d40340ea97380163a8473904 12e9794483503ee0aa9b6c138eb55b8c99df1002d40340ea97380163a8473904 2726be4b62e1f4f243af50a910ed00921ae165d1004d7b588f7614eb8217195d 2726be4b62e1f4f243af50a910ed00921ae165d1004d7b588f7614eb8217195d 4fdb002927e55bad22df03890a7559679170007a4baf31198c6944d3e2d9c906 4fdb002927e55bad22df03890a7559679170007a4baf31198c6944d3e2d9c906 52d5c38d3bb76a33155af907e2db359a0d47fee49a1571b97e938dea96319e45 52d5c38d3bb76a33155af907e2db359a0d47fee49a1571b97e938dea96319e45 9e66567dd53640592480b0aa6ce137fa2f6585fdcc728134a7dd09d2fbdb98f2 9e66567dd53640592480b0aa6ce137fa2f6585fdcc728134a7dd09d2fbdb98f2 b030e09515f6bc13a23c2cbc392e11e6de48fc4a091b30a3540418768c30a2e9 b030e09515f6bc13a23c2cbc392e11e6de48fc4a091b30a3540418768c30a2e9 b6f1f9c12d89957f690ca934a029c96861831e371f0e03abaeb87560e9c2a5d8 b6f1f9c12d89957f690ca934a029c96861831e371f0e03abaeb87560e9c2a5d8 [root@dcn1-computehci1-2 ~]# podman ps | grep ceph 76f78c107786 site-undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhceph:4-27 1 second ago Up 1 second ago ceph-osd-12 60655b230ba4 site-undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhceph:4-27 2 seconds ago Up 1 second ago ceph-mgr-dcn1-computehci1-2 2a8bab43c3b7 site-undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhceph:4-27 2 seconds ago Up 1 second ago ceph-mon-dcn1-computehci1-2 906786e53078 site-undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhceph:4-27 2 seconds ago Up 2 seconds ago ceph-osd-1 c9f28583d688 site-undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhceph:4-27 2 seconds ago Up 2 seconds ago ceph-osd-4 dc3d2beae848 site-undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhceph:4-27 2 seconds ago Up 2 seconds ago ceph-osd-9 [root@dcn1-computehci1-2 ~]# [root@dcn1-computehci1-2 ~]# podman exec ceph-mon-dcn1-computehci1-2 ceph -s --cluster dcn1 Error initializing cluster client: ObjectNotFound('error calling conf_read_file',) Error: non zero exit code: 1: OCI runtime error [root@dcn1-computehci1-2 ~]# [root@dcn1-computehci1-2 ~]# podman exec -ti ceph-mon-dcn1-computehci1-2 /bin/bash [root@dcn1-computehci1-2 /]# [root@dcn1-computehci1-2 /]# ceph -s cluster: id: 80ceec70-c646-4ab9-98cf-f966afb8f177 health: HEALTH_WARN too few PGs per OSD (12 < min 30) services: mon: 3 daemons, quorum dcn1-computehci1-2,dcn1-computehci1-0,dcn1-computehci1-1 (age 3m) mgr: dcn1-computehci1-1(active, since 7m), standbys: dcn1-computehci1-0, dcn1-computehci1-2 osd: 15 osds: 15 up (since 3m), 15 in (since 16h) data: pools: 2 pools, 64 pgs objects: 282 objects, 391 MiB usage: 16 GiB used, 239 GiB / 255 GiB avail pgs: 64 active+clean [root@dcn1-computehci1-2 /]#
If you see something like the following on the logs of a node, then run `systemctl restart tripleo_cinder_volume.service` on that node. ERROR cinder.service [-] Manager for service cinder-volume dcn1-computehci1-2@tripleo_ceph is reporting problems, not sending heartbeat. Service will appear "down"
ceph-ansible-4.0.23-1.el8cp.noarch does not have this problem. We tested the following: 1. Deploy Compute HCI nodes with ceph(mon,mgr,osd) services and openstack compute services 2. on all nodes run "echo c> /proc/sysrq-trigger" 3. on all nodes run "init 6" 3. on any node observe that openstack containers start 4. on any node observe that ceph containers start too! The only difference was if we deployed with ceph-ansible-4.0.23-1.el8cp.noarch vs ceph-ansible-4.0.25-1.el8cp.src.rpm
System deployed with 4.0.25 [root@dcn1-computehci1-0 system]# head ceph-mon\@.service [Unit] Description=Ceph Monitor After=network.target [Service] EnvironmentFile=-/etc/environment ExecStartPre=-/usr/bin/rm -f /%t/%n-pid /%t/%n-cid ExecStartPre=/bin/sh -c '"$(command -v mkdir)" -p /etc/ceph /var/lib/ceph/mon' ExecStart=/usr/bin/podman run --rm --name ceph-mon-%i \ -d --conmon-pidfile /%t/%n-pid --cidfile /%t/%n-cid \ [root@dcn1-computehci1-0 system]# System deployed with 4.0.23 [root@dcn2-computehci2-2 system]# head ceph-mon\@.service [Unit] Description=Ceph Monitor After=network.target [Service] EnvironmentFile=-/etc/environment ExecStartPre=-/usr/bin/podman rm ceph-mon-%i ExecStartPre=/bin/sh -c '"$(command -v mkdir)" -p /etc/ceph /var/lib/ceph/mon' ExecStart=/usr/bin/podman run --rm --name ceph-mon-%i \ --memory=7809m \ [root@dcn2-computehci2-2 system]#
The reason for comment #5 is the following commit: https://github.com/ceph/ceph-ansible/commit/eb3f065d03ba45ed147b8154f2750c0e0533e029 I suspect that the missing "podman rm" explains the issue.
A normal reboot is not enough to trigger this issue. To reproduce this you need to run "echo c> /proc/sysrq-trigger"
The fix for bz 1834974 changed how unit files are created and seems to have introduced this issue.
Hi Dimitri, Can you review the observations related to restart of Mon containers?
Waiting on triage. This BZ is targeted for 4.1z1 Async.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Red Hat Ceph Storage 4.1 bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:3322
*** Bug 1876717 has been marked as a duplicate of this bug. ***