Description of problem: Update of OSP13z12 failed during Controller update on: 2022-02-25 04:21:34 | TASK [Stop mons to make them consistent with systemd] ************************** 2022-02-25 04:21:34 | Friday 25 February 2022 04:21:33 +0000 (0:00:00.450) 0:05:54.143 ******* 2022-02-25 04:21:34 | fatal: [controller-2]: FAILED! => {"changed": true, "cmd": "docker stop ceph-mon-controller-2", "delta": "0:00:00.040103", "end": "2022-02-25 04:21:33.802947", "msg": "non-zero return code", "rc": 1, "start": "2022-02-25 04:21:33.762844", "stderr": "Error response from daemon: No such container: ceph-mon-controller-2", "stderr_lines": ["Error response from daemon: No such container: ceph-mon-controller-2"], "stdout": "", "stdout_lines": []} 2022-02-25 04:21:34 | On a live environment we saw that that ceph-mon could not start because it couldn't get a LOCK: lsof /var/lib/ceph/mon/ceph-controller-2/store.db/LOCK COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME ceph-mon 6769 ceph 7uW REG 253,2 0 88511062 /var/lib/ceph/mon/ceph-controller-2/store.db/LOCK and that process was a leftover ceph-mon container that didn't show up in the docker ps output. [root@controller-2 heat-admin]# ps -eo pid,lstart,cmd |grep 6455 6455 Tue Feb 22 11:42:20 2022 /usr/bin/docker-containerd-shim-current f43eafd4eb15045747c17b470726eb3d42c5b1f70017cd0f619d2fc9c3695d7c /var/run/docker/libcontainerd/f43eafd4eb15045747c17b470726eb3d42c5b1f70017cd0f619d2fc9c3695d7c /usr/libexec/docker/docker-runc-current 988527 Fri Feb 25 13:03:14 2022 grep --color=auto 6455 After killing (using kill -9 <pid>) the rogue container we were able to start ceph-mon again. Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
We are currently testing if setting: exclude=docker-common-1.13.1-209.git7d71120.el7_9.x86_64 docker-client.x86_64 2:1.13.1-209.git7d71120.el7_9 docker-1.13.1-209.git7d71120.el7_9.x86_64 docker-rhel-push-plugin.x86_64 2:1.13.1-209.git7d71120.el7_9 in /etc/yum.conf of all overcloud node (so that we update docker to 208 and not to 209) is a working workaround.