Bug 2058636
| Summary: | [update] Cannot stop ceph-mon on controllers: No such container: ceph-mon-controller-2 | ||
|---|---|---|---|
| Product: | Red Hat OpenStack | Reporter: | Sofer Athlan-Guyot <sathlang> | 
| Component: | openstack-tripleo-heat-templates | Assignee: | Mikolaj Ciecierski <mciecier> | 
| Status: | CLOSED CURRENTRELEASE | QA Contact: | Jason Grosso <jgrosso> | 
| Severity: | urgent | Docs Contact: | |
| Priority: | urgent | ||
| Version: | 13.0 (Queens) | CC: | jgrosso, jlarriba, jpretori, kgilliga, kthakre, mburns, mciecier | 
| Target Milestone: | --- | Keywords: | Triaged, ZStream | 
| Target Release: | --- | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2022-05-23 14:15:25 UTC | Type: | Bug | 
| Regression: | --- | Mount Type: | --- | 
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
| Bug Depends On: | 2058644 | ||
| Bug Blocks: | |||
| We are currently testing if setting: exclude=docker-common-1.13.1-209.git7d71120.el7_9.x86_64 docker-client.x86_64 2:1.13.1-209.git7d71120.el7_9 docker-1.13.1-209.git7d71120.el7_9.x86_64 docker-rhel-push-plugin.x86_64 2:1.13.1-209.git7d71120.el7_9 in /etc/yum.conf of all overcloud node (so that we update docker to 208 and not to 209) is a working workaround. | 
Description of problem: Update of OSP13z12 failed during Controller update on: 2022-02-25 04:21:34 | TASK [Stop mons to make them consistent with systemd] ************************** 2022-02-25 04:21:34 | Friday 25 February 2022 04:21:33 +0000 (0:00:00.450) 0:05:54.143 ******* 2022-02-25 04:21:34 | fatal: [controller-2]: FAILED! => {"changed": true, "cmd": "docker stop ceph-mon-controller-2", "delta": "0:00:00.040103", "end": "2022-02-25 04:21:33.802947", "msg": "non-zero return code", "rc": 1, "start": "2022-02-25 04:21:33.762844", "stderr": "Error response from daemon: No such container: ceph-mon-controller-2", "stderr_lines": ["Error response from daemon: No such container: ceph-mon-controller-2"], "stdout": "", "stdout_lines": []} 2022-02-25 04:21:34 | On a live environment we saw that that ceph-mon could not start because it couldn't get a LOCK: lsof /var/lib/ceph/mon/ceph-controller-2/store.db/LOCK COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME ceph-mon 6769 ceph 7uW REG 253,2 0 88511062 /var/lib/ceph/mon/ceph-controller-2/store.db/LOCK and that process was a leftover ceph-mon container that didn't show up in the docker ps output. [root@controller-2 heat-admin]# ps -eo pid,lstart,cmd |grep 6455 6455 Tue Feb 22 11:42:20 2022 /usr/bin/docker-containerd-shim-current f43eafd4eb15045747c17b470726eb3d42c5b1f70017cd0f619d2fc9c3695d7c /var/run/docker/libcontainerd/f43eafd4eb15045747c17b470726eb3d42c5b1f70017cd0f619d2fc9c3695d7c /usr/libexec/docker/docker-runc-current 988527 Fri Feb 25 13:03:14 2022 grep --color=auto 6455 After killing (using kill -9 <pid>) the rogue container we were able to start ceph-mon again. Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info: