Bug 2058636 - [update] Cannot stop ceph-mon on controllers: No such container: ceph-mon-controller-2
Summary: [update] Cannot stop ceph-mon on controllers: No such container: ceph-mon-con...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-tripleo-heat-templates
Version: 13.0 (Queens)
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: ---
: ---
Assignee: Mikolaj Ciecierski
QA Contact: Jason Grosso
URL:
Whiteboard:
Depends On: 2058644
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-02-25 13:52 UTC by Sofer Athlan-Guyot
Modified: 2022-05-23 14:15 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-05-23 14:15:25 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker OSP-13196 0 None None None 2022-02-25 13:55:08 UTC

Internal Links: 2069570

Description Sofer Athlan-Guyot 2022-02-25 13:52:27 UTC
Description of problem:

Update of OSP13z12 failed during Controller update on:

  2022-02-25 04:21:34 | TASK [Stop mons to make them consistent with systemd] **************************
  2022-02-25 04:21:34 | Friday 25 February 2022  04:21:33 +0000 (0:00:00.450)       0:05:54.143 *******
  2022-02-25 04:21:34 | fatal: [controller-2]: FAILED! => {"changed": true, "cmd": "docker stop ceph-mon-controller-2", "delta": "0:00:00.040103", "end": "2022-02-25 04:21:33.802947", "msg": "non-zero return code", "rc": 1, "start": "2022-02-25 04:21:33.762844", "stderr": "Error response from daemon: No such container: ceph-mon-controller-2", "stderr_lines": ["Error response from daemon: No such container: ceph-mon-controller-2"], "stdout": "", "stdout_lines": []}
  2022-02-25 04:21:34 |


On a live environment we saw that that ceph-mon could not start because it couldn't get a LOCK:


  lsof /var/lib/ceph/mon/ceph-controller-2/store.db/LOCK 
  COMMAND   PID USER   FD   TYPE DEVICE SIZE/OFF     NODE NAME
  ceph-mon 6769 ceph    7uW  REG  253,2        0 88511062 /var/lib/ceph/mon/ceph-controller-2/store.db/LOCK

and that process was a leftover ceph-mon container that didn't show up in the docker ps output.

  [root@controller-2 heat-admin]# ps -eo pid,lstart,cmd |grep 6455
  6455 Tue Feb 22 11:42:20 2022 /usr/bin/docker-containerd-shim-current f43eafd4eb15045747c17b470726eb3d42c5b1f70017cd0f619d2fc9c3695d7c /var/run/docker/libcontainerd/f43eafd4eb15045747c17b470726eb3d42c5b1f70017cd0f619d2fc9c3695d7c /usr/libexec/docker/docker-runc-current 
  988527 Fri Feb 25 13:03:14 2022 grep --color=auto 6455 

After killing (using kill -9 <pid>) the rogue container we were able to start ceph-mon again.



Version-Release number of selected component (if applicable):




How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 2 Sofer Athlan-Guyot 2022-02-25 14:09:13 UTC
We are currently testing if setting:

exclude=docker-common-1.13.1-209.git7d71120.el7_9.x86_64 docker-client.x86_64 2:1.13.1-209.git7d71120.el7_9 docker-1.13.1-209.git7d71120.el7_9.x86_64 docker-rhel-push-plugin.x86_64 2:1.13.1-209.git7d71120.el7_9

in /etc/yum.conf of all overcloud node (so that we update docker to 208 and not to 209) is a working workaround.


Note You need to log in before you can comment on or make changes to this bug.