2058636 – [update] Cannot stop ceph-mon on controllers: No such container: ceph-mon-controller-2

Bug 2058636 - [update] Cannot stop ceph-mon on controllers: No such container: ceph-mon-controller-2

Summary: [update] Cannot stop ceph-mon on controllers: No such container: ceph-mon-con...

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	openstack-tripleo-heat-templates
Sub Component:
Version:	13.0 (Queens)
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	urgent
Target Milestone:	---
Target Release:	---
Assignee:	Mikolaj Ciecierski
QA Contact:	Jason Grosso
Docs Contact:
URL:
Whiteboard:
Depends On:	2058644
Blocks:
TreeView+	depends on / blocked

Reported:	2022-02-25 13:52 UTC by Sofer Athlan-Guyot
Modified:	2022-05-23 14:15 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2022-05-23 14:15:25 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	OSP-13196	0	None	None	None	2022-02-25 13:55:08 UTC

Internal Links: 2069570

Description Sofer Athlan-Guyot 2022-02-25 13:52:27 UTC

Description of problem:

Update of OSP13z12 failed during Controller update on:

  2022-02-25 04:21:34 | TASK [Stop mons to make them consistent with systemd] **************************
  2022-02-25 04:21:34 | Friday 25 February 2022  04:21:33 +0000 (0:00:00.450)       0:05:54.143 *******
  2022-02-25 04:21:34 | fatal: [controller-2]: FAILED! => {"changed": true, "cmd": "docker stop ceph-mon-controller-2", "delta": "0:00:00.040103", "end": "2022-02-25 04:21:33.802947", "msg": "non-zero return code", "rc": 1, "start": "2022-02-25 04:21:33.762844", "stderr": "Error response from daemon: No such container: ceph-mon-controller-2", "stderr_lines": ["Error response from daemon: No such container: ceph-mon-controller-2"], "stdout": "", "stdout_lines": []}
  2022-02-25 04:21:34 |


On a live environment we saw that that ceph-mon could not start because it couldn't get a LOCK:


  lsof /var/lib/ceph/mon/ceph-controller-2/store.db/LOCK 
  COMMAND   PID USER   FD   TYPE DEVICE SIZE/OFF     NODE NAME
  ceph-mon 6769 ceph    7uW  REG  253,2        0 88511062 /var/lib/ceph/mon/ceph-controller-2/store.db/LOCK

and that process was a leftover ceph-mon container that didn't show up in the docker ps output.

  [root@controller-2 heat-admin]# ps -eo pid,lstart,cmd |grep 6455
  6455 Tue Feb 22 11:42:20 2022 /usr/bin/docker-containerd-shim-current f43eafd4eb15045747c17b470726eb3d42c5b1f70017cd0f619d2fc9c3695d7c /var/run/docker/libcontainerd/f43eafd4eb15045747c17b470726eb3d42c5b1f70017cd0f619d2fc9c3695d7c /usr/libexec/docker/docker-runc-current 
  988527 Fri Feb 25 13:03:14 2022 grep --color=auto 6455 

After killing (using kill -9 <pid>) the rogue container we were able to start ceph-mon again.



Version-Release number of selected component (if applicable):




How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 2 Sofer Athlan-Guyot 2022-02-25 14:09:13 UTC

We are currently testing if setting:

exclude=docker-common-1.13.1-209.git7d71120.el7_9.x86_64 docker-client.x86_64 2:1.13.1-209.git7d71120.el7_9 docker-1.13.1-209.git7d71120.el7_9.x86_64 docker-rhel-push-plugin.x86_64 2:1.13.1-209.git7d71120.el7_9

in /etc/yum.conf of all overcloud node (so that we update docker to 208 and not to 209) is a working workaround.

Note You need to log in before you can comment on or make changes to this bug.