Bug 2058636

Summary: [update] Cannot stop ceph-mon on controllers: No such container: ceph-mon-controller-2
Product: Red Hat OpenStack Reporter: Sofer Athlan-Guyot <sathlang>
Component: openstack-tripleo-heat-templatesAssignee: Mikolaj Ciecierski <mciecier>
Status: CLOSED CURRENTRELEASE QA Contact: Jason Grosso <jgrosso>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 13.0 (Queens)CC: jgrosso, jlarriba, jpretori, kgilliga, kthakre, mburns, mciecier
Target Milestone: ---Keywords: Triaged, ZStream
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-05-23 14:15:25 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 2058644    
Bug Blocks:    

Description Sofer Athlan-Guyot 2022-02-25 13:52:27 UTC
Description of problem:

Update of OSP13z12 failed during Controller update on:

  2022-02-25 04:21:34 | TASK [Stop mons to make them consistent with systemd] **************************
  2022-02-25 04:21:34 | Friday 25 February 2022  04:21:33 +0000 (0:00:00.450)       0:05:54.143 *******
  2022-02-25 04:21:34 | fatal: [controller-2]: FAILED! => {"changed": true, "cmd": "docker stop ceph-mon-controller-2", "delta": "0:00:00.040103", "end": "2022-02-25 04:21:33.802947", "msg": "non-zero return code", "rc": 1, "start": "2022-02-25 04:21:33.762844", "stderr": "Error response from daemon: No such container: ceph-mon-controller-2", "stderr_lines": ["Error response from daemon: No such container: ceph-mon-controller-2"], "stdout": "", "stdout_lines": []}
  2022-02-25 04:21:34 |


On a live environment we saw that that ceph-mon could not start because it couldn't get a LOCK:


  lsof /var/lib/ceph/mon/ceph-controller-2/store.db/LOCK 
  COMMAND   PID USER   FD   TYPE DEVICE SIZE/OFF     NODE NAME
  ceph-mon 6769 ceph    7uW  REG  253,2        0 88511062 /var/lib/ceph/mon/ceph-controller-2/store.db/LOCK

and that process was a leftover ceph-mon container that didn't show up in the docker ps output.

  [root@controller-2 heat-admin]# ps -eo pid,lstart,cmd |grep 6455
  6455 Tue Feb 22 11:42:20 2022 /usr/bin/docker-containerd-shim-current f43eafd4eb15045747c17b470726eb3d42c5b1f70017cd0f619d2fc9c3695d7c /var/run/docker/libcontainerd/f43eafd4eb15045747c17b470726eb3d42c5b1f70017cd0f619d2fc9c3695d7c /usr/libexec/docker/docker-runc-current 
  988527 Fri Feb 25 13:03:14 2022 grep --color=auto 6455 

After killing (using kill -9 <pid>) the rogue container we were able to start ceph-mon again.



Version-Release number of selected component (if applicable):




How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 2 Sofer Athlan-Guyot 2022-02-25 14:09:13 UTC
We are currently testing if setting:

exclude=docker-common-1.13.1-209.git7d71120.el7_9.x86_64 docker-client.x86_64 2:1.13.1-209.git7d71120.el7_9 docker-1.13.1-209.git7d71120.el7_9.x86_64 docker-rhel-push-plugin.x86_64 2:1.13.1-209.git7d71120.el7_9

in /etc/yum.conf of all overcloud node (so that we update docker to 208 and not to 209) is a working workaround.