Bug 1846830

Summary: openstack overcloud ceph-upgrade run fails with error ""stat: cannot stat '/var/run/ceph/ceph-mon.controller-3.localdomain.asok': No such file or directory"
Product: [Red Hat Storage] Red Hat Ceph Storage Reporter: Sadique Puthen <sputhenp>
Component: Ceph-AnsibleAssignee: Dimitri Savineau <dsavinea>
Status: CLOSED ERRATA QA Contact: Vasishta <vashastr>
Severity: high Docs Contact:
Priority: high    
Version: 3.2CC: aschoen, ceph-eng-bugs, ceph-qe-bugs, dsavinea, fpantano, gabrioux, gfidente, gmeno, jhoylaer, johfulto, lbezdick, mburns, nthomas, ravsingh, rlondhe, rrasouli, sathlang, tchandra, tserlin, ykaul, yrabl
Target Milestone: z6   
Target Release: 3.3   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: RHEL: ceph-ansible-3.2.44-1.el7cp Ubuntu: ceph-ansible_3.2.44-2redhat1 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-09-23 12:10:52 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1578730, 1877815    
Attachments:
Description Flags
ansible log none

Description Sadique Puthen 2020-06-15 02:03:00 UTC
Description of problem:


Sadique Puthen <sputhenp>
Thu, Jun 11, 1:20 PM (4 days ago)

to rhos-tech, ceph-osp, Giulio, John

I am running "openstack overcloud ceph-upgrade .." on the latest version of OSP-13 after running "openstack overcloud update run --nodes CephStorage". It fails with below error.

2020-06-11 01:01:55,393 p=28354 u=mistral |  fatal: [172.16.0.53]: FAILED! => {"attempts": 5, "changed": true, "cmd": ["docker", "exec", "ceph-mon-controller-3", "sh", "-c", "stat /var/run/ceph/ceph-mon.controller-3.asok || stat /var/run/ceph/ceph-mon.controller-3.localdomain.asok"], "delta": "0:00:00.114334", "end": "2020-06-11 05:01:55.382138", "msg": "non-zero return code", "rc": 1, "start": "2020-06-11 05:01:55.267804", "stderr": "stat: cannot stat '/var/run/ceph/ceph-mon.controller-3.asok': No such file or directory\nstat: cannot stat '/var/run/ceph/ceph-mon.controller-3.localdomain.asok': No such file or directory", "stderr_lines": ["stat: cannot stat '/var/run/ceph/ceph-mon.controller-3.asok': No such file or directory", "stat: cannot stat '/var/run/ceph/ceph-mon.controller-3.localdomain.asok': No such file or directory"], "stdout": "", "stdout_lines": []}

# docker exec ceph-mon-controller-3 sh -c stat /var/run/ceph/ceph-mon.controller-3.asok || stat /var/run/ceph/ceph-mon.controller-3.localdomain.asok
stat: missing operand
Try 'stat --help' for more information.
stat: cannot stat ‘/var/run/ceph/ceph-mon.controller-3.localdomain.asok’: No such file or directory

None of the .asok does exist inside the container. It only has the asok for ceph-mgr container.

# docker exec -it ceph-mon-controller-3 /bin/bash
# ls /var/run/ceph/
ceph-mgr.controller-3.asok

This was freshly deployed OSP-13 just to test upgrade. This problem surfaced only during ceph upgrade step. Your help is highly appreciated.
Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 Sadique Puthen 2020-06-15 02:04:44 UTC
Created attachment 1697281 [details]
ansible log

Comment 7 Sadique Puthen 2020-06-16 02:05:25 UTC
Here is the update.

1 - Deploy OSP13. Verify .asok file is present for mon on all controllers.
2 - Test uploading glance image, create vm and verify the environment is working.
3 - Run the upgrade.
3.1 Run update prepare using openstack overcloud update prepare \.. SUCCESS. Verify .asok file is present for mon on all controllers
3.2 Update controllers # openstack overcloud update run --nodes Controller SUCCESS But .asok file for mon container has disappeared from all controllers.

At this time, docker ps shows the mon container is running, ceph -s shows 3 mons running, but systemd status shows mon start up failed during the Controller update.

Jun 16 02:01:29 controller-3 systemd: Stopped Ceph Monitor.
Jun 16 02:01:29 controller-3 systemd: Starting Ceph Monitor...
Jun 16 02:01:29 controller-3 dockerd-current: time="2020-06-16T02:01:29.527339501Z" level=error msg="Handler for DELETE /v1.26/containers/ceph-mon-controller-3 returned error: You cannot remove a running container 6738219de56ce6715c8abce77529190ad8403bf886f8a63dcfd73e617a90c874. Stop the container before attempting removal or use -f"
Jun 16 02:01:29 controller-3 dockerd-current: time="2020-06-16T02:01:29.528166637Z" level=error msg="Handler for DELETE /v1.26/containers/ceph-mon-controller-3 returned error: You cannot remove a running container 6738219de56ce6715c8abce77529190ad8403bf886f8a63dcfd73e617a90c874. Stop the container before attempting removal or use -f"
Jun 16 02:01:29 controller-3 docker: Error response from daemon: You cannot remove a running container 6738219de56ce6715c8abce77529190ad8403bf886f8a63dcfd73e617a90c874. Stop the container before attempting removal or use -f
Jun 16 02:01:29 controller-3 systemd: Started Ceph Monitor.
Jun 16 02:01:29 controller-3 dockerd-current: time="2020-06-16T02:01:29.565847487Z" level=error msg="Handler for POST /v1.26/containers/create?name=ceph-mon-controller-3 returned error: Conflict. The container name \"/ceph-mon-controller-3\" is already in use by container 6738219de56ce6715c8abce77529190ad8403bf886f8a63dcfd73e617a90c874. You have to remove (or rename) that container to be able to reuse that name."
Jun 16 02:01:29 controller-3 dockerd-current: time="2020-06-16T02:01:29.566825193Z" level=error msg="Handler for POST /v1.26/containers/create returned error: Conflict. The container name \"/ceph-mon-controller-3\" is already in use by container 6738219de56ce6715c8abce77529190ad8403bf886f8a63dcfd73e617a90c874. You have to remove (or rename) that container to be able to reuse that name."
Jun 16 02:01:29 controller-3 docker: /usr/bin/docker-current: Error response from daemon: Conflict. The container name "/ceph-mon-controller-3" is already in use by container 6738219de56ce6715c8abce77529190ad8403bf886f8a63dcfd73e617a90c874. You have to remove (or rename) that container to be able to reuse that name..

Though this error is there the overall 3.2 is showing as succeeded.

PLAY RECAP *********************************************************************
controller-1               : ok=314  changed=141  unreachable=0    failed=0   
controller-2               : ok=305  changed=138  unreachable=0    failed=0   
controller-3               : ok=305  changed=138  unreachable=0    failed=0   

Monday 15 June 2020  13:24:27 -0400 (0:00:00.043)       1:20:37.365 *********** 
=============================================================================== 

Updated nodes - Controller
Success

Can you help to understand why mon restart failed during a controller update? I tried to docker restart <mon id>, but did not help

Comment 29 errata-xmlrpc 2020-08-18 18:05:58 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: Red Hat Ceph Storage 3.3 security and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:3504

Comment 33 John Fulton 2020-09-08 22:59:01 UTC
*** Bug 1856711 has been marked as a duplicate of this bug. ***

Comment 35 Yogev Rabl 2020-09-14 13:23:39 UTC
*** Bug 1877815 has been marked as a duplicate of this bug. ***

Comment 41 Red Hat Bugzilla 2023-09-14 06:02:08 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days