Bug 1846830 - openstack overcloud ceph-upgrade run fails with error ""stat: cannot stat '/var/run/ceph/ceph-mon.controller-3.localdomain.asok': No such file or directory"
Summary: openstack overcloud ceph-upgrade run fails with error ""stat: cannot stat '/v...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Ceph Storage
Classification: Red Hat Storage
Component: Ceph-Ansible
Version: 3.2
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: z6
: 3.3
Assignee: Dimitri Savineau
QA Contact: Vasishta
URL:
Whiteboard:
: 1856711 (view as bug list)
Depends On:
Blocks: 1578730 1877815
TreeView+ depends on / blocked
 
Reported: 2020-06-15 02:03 UTC by Sadique Puthen
Modified: 2023-12-15 18:09 UTC (History)
21 users (show)

Fixed In Version: RHEL: ceph-ansible-3.2.44-1.el7cp Ubuntu: ceph-ansible_3.2.44-2redhat1
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-09-23 12:10:52 UTC
Embargoed:


Attachments (Terms of Use)
ansible log (155.32 KB, application/x-bzip)
2020-06-15 02:04 UTC, Sadique Puthen
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github ceph ceph-ansible pull 5444 0 None closed [skip ci] docker: Add Requires on docker service 2021-02-03 08:19:36 UTC
Red Hat Product Errata RHSA-2020:3504 0 None None None 2020-08-18 18:06:29 UTC

Description Sadique Puthen 2020-06-15 02:03:00 UTC
Description of problem:


Sadique Puthen <sputhenp>
Thu, Jun 11, 1:20 PM (4 days ago)

to rhos-tech, ceph-osp, Giulio, John

I am running "openstack overcloud ceph-upgrade .." on the latest version of OSP-13 after running "openstack overcloud update run --nodes CephStorage". It fails with below error.

2020-06-11 01:01:55,393 p=28354 u=mistral |  fatal: [172.16.0.53]: FAILED! => {"attempts": 5, "changed": true, "cmd": ["docker", "exec", "ceph-mon-controller-3", "sh", "-c", "stat /var/run/ceph/ceph-mon.controller-3.asok || stat /var/run/ceph/ceph-mon.controller-3.localdomain.asok"], "delta": "0:00:00.114334", "end": "2020-06-11 05:01:55.382138", "msg": "non-zero return code", "rc": 1, "start": "2020-06-11 05:01:55.267804", "stderr": "stat: cannot stat '/var/run/ceph/ceph-mon.controller-3.asok': No such file or directory\nstat: cannot stat '/var/run/ceph/ceph-mon.controller-3.localdomain.asok': No such file or directory", "stderr_lines": ["stat: cannot stat '/var/run/ceph/ceph-mon.controller-3.asok': No such file or directory", "stat: cannot stat '/var/run/ceph/ceph-mon.controller-3.localdomain.asok': No such file or directory"], "stdout": "", "stdout_lines": []}

# docker exec ceph-mon-controller-3 sh -c stat /var/run/ceph/ceph-mon.controller-3.asok || stat /var/run/ceph/ceph-mon.controller-3.localdomain.asok
stat: missing operand
Try 'stat --help' for more information.
stat: cannot stat ‘/var/run/ceph/ceph-mon.controller-3.localdomain.asok’: No such file or directory

None of the .asok does exist inside the container. It only has the asok for ceph-mgr container.

# docker exec -it ceph-mon-controller-3 /bin/bash
# ls /var/run/ceph/
ceph-mgr.controller-3.asok

This was freshly deployed OSP-13 just to test upgrade. This problem surfaced only during ceph upgrade step. Your help is highly appreciated.
Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 Sadique Puthen 2020-06-15 02:04:44 UTC
Created attachment 1697281 [details]
ansible log

Comment 7 Sadique Puthen 2020-06-16 02:05:25 UTC
Here is the update.

1 - Deploy OSP13. Verify .asok file is present for mon on all controllers.
2 - Test uploading glance image, create vm and verify the environment is working.
3 - Run the upgrade.
3.1 Run update prepare using openstack overcloud update prepare \.. SUCCESS. Verify .asok file is present for mon on all controllers
3.2 Update controllers # openstack overcloud update run --nodes Controller SUCCESS But .asok file for mon container has disappeared from all controllers.

At this time, docker ps shows the mon container is running, ceph -s shows 3 mons running, but systemd status shows mon start up failed during the Controller update.

Jun 16 02:01:29 controller-3 systemd: Stopped Ceph Monitor.
Jun 16 02:01:29 controller-3 systemd: Starting Ceph Monitor...
Jun 16 02:01:29 controller-3 dockerd-current: time="2020-06-16T02:01:29.527339501Z" level=error msg="Handler for DELETE /v1.26/containers/ceph-mon-controller-3 returned error: You cannot remove a running container 6738219de56ce6715c8abce77529190ad8403bf886f8a63dcfd73e617a90c874. Stop the container before attempting removal or use -f"
Jun 16 02:01:29 controller-3 dockerd-current: time="2020-06-16T02:01:29.528166637Z" level=error msg="Handler for DELETE /v1.26/containers/ceph-mon-controller-3 returned error: You cannot remove a running container 6738219de56ce6715c8abce77529190ad8403bf886f8a63dcfd73e617a90c874. Stop the container before attempting removal or use -f"
Jun 16 02:01:29 controller-3 docker: Error response from daemon: You cannot remove a running container 6738219de56ce6715c8abce77529190ad8403bf886f8a63dcfd73e617a90c874. Stop the container before attempting removal or use -f
Jun 16 02:01:29 controller-3 systemd: Started Ceph Monitor.
Jun 16 02:01:29 controller-3 dockerd-current: time="2020-06-16T02:01:29.565847487Z" level=error msg="Handler for POST /v1.26/containers/create?name=ceph-mon-controller-3 returned error: Conflict. The container name \"/ceph-mon-controller-3\" is already in use by container 6738219de56ce6715c8abce77529190ad8403bf886f8a63dcfd73e617a90c874. You have to remove (or rename) that container to be able to reuse that name."
Jun 16 02:01:29 controller-3 dockerd-current: time="2020-06-16T02:01:29.566825193Z" level=error msg="Handler for POST /v1.26/containers/create returned error: Conflict. The container name \"/ceph-mon-controller-3\" is already in use by container 6738219de56ce6715c8abce77529190ad8403bf886f8a63dcfd73e617a90c874. You have to remove (or rename) that container to be able to reuse that name."
Jun 16 02:01:29 controller-3 docker: /usr/bin/docker-current: Error response from daemon: Conflict. The container name "/ceph-mon-controller-3" is already in use by container 6738219de56ce6715c8abce77529190ad8403bf886f8a63dcfd73e617a90c874. You have to remove (or rename) that container to be able to reuse that name..

Though this error is there the overall 3.2 is showing as succeeded.

PLAY RECAP *********************************************************************
controller-1               : ok=314  changed=141  unreachable=0    failed=0   
controller-2               : ok=305  changed=138  unreachable=0    failed=0   
controller-3               : ok=305  changed=138  unreachable=0    failed=0   

Monday 15 June 2020  13:24:27 -0400 (0:00:00.043)       1:20:37.365 *********** 
=============================================================================== 

Updated nodes - Controller
Success

Can you help to understand why mon restart failed during a controller update? I tried to docker restart <mon id>, but did not help

Comment 29 errata-xmlrpc 2020-08-18 18:05:58 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: Red Hat Ceph Storage 3.3 security and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:3504

Comment 33 John Fulton 2020-09-08 22:59:01 UTC
*** Bug 1856711 has been marked as a duplicate of this bug. ***

Comment 35 Yogev Rabl 2020-09-14 13:23:39 UTC
*** Bug 1877815 has been marked as a duplicate of this bug. ***

Comment 41 Red Hat Bugzilla 2023-09-14 06:02:08 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days


Note You need to log in before you can comment on or make changes to this bug.