Description of problem: Upgrade of an OSP11 environment(first deployed OSP11 GA, then updated to latest) with Ceph nodes could fail with the following error in /var/log/mistral/ceph-install-workflow.log: 2017-12-01 08:30:48,553 p=5871 u=mistral | fatal: [192.168.24.18]: FAILED! => {"attempts": 5, "changed": true, "cmd": ["docker", "exec", "ceph-mon-controller-2", "stat", "/var/run/ceph/ceph-mon.controller-2.asok"], "delta": "0:00:00.076345", "end": "2017-12-01 13:30:47.681632", "failed": true, "msg": "non-zero return code", "rc": 1, "start": "2017-12-01 13:30:47.605287", "stderr": "stat: cannot stat '/var/run/ceph/ceph-mon.controller-2.asok': No such file or directory", "stderr_lines": ["stat: cannot stat '/var/run/ceph/ceph-mon.controller-2.asok': No such file or directory"], "stdout": "", "stdout_lines": []} Version-Release number of selected component (if applicable): How reproducible: 100% Steps to Reproduce: 1. Deploy OSP11 GA(with RHEL 7.3) including Ceph nodes 2. Update to latest OSP11(RHEL 7.4) 3. Reboot overcloud nodes 4. Upgrade to OSP12 Actual results: Major upgrade composable step fails with: 2017-12-01 08:30:48,553 p=5871 u=mistral | fatal: [192.168.24.18]: FAILED! => {"attempts": 5, "changed": true, "cmd": ["docker", "exec", "ceph-mon-controller-2", "stat", "/var/run/ceph/ceph-mon.controller-2.asok"], "delta": "0:00:00.076345", "end": "2017-12-01 13:30:47.681632", "failed": true, "msg": "non-zero return code", "rc": 1, "start": "2017-12-01 13:30:47.605287", "stderr": "stat: cannot stat '/var/run/ceph/ceph-mon.controller-2.asok': No such file or directory", "stderr_lines": ["stat: cannot stat '/var/run/ceph/ceph-mon.controller-2.asok': No such file or directory"], "stdout": "", "stdout_lines": []} Expected results: Upgrade doesn't fail. Additional info: Inside the ceph-mon container we can see that the asok file gets generated by using the fqdn instead of the short name: [root@controller-2 /]# ls /var/run/ceph/ ceph-mon.controller-2.localdomain.asok This issue looks pretty much similar to the issue reported in bug 1507888 but the workaround mentioned in https://bugzilla.redhat.com/show_bug.cgi?id=1507888#c5 doesn't allow the upgrade to succeed. Checking the hostname on the monitor nodes we can see: [root@controller-0 ~]# cat /etc/hostname controller-0.localdomain [root@controller-0 ~]# hostname -s controller-0 [root@controller-0 ~]# hostname -f controller-0.localdomain [root@controller-0 ~]# hostname controller-0.localdomain [root@controller-0 ~]# hostnamectl Static hostname: controller-0.localdomain Icon name: computer-vm Chassis: vm Machine ID: c5b3db13b6534c638ca71d99ea4bc1ea Boot ID: b6b1d21713f24804af83f2a46325bccd Virtualization: kvm Operating System: Red Hat Enterprise Linux Server 7.4 (Maipo) CPE OS Name: cpe:/o:redhat:enterprise_linux:7.4:GA:server Kernel: Linux 3.10.0-693.11.1.el7.x86_64 Architecture: x86-64 Trying as a workaround to set the hostname manually to the shortname: [root@controller-2 ~]# hostname controller-2 [root@controller-2 ~]# echo controller-2 > /etc/hostname [root@controller-2 ~]# hostname controller-2 Repeat this for all monitor nodes(controllers). ^ This worked and allowed the ceph-ansible upgrade to complete.
Marius, do you have any clue if the hostname of the nodes changed when upgrading from 11 to 12? The sockets will be named after the ouput of 'hostname', did that include the domain part in the initial deployment?
(In reply to Giulio Fidente from comment #1) > Marius, do you have any clue if the hostname of the nodes changed when > upgrading from 11 to 12? I'm not sure about this. I'm going to reproduce this today and I'll get back with the confirmation. > The sockets will be named after the ouput of 'hostname', did that include > the domain part in the initial deployment? Yes, in the initial OSP11 GA deployment the output of 'hostname' includes the domain part.
thanks Marius, more questions below (In reply to Marius Cornea from comment #2) > (In reply to Giulio Fidente from comment #1) > > Marius, do you have any clue if the hostname of the nodes changed when > > upgrading from 11 to 12? > > I'm not sure about this. I'm going to reproduce this today and I'll get back > with the confirmation. this might not be necessary > > The sockets will be named after the ouput of 'hostname', did that include > > the domain part in the initial deployment? > > Yes, in the initial OSP11 GA deployment the output of 'hostname' includes > the domain part. I wonder if the workaround mentioned at [1] will work on upgrade too then? 1. https://bugzilla.redhat.com/show_bug.cgi?id=1507888#c5
(In reply to Giulio Fidente from comment #3) > thanks Marius, more questions below > > (In reply to Marius Cornea from comment #2) > > (In reply to Giulio Fidente from comment #1) > > > Marius, do you have any clue if the hostname of the nodes changed when > > > upgrading from 11 to 12? > > > > I'm not sure about this. I'm going to reproduce this today and I'll get back > > with the confirmation. > > this might not be necessary > > > > The sockets will be named after the ouput of 'hostname', did that include > > > the domain part in the initial deployment? > > > > Yes, in the initial OSP11 GA deployment the output of 'hostname' includes > > the domain part. > > I wonder if the workaround mentioned at [1] will work on upgrade too then? > > 1. https://bugzilla.redhat.com/show_bug.cgi?id=1507888#c5 Nope, it didn't work because for initial deployment ceph-mon runs with -i $shortname while if we set mon_use_fqdn: true then the ceph-mon tries to run with -i $fqdn
In the initial bug report Marius posted a possible workaround: Trying as a workaround to set the hostname manually to the shortname: [root@controller-2 ~]# hostname controller-2 [root@controller-2 ~]# echo controller-2 > /etc/hostname [root@controller-2 ~]# hostname controller-2 Repeat this for all monitor nodes(controllers). ^ This worked and allowed the ceph-ansible upgrade to complete.
*** Bug 1533283 has been marked as a duplicate of this bug. ***
I think one option to fix this ubiquitously is to make [1] and [2] in ceph-ansible to check for a socket with either shortname or fqdn 1. https://github.com/ceph/ceph-ansible/blob/master/roles/ceph-mds/tasks/containerized.yml#L66 2. https://github.com/ceph/ceph-ansible/blob/master/roles/ceph-mon/tasks/docker/main.yml#L10
I think we need to look inside the container to see the output of the hostname command first. If the output of hostname / -s / -f inside the container differs from what we have on the host (basically what ansible sees) then this might explain your problem
(In reply to leseb from comment #11) > I think we need to look inside the container to see the output of the > hostname command first. If the output of hostname / -s / -f inside the > container differs from what we have on the host (basically what ansible > sees) then this might explain your problem after further investigation with Guillaume that seems to be exactly the case indeed; ansible_hostname variable is set from the host to the shortname while inside the container 'hostname' returns it with the domain part I think Guillaume is looking into a fix in ceph-ansible, either set a known value for the socket name or, maybe, merge something like [1] to catch both names. 1. https://github.com/ceph/ceph-ansible/pull/2371
the patch in https://github.com/ceph/ceph-container/pull/895 should fix this issue.
*** Bug 1544400 has been marked as a duplicate of this bug. ***
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:0606