1519842 – OSP11 -> OSP12 upgrade: ceph-ansible might fail with: "stat: cannot stat '/var/run/ceph/ceph-mon.controller-2.asok': No such file or directory"

Bug 1519842 - OSP11 -> OSP12 upgrade: ceph-ansible might fail with: "stat: cannot stat '/var/run/ceph/ceph-mon.controller-2.asok': No such file or directory"

Summary: OSP11 -> OSP12 upgrade: ceph-ansible might fail with: "stat: cannot stat '/va...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	ceph-ansible
Sub Component:
Version:	12.0 (Pike)
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	urgent
Target Milestone:	z2
Target Release:	12.0 (Pike)
Assignee:	Giulio Fidente
QA Contact:	Yogev Rabl
Docs Contact:
URL:
Whiteboard:
Duplicates (2):	1533283 1544400 (view as bug list)
Depends On:	1541303
Blocks:
TreeView+	depends on / blocked

Reported:	2017-12-01 14:36 UTC by Marius Cornea
Modified:	2018-03-28 17:22 UTC (History)
CC List:	14 users (show)
Fixed In Version:	ceph-ansible-3.0.24-1.el7cp
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1541303 (view as bug list)
Environment:
Last Closed:	2018-03-28 17:21:38 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2018:0606	0	None	None	None	2018-03-28 17:22:42 UTC

Description Marius Cornea 2017-12-01 14:36:41 UTC

Description of problem:

Upgrade of an OSP11 environment(first deployed OSP11 GA, then updated to latest) with Ceph nodes could fail with the following error in /var/log/mistral/ceph-install-workflow.log:

2017-12-01 08:30:48,553 p=5871 u=mistral |  fatal: [192.168.24.18]: FAILED! => {"attempts": 5, "changed": true, "cmd": ["docker", "exec", "ceph-mon-controller-2", "stat", "/var/run/ceph/ceph-mon.controller-2.asok"], "delta": "0:00:00.076345", "end": "2017-12-01 13:30:47.681632", "failed": true, "msg": "non-zero return code", "rc": 1, "start": "2017-12-01 13:30:47.605287", "stderr": "stat: cannot stat '/var/run/ceph/ceph-mon.controller-2.asok': No such file or directory", "stderr_lines": ["stat: cannot stat '/var/run/ceph/ceph-mon.controller-2.asok': No such file or directory"], "stdout": "", "stdout_lines": []}

Version-Release number of selected component (if applicable):


How reproducible:
100%

Steps to Reproduce:
1. Deploy OSP11 GA(with RHEL 7.3) including Ceph nodes
2. Update to latest OSP11(RHEL 7.4)
3. Reboot overcloud nodes
4. Upgrade to OSP12

Actual results:
Major upgrade composable step fails with:
2017-12-01 08:30:48,553 p=5871 u=mistral |  fatal: [192.168.24.18]: FAILED! => {"attempts": 5, "changed": true, "cmd": ["docker", "exec", "ceph-mon-controller-2", "stat", "/var/run/ceph/ceph-mon.controller-2.asok"], "delta": "0:00:00.076345", "end": "2017-12-01 13:30:47.681632", "failed": true, "msg": "non-zero return code", "rc": 1, "start": "2017-12-01 13:30:47.605287", "stderr": "stat: cannot stat '/var/run/ceph/ceph-mon.controller-2.asok': No such file or directory", "stderr_lines": ["stat: cannot stat '/var/run/ceph/ceph-mon.controller-2.asok': No such file or directory"], "stdout": "", "stdout_lines": []}

Expected results:
Upgrade doesn't fail.

Additional info:

Inside the ceph-mon container we can see that the asok file gets generated by using the fqdn instead of the short name:

[root@controller-2 /]# ls /var/run/ceph/ 
ceph-mon.controller-2.localdomain.asok

This issue looks pretty much similar to the issue reported in bug 1507888 but the workaround mentioned in https://bugzilla.redhat.com/show_bug.cgi?id=1507888#c5 doesn't allow the upgrade to succeed.

Checking the hostname on the monitor nodes we can see:

[root@controller-0 ~]# cat /etc/hostname 
controller-0.localdomain
[root@controller-0 ~]# hostname -s
controller-0
[root@controller-0 ~]# hostname -f
controller-0.localdomain
[root@controller-0 ~]# hostname
controller-0.localdomain
[root@controller-0 ~]# hostnamectl 
   Static hostname: controller-0.localdomain
         Icon name: computer-vm
           Chassis: vm
        Machine ID: c5b3db13b6534c638ca71d99ea4bc1ea
           Boot ID: b6b1d21713f24804af83f2a46325bccd
    Virtualization: kvm
  Operating System: Red Hat Enterprise Linux Server 7.4 (Maipo)
       CPE OS Name: cpe:/o:redhat:enterprise_linux:7.4:GA:server
            Kernel: Linux 3.10.0-693.11.1.el7.x86_64
      Architecture: x86-64

Trying as a workaround to set the hostname manually to the shortname:

[root@controller-2 ~]# hostname controller-2
[root@controller-2 ~]# echo controller-2 > /etc/hostname
[root@controller-2 ~]# hostname
controller-2

Repeat this for all monitor nodes(controllers).

^ This worked and allowed the ceph-ansible upgrade to complete.

Comment 1 Giulio Fidente 2017-12-04 09:55:10 UTC

Marius, do you have any clue if the hostname of the nodes changed when upgrading from 11 to 12?

The sockets will be named after the ouput of 'hostname', did that include the domain part in the initial deployment?

Comment 2 Marius Cornea 2017-12-04 09:59:07 UTC

(In reply to Giulio Fidente from comment #1)
> Marius, do you have any clue if the hostname of the nodes changed when
> upgrading from 11 to 12?

I'm not sure about this. I'm going to reproduce this today and I'll get back with the confirmation. 

> The sockets will be named after the ouput of 'hostname', did that include
> the domain part in the initial deployment?

Yes, in the initial OSP11 GA deployment the output of 'hostname' includes the domain part.

Comment 3 Giulio Fidente 2017-12-04 10:11:48 UTC

thanks Marius, more questions below

(In reply to Marius Cornea from comment #2)
> (In reply to Giulio Fidente from comment #1)
> > Marius, do you have any clue if the hostname of the nodes changed when
> > upgrading from 11 to 12?
> 
> I'm not sure about this. I'm going to reproduce this today and I'll get back
> with the confirmation. 

this might not be necessary
 
> > The sockets will be named after the ouput of 'hostname', did that include
> > the domain part in the initial deployment?
> 
> Yes, in the initial OSP11 GA deployment the output of 'hostname' includes
> the domain part.

I wonder if the workaround mentioned at [1] will work on upgrade too then?

1. https://bugzilla.redhat.com/show_bug.cgi?id=1507888#c5

Comment 4 Marius Cornea 2017-12-04 14:28:10 UTC

(In reply to Giulio Fidente from comment #3)
> thanks Marius, more questions below
> 
> (In reply to Marius Cornea from comment #2)
> > (In reply to Giulio Fidente from comment #1)
> > > Marius, do you have any clue if the hostname of the nodes changed when
> > > upgrading from 11 to 12?
> > 
> > I'm not sure about this. I'm going to reproduce this today and I'll get back
> > with the confirmation. 
> 
> this might not be necessary
>  
> > > The sockets will be named after the ouput of 'hostname', did that include
> > > the domain part in the initial deployment?
> > 
> > Yes, in the initial OSP11 GA deployment the output of 'hostname' includes
> > the domain part.
> 
> I wonder if the workaround mentioned at [1] will work on upgrade too then?
> 
> 1. https://bugzilla.redhat.com/show_bug.cgi?id=1507888#c5

Nope, it didn't work because for initial deployment ceph-mon runs with -i $shortname while if we set mon_use_fqdn: true then the ceph-mon tries to run with -i $fqdn

Comment 6 Giulio Fidente 2018-01-23 17:24:45 UTC

In the initial bug report Marius posted a possible workaround:

Trying as a workaround to set the hostname manually to the shortname:

[root@controller-2 ~]# hostname controller-2
[root@controller-2 ~]# echo controller-2 > /etc/hostname
[root@controller-2 ~]# hostname
controller-2

Repeat this for all monitor nodes(controllers).

^ This worked and allowed the ceph-ansible upgrade to complete.

Comment 7 Giulio Fidente 2018-01-23 17:25:32 UTC

*** Bug 1533283 has been marked as a duplicate of this bug. ***

Comment 8 Giulio Fidente 2018-02-01 17:42:24 UTC

I think one option to fix this ubiquitously is to make [1] and [2] in ceph-ansible to check for a socket with either shortname or fqdn

1. https://github.com/ceph/ceph-ansible/blob/master/roles/ceph-mds/tasks/containerized.yml#L66
2. https://github.com/ceph/ceph-ansible/blob/master/roles/ceph-mon/tasks/docker/main.yml#L10

Comment 11 Sébastien Han 2018-02-02 09:00:16 UTC

I think we need to look inside the container to see the output of the hostname command first. If the output of hostname / -s / -f inside the container differs from what we have on the host (basically what ansible sees) then this might explain your problem

Comment 12 Giulio Fidente 2018-02-02 18:02:50 UTC

(In reply to leseb from comment #11)
> I think we need to look inside the container to see the output of the
> hostname command first. If the output of hostname / -s / -f inside the
> container differs from what we have on the host (basically what ansible
> sees) then this might explain your problem

after further investigation with Guillaume that seems to be exactly the case indeed; ansible_hostname variable is set from the host to the shortname while inside the container 'hostname' returns it with the domain part

I think Guillaume is looking into a fix in ceph-ansible, either set a known value for the socket name or, maybe, merge something like [1] to catch both names.

1. https://github.com/ceph/ceph-ansible/pull/2371

Comment 13 Guillaume Abrioux 2018-02-05 15:53:05 UTC

the patch in https://github.com/ceph/ceph-container/pull/895 should fix this issue.

Comment 15 Giulio Fidente 2018-02-13 09:30:01 UTC

*** Bug 1544400 has been marked as a duplicate of this bug. ***

Comment 18 errata-xmlrpc 2018-03-28 17:21:38 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:0606

Note You need to log in before you can comment on or make changes to this bug.