Bug 1507888 - Deployment with ceph and TLS everywhere fails with: "WorkflowTasks_Step2_Execution: ERROR "cannot stat '/var/run/ceph/ceph-mon.overcloud-controller-2.asok': No such file or directory""
Summary: Deployment with ceph and TLS everywhere fails with: "WorkflowTasks_Step2_Exec...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-tripleo-heat-templates
Version: 12.0 (Pike)
Hardware: x86_64
OS: Linux
high
high
Target Milestone: z2
: 12.0 (Pike)
Assignee: Giulio Fidente
QA Contact: Yogev Rabl
Derek
URL:
Whiteboard:
: 1508038 (view as bug list)
Depends On: 1554444
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-10-31 11:44 UTC by Artem Hrechanychenko
Modified: 2018-09-17 14:55 UTC (History)
27 users (show)

Fixed In Version: openstack-tripleo-heat-templates-7.0.3-21.el7ost
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-03-28 17:14:53 UTC
Target Upstream Version:


Attachments (Terms of Use)
ceph-install-workflow (61.54 KB, text/plain)
2017-10-31 11:44 UTC, Artem Hrechanychenko
no flags Details


Links
System ID Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2018:0602 None None None 2018-03-28 17:15:58 UTC
OpenStack gerrit 523375 None None None 2018-01-18 11:02:46 UTC
OpenStack gerrit 526323 None None None 2018-01-18 11:03:16 UTC
Launchpad 1733874 None None None 2017-11-23 10:25:47 UTC

Description Artem Hrechanychenko 2017-10-31 11:44:41 UTC
Created attachment 1345843 [details]
ceph-install-workflow

Description of problem:
Deployment with 3ctrl+2comp+3 ceph failed 
https://rhos-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/OT2_container_netiso_osp12_HA_TLS_everywhere/69/console

09:11:45 2017-10-31 09:11:33Z [overcloud]: CREATE_FAILED  Resource CREATE failed: resources.AllNodesDeploySteps: Resource CREATE failed: resources.WorkflowTasks_Step2_Execution: ERROR
09:11:45 
09:11:45  Stack overcloud CREATE_FAILED 
09:11:45 
09:11:45 overcloud.AllNodesDeploySteps.WorkflowTasks_Step2_Execution:
09:11:45   resource_type: OS::Mistral::ExternalResource
09:11:45   physical_resource_id: 07145ae1-c20f-4430-aed3-b59cbf56cbad
09:11:45   status: CREATE_FAILED
09:11:45   status_reason: |
09:11:45     resources.WorkflowTasks_Step2_Execution: ERROR
09:11:45 
09:11:45 
09:11:45 STDERR:
09:11:45 
09:11:45 Waiting for messages on queue 'babe085e-a82b-4463-885b-9bd5d5672a9f' with no timeout.
09:11:45 Heat Stack create failed.
09:11:45 Heat Stack create failed.

from ceph-ansible install log
(undercloud) [stack@undercloud-0 ~]$ sudo cat /var/log/mistral/ceph-install-workflow.log |grep fatal
2017-10-31 05:11:29,244 p=17581 u=mistral |  fatal: [192.168.24.20]: FAILED! => {"attempts": 5, "changed": true, "cmd": ["docker", "exec", "ceph-mon-overcloud-controller-2", "stat", "/var/run/ceph/ceph-mon.overcloud-controller-2.asok"], "delta": "0:00:00.075543", "end": "2017-10-31 09:11:29.814950", "failed": true, "msg": "non-zero return code", "rc": 1, "start": "2017-10-31 09:11:29.739407", "stderr": "stat: cannot stat '/var/run/ceph/ceph-mon.overcloud-controller-2.asok': No such file or directory", "stderr_lines": ["stat: cannot stat '/var/run/ceph/ceph-mon.overcloud-controller-2.asok': No such file or directory"], "stdout": "", "stdout_lines": []}
(undercloud) [stack@undercloud-0 ~]$ nova list
+--------------------------------------+-------------------------+--------+------------+-------------+------------------------+
| ID                                   | Name                    | Status | Task State | Power State | Networks               |
+--------------------------------------+-------------------------+--------+------------+-------------+------------------------+
| 44b8d372-1362-4fbc-a5a6-eb106849216e | overcloud-cephstorage-0 | ACTIVE | -          | Running     | ctlplane=192.168.24.8  |
| 29600d77-24a0-4aaa-b4cd-4145535c71a2 | overcloud-cephstorage-1 | ACTIVE | -          | Running     | ctlplane=192.168.24.9  |
| 75e9a780-6f9e-4eb7-97f4-34f4600241e9 | overcloud-cephstorage-2 | ACTIVE | -          | Running     | ctlplane=192.168.24.15 |
| ed58b681-2fa9-4b62-bf94-efdb498752f7 | overcloud-compute-0     | ACTIVE | -          | Running     | ctlplane=192.168.24.11 |
| d5acdc1b-bf38-4476-925b-a8c7253256b6 | overcloud-compute-1     | ACTIVE | -          | Running     | ctlplane=192.168.24.10 |
| 18256f3a-f80a-4a77-96d4-486435a6af7f | overcloud-controller-0  | ACTIVE | -          | Running     | ctlplane=192.168.24.17 |
| c203603b-c380-4e67-a4b6-43210b8af35a | overcloud-controller-1  | ACTIVE | -          | Running     | ctlplane=192.168.24.6  |
| 463c0929-e3d6-463b-980d-780c1db16c98 | overcloud-controller-2  | ACTIVE | -          | Running     | ctlplane=192.168.24.20 |
+--------------------------------------+-------------------------+--------+------------+-------------+------------------------+
(undercloud) [stack@undercloud-0 ~]$ ssh heat-admin@192.168.24.20
The authenticity of host '192.168.24.20 (<no hostip for proxy command>)' can't be established.
ECDSA key fingerprint is SHA256:g4GGhrKma9sjlMVjulhwZBqPisG2AD54xH1flkzr/Ak.
ECDSA key fingerprint is MD5:22:b7:98:94:a1:4d:41:08:e1:9a:18:7f:48:a6:34:90.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added '192.168.24.20' (ECDSA) to the list of known hosts.
[heat-admin@overcloud-controller-2 ~]$ ls /var/run/ceph/
ls: cannot open directory /var/run/ceph/: Permission denied
[heat-admin@overcloud-controller-2 ~]$ sudo ls /var/run/ceph/
[heat-admin@overcloud-controller-2 ~]$ sudo ls /var/run/ceph/
[heat-admin@overcloud-controller-2 ~]$ exit
logout
Connection to 192.168.24.20 closed.
(undercloud) [stack@undercloud-0 ~]$ ssh heat-admin@192.168.24.17
The authenticity of host '192.168.24.17 (<no hostip for proxy command>)' can't be established.
ECDSA key fingerprint is SHA256:eV0Ks+0JBsXl7YDrF69IZ76RCU7fCb5krCLs89qXNro.
ECDSA key fingerprint is MD5:d0:f0:c4:10:65:8e:1a:c7:e1:e4:64:a0:77:97:bd:b7.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added '192.168.24.17' (ECDSA) to the list of known hosts.
[heat-admin@overcloud-controller-0 ~]$ sudo ls /var/run/ceph/

Version-Release number of selected component (if applicable):
puddle - 20171026.1 
ceph-ansible-3.0.7-1.el7cp.noarch
openstack-mistral-executor-5.1.1-0.20171020071242.063951f.el7ost.noarch
puppet-openstack_extras-11.3.1-0.20170906070209.b99c3a4.el7ost.noarch
openstack-nova-compute-16.0.2-0.20171023105738.a2e4540.el7ost.noarch
openstack-neutron-openvswitch-11.0.2-0.20171020230401.el7ost.noarch
openstack-heat-engine-9.0.1-0.20171023060845.be1e2e9.el7ost.noarch
openstack-tripleo-heat-templates-7.0.3-0.20171023134948.el7ost.noarch
openstack-zaqar-5.0.1-0.20170929082616.2d42d4d.el7ost.noarch
openstack-nova-placement-api-16.0.2-0.20171023105738.a2e4540.el7ost.noarch
openstack-nova-common-16.0.2-0.20171023105738.a2e4540.el7ost.noarch
openstack-swift-object-2.15.2-0.20170927035729.0344d6e.el7ost.noarch
openstack-heat-common-9.0.1-0.20171023060845.be1e2e9.el7ost.noarch
openstack-mistral-api-5.1.1-0.20171020071242.063951f.el7ost.noarch
puppet-openstacklib-11.3.1-0.20170921022915.6e2b844.el7ost.noarch
openstack-keystone-12.0.1-0.20171012013909.5c9ccce.el7ost.noarch
openstack-neutron-common-11.0.2-0.20171020230401.el7ost.noarch
python-openstackclient-lang-3.12.0-1.el7ost.noarch
openstack-ironic-common-9.1.2-0.20171019051035.f0b0521.el7ost.noarch
python-openstackclient-3.12.0-1.el7ost.noarch
openstack-selinux-0.8.11-0.20171013192233.ce13ba7.el7ost.noarch
python-openstacksdk-0.9.17-1.el7ost.noarch
openstack-tripleo-image-elements-7.0.1-0.20171020101256.2e61e31.el7ost.noarch
openstack-mistral-common-5.1.1-0.20171020071242.063951f.el7ost.noarch
openstack-tripleo-ui-7.4.3-0.20171023133305.8616195.el7ost.noarch
openstack-tripleo-validations-7.4.2-0.20171016115241.c2c9bf2.el7ost.noarch
openstack-nova-scheduler-16.0.2-0.20171023105738.a2e4540.el7ost.noarch
openstack-glance-15.0.1-0.20171017090105.06af2eb.el7ost.noarch
openstack-swift-account-2.15.2-0.20170927035729.0344d6e.el7ost.noarch
openstack-neutron-ml2-11.0.2-0.20171020230401.el7ost.noarch
openstack-swift-proxy-2.15.2-0.20170927035729.0344d6e.el7ost.noarch
openstack-heat-api-cfn-9.0.1-0.20171023060845.be1e2e9.el7ost.noarch
openstack-ironic-api-9.1.2-0.20171019051035.f0b0521.el7ost.noarch
openstack-tripleo-puppet-elements-7.0.1-0.20171020122223.82d7e6c.el7ost.noarch
openstack-nova-api-16.0.2-0.20171023105738.a2e4540.el7ost.noarch
openstack-nova-conductor-16.0.2-0.20171023105738.a2e4540.el7ost.noarch
openstack-neutron-11.0.2-0.20171020230401.el7ost.noarch
openstack-ironic-conductor-9.1.2-0.20171019051035.f0b0521.el7ost.noarch
openstack-tempest-17.0.0-1.el7ost.noarch
openstack-mistral-engine-5.1.1-0.20171020071242.063951f.el7ost.noarch
openstack-ironic-inspector-6.0.1-0.20170920142417.77e2b1a.el7ost.noarch
openstack-tripleo-common-7.6.3-0.20171022171808.el7ost.noarch
openstack-swift-container-2.15.2-0.20170927035729.0344d6e.el7ost.noarch
openstack-puppet-modules-11.0.0-0.20170828113154.el7ost.noarch
openstack-tripleo-common-containers-7.6.3-0.20171022171808.el7ost.noarch
openstack-heat-api-9.0.1-0.20171023060845.be1e2e9.el7ost.noarch


How reproducible:
always

Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 4 David Critch 2017-10-31 18:27:14 UTC
I have the same issue, though I'm not using TLS. It seems to be due to setting an overcloud domain. All things being equal, setting an overcloud domain causes the same error. 

The ceph-mon processes are being started with an FQDN so /var/run/ceph/ceph-mon.overcloud-controller-2.asok does not exist, but /var/run/ceph/ceph-mon.overcloud-controller-2.my.domain.asok does.

Without setting an overcloud domain, the short hostname version of the file is there and ceph-ansible is happy.

Comment 5 Giulio Fidente 2017-10-31 18:52:51 UTC
hi David, thanks for your help.

I wonder if passing mon_use_fqdn to ceph-ansible helps; could you try creating an heat environment file with the following contents:

parameter_defaults:
  CephAnsibleExtraConfig:
    mon_use_fqdn: true

and deploy the overcloud with the above to see if it passes?

Comment 6 David Critch 2017-11-01 03:05:12 UTC
Yup, that totally did it. Thanks Giulio!

Comment 8 Giulio Fidente 2017-11-01 10:44:22 UTC
David thanks a lot for helping. I am trying to figure if we can enable the parameter in comment #5 conditionally when a cloud domain is set.

Can you tell me which parameters you use to set the cloud domain? Is it CloudDomain only?

Comment 9 David Critch 2017-11-01 13:09:48 UTC
No. With OSP12, you can set an 'overcloud_domain_name' in undercloud.conf.

I had CloudDomain in one of my env files - along w/ doing the nova/neutron changes and restarts - prior to 12 but I took it out since undercloud.conf now covers all that.

Comment 10 Artem Hrechanychenko 2017-11-01 13:16:54 UTC
w/a with https://bugzilla.redhat.com/show_bug.cgi?id=1507888#c5 works for me

Comment 11 David Critch 2017-11-01 21:23:12 UTC
FWIW, I tried `mon_use_fqdn: true` on a fresh deploy without setting the domain and that failed for the same reason. Would have been sweet you could just set it and not worry about it.

Whether you've configured a domain the old or the new way, the net result will include a dhcp_domain entry in nova.conf and a dns_domain entry in neutron.conf. So you could conditionally enable it if:
crudini --get /etc/neutron/neutron.conf DEFAULT dns_domain
or
crudini --get /etc/nova/nova.conf DEFAULT dhcp_domain
is not 'localdomain'

Comment 12 Giulio Fidente 2017-11-02 09:52:01 UTC
(In reply to David Critch from comment #11)
> FWIW, I tried `mon_use_fqdn: true` on a fresh deploy without setting the
> domain and that failed for the same reason. Would have been sweet you could
> just set it and not worry about it.

yeah it doesn't guess the configuration and expects user to set the boolean when necessary
 
> Whether you've configured a domain the old or the new way, the net result
> will include a dhcp_domain entry in nova.conf and a dns_domain entry in
> neutron.conf. So you could conditionally enable it if:
> crudini --get /etc/neutron/neutron.conf DEFAULT dns_domain
> or
> crudini --get /etc/nova/nova.conf DEFAULT dhcp_domain
> is not 'localdomain'

it's actually quickstart passing via CloudDomain the same domain set in undercloud.conf so I think we should be okay enabling the boolean based on CloudDomain

Comment 13 Giulio Fidente 2017-11-02 10:08:36 UTC
*** Bug 1508038 has been marked as a duplicate of this bug. ***

Comment 20 Michele Baldessari 2017-12-04 11:08:16 UTC
Hasn't https://review.openstack.org/#/c/523375/ made things worse for everyone now? 

Now on a fresh deploy that worked until yesterday I get errors because
the short hostname is:
overcloud-novacompute-0.novalocal

And this even when I explicitly set my CloudDomain to localdomain in my env files. In our current undercloud configuration in the default way hostnames are assigned if dhcp_domain on the undercloud is set to something different than the CloudDomain, short hostnames become FQDNs all of a sudden.

Comment 21 Michele Baldessari 2017-12-04 13:42:32 UTC
Also another side-effect here is that a deployment from master now has overcloud nodes with hostnames ending in .novalocal even when .localdomain is specified as CloudDomain

Comment 22 Giulio Fidente 2017-12-04 13:50:35 UTC
(In reply to Michele Baldessari from comment #20)
> Hasn't https://review.openstack.org/#/c/523375/ made things worse for
> everyone now? 
> 
> Now on a fresh deploy that worked until yesterday I get errors because
> the short hostname is:
> overcloud-novacompute-0.novalocal
> 
> And this even when I explicitly set my CloudDomain to localdomain in my env
> files. In our current undercloud configuration in the default way hostnames
> are assigned if dhcp_domain on the undercloud is set to something different
> than the CloudDomain, short hostnames become FQDNs all of a sudden.

I suppose that is because nova is deliberately using .novalocal when the setting is left to the default.

I suppose a better fix would have been to set it to '' instead, as we do for the overcloud [1], what do you think?

1. https://github.com/openstack/tripleo-heat-templates/blob/107b610923ba5d39f90c3a6a63bf2d3642e1b35d/puppet/services/nova-base.yaml#L223

Comment 33 Yogev Rabl 2018-03-16 15:04:16 UTC
failed to deploy the overcloud with the error: 
"Error: /Stage[main]/Tripleo::Certmonger::Ca::Crl/File[tripleo-ca-crl]: Could not evaluate: Could not retrieve file metadata for http://ipa-ca/ipa/crl/MasterCRL.bin: getaddrinfo:
Name or service not known"

which is documented in the bug: https://bugzilla.redhat.com/show_bug.cgi?id=1554444

once that bug will backported I'll be able to verify this one

Comment 34 Giulio Fidente 2018-03-19 18:58:09 UTC
I am the assignee so I am not sure if I can move the BZ into VERIFIED myself but I tested this with:

ceph-ansible-3.0.27-1.el7cp.noarch

using a custom domain name (example.com) in neutron.conf and the following heat parameters:

  CloudDomain: example.com
  CloudName: overcloud.example.com
  CloudNameInternal: overcloud.internalapi.example.com
  CloudNameStorage: overcloud.storage.example.com
  CloudNameStorageManagement: overcloud.storagemgmt.example.com
  CloudNameCtlplane: overcloud.ctlplane.example.com

Comment 36 Yogev Rabl 2018-03-23 19:35:32 UTC
verified on openstack-tripleo-heat-templates-7.0.9-8.el7ost.noarch

Comment 39 errata-xmlrpc 2018-03-28 17:14:53 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2018:0602


Note You need to log in before you can comment on or make changes to this bug.