Created attachment 1345843 [details] ceph-install-workflow Description of problem: Deployment with 3ctrl+2comp+3 ceph failed https://rhos-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/OT2_container_netiso_osp12_HA_TLS_everywhere/69/console 09:11:45 2017-10-31 09:11:33Z [overcloud]: CREATE_FAILED Resource CREATE failed: resources.AllNodesDeploySteps: Resource CREATE failed: resources.WorkflowTasks_Step2_Execution: ERROR 09:11:45 09:11:45 Stack overcloud CREATE_FAILED 09:11:45 09:11:45 overcloud.AllNodesDeploySteps.WorkflowTasks_Step2_Execution: 09:11:45 resource_type: OS::Mistral::ExternalResource 09:11:45 physical_resource_id: 07145ae1-c20f-4430-aed3-b59cbf56cbad 09:11:45 status: CREATE_FAILED 09:11:45 status_reason: | 09:11:45 resources.WorkflowTasks_Step2_Execution: ERROR 09:11:45 09:11:45 09:11:45 STDERR: 09:11:45 09:11:45 Waiting for messages on queue 'babe085e-a82b-4463-885b-9bd5d5672a9f' with no timeout. 09:11:45 Heat Stack create failed. 09:11:45 Heat Stack create failed. from ceph-ansible install log (undercloud) [stack@undercloud-0 ~]$ sudo cat /var/log/mistral/ceph-install-workflow.log |grep fatal 2017-10-31 05:11:29,244 p=17581 u=mistral | fatal: [192.168.24.20]: FAILED! => {"attempts": 5, "changed": true, "cmd": ["docker", "exec", "ceph-mon-overcloud-controller-2", "stat", "/var/run/ceph/ceph-mon.overcloud-controller-2.asok"], "delta": "0:00:00.075543", "end": "2017-10-31 09:11:29.814950", "failed": true, "msg": "non-zero return code", "rc": 1, "start": "2017-10-31 09:11:29.739407", "stderr": "stat: cannot stat '/var/run/ceph/ceph-mon.overcloud-controller-2.asok': No such file or directory", "stderr_lines": ["stat: cannot stat '/var/run/ceph/ceph-mon.overcloud-controller-2.asok': No such file or directory"], "stdout": "", "stdout_lines": []} (undercloud) [stack@undercloud-0 ~]$ nova list +--------------------------------------+-------------------------+--------+------------+-------------+------------------------+ | ID | Name | Status | Task State | Power State | Networks | +--------------------------------------+-------------------------+--------+------------+-------------+------------------------+ | 44b8d372-1362-4fbc-a5a6-eb106849216e | overcloud-cephstorage-0 | ACTIVE | - | Running | ctlplane=192.168.24.8 | | 29600d77-24a0-4aaa-b4cd-4145535c71a2 | overcloud-cephstorage-1 | ACTIVE | - | Running | ctlplane=192.168.24.9 | | 75e9a780-6f9e-4eb7-97f4-34f4600241e9 | overcloud-cephstorage-2 | ACTIVE | - | Running | ctlplane=192.168.24.15 | | ed58b681-2fa9-4b62-bf94-efdb498752f7 | overcloud-compute-0 | ACTIVE | - | Running | ctlplane=192.168.24.11 | | d5acdc1b-bf38-4476-925b-a8c7253256b6 | overcloud-compute-1 | ACTIVE | - | Running | ctlplane=192.168.24.10 | | 18256f3a-f80a-4a77-96d4-486435a6af7f | overcloud-controller-0 | ACTIVE | - | Running | ctlplane=192.168.24.17 | | c203603b-c380-4e67-a4b6-43210b8af35a | overcloud-controller-1 | ACTIVE | - | Running | ctlplane=192.168.24.6 | | 463c0929-e3d6-463b-980d-780c1db16c98 | overcloud-controller-2 | ACTIVE | - | Running | ctlplane=192.168.24.20 | +--------------------------------------+-------------------------+--------+------------+-------------+------------------------+ (undercloud) [stack@undercloud-0 ~]$ ssh heat-admin.24.20 The authenticity of host '192.168.24.20 (<no hostip for proxy command>)' can't be established. ECDSA key fingerprint is SHA256:g4GGhrKma9sjlMVjulhwZBqPisG2AD54xH1flkzr/Ak. ECDSA key fingerprint is MD5:22:b7:98:94:a1:4d:41:08:e1:9a:18:7f:48:a6:34:90. Are you sure you want to continue connecting (yes/no)? yes Warning: Permanently added '192.168.24.20' (ECDSA) to the list of known hosts. [heat-admin@overcloud-controller-2 ~]$ ls /var/run/ceph/ ls: cannot open directory /var/run/ceph/: Permission denied [heat-admin@overcloud-controller-2 ~]$ sudo ls /var/run/ceph/ [heat-admin@overcloud-controller-2 ~]$ sudo ls /var/run/ceph/ [heat-admin@overcloud-controller-2 ~]$ exit logout Connection to 192.168.24.20 closed. (undercloud) [stack@undercloud-0 ~]$ ssh heat-admin.24.17 The authenticity of host '192.168.24.17 (<no hostip for proxy command>)' can't be established. ECDSA key fingerprint is SHA256:eV0Ks+0JBsXl7YDrF69IZ76RCU7fCb5krCLs89qXNro. ECDSA key fingerprint is MD5:d0:f0:c4:10:65:8e:1a:c7:e1:e4:64:a0:77:97:bd:b7. Are you sure you want to continue connecting (yes/no)? yes Warning: Permanently added '192.168.24.17' (ECDSA) to the list of known hosts. [heat-admin@overcloud-controller-0 ~]$ sudo ls /var/run/ceph/ Version-Release number of selected component (if applicable): puddle - 20171026.1 ceph-ansible-3.0.7-1.el7cp.noarch openstack-mistral-executor-5.1.1-0.20171020071242.063951f.el7ost.noarch puppet-openstack_extras-11.3.1-0.20170906070209.b99c3a4.el7ost.noarch openstack-nova-compute-16.0.2-0.20171023105738.a2e4540.el7ost.noarch openstack-neutron-openvswitch-11.0.2-0.20171020230401.el7ost.noarch openstack-heat-engine-9.0.1-0.20171023060845.be1e2e9.el7ost.noarch openstack-tripleo-heat-templates-7.0.3-0.20171023134948.el7ost.noarch openstack-zaqar-5.0.1-0.20170929082616.2d42d4d.el7ost.noarch openstack-nova-placement-api-16.0.2-0.20171023105738.a2e4540.el7ost.noarch openstack-nova-common-16.0.2-0.20171023105738.a2e4540.el7ost.noarch openstack-swift-object-2.15.2-0.20170927035729.0344d6e.el7ost.noarch openstack-heat-common-9.0.1-0.20171023060845.be1e2e9.el7ost.noarch openstack-mistral-api-5.1.1-0.20171020071242.063951f.el7ost.noarch puppet-openstacklib-11.3.1-0.20170921022915.6e2b844.el7ost.noarch openstack-keystone-12.0.1-0.20171012013909.5c9ccce.el7ost.noarch openstack-neutron-common-11.0.2-0.20171020230401.el7ost.noarch python-openstackclient-lang-3.12.0-1.el7ost.noarch openstack-ironic-common-9.1.2-0.20171019051035.f0b0521.el7ost.noarch python-openstackclient-3.12.0-1.el7ost.noarch openstack-selinux-0.8.11-0.20171013192233.ce13ba7.el7ost.noarch python-openstacksdk-0.9.17-1.el7ost.noarch openstack-tripleo-image-elements-7.0.1-0.20171020101256.2e61e31.el7ost.noarch openstack-mistral-common-5.1.1-0.20171020071242.063951f.el7ost.noarch openstack-tripleo-ui-7.4.3-0.20171023133305.8616195.el7ost.noarch openstack-tripleo-validations-7.4.2-0.20171016115241.c2c9bf2.el7ost.noarch openstack-nova-scheduler-16.0.2-0.20171023105738.a2e4540.el7ost.noarch openstack-glance-15.0.1-0.20171017090105.06af2eb.el7ost.noarch openstack-swift-account-2.15.2-0.20170927035729.0344d6e.el7ost.noarch openstack-neutron-ml2-11.0.2-0.20171020230401.el7ost.noarch openstack-swift-proxy-2.15.2-0.20170927035729.0344d6e.el7ost.noarch openstack-heat-api-cfn-9.0.1-0.20171023060845.be1e2e9.el7ost.noarch openstack-ironic-api-9.1.2-0.20171019051035.f0b0521.el7ost.noarch openstack-tripleo-puppet-elements-7.0.1-0.20171020122223.82d7e6c.el7ost.noarch openstack-nova-api-16.0.2-0.20171023105738.a2e4540.el7ost.noarch openstack-nova-conductor-16.0.2-0.20171023105738.a2e4540.el7ost.noarch openstack-neutron-11.0.2-0.20171020230401.el7ost.noarch openstack-ironic-conductor-9.1.2-0.20171019051035.f0b0521.el7ost.noarch openstack-tempest-17.0.0-1.el7ost.noarch openstack-mistral-engine-5.1.1-0.20171020071242.063951f.el7ost.noarch openstack-ironic-inspector-6.0.1-0.20170920142417.77e2b1a.el7ost.noarch openstack-tripleo-common-7.6.3-0.20171022171808.el7ost.noarch openstack-swift-container-2.15.2-0.20170927035729.0344d6e.el7ost.noarch openstack-puppet-modules-11.0.0-0.20170828113154.el7ost.noarch openstack-tripleo-common-containers-7.6.3-0.20171022171808.el7ost.noarch openstack-heat-api-9.0.1-0.20171023060845.be1e2e9.el7ost.noarch How reproducible: always Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
I have the same issue, though I'm not using TLS. It seems to be due to setting an overcloud domain. All things being equal, setting an overcloud domain causes the same error. The ceph-mon processes are being started with an FQDN so /var/run/ceph/ceph-mon.overcloud-controller-2.asok does not exist, but /var/run/ceph/ceph-mon.overcloud-controller-2.my.domain.asok does. Without setting an overcloud domain, the short hostname version of the file is there and ceph-ansible is happy.
hi David, thanks for your help. I wonder if passing mon_use_fqdn to ceph-ansible helps; could you try creating an heat environment file with the following contents: parameter_defaults: CephAnsibleExtraConfig: mon_use_fqdn: true and deploy the overcloud with the above to see if it passes?
Yup, that totally did it. Thanks Giulio!
David thanks a lot for helping. I am trying to figure if we can enable the parameter in comment #5 conditionally when a cloud domain is set. Can you tell me which parameters you use to set the cloud domain? Is it CloudDomain only?
No. With OSP12, you can set an 'overcloud_domain_name' in undercloud.conf. I had CloudDomain in one of my env files - along w/ doing the nova/neutron changes and restarts - prior to 12 but I took it out since undercloud.conf now covers all that.
w/a with https://bugzilla.redhat.com/show_bug.cgi?id=1507888#c5 works for me
FWIW, I tried `mon_use_fqdn: true` on a fresh deploy without setting the domain and that failed for the same reason. Would have been sweet you could just set it and not worry about it. Whether you've configured a domain the old or the new way, the net result will include a dhcp_domain entry in nova.conf and a dns_domain entry in neutron.conf. So you could conditionally enable it if: crudini --get /etc/neutron/neutron.conf DEFAULT dns_domain or crudini --get /etc/nova/nova.conf DEFAULT dhcp_domain is not 'localdomain'
(In reply to David Critch from comment #11) > FWIW, I tried `mon_use_fqdn: true` on a fresh deploy without setting the > domain and that failed for the same reason. Would have been sweet you could > just set it and not worry about it. yeah it doesn't guess the configuration and expects user to set the boolean when necessary > Whether you've configured a domain the old or the new way, the net result > will include a dhcp_domain entry in nova.conf and a dns_domain entry in > neutron.conf. So you could conditionally enable it if: > crudini --get /etc/neutron/neutron.conf DEFAULT dns_domain > or > crudini --get /etc/nova/nova.conf DEFAULT dhcp_domain > is not 'localdomain' it's actually quickstart passing via CloudDomain the same domain set in undercloud.conf so I think we should be okay enabling the boolean based on CloudDomain
*** Bug 1508038 has been marked as a duplicate of this bug. ***
Hasn't https://review.openstack.org/#/c/523375/ made things worse for everyone now? Now on a fresh deploy that worked until yesterday I get errors because the short hostname is: overcloud-novacompute-0.novalocal And this even when I explicitly set my CloudDomain to localdomain in my env files. In our current undercloud configuration in the default way hostnames are assigned if dhcp_domain on the undercloud is set to something different than the CloudDomain, short hostnames become FQDNs all of a sudden.
Also another side-effect here is that a deployment from master now has overcloud nodes with hostnames ending in .novalocal even when .localdomain is specified as CloudDomain
(In reply to Michele Baldessari from comment #20) > Hasn't https://review.openstack.org/#/c/523375/ made things worse for > everyone now? > > Now on a fresh deploy that worked until yesterday I get errors because > the short hostname is: > overcloud-novacompute-0.novalocal > > And this even when I explicitly set my CloudDomain to localdomain in my env > files. In our current undercloud configuration in the default way hostnames > are assigned if dhcp_domain on the undercloud is set to something different > than the CloudDomain, short hostnames become FQDNs all of a sudden. I suppose that is because nova is deliberately using .novalocal when the setting is left to the default. I suppose a better fix would have been to set it to '' instead, as we do for the overcloud [1], what do you think? 1. https://github.com/openstack/tripleo-heat-templates/blob/107b610923ba5d39f90c3a6a63bf2d3642e1b35d/puppet/services/nova-base.yaml#L223
failed to deploy the overcloud with the error: "Error: /Stage[main]/Tripleo::Certmonger::Ca::Crl/File[tripleo-ca-crl]: Could not evaluate: Could not retrieve file metadata for http://ipa-ca/ipa/crl/MasterCRL.bin: getaddrinfo: Name or service not known" which is documented in the bug: https://bugzilla.redhat.com/show_bug.cgi?id=1554444 once that bug will backported I'll be able to verify this one
I am the assignee so I am not sure if I can move the BZ into VERIFIED myself but I tested this with: ceph-ansible-3.0.27-1.el7cp.noarch using a custom domain name (example.com) in neutron.conf and the following heat parameters: CloudDomain: example.com CloudName: overcloud.example.com CloudNameInternal: overcloud.internalapi.example.com CloudNameStorage: overcloud.storage.example.com CloudNameStorageManagement: overcloud.storagemgmt.example.com CloudNameCtlplane: overcloud.ctlplane.example.com
verified on openstack-tripleo-heat-templates-7.0.9-8.el7ost.noarch
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2018:0602