Description of problem: Deploy the overcloud either fails to deploy certain nodes or shows a STACK_COMPLETE but has a non-zero (1) return code due to "No valid host was found. There are not enough hosts" 500 error. This error shows up in the heat engine logs. However, ironic shows the nodes are available: 00:45:47 cmd: 00:45:47 source /home/stack/stackrc; instack-ironic-deployment --show-profile; 00:45:47 00:45:47 start: 00:45:47 2015-07-06 20:45:43.220294 00:45:47 00:45:47 end: 00:45:47 2015-07-06 20:45:47.442309 00:45:47 00:45:47 delta: 00:45:47 0:00:04.222015 00:45:47 00:45:47 stdout: 00:45:47 Preparing for deployment... 00:45:47 Querying assigned profiles ... 00:45:47 00:45:47 7d3f8c6f-7eb5-4609-b157-fe19c70f7fb6 00:45:47 "boot_option:local" 00:45:47 00:45:47 ff42ed8b-4663-43d5-a96d-4c4d630cb951 00:45:47 "boot_option:local" 00:45:47 00:45:47 99a67fa9-3dea-4ff6-a955-a4902ce3eae8 00:45:47 "boot_option:local" 00:45:47 00:45:47 a8914beb-b7ea-4f9d-8a06-5e7a741b6cf8 00:45:47 "boot_option:local" 00:45:47 00:45:47 458d482f-f855-4325-a02b-3f7b1deb113d 00:45:47 "boot_option:local" 00:45:47 00:45:47 00:45:47 DONE. 00:45:47 00:45:47 Prepared. Version-Release number of selected component (if applicable): [stack@host15 ~]$ rpm -qa | grep openstack openstack-neutron-openvswitch-2015.1.0-10.el7ost.noarch openstack-nova-api-2015.1.0-14.el7ost.noarch openstack-utils-2014.2-1.el7ost.noarch openstack-heat-api-cloudwatch-2015.1.0-4.el7ost.noarch openstack-tuskar-0.4.18-3.el7ost.noarch openstack-nova-compute-2015.1.0-14.el7ost.noarch openstack-nova-conductor-2015.1.0-14.el7ost.noarch openstack-swift-account-2.3.0-1.el7ost.noarch redhat-access-plugin-openstack-7.0.0-0.el7ost.noarch openstack-heat-api-2015.1.0-4.el7ost.noarch openstack-ceilometer-central-2015.1.0-6.el7ost.noarch openstack-tripleo-common-0.0.1.dev6-0.git49b57eb.el7ost.noarch openstack-heat-api-cfn-2015.1.0-4.el7ost.noarch openstack-ceilometer-api-2015.1.0-6.el7ost.noarch openstack-ironic-api-2015.1.0-8.el7ost.noarch openstack-swift-plugin-swift3-1.7-3.el7ost.noarch openstack-tripleo-puppet-elements-0.0.1-3.el7ost.noarch openstack-nova-common-2015.1.0-14.el7ost.noarch openstack-tripleo-image-elements-0.9.6-5.el7ost.noarch openstack-heat-templates-0-0.6.20150605git.el7ost.noarch openstack-ceilometer-notification-2015.1.0-6.el7ost.noarch openstack-ceilometer-collector-2015.1.0-6.el7ost.noarch openstack-ironic-common-2015.1.0-8.el7ost.noarch openstack-tempest-kilo-20150507.2.el7ost.noarch openstack-swift-2.3.0-1.el7ost.noarch openstack-neutron-ml2-2015.1.0-10.el7ost.noarch openstack-nova-novncproxy-2015.1.0-14.el7ost.noarch openstack-nova-scheduler-2015.1.0-14.el7ost.noarch openstack-swift-object-2.3.0-1.el7ost.noarch openstack-nova-cert-2015.1.0-14.el7ost.noarch openstack-dashboard-theme-2015.1.0-10.el7ost.noarch openstack-tuskar-ui-extras-0.0.4-1.el7ost.noarch openstack-nova-console-2015.1.0-14.el7ost.noarch openstack-neutron-common-2015.1.0-10.el7ost.noarch openstack-neutron-2015.1.0-10.el7ost.noarch openstack-heat-engine-2015.1.0-4.el7ost.noarch openstack-ceilometer-common-2015.1.0-6.el7ost.noarch openstack-ironic-conductor-2015.1.0-8.el7ost.noarch openstack-selinux-0.6.35-1.el7ost.noarch openstack-swift-container-2.3.0-1.el7ost.noarch openstack-puppet-modules-2015.1.7-5.el7ost.noarch openstack-dashboard-2015.1.0-10.el7ost.noarch openstack-swift-proxy-2.3.0-1.el7ost.noarch python-django-openstack-auth-1.2.0-3.el7ost.noarch openstack-tripleo-heat-templates-0.8.6-23.el7ost.noarch openstack-glance-2015.1.0-6.el7ost.noarch python-openstackclient-1.0.3-2.el7ost.noarch openstack-ironic-discoverd-1.1.0-4.el7ost.noarch openstack-ceilometer-alarm-2015.1.0-6.el7ost.noarch openstack-keystone-2015.1.0-4.el7ost.noarch openstack-tuskar-ui-0.3.0-8.el7ost.noarch openstack-heat-common-2015.1.0-4.el7ost.noarch openstack-tripleo-0.0.7-0.1.1664e566.el7ost.noarch [stack@host15 ~]$ rpm -qa | grep plugin yum-rhn-plugin-2.0.1-5.el7.noarch yum-plugin-priorities-1.1.31-29.el7.noarch redhat-access-plugin-openstack-7.0.0-0.el7ost.noarch openstack-swift-plugin-swift3-1.7-3.el7ost.noarch How reproducible: Not always but fairly often. More so with baremetal - and HA - more nodes Steps to Reproduce: 1. Install ops-director from poodle/puddle bits 2. instack-ironic-deployment --show-profile to check overcloud nodes are there and available 3. Deploy the overcloud Actual results: Return code 1 and/or deploy fails - some nodes left in BUILD state Expected results: Deploy passes Additional info:
Looked into this a bit this afternoon, and it looks like we are not setting the nodes to available when the UCLI inspection command completes[1], but instead doing it in the middle of the deploy command[2]. In the instack scripts, we did this immediately after inspection[3]. The problem with doing it in the middle of the deploy command, is that it takes a minute or so for the Nova scheduler to get updated[4]. So, this then creates a race. This is mitigated somewhat by Heat retrying the deploy, but we still get spurious CI failures because we end up with a non-zero exit code. Looking at the CLI bulk introspection code, I do not see an obvious place to put the state transition, as we only have commands for starting and polling inspection. In any case, we should move the nodes to available some time before the deploy command. [1] https://github.com/rdo-management/python-rdomanager-oscplugin/blob/master/rdomanager_oscplugin/v1/baremetal.py#L123-L163 [2] https://github.com/rdo-management/python-rdomanager-oscplugin/blob/master/rdomanager_oscplugin/v1/overcloud_deploy.py#L359-L362 [3] https://github.com/rdo-management/instack-undercloud/blob/master/scripts/instack-ironic-deployment#L158 [4] https://bugs.launchpad.net/ironic/+bug/1248022
We can add this to the end of the command to start introspection. The tricky bit is that waiting for introspection to finish is optional. If we move this here, we need to wait for it to complete every time. So this will cause a slight regression in removing a small feature.
Midstream patch https://review.gerrithub.io/#/c/238962/
I am unable to reproduce this to fully verify the issue, but based on the comment 5, the above review moves the changing of provisioning state to make it happen earlier in the process. Is there a way we can tell when the Nova scheduler is updated?
Verified: python-rdomanager-oscplugin-0.0.8-41.el7ost.noarch
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2015:1549