Bug 1560690 - Re-deployment of overcloud fails with: ResourceInError: resources.OpenShiftWorker: Went to status ERROR due to "Message: Build of instance c7c5191b-59ed-44a0-8b2a-0f68e48e9a52 aborted: Failure prepping block device., Code: 500" [NEEDINFO]
Summary: Re-deployment of overcloud fails with: ResourceInError: resources.OpenShiftWo...
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-heat
Version: 13.0 (Queens)
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: rc
: 13.0 (Queens)
Assignee: Julia Kreger
QA Contact: Alexander Chuzhoy
URL:
Whiteboard:
Keywords: Reopened, Triaged
: 1571929 1608702 (view as bug list)
Depends On:
Blocks: 1214284 1687381
TreeView+ depends on / blocked
 
Reported: 2018-03-26 18:52 UTC by Alexander Chuzhoy
Modified: 2019-07-18 04:47 UTC (History)
35 users (show)

(edit)
Clone Of:
: 1687381 (view as bug list)
(edit)
Last Closed: 2019-05-24 20:26:00 UTC
jkreger: needinfo-
bfournie: needinfo? (dhill)


Attachments (Terms of Use)
nova and ironic logs (6.27 MB, application/x-gzip)
2018-03-26 18:57 UTC, Alexander Chuzhoy
no flags Details
neutron logs (2.40 MB, application/x-gzip)
2018-03-26 19:28 UTC, Alexander Chuzhoy
no flags Details
nova and ironic logs (2.30 MB, application/x-gzip)
2018-04-24 16:08 UTC, Alexander Chuzhoy
no flags Details
rhbz1560690-undercloud.log (88.25 KB, text/plain)
2018-05-19 08:48 UTC, Carlos Goncalves
no flags Details
sosreport (nova, ironic, instack, heat, cinder, neutron) (13.04 MB, application/x-xz)
2018-05-19 08:51 UTC, Carlos Goncalves
no flags Details


External Trackers
Tracker ID Priority Status Summary Last Updated
Red Hat Product Errata RHEA-2018:2086 None None None 2018-06-27 13:49 UTC
OpenStack gerrit 568268 None stable/queens: MERGED heat: Delete internal ports for ERROR-ed nodes (I965abefbb10c22a635c35e3a44d227045f33885c) 2018-05-16 12:33 UTC
Launchpad 1766301 None None None 2018-04-25 18:21 UTC

Description Alexander Chuzhoy 2018-03-26 18:52:45 UTC
Re-deployment of overcloud fails with: ResourceInError: resources.OpenShiftWorker: Went to status ERROR due to "Message: Build of instance c7c5191b-59ed-44a0-8b2a-0f68e48e9a52 aborted: Failure prepping block device., Code: 500"

Environment:

openstack-nova-conductor-17.0.1-0.20180302144923.9ace6ed.el7ost.noarch
openstack-ironic-api-10.1.2-0.20180302010846.7cd9deb.el7ost.noarch
openstack-ironic-inspector-7.2.1-0.20180302142656.397a98a.el7ost.noarch
openstack-nova-common-17.0.1-0.20180302144923.9ace6ed.el7ost.noarch
python2-ironicclient-2.2.0-0.20180211230646.683b7c6.el7ost.noarch
instack-undercloud-8.3.1-0.20180304032746.fc5704f.el7ost.noarch
openstack-tripleo-heat-templates-8.0.0-0.20180304031148.el7ost.noarch
python-ironic-lib-2.12.0-0.20180213172054.831c55b.el7ost.noarch
puppet-ironic-12.3.1-0.20180221115553.12ab03d.el7ost.noarch
openstack-nova-scheduler-17.0.1-0.20180302144923.9ace6ed.el7ost.noarch
openstack-ironic-common-10.1.2-0.20180302010846.7cd9deb.el7ost.noarch
openstack-ironic-staging-drivers-0.9.0-0.20180220235748.de59d74.el7ost.noarch
puppet-nova-12.3.1-0.20180221110444.c11423e.el7ost.noarch
openstack-nova-compute-17.0.1-0.20180302144923.9ace6ed.el7ost.noarch
python-nova-17.0.1-0.20180302144923.9ace6ed.el7ost.noarch
python-novaclient-9.1.1-1.el7ost.noarch
openstack-nova-api-17.0.1-0.20180302144923.9ace6ed.el7ost.noarch
openstack-nova-placement-api-17.0.1-0.20180302144923.9ace6ed.el7ost.noarch
python-ironic-inspector-client-3.1.0-0.20180213173236.c82b59f.el7ost.noarch
openstack-ironic-conductor-10.1.2-0.20180302010846.7cd9deb.el7ost.noarch


Steps to reproduce:
1. Successfully deployed a stack
2. Removed the stack.
3. Attempted to re-deploy.

Result:
2018-03-26 14:44:20.946 5060 ERROR nova.compute.manager [instance: c7c5191b-59ed-44a0-8b2a-0f68e48e9a52]     six.reraise(self.type_, self.value, self.tb)
2018-03-26 14:44:20.946 5060 ERROR nova.compute.manager [instance: c7c5191b-59ed-44a0-8b2a-0f68e48e9a52]   File "/usr/lib/python2.7/site-packages/nova/virt/ironic/driver.py", line 1808, in prepare_networks_before_block_device_mapping
2018-03-26 14:44:20.946 5060 ERROR nova.compute.manager [instance: c7c5191b-59ed-44a0-8b2a-0f68e48e9a52]     self.plug_vifs(instance, network_info)
2018-03-26 14:44:20.946 5060 ERROR nova.compute.manager [instance: c7c5191b-59ed-44a0-8b2a-0f68e48e9a52]   File "/usr/lib/python2.7/site-packages/nova/virt/ironic/driver.py", line 1460, in plug_vifs
2018-03-26 14:44:20.946 5060 ERROR nova.compute.manager [instance: c7c5191b-59ed-44a0-8b2a-0f68e48e9a52]     self._plug_vifs(node, instance, network_info)
2018-03-26 14:44:20.946 5060 ERROR nova.compute.manager [instance: c7c5191b-59ed-44a0-8b2a-0f68e48e9a52]   File "/usr/lib/python2.7/site-packages/nova/virt/ironic/driver.py", line 1427, in _plug_vifs
2018-03-26 14:44:20.946 5060 ERROR nova.compute.manager [instance: c7c5191b-59ed-44a0-8b2a-0f68e48e9a52]     raise exception.VirtualInterfacePlugException(msg)
2018-03-26 14:44:20.946 5060 ERROR nova.compute.manager [instance: c7c5191b-59ed-44a0-8b2a-0f68e48e9a52] VirtualInterfacePlugException: Cannot attach VIF 72b500c7-da06-40be-924e-941f87008365 to the node 00b6bf71-2a7e-4e1c-8b9b-ce24c7a935b0 due to error: Unable to attach VIF 72b500c7-da06-40be-924e-941f87008365, not enough free physical ports. (HTTP 400)
2018-03-26 14:44:20.946 5060 ERROR nova.compute.manager [instance: c7c5191b-59ed-44a0-8b2a-0f68e48e9a52] 
2018-03-26 14:44:24.263 5060 ERROR nova.compute.manager [req-027077ba-6b21-4d6f-a838-95dc541afd07 ea9228f2636848b3bc51439feb8ee159 7a80892307b343d7ac09445827216679 - default default] [instance: c7c5191b-59ed-44a0-8b2a-0f68e48e9a52] Build of instance c7c5191b-59ed-44a0-8b2a-0f68e48e9a52 aborted: Failure prepping block device.: BuildAbortException: Build of instance c7c5191b-59ed-44a0-8b2a-0f68e48e9a52 aborted: Failure prepping block device.
2018-03-26 14:44:24.263 5060 ERROR nova.compute.manager [instance: c7c5191b-59ed-44a0-8b2a-0f68e48e9a52] Traceback (most recent call last):
2018-03-26 14:44:24.263 5060 ERROR nova.compute.manager [instance: c7c5191b-59ed-44a0-8b2a-0f68e48e9a52]   File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 1840, in _do_build_and_run_instance
2018-03-26 14:44:24.263 5060 ERROR nova.compute.manager [instance: c7c5191b-59ed-44a0-8b2a-0f68e48e9a52]     filter_properties, request_spec)
2018-03-26 14:44:24.263 5060 ERROR nova.compute.manager [instance: c7c5191b-59ed-44a0-8b2a-0f68e48e9a52]   File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 2053, in _build_and_run_instance
2018-03-26 14:44:24.263 5060 ERROR nova.compute.manager [instance: c7c5191b-59ed-44a0-8b2a-0f68e48e9a52]     bdms=block_device_mapping)
2018-03-26 14:44:24.263 5060 ERROR nova.compute.manager [instance: c7c5191b-59ed-44a0-8b2a-0f68e48e9a52]   File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 220, in __exit__
2018-03-26 14:44:24.263 5060 ERROR nova.compute.manager [instance: c7c5191b-59ed-44a0-8b2a-0f68e48e9a52]     self.force_reraise()
2018-03-26 14:44:24.263 5060 ERROR nova.compute.manager [instance: c7c5191b-59ed-44a0-8b2a-0f68e48e9a52]   File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 196, in force_reraise
2018-03-26 14:44:24.263 5060 ERROR nova.compute.manager [instance: c7c5191b-59ed-44a0-8b2a-0f68e48e9a52]     six.reraise(self.type_, self.value, self.tb)
2018-03-26 14:44:24.263 5060 ERROR nova.compute.manager [instance: c7c5191b-59ed-44a0-8b2a-0f68e48e9a52]   File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 2005, in _build_and_run_instance
2018-03-26 14:44:24.263 5060 ERROR nova.compute.manager [instance: c7c5191b-59ed-44a0-8b2a-0f68e48e9a52]     block_device_mapping) as resources:
2018-03-26 14:44:24.263 5060 ERROR nova.compute.manager [instance: c7c5191b-59ed-44a0-8b2a-0f68e48e9a52]   File "/usr/lib64/python2.7/contextlib.py", line 17, in __enter__
2018-03-26 14:44:24.263 5060 ERROR nova.compute.manager [instance: c7c5191b-59ed-44a0-8b2a-0f68e48e9a52]     return self.gen.next()
2018-03-26 14:44:24.263 5060 ERROR nova.compute.manager [instance: c7c5191b-59ed-44a0-8b2a-0f68e48e9a52]   File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 2222, in _build_resources
2018-03-26 14:44:24.263 5060 ERROR nova.compute.manager [instance: c7c5191b-59ed-44a0-8b2a-0f68e48e9a52]     reason=msg)
2018-03-26 14:44:24.263 5060 ERROR nova.compute.manager [instance: c7c5191b-59ed-44a0-8b2a-0f68e48e9a52] BuildAbortException: Build of instance c7c5191b-59ed-44a0-8b2a-0f68e48e9a52 aborted: Failure prepping block device.
2018-03-26 14:44:24.263 5060 ERROR nova.compute.manager [instance: c7c5191b-59ed-44a0-8b2a-0f68e48e9a52] 



Note: Failure prepping block device. and  Cannot attach VIF 72b500c7-da06-40be-924e-941f87008365 to the node 00b6bf71-2a7e-4e1c-8b9b-ce24c7a935b0 due to error: Unable to attach VIF 72b500c7-da06-40be-924e-941f87008365, not enough free physical ports.

Comment 2 Alexander Chuzhoy 2018-03-26 18:57 UTC
Created attachment 1413294 [details]
nova and ironic logs

Comment 3 Alexander Chuzhoy 2018-03-26 19:28 UTC
Created attachment 1413311 [details]
neutron logs

Comment 4 Bob Fournier 2018-03-26 20:59:00 UTC
The neutron l3_agent logs are full of these errors:

e_timeout option as Neutron server(s) may be overloaded and unable to respond quickly enough.: MessagingTimeout: Timed out waiting for a reply to message ID 83d509e6c032450e8cb7aa246385a2dd
l3-agent.log:2018-03-21 08:50:40.232 28398 ERROR neutron.common.rpc [req-bba392f7-b308-4543-b9cb-de85e2d724c2 - - - - -] Timeout in RPC method get_service_plugin_list. Waiting for 6 seconds before next attempt. If the server is not down, consider increasing the rpc_response_timeout option as Neutron server(s) may be overloaded and unable to respond quickly enough.: MessagingTimeout: Timed out waiting for a reply to message ID fa19acc6923f478fa039c703ead62ac2
l3-agent.log:2018-03-21 09:00:46.570 28398 ERROR neutron.common.rpc [req-0536a495-7892-46f6-9d37-70ff11afa3f2 - - - - -] Timeout in RPC method get_service_plugin_list. Waiting for 42 seconds before next attempt. If the server is not down, consider increasing the rpc_response_timeout option as Neutron server(s) may be overloaded and unable to respond quickly enough.: MessagingTimeout: Timed out waiting for a reply to message ID 2301c81877e649148fe731408d73f508
l3-agent.log:2018-03-21 09:11:28.913 28398 ERROR neutron.common.rpc [req-f4755bfb-deea-4685-bd3b-e2fd5140047b - - - - -] Timeout in RPC method get_service_plugin_list. Waiting for 37 seconds before next attempt. If the server is not down, consider increasing the rpc_response_timeout option as Neutron server(s) may be overloaded and unable to respond quickly enough.: MessagingTimeout: Timed out waiting for a reply to message ID 83e22a0803904dfdaf5c8a893b2551d2

The neutron server logs have hundreds of these errors:

2018-03-26 03:44:15.687 3281 WARNING neutron.pecan_wsgi.controllers.root [req-68e614e3-1b38-43a9-aca0-d845f3f1036b 70e7c1e7b21a4d8bbf8e3b42b053eced 7a80892307b343d7ac09445827216679 - default default] No controller found for: floatingips - returning response code 404: PecanNotFound
2018-03-26 03:44:15.689 3281 INFO neutron.pecan_wsgi.hooks.translation [req-68e614e3-1b38-43a9-aca0-d845f3f1036b 70e7c1e7b21a4d8bbf8e3b42b053eced 7a80892307b343d7ac09445827216679 - default default] GET failed (client error): The resource could not be found.

Comment 5 Bob Fournier 2018-03-26 21:32:26 UTC
According to Ihar, those neutron logs just indicate the l3 plugin is not loaded in the undercloud - 'what the warning says is that we received a request for /floatingips but no loaded plugin provides the get_floatingips method.

nova shouldn't issue those requests so maybe it's an issue just low priority
nova seems to not check extension is enabled which is the wrong way of using the api, they should first check it's on, then call to it.  I understand that l3 is 99% of time is on but e.g. undercloud is different, definitely not the root cause though."


In the Ironic conductor logs we see a couple connection errors and a power off node failure:
2018-03-26 12:49:12.550 1878 ERROR ironic.drivers.modules.agent_client [req-36620f31-5221-44f4-a8a0-713a6043f7ac - - - - -] Error invoking agent command standby.power_off for node a1d2b4b1-17e9-416e-ad38-677f8a13cd30. Error: HTTPConnectionPool(host='10.19.4.86', port=9999): Max retries exceeded with url: /v1/commands?wait=false (Caused by NewConnectionError('<requests.packages.urllib3.connection.HTTPConnection object at 0x7fcd387b6710>: Failed to establish a new connection: [Errno 111] ECONNREFUSED',)): ConnectionError: HTTPConnectionPool(host='10.19.4.86', port=9999): Max retries exceeded with url: /v1/commands?wait=false (Caused by NewConnectionError('<requests.packages.urllib3.connection.HTTPConnection object at 0x7fcd387b6710>: Failed to establish a new connection: [Errno 111] ECONNREFUSED',))2018-03-26 12:49:12.552 1878 WARNING ironic.drivers.modules.agent_base_vendor [req-36620f31-5221-44f4-a8a0-713a6043f7ac - - - - -] Failed to soft power off node a1d2b4b1-17e9-416e-ad38-677f8a13cd30 in at least 30 seconds. Error: Error invoking agent command standby.power_off for node a1d2b4b1-17e9-416e-ad38-677f8a13cd30. Error: HTTPConnectionPool(host='10.19.4.86', port=9999): Max retries exceeded with url: /v1/commands?wait=false (Caused by NewConnectionError('<requests.packages.urllib3.connection.HTTPConnection object at 0x7fcd387b6710>: Failed to establish a new connection: [Errno 111] ECONNREFUSED',)): IronicException: Error invoking agent command standby.power_off for node a1d2b4b1-17e9-416e-ad38-677f8a13cd30. Error: HTTPConnectionPool(host='10.19.4.86', port=9999): Max retries exceeded with url: /v1/commands?wait=false (Caused by NewConnectionError('<requests.packages.urllib3.connection.HTTPConnection object at 0x7fcd387b6710>: Failed to establish a new connection: [Errno 111] ECONNREFUSED',))

Comment 6 Ian Main 2018-03-26 21:40:10 UTC
I was looking through the ironic state and found that the entry that was failing had a duplicate 'tenant_vif_port_id' for some reason.

# ironic port-show 7679f37a-c6a1-476c-9a4a-9ac53a3a5ff4

+-----------------------+------------------------------------------------------------------+
| Property              | Value                                                            |
+-----------------------+------------------------------------------------------------------+
| address               | 54:9f:35:f6:70:55                                                |
| created_at            | 2018-03-20T16:32:57+00:00                                        |
| extra                 | {}                                                               |
| internal_info         | {u'tenant_vif_port_id': u'2154ee13-0ba8-48cf-a45a-71e111ffece4'} |
| local_link_connection | {u'port_id': u'Gi1/0/5', u'switch_id': u'f8:b1:56:37:7e:b4'}     |
| node_uuid             | 6ae0110d-eef7-4e3d-b2db-414a2db39b67                             |
| physical_network      | None                                                             |
| portgroup_uuid        | None                                                             |
| pxe_enabled           | True                                                             |
| updated_at            | 2018-03-26T20:09:34+00:00                                        |
| uuid                  | 7679f37a-c6a1-476c-9a4a-9ac53a3a5ff4                             |
+-----------------------+------------------------------------------------------------------+


# ironic port-show a479cb4c-5239-4738-b0d9-253f21d35066
+-----------------------+------------------------------------------------------------------+
| Property              | Value                                                            |
+-----------------------+------------------------------------------------------------------+
| address               | 54:9f:35:f6:70:3b                                                |
| created_at            | 2018-03-20T16:31:42+00:00                                        |
| extra                 | {}                                                               |
| internal_info         | {u'tenant_vif_port_id': u'b578dd75-6707-49c8-81b3-30267c2310a1'} |
| local_link_connection | {u'port_id': u'Gi1/0/3', u'switch_id': u'f8:b1:56:37:7e:b4'}     |
| node_uuid             | 9ebf1192-2987-40b0-8028-38b923f2332a                             |
| physical_network      | None                                                             |
| portgroup_uuid        | None                                                             |
| pxe_enabled           | True                                                             |
| updated_at            | 2018-03-20T21:30:30+00:00                                        |
| uuid                  | a479cb4c-5239-4738-b0d9-253f21d35066                             |
+-----------------------+------------------------------------------------------------------+

With one of them being 'active' and the other 'available'.  The error occurred while trying to set up the available node listed here.

Comment 7 Ian Main 2018-03-26 21:43:32 UTC
Sorry I grabbed the wrong entry there :)  The duplicate one is:

# ironic port-show 1db7c067-ef37-4052-a99d-0d7885424d82
+-----------------------+------------------------------------------------------------------+
| Property              | Value                                                            |
+-----------------------+------------------------------------------------------------------+
| address               | 54:9f:35:f6:70:2e                                                |
| created_at            | 2018-03-20T16:32:28+00:00                                        |
| extra                 | {}                                                               |
| internal_info         | {u'tenant_vif_port_id': u'2154ee13-0ba8-48cf-a45a-71e111ffece4'} |
| local_link_connection | {u'port_id': u'Gi1/0/2', u'switch_id': u'f8:b1:56:37:7e:b4'}     |
| node_uuid             | a1d2b4b1-17e9-416e-ad38-677f8a13cd30                             |
| physical_network      | None                                                             |
| portgroup_uuid        | None                                                             |
| pxe_enabled           | True                                                             |
| updated_at            | 2018-03-26T20:09:53+00:00                                        |
| uuid                  | 1db7c067-ef37-4052-a99d-0d7885424d82                             |
+-----------------------+------------------------------------------------------------------+

which has the same vif id as the first entry in the last post.  Sorry for the confusion!

Comment 8 Harald Jensås 2018-03-26 22:18:14 UTC
In the logs there are errors both for attach and detach:

2018-03-26 14:44:20.930 9165 DEBUG wsme.api [req-f95d1c83-636e-4ee9-a307-cca2cb228405 b129778b3822410ea1199ee7beee87a4 7a80892307b343d7ac09445827216679 - default default] Client-side error: Unable to attach VIF 72b500c7-da06-40be-924e-941f87008365, not enough free physical ports.
Traceback (most recent call last):

  File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/server.py", line 226, in inner
    return func(*args, **kwargs)

  File "/usr/lib/python2.7/site-packages/ironic/conductor/manager.py", line 2924, in vif_attach
    task.driver.network.vif_attach(task, vif_info)

  File "/usr/lib/python2.7/site-packages/ironic/drivers/modules/network/common.py", line 551, in vif_attach
    port_like_obj = get_free_port_like_object(task, vif_id, physnets)

  File "/usr/lib/python2.7/site-packages/ironic/drivers/modules/network/common.py", line 175, in get_free_port_like_object
    raise exception.NoFreePhysicalPorts(vif=vif_id)

NoFreePhysicalPorts: Unable to attach VIF 72b500c7-da06-40be-924e-941f87008365, not enough free physical ports.
 format_exception /usr/lib/python2.7/site-packages/wsme/api.py:222
2018-03-26 14:44:23.818 9165 DEBUG wsme.api [req-fafebeda-a61b-4fde-af8e-93722c3afdb3 b129778b3822410ea1199ee7beee87a4 7a80892307b343d7ac09445827216679 - default default] Client-side error: Unable to detach VIF 72b500c7-da06-40be-924e-941f87008365 from node 00b6bf71-2a7e-4e1c-8b9b-ce24c7a935b0 because it is not attached to it.
Traceback (most recent call last):

  File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/server.py", line 226, in inner
    return func(*args, **kwargs)

  File "/usr/lib/python2.7/site-packages/ironic/conductor/manager.py", line 2950, in vif_detach
    task.driver.network.vif_detach(task, vif_id)

  File "/usr/lib/python2.7/site-packages/ironic/drivers/modules/network/common.py", line 582, in vif_detach
    port_like_obj = self._get_port_like_obj_by_vif_id(task, vif_id)

  File "/usr/lib/python2.7/site-packages/ironic/drivers/modules/network/common.py", line 333, in _get_port_like_obj_by_vif_id
    raise exception.VifNotAttached(vif=vif_id, node=task.node.uuid)

VifNotAttached: Unable to detach VIF 72b500c7-da06-40be-924e-941f87008365 from node 00b6bf71-2a7e-4e1c-8b9b-ce24c7a935b0 because it is not attached to it.


For the NoFreePhysicalPorts error, it would be interesting to know which one.
https://github.com/openstack/ironic/blob/master/ironic/drivers/modules/network/common.py#L114-L123

Edit: I was going to suggest some additional debug logging, but based on Ian's findings above it is likely L117: if _vif_attached(p, vif_id):

Comment 9 Bob Fournier 2018-03-27 21:04:30 UTC
Proposed patch upstream https://review.openstack.org/#/c/556748/ will need to be backported.

Comment 10 Rachana Patel 2018-03-28 06:40:10 UTC
I have also observed similar error with redeployment in last week but today I am facing similar error with new deployment.

- deleted previous stack
- reimaged undercloud server.
- deplpoy again

"Went to status ERROR due to "Message: Build of instance 7792026b-8c4d-4b44-81c3-102ccaec2a92 aborted: Failure prepping block device., Code: 500""

I am using "rhos-release 13   -p 2018-03-19.2".

Let me know if you need logs or access to servers or any other details.

Please update me if my error is not related to this bug and I should open new BZ for it

Comment 11 Alexander Chuzhoy 2018-03-28 13:16:03 UTC
The following worked for me...
As a W/A you can either:
Removing all non-active entries from ironic and re-insert/introspect
or
Apply patch from comment #9 and restart openstack-ironic-conductor.

Comment 12 Rachana Patel 2018-03-29 06:53:29 UTC
(In reply to Alexander Chuzhoy from comment #11)
> The following worked for me...
> As a W/A you can either:
> Removing all non-active entries from ironic and re-insert/introspect
> or
> Apply patch from comment #9 and restart openstack-ironic-conductor.

Tried both approach

1) remove all entries from ironic and re-insert/introspect
2) apply patch and restart service
 curl -4 -k https://review.openstack.org/changes/556748/revisions/9715f80915c124b47c4463d7af4d505c5bedab9b/patch?download|base64 -d|sudo patch -t -d /usr/lib/python2.7/site-packages/ -p1 

sudo systemctl stop openstack-ironic-conductor.service
sudo systemctl start openstack-ironic-conductor.service


Error:-
======
ResourceInError: resources.ComputeHCI: Went to status ERROR due to "Message: Build of instance d3a4342c-42ae-46af-ac70-6b670c2bc309 aborted: Failure prepping block device., Code: 500"


My Testing is blocked due to this error.
(Let me know if you need logs or access to servers or any other details.

Please update me if my error is not related to this bug and I should open new BZ for it)

Comment 13 Alexander Chuzhoy 2018-03-29 21:16:04 UTC
Version: 
openstack-nova-compute-17.0.1-0.20180302144923.9ace6ed.el7ost.noarch
python-nova-17.0.1-0.20180302144923.9ace6ed.el7ost.noarch
puppet-neutron-12.3.1-0.20180222064632.07b93f1.el7ost.noarch
python-neutron-12.0.0-0.20180222093624.abb60c6.0rc2.el7ost.noarch
python-novaclient-9.1.1-1.el7ost.noarch
openstack-nova-api-17.0.1-0.20180302144923.9ace6ed.el7ost.noarch
openstack-nova-scheduler-17.0.1-0.20180302144923.9ace6ed.el7ost.noarch
openstack-tripleo-heat-templates-8.0.0-0.20180304031148.el7ost.noarch
openstack-neutron-ml2-12.0.0-0.20180222093624.abb60c6.0rc2.el7ost.noarch
openstack-ironic-conductor-10.1.2-0.20180302010846.7cd9deb.el7ost.noarch
instack-undercloud-8.3.1-0.20180304032746.fc5704f.el7ost.noarch
puppet-ironic-12.3.1-0.20180221115553.12ab03d.el7ost.noarch
openstack-neutron-12.0.0-0.20180222093624.abb60c6.0rc2.el7ost.noarch
python2-neutronclient-6.7.0-0.20180211221651.95d64ce.el7ost.noarch
puppet-nova-12.3.1-0.20180221110444.c11423e.el7ost.noarch
python2-ironicclient-2.2.0-0.20180211230646.683b7c6.el7ost.noarch
openstack-ironic-common-10.1.2-0.20180302010846.7cd9deb.el7ost.noarch
python2-neutron-lib-1.13.0-0.20180211233639.dcf96cd.el7ost.noarch
openstack-ironic-staging-drivers-0.9.0-0.20180220235748.de59d74.el7ost.noarch
openstack-nova-conductor-17.0.1-0.20180302144923.9ace6ed.el7ost.noarch
openstack-ironic-api-10.1.2-0.20180302010846.7cd9deb.el7ost.noarch
openstack-ironic-inspector-7.2.1-0.20180302142656.397a98a.el7ost.noarch
openstack-nova-common-17.0.1-0.20180302144923.9ace6ed.el7ost.noarch
openstack-neutron-common-12.0.0-0.20180222093624.abb60c6.0rc2.el7ost.noarch
python-ironic-inspector-client-3.1.0-0.20180213173236.c82b59f.el7ost.noarch
openstack-neutron-openvswitch-12.0.0-0.20180222093624.abb60c6.0rc2.el7ost.noarch
python-ironic-lib-2.12.0-0.20180213172054.831c55b.el7ost.noarch
openstack-nova-placement-api-17.0.1-0.20180302144923.9ace6ed.el7ost.noarch



So after re-installing the setup and applying the patch from comment #9, we saw the error appearing in the logs:

2018-03-29 20:57:48Z [openshift.OpenShiftMaster.0.OpenShiftMaster]: CREATE_IN_PROGRESS  state changed
2018-03-29 20:58:15Z [openshift.OpenShiftWorker.0.OpenShiftWorker]: CREATE_FAILED  ResourceInError: resources.OpenShiftWorker: Went to status ERROR due to "Message: Build of instance 7fccdbd
5-fdea-42d1-bb30-696f4c62f47a aborted: Failure prepping block device., Code: 500"
2018-03-29 20:58:15Z [openshift.OpenShiftWorker.0.OpenShiftWorker]: DELETE_IN_PROGRESS  state changed
2018-03-29 20:58:18Z [openshift.OpenShiftWorker.0.OpenShiftWorker]: DELETE_COMPLETE  state changed
2018-03-29 20:58:21Z [openshift.OpenShiftWorker.0.OpenShiftWorker]: CREATE_IN_PROGRESS  state changed




But the deployment continued and completed successfully.
Successfully re-deployed 4 times but now.

Comment 15 Julia Kreger 2018-04-11 16:53:50 UTC
This seems like something is occurring in nova in the block device handling (which also requires network handling).

The node could also just be getting rescheduled possibly....

1) Have the ports in ironic been checked for any legacy vifs that are no longer in use?  In particular on any presently undeployed nodes?
2) Are there current logs with this error available that can be reviewed? Last comment was from the 29th, the attachments appear to be the 26th.

-Julia

Comment 23 Julia Kreger 2018-04-20 06:04:38 UTC
After looking at the environment where this was encountered, and four successful stack deploys later I'm fairly certain I understand what is going on.

In essence, what is happening in this case (which is different from some of the other cases of a vif being orphaned in Ironic's database), since in queens, any sort of scheduling/locking issue with obtaining a baremetal node or trying to use an already scheduled node will appear as this error.

With the configuration being used for the flavor:

+----------------------------+---------------------------------------------------------------------------------+
| Field                      | Value                                                                           |
+----------------------------+---------------------------------------------------------------------------------+
| OS-FLV-DISABLED:disabled   | False                                                                           |
| OS-FLV-EXT-DATA:ephemeral  | 0                                                                               |
| access_project_ids         | None                                                                            |
| disk                       | 40                                                                              |
| id                         | 51bd7b31-94cb-43a6-8d53-ea9810f0bc14                                            |
| name                       | baremetal                                                                             |
| os-flavor-access:is_public | True                                                                            |
| properties                 | capabilities:boot_option='local', capabilities:profile='baremetal', cpu_arch='x86_64' |
| ram                        | 4096                                                                            |
| rxtx_factor                | 1.0                                                                             |
| swap                       |                                                                                 |
| vcpus                      | 1                                                                               |
+----------------------------+---------------------------------------------------------------------------------+

There was no use of the resource class. So what was occurring, was nova was saying "Oh, it looks like you can fit two of the requested instances on to this single baremetal node", which architecturally is impossible.

This was presenting as something like:

2018-04-19 19:49:06.832 18246 ERROR nova.virt.ironic.driver [req-90f1e5e7-1ee0-4f1d-af88-a42f74b0a8e0 e9b4e6ab60ae40cc84ee5689c38608ef f3dccd2210514e3695c4d087d81a65a7 - default default] Can
40-50b3-489b-ae1e-0840e0608253 to the node 64eea9c8-b69f-4a30-a05f-2c7b09f5b1d8 due to error: Unable to attach VIF 7d0a6b40-50b3-489b-ae1e-0840e0608253, not enough free physical ports. (HTTP
able to attach VIF 7d0a6b40-50b3-489b-ae1e-0840e0608253, not enough free physical ports. (HTTP 400)
2018-04-19 19:49:06.833 18246 ERROR nova.virt.ironic.driver [req-90f1e5e7-1ee0-4f1d-af88-a42f74b0a8e0 e9b4e6ab60ae40cc84ee5689c38608ef f3dccd2210514e3695c4d087d81a65a7 - default default] [in
-4067-af8e-cd6e25fb6b59] Error preparing deploy for instance 964f0d93-e9c4-4067-af8e-cd6e25fb6b59 on baremetal node 64eea9c8-b69f-4a30-a05f-2c7b09f5b1d8.: VirtualInterfacePlugException: Cann
0-50b3-489b-ae1e-0840e0608253 to the node 64eea9c8-b69f-4a30-a05f-2c7b09f5b1d8 due to error: Unable to attach VIF 7d0a6b40-50b3-489b-ae1e-0840e0608253, not enough free physical ports. (HTTP 
2018-04-19 19:49:06.833 18246 ERROR nova.compute.manager [req-90f1e5e7-1ee0-4f1d-af88-a42f74b0a8e0 e9b4e6ab60ae40cc84ee5689c38608ef f3dccd2210514e3695c4d087d81a65a7 - default default] [insta
67-af8e-cd6e25fb6b59] Failure prepping block device: VirtualInterfacePlugException: Cannot attach VIF 7d0a6b40-50b3-489b-ae1e-0840e0608253 to the node 64eea9c8-b69f-4a30-a05f-2c7b09f5b1d8 du
 attach VIF 7d0a6b40-50b3-489b-ae1e-0840e0608253, not enough free physical ports. (HTTP 400)

Looking at the http transaction log:

192.168.24.1 - - [19/Apr/2018:19:49:00 -0400] "POST /v1/nodes/64eea9c8-b69f-4a30-a05f-2c7b09f5b1d8/vifs HTTP/1.1" 204 - "-" "python-ironicclient"
192.168.24.1 - - [19/Apr/2018:19:49:04 -0400] "POST /v1/nodes/64eea9c8-b69f-4a30-a05f-2c7b09f5b1d8/vifs HTTP/1.1" 400 177 "-" "python-ironicclient"
192.168.24.1 - - [19/Apr/2018:19:49:05 -0400] "PATCH /v1/nodes/64eea9c8-b69f-4a30-a05f-2c7b09f5b1d8 HTTP/1.1" 200 1239 "-" "python-ironicclient"
192.168.24.1 - - [19/Apr/2018:19:49:06 -0400] "DELETE /v1/nodes/64eea9c8-b69f-4a30-a05f-2c7b09f5b1d8/vifs/7d0a6b40-50b3-489b-ae1e-0840e0608253 HTTP/1.1" 400 225 "-" "python-ironicclient"
192.168.24.1 - - [19/Apr/2018:19:49:20 -0400] "POST /v1/nodes/67eeb698-cfd7-4f30-83c7-c6748f78da60/vifs HTTP/1.1" 204 - "-" "python-ironicclient"

When attempting to corrilate the vif plugging API actions to the nova-compute service log, I discovered that the payloads of the POST operations were for two different VIFs.

The above patch operation is where the node is essentially "reserved", however due to the needs of BFV, we attempt to attach the VIF first. This is where things begin to go sideways. If we're trying to put two different vifs on the same node, with no concept that that node is already in use, then we will absolutely fail. With many nodes being deployed at the same time, it is conceivable that this cascades to the point where nova eventually gives up because it has retried three times.

There is likely something in the "placement" algorithm that creates hotspots, which causes longer lived undercloud deployments to prefer scheduling on to the same nodes more frequently, which may be why this is seemingly happening more frequently.

What should we be doing?

We should be setting the resource class fields on the flavor:

openstack flavor set baremetal --property "resources:CUSTOM_BAREMETAL"="1"
openstack flavor set baremetal --property "resources:DISK_GB"="0"
openstack flavor set baremetal --property "resources:MEMORY_MB"="0"
openstack flavor set baremetal --property "resources:VCPU"="0"

Which results in the resource class looking like:

+----------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------
------+
| Field                      | Value                                                                                                                                                          
      |
+----------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------
------+
| OS-FLV-DISABLED:disabled   | False                                                                                                                                                          
      |
| OS-FLV-EXT-DATA:ephemeral  | 0                                                                                                                                                              
      |
| access_project_ids         | None                                                                                                                                                           
      |
| disk                       | 40                                                                                                                                                             
      |
| id                         | 36e063d9-bcb8-4f02-bb79-8edc8e8dc9ae                                                                                                                           
      |
| name                       | baremetal                                                                                                                                                      
      |
| os-flavor-access:is_public | True                                                                                                                                                           
      |
| properties                 | capabilities:boot_option='local', capabilities:profile='baremetal', cpu_arch='x86_64', resources:CUSTOM_BAREMETAL='1', resources:DISK_GB='0', resources:MEMORY_
U='0' |
| ram                        | 4096                                                                                                                                                           
      |
| rxtx_factor                | 1.0                                                                                                                                                            
      |
| swap                       |                                                                                                                                                                
      |
| vcpus                      | 2                                                                                                                                                              
      |
+----------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------
------+

Doing this results in a better looking behavior when requesting nodes:

192.168.24.1 - - [20/Apr/2018:01:02:07 -0400] "POST /v1/nodes/64eea9c8-b69f-4a30-a05f-2c7b09f5b1d8/vifs HTTP/1.1" 204 - "-" "python-ironicclient"
192.168.24.1 - - [20/Apr/2018:01:02:08 -0400] "POST /v1/nodes/67eeb698-cfd7-4f30-83c7-c6748f78da60/vifs HTTP/1.1" 204 - "-" "python-ironicclient"
192.168.24.1 - - [20/Apr/2018:01:02:12 -0400] "PATCH /v1/nodes/64eea9c8-b69f-4a30-a05f-2c7b09f5b1d8 HTTP/1.1" 200 1245 "-" "python-ironicclient"
192.168.24.1 - - [20/Apr/2018:01:02:13 -0400] "PATCH /v1/nodes/67eeb698-cfd7-4f30-83c7-c6748f78da60 HTTP/1.1" 200 1246 "-" "python-ironicclient"

Note, the two different node UUIDs being used upfront. Before switching to resource classes for scheduling, I was experiencing a 50% rate where a single node would be selected by placement for the instances. With only two instances being requested and two nodes available in the resource class at present, it is relatively easy for the instance that needs to be rescheduled, to end up rescheduled. However with more nodes being requested at one time, the odds of three failures occurring for any given requested instance begins to go up.

Can this be fixed without changing the flavors?

tl;dr somewhere between maybe and most likely not. This is a complex race that really only ironic can solve in its virt driver inside of nova by adding additional checks and attempting to lock a node earlier. That being said, I'm unsure at present if we are able to or will be able to continue to raise a particular exception to force rescheduling of the requested instance without a negative impact. If we do try and fix it, then we're ultimately trying to work around the side effect of deprecated behavior, which is not ideal.

Where do we go from here?

It would be good to understand if the environments where we have been encountering extremely similar issues, are configured this way or the prior way. If this can be tested on a few of those more complex environments, that would help as well.

Comment 24 Dmitry Tantsur 2018-04-23 14:50:56 UTC
> Can this be fixed without changing the flavors?

Maybe it can, but it should not. We have to use the correct flavors, or we'll get broken one day anyway.

Comment 25 Julia Kreger 2018-04-23 17:05:30 UTC
I'm going to take a look at double checking/fixing the locking behavior so we hopefully gracefully avoid this. Opened launchpad https://bugs.launchpad.net/nova/+bug/1766301

Comment 26 Alexander Chuzhoy 2018-04-24 16:08:18 UTC
Saw this error on a setup during first attempt to deploy.

I did delete baremetal nodes and re-imported them using ipmi driver though.

2018-04-24 11:24:35.603 25823 ERROR nova.compute.manager [instance: 3327c681-7b3b-446c-a771-0de451a4d72e]     reason=msg)                                                                     │····················
2018-04-24 11:24:35.603 25823 ERROR nova.compute.manager [instance: 3327c681-7b3b-446c-a771-0de451a4d72e] BuildAbortException: Build of instance 3327c681-7b3b-446c-a771-0de451a4d72e aborted:│····················
 Failure prepping block device.                                                                                                                                                               │····················
2018-04-24 11:24:35.603 25823 ERROR nova.compute.manager [instance: 3327c681-7b3b-446c-a771-0de451a4d72e]   


Adding ironic and nova logs.

Comment 27 Alexander Chuzhoy 2018-04-24 16:08 UTC
Created attachment 1426153 [details]
nova and ironic logs

Comment 28 Leonid Natapov 2018-05-08 10:25:08 UTC
Happens to me too. Also getting messages: No valid host was found. There are not enough hosts available., Code: 500"

3 controllers 1 compute topology. Reducing the number of overcloud nodes to 1 compute and 1 controller solves the problem.


-----


2018-05-08 10:11:53Z [overcloud.Controller.2.NodeAdminUserData]: CREATE_COMPLETE  state changed
2018-05-08 10:11:54Z [overcloud.Controller.1.NodeAdminUserData]: CREATE_COMPLETE  state changed
2018-05-08 10:11:54Z [overcloud.Controller.2.DeploymentActions]: CREATE_COMPLETE  state changed
2018-05-08 10:11:54Z [overcloud.Controller.1.RoleUserData]: CREATE_IN_PROGRESS  state changed
2018-05-08 10:11:54Z [overcloud.Compute.0.NodeUserData]: CREATE_COMPLETE  state changed
2018-05-08 10:11:54Z [overcloud.Controller.2.UpdateConfig]: CREATE_COMPLETE  state changed
2018-05-08 10:11:54Z [overcloud.Compute.0.UserData]: CREATE_IN_PROGRESS  state changed
2018-05-08 10:11:54Z [overcloud.Controller.0.DeploymentActions]: CREATE_IN_PROGRESS  state changed
2018-05-08 10:11:54Z [overcloud.Controller.0.DeploymentActions]: CREATE_COMPLETE  state changed
2018-05-08 10:11:54Z [overcloud.Compute.0.UserData]: CREATE_COMPLETE  state changed
2018-05-08 10:11:55Z [overcloud.Controller.0.UpdateConfig]: CREATE_COMPLETE  state changed
2018-05-08 10:11:55Z [overcloud.Controller.2.NodeUserData]: CREATE_IN_PROGRESS  state changed
2018-05-08 10:11:55Z [overcloud.Compute.0.NovaCompute]: CREATE_IN_PROGRESS  state changed
2018-05-08 10:11:55Z [overcloud.Controller.1.ControllerUpgradeInitConfig]: CREATE_IN_PROGRESS  state changed
2018-05-08 10:11:55Z [overcloud.Controller.1.ControllerUpgradeInitConfig]: CREATE_COMPLETE  state changed
2018-05-08 10:11:55Z [overcloud.Compute.0.UpdateConfig]: CREATE_COMPLETE  state changed
2018-05-08 10:11:55Z [overcloud.Controller.0.NodeAdminUserData]: CREATE_IN_PROGRESS  state changed
2018-05-08 10:11:55Z [overcloud.Controller.2.RoleUserData]: CREATE_IN_PROGRESS  state changed
2018-05-08 10:11:56Z [overcloud.Controller.1.RoleUserData]: CREATE_COMPLETE  state changed
2018-05-08 10:11:56Z [overcloud.Controller.1.NodeUserData]: CREATE_IN_PROGRESS  state changed
2018-05-08 10:11:56Z [overcloud.Controller.2.NodeUserData]: CREATE_COMPLETE  state changed
2018-05-08 10:11:57Z [overcloud.Controller.0.NodeAdminUserData]: CREATE_COMPLETE  state changed
2018-05-08 10:11:57Z [overcloud.Controller.2.RoleUserData]: CREATE_COMPLETE  state changed
2018-05-08 10:11:57Z [overcloud.Controller.1.UpdateConfig]: CREATE_IN_PROGRESS  state changed
2018-05-08 10:11:57Z [overcloud.Controller.0.UserData]: CREATE_IN_PROGRESS  state changed
2018-05-08 10:11:57Z [overcloud.Controller.2.UserData]: CREATE_IN_PROGRESS  state changed
2018-05-08 10:11:57Z [overcloud.Controller.0.UserData]: CREATE_COMPLETE  state changed
2018-05-08 10:11:57Z [overcloud.Controller.0.Controller]: CREATE_IN_PROGRESS  state changed
2018-05-08 10:11:58Z [overcloud.Controller.1.NodeUserData]: CREATE_COMPLETE  state changed
2018-05-08 10:11:58Z [overcloud.Controller.2.UserData]: CREATE_COMPLETE  state changed
2018-05-08 10:11:58Z [overcloud.Controller.1.UserData]: CREATE_IN_PROGRESS  state changed
2018-05-08 10:11:58Z [overcloud.Controller.2.Controller]: CREATE_IN_PROGRESS  state changed
2018-05-08 10:11:58Z [overcloud.Controller.1.UserData]: CREATE_COMPLETE  state changed
2018-05-08 10:11:58Z [overcloud.Controller.1.Controller]: CREATE_IN_PROGRESS  state changed
2018-05-08 10:11:59Z [overcloud.Controller.1.UpdateConfig]: CREATE_COMPLETE  state changed
2018-05-08 10:14:55Z [overcloud.Compute.0.NovaCompute]: CREATE_FAILED  ResourceInError: resources.NovaCompute: Went to status ERROR due to "Message: Exceeded maximum number of retries. Exhausted all hosts available for retrying build failures for instance e7d10312-a672-4cab-ace2-984116e8832c., Code: 500"
2018-05-08 10:14:55Z [overcloud.Compute.0.NovaCompute]: DELETE_IN_PROGRESS  state changed
2018-05-08 10:14:58Z [overcloud.Compute.0.NovaCompute]: DELETE_COMPLETE  state changed
2018-05-08 10:15:01Z [overcloud.Compute.0.NovaCompute]: CREATE_IN_PROGRESS  state changed
2018-05-08 10:15:04Z [overcloud.Compute.0.NovaCompute]: CREATE_FAILED  ResourceInError: resources.NovaCompute: Went to status ERROR due to "Message: No valid host was found. There are not enough hosts available., Code: 500"
2018-05-08 10:15:04Z [overcloud.Compute.0.NovaCompute]: DELETE_IN_PROGRESS  state changed
2018-05-08 10:15:06Z [overcloud.Controller.2.Controller]: CREATE_FAILED  ResourceInError: resources.Controller: Went to status ERROR due to "Message: Unknown, Code: Unknown"
2018-05-08 10:15:06Z [overcloud.Controller.2.Controller]: DELETE_IN_PROGRESS  state changed
2018-05-08 10:15:06Z [overcloud.Compute.0.NovaCompute]: DELETE_COMPLETE  state changed
2018-05-08 10:15:08Z [overcloud.Controller.2.Controller]: DELETE_COMPLETE  state changed
2018-05-08 10:15:11Z [overcloud.Compute.0.NovaCompute]: CREATE_IN_PROGRESS  state changed
2018-05-08 10:15:11Z [overcloud.Controller.2.Controller]: CREATE_IN_PROGRESS  state changed
2018-05-08 10:15:14Z [overcloud.Compute.0.NovaCompute]: CREATE_FAILED  ResourceInError: resources.NovaCompute: Went to status ERROR due to "Message: No valid host was found. There are not enough hosts available., Code: 500"
2018-05-08 10:15:14Z [overcloud.Compute.0.NovaCompute]: DELETE_IN_PROGRESS  state changed
2018-05-08 10:15:16Z [overcloud.Compute.0.NovaCompute]: DELETE_COMPLETE  state changed
2018-05-08 10:15:20Z [overcloud.Controller.2.Controller]: CREATE_FAILED  ResourceInError: resources.Controller: Went to status ERROR due to "Message: Build of instance 22ba25e5-1336-4594-9a71-c93ccf43d1d1 aborted: Failure prepping block device., Code: 500"
2018-05-08 10:15:20Z [overcloud.Controller.2.Controller]: DELETE_IN_PROGRESS  state changed
2018-05-08 10:15:22Z [overcloud.Controller.2.Controller]: DELETE_COMPLETE  state changed
2018-05-08 10:15:25Z [overcloud.Compute.0.NovaCompute]: CREATE_IN_PROGRESS  state changed
2018-05-08 10:15:28Z [overcloud.Compute.0.NovaCompute]: CREATE_FAILED  ResourceInError: resources.NovaCompute: Went to status ERROR due to "Message: No valid host was found. There are not enough hosts available., Code: 500"
2018-05-08 10:15:28Z [overcloud.Compute.0.NovaCompute]: DELETE_IN_PROGRESS  state changed
2018-05-08 10:15:28Z [overcloud.Controller.2.Controller]: CREATE_IN_PROGRESS  state changed
2018-05-08 10:15:30Z [overcloud.Compute.0.NovaCompute]: DELETE_COMPLETE  state changed
2018-05-08 10:15:31Z [overcloud.Controller.2.Controller]: CREATE_FAILED  ResourceInError: resources.Controller: Went to status ERROR due to "Message: No valid host was found. There are not enough hosts available., Code: 500"
2018-05-08 10:15:31Z [overcloud.Controller.2.Controller]: DELETE_IN_PROGRESS  state changed
2018-05-08 10:15:33Z [overcloud.Controller.2.Controller]: DELETE_COMPLETE  state changed
2018-05-08 10:15:34Z [overcloud.Controller.0.Controller]: CREATE_FAILED  ResourceInError: resources.Controller: Went to status ERROR due to "Message: Build of instance 6fc2fd23-b9d8-4b01-af03-6fe5fbdd2a86 aborted: Failure prepping block device., Code: 500"
2018-05-08 10:15:34Z [overcloud.Controller.0.Controller]: DELETE_IN_PROGRESS  state changed
2018-05-08 10:15:36Z [overcloud.Controller.0.Controller]: DELETE_COMPLETE  state changed
2018-05-08 10:15:39Z [overcloud.Controller.0.Controller]: CREATE_IN_PROGRESS  state changed
2018-05-08 10:15:43Z [overcloud.Controller.2.Controller]: CREATE_IN_PROGRESS  state changed
parameter_defaults:
2018-05-08 10:15:47Z [overcloud.Compute.0.NovaCompute]: CREATE_IN_PROGRESS  state changed
2018-05-08 10:15:49Z [overcloud.Controller.0.Controller]: CREATE_FAILED  ResourceInError: resources.Controller: Went to status ERROR due to "Message: Build of instance 4db60876-5533-488f-8dd1-57b240faa6f3 aborted: Failure prepping block device., Code: 500"
2018-05-08 10:15:49Z [overcloud.Controller.0.Controller]: DELETE_IN_PROGRESS  state changed
2018-05-08 10:15:52Z [overcloud.Controller.0.Controller]: DELETE_COMPLETE  state changed
2018-05-08 10:15:57Z [overcloud.Controller.0.Controller]: CREATE_IN_PROGRESS  state changed
2018-05-08 10:16:08Z [overcloud.Controller.0.Controller]: CREATE_FAILED  ResourceInError: resources.Controller: Went to status ERROR due to "Message: Build of instance 885006cd-385d-4d17-8b0f-8cfd4a5506a0 aborted: Failure prepping block device., Code: 500"
2018-05-08 10:16:08Z [overcloud.Controller.0.Controller]: DELETE_IN_PROGRESS  state changed
2018-05-08 10:16:10Z [overcloud.Controller.0.Controller]: DELETE_COMPLETE  state changed
2018-05-08 10:16:19Z [overcloud.Controller.0.Controller]: CREATE_IN_PROGRESS  state changed
2018-05-08 10:16:29Z [overcloud.Controller.0.Controller]: CREATE_FAILED  ResourceInError: resources.Controller: Went to status ERROR due to "Message: Build of instance 4686fb22-6060-424a-b042-627cda0db173 aborted: Failure prepping block device., Code: 500"
2018-05-08 10:16:29Z [overcloud.Controller.0.Controller]: DELETE_IN_PROGRESS  state changed
2018-05-08 10:16:31Z [overcloud.Controller.0.Controller]: DELETE_COMPLETE  state changed
2018-05-08 10:16:49Z [overcloud.Controller.0.Controller]: CREATE_IN_PROGRESS  state changed
2018-05-08 10:17:03Z [overcloud.Controller.0.Controller]: CREATE_FAILED  ResourceInError: resources.Controller: Went to status ERROR due to "Message: Build of instance cc9a7364-c101-4151-92d5-20b29cdcb666 aborted: Failure prepping block device., Code: 500"
2018-05-08 10:17:03Z [overcloud.Controller.0.Controller]: DELETE_IN_PROGRESS  state changed
2018-05-08 10:17:06Z [overcloud.Controller.0.Controller]: DELETE_COMPLETE  state changed
2018-05-08 10:17:41Z [overcloud.Controller.0.Controller]: CREATE_IN_PROGRESS  state changed
2018-05-08 10:17:52Z [overcloud.Controller.0.Controller]: CREATE_FAILED  ResourceInError: resources.Controller: Went to status ERROR due to "Message: Build of instance 828c2323-b6d3-4ee6-bf1e-82a5f81337fc aborted: Failure prepping block device., Code: 500"
2018-05-08 10:17:52Z [overcloud.Controller.0]: CREATE_FAILED  Resource CREATE failed: ResourceInError: resources.Controller: Went to status ERROR due to "Message: Build of instance 6fc2fd23-b9d8-4b01-af03-6fe5fbdd2a86 aborted: Failure prepping block device., Code: 500"
2018-05-08 10:17:52Z [overcloud.Controller.0]: CREATE_FAILED  ResourceInError: resources[0].resources.Controller: Went to status ERROR due to "Message: Build of instance 6fc2fd23-b9d8-4b01-af03-6fe5fbdd2a86 aborted: Failure prepping block device., Code: 500"
2018-05-08 10:17:52Z [overcloud.Controller]: UPDATE_FAILED  Resource CREATE failed: ResourceInError: resources[0].resources.Controller: Went to status ERROR due to "Message: Build of instance 6fc2fd23-b9d8-4b01-af03-6fe5fbdd2a86 aborted: Failure prepping block device., Code: 500"
2018-05-08 10:17:53Z [overcloud.Controller]: CREATE_FAILED  resources.Controller: Resource CREATE failed: ResourceInError: resources[0].resources.Controller: Went to status ERROR due to "Message: Build of instance 6fc2fd23-b9d8-4b01-af03-6fe5fbdd2a86 aborted: Failure prepping block device., Code: 500"
2018-05-08 10:17:53Z [overcloud]: CREATE_FAILED  Resource CREATE failed: resources.Controller: Resource CREATE failed: ResourceInError: resources[0].resources.Controller: Went to status ERROR due to "Message: Build of instance 6fc2fd23-b9d8-4b01-af03-6fe5fbdd2a86 aborted: Failure prepping block device.,

 Stack overcloud CREATE_FAILED

overcloud.Controller.0.Controller:
  resource_type: OS::TripleO::ControllerServer
  physical_resource_id: 828c2323-b6d3-4ee6-bf1e-82a5f81337fc
  status: CREATE_FAILED
  status_reason: |
    ResourceInError: resources.Controller: Went to status ERROR due to "Message: Build of instance 828c2323-b6d3-4ee6-bf1e-82a5f81337fc aborted: Failure prepping block device., Code: 500"
Heat Stack create failed.
Heat Stack create failed.

Comment 29 Sai Sindhur Malleni 2018-05-08 15:06:23 UTC
Still seeing this on my puddle from 2018-05-02.5

Comment 30 Bob Fournier 2018-05-08 15:17:55 UTC
Per https://bugzilla.redhat.com/show_bug.cgi?id=1560690#c23

Please ensure you are setting the resource class fields on the flavor:

openstack flavor set baremetal --property "resources:CUSTOM_BAREMETAL"="1"
openstack flavor set baremetal --property "resources:DISK_GB"="0"
openstack flavor set baremetal --property "resources:MEMORY_MB"="0"
openstack flavor set baremetal --property "resources:VCPU"="0"

Comment 31 Sai Sindhur Malleni 2018-05-08 17:23:56 UTC
Hey Bob,

Why not have these properties set by default on the baremetal flavor that director sets up?

Comment 32 Bob Fournier 2018-05-08 17:46:55 UTC
In Sai's setup I confirmed that he changed to use the resource correctly in the baremetal flavor:
(undercloud) [stack@c08-h26-r630 ~]$ openstack flavor show baremetal
+----------------------------+--------------------------------------------------------------------------------------------------------------------------------------+
| Field                      | Value                                                                                                                                |
+----------------------------+--------------------------------------------------------------------------------------------------------------------------------------+
| OS-FLV-DISABLED:disabled   | False                                                                                                                                |
| OS-FLV-EXT-DATA:ephemeral  | 0                                                                                                                                    |
| access_project_ids         | None                                                                                                                                 |
| disk                       | 40                                                                                                                                   |
| id                         | 077a71a5-d336-4508-be14-28f224650c32                                                                                                 |
| name                       | baremetal                                                                                                                            |
| os-flavor-access:is_public | True                                                                                                                                 |
| properties                 | capabilities:boot_option='local', resources:CUSTOM_BAREMETAL='1', resources:DISK_GB='0', resources:MEMORY_MB='0', resources:VCPU='0' |
| ram                        | 4096                                                                                                                                 |
| rxtx_factor                | 1.0                                                                                                                                  |
| swap                       |                                                                                                                                      |
| vcpus                      | 1                                                                                                                                    |
+----------------------------+--------------------------------------------------------------------------------------------------------------------------------------+
(undercloud) [stack@c08-h26-r630 ~]$ 

However he still occasionally gets the scheduling issue, we see this in nova logs:

nova/nova-compute.log:2018-05-08 16:57:07.491 22079 ERROR nova.compute.manager [req-e81cc5d3-81d9-436e-9166-ab12a649c20e 05f424ebba56487797be7f76cc3fc5b7 cb950f51a4dd41799b85f96c5cf712d7 - default default] [instance: cb7f12e1-7ab4-42a3-b13e-af18af69e093] Failure prepping block device: VirtualInterfacePlugException: Cannot attach VIF 693dec9a-1691-4dbe-83dd-afaecdb6e166 to the node c3c21b1f-6410-4397-bb3b-0ce385ffe232 due to error: Unable to attach VIF 693dec9a-1691-4dbe-83dd-afaecdb6e166, not enough free physical ports. (HTTP 400)

We are also seeing many of these errors in nova, related to the resource:
/nova-placement-api.log:2018-05-08 16:54:56.456 119895 DEBUG nova.api.openstack.placement.wsgi_wrapper [req-080364c1-e4da-4f6a-bd80-410b0e817ad9 f7e2a20676bf4948ab011404ba3d4353 cb950f51a4dd41799b85f96c5cf712d7 - default default] Placement API returning an error response: Unable to allocate inventory: Unable to create allocation for 'CUSTOM_BAREMETAL' on resource provider '40b677b5-fe1a-4006-ae87-d62082bbe43d'. The requested amount would exceed the capacity. call_func /usr/lib/python2.7/site-packages/nova/api/openstack/placement/wsgi_wrapper.py:31

Comment 33 Bob Fournier 2018-05-08 17:48:53 UTC
> Why not have these properties set by default on the baremetal flavor that director 
> sets up?

We have an RFE to validate they are set:
https://bugzilla.redhat.com/show_bug.cgi?id=1564207

Comment 34 Waldemar Znoinski 2018-05-08 17:49:44 UTC
I'm seeing this VIF exception in many deployments recently, it looks like the underlying problem is with ironic power commands (which cause ironic to retry with a different baremetal node and catch the VIF issue by the way), more details in https://bugzilla.redhat.com/show_bug.cgi?id=1576073

Comment 43 Carlos Goncalves 2018-05-18 15:07:59 UTC
I'm facing the reported bug. I ran a clean undercloud+overcloud deployment (latest puddle, 2018-05-18.1) with infrared.


[stack@undercloud-0 ~]$ cat /etc/yum.repos.d/latest-installed 
13   -p 2018-05-18.1

[stack@undercloud-0 ~]$ rpm -q openstack-heat-common
openstack-heat-common-10.0.1-0.20180411125640.el7ost.noarch

overcloud_install.log:

[...]
2018-05-18 14:43:11Z [overcloud.Compute.1.UserData]: CREATE_IN_PROGRESS  state changed
2018-05-18 14:43:11Z [overcloud.Compute.0.UserData]: CREATE_IN_PROGRESS  state changed
2018-05-18 14:43:11Z [overcloud.Compute.1.UserData]: CREATE_COMPLETE  state changed
2018-05-18 14:43:11Z [overcloud.Compute.0.UserData]: CREATE_COMPLETE  state changed
2018-05-18 14:43:11Z [overcloud.Compute.1.NovaCompute]: CREATE_IN_PROGRESS  state changed
2018-05-18 14:43:11Z [overcloud.Compute.0.NovaCompute]: CREATE_IN_PROGRESS  state changed
2018-05-18 14:43:12Z [overcloud.Compute.0.UpdateConfig]: CREATE_COMPLETE  state changed
2018-05-18 14:44:16Z [overcloud.Controller.0.Controller]: CREATE_FAILED  ResourceInError: resources.Controller: Went to status ERROR due to "Message: Exceeded maximum number of retries. Exhausted all hosts available for retrying build failures for instance a0c9d982-beb1-4ee2-b732-00ffe0ed904d., Code: 500"
2018-05-18 14:44:16Z [overcloud.Controller.0.Controller]: DELETE_IN_PROGRESS  state changed
2018-05-18 14:44:18Z [overcloud.Controller.0.Controller]: DELETE_COMPLETE  state changed
2018-05-18 14:44:22Z [overcloud.Controller.0.Controller]: CREATE_IN_PROGRESS  state changed
2018-05-18 14:44:23Z [overcloud.Compute.1.NovaCompute]: CREATE_FAILED  ResourceInError: resources.NovaCompute: Went to status ERROR due to "Message: Build of instance cc92987b-7f6b-4b0a-9968-76d44a4e7f5f aborted: Failure prepping block device., Code: 500"
2018-05-18 14:44:23Z [overcloud.Compute.1.NovaCompute]: DELETE_IN_PROGRESS  state changed
2018-05-18 14:44:24Z [overcloud.Compute.1.NovaCompute]: DELETE_COMPLETE  state changed
2018-05-18 14:44:24Z [overcloud.Controller.0.Controller]: CREATE_FAILED  ResourceInError: resources.Controller: Went to status ERROR due to "Message: No valid host was found. There are not enough hosts available., Code: 500"
2018-05-18 14:44:24Z [overcloud.Controller.0.Controller]: DELETE_IN_PROGRESS  state changed
2018-05-18 14:44:26Z [overcloud.Controller.0.Controller]: DELETE_COMPLETE  state changed
2018-05-18 14:44:28Z [overcloud.Compute.1.NovaCompute]: CREATE_IN_PROGRESS  state changed
2018-05-18 14:44:31Z [overcloud.Controller.0.Controller]: CREATE_IN_PROGRESS  state changed
2018-05-18 14:44:33Z [overcloud.Controller.0.Controller]: CREATE_FAILED  ResourceInError: resources.Controller: Went to status ERROR due to "Message: No valid host was found. There are not enough hosts available., Code: 500"
2018-05-18 14:44:33Z [overcloud.Controller.0.Controller]: DELETE_IN_PROGRESS  state changed
2018-05-18 14:44:35Z [overcloud.Controller.0.Controller]: DELETE_COMPLETE  state changed
2018-05-18 14:44:35Z [overcloud.Compute.1.NovaCompute]: CREATE_FAILED  ResourceInError: resources.NovaCompute: Went to status ERROR due to "Message: Build of instance 10f0d584-975d-4285-9435-8e5ebc4aaaf4 aborted: Failure prepping block device., Code: 500"
2018-05-18 14:44:35Z [overcloud.Compute.1.NovaCompute]: DELETE_IN_PROGRESS  state changed
2018-05-18 14:44:37Z [overcloud.Compute.1.NovaCompute]: DELETE_COMPLETE  state changed
2018-05-18 14:44:43Z [overcloud.Compute.1.NovaCompute]: CREATE_IN_PROGRESS  state changed
2018-05-18 14:44:45Z [overcloud.Controller.0.Controller]: CREATE_IN_PROGRESS  state changed
2018-05-18 14:44:47Z [overcloud.Controller.0.Controller]: CREATE_FAILED  ResourceInError: resources.Controller: Went to status ERROR due to "Message: No valid host was found. There are not enough hosts available., Code: 500"
2018-05-18 14:44:47Z [overcloud.Controller.0.Controller]: DELETE_IN_PROGRESS  state changed
2018-05-18 14:44:48Z [overcloud.Compute.1.NovaCompute]: CREATE_FAILED  ResourceInError: resources.NovaCompute: Went to status ERROR due to "Message: Build of instance 56fe683d-5049-445b-934b-4d6038787304 aborted: Failure prepping block device., Code: 500"
2018-05-18 14:44:48Z [overcloud.Compute.1.NovaCompute]: DELETE_IN_PROGRESS  state changed
2018-05-18 14:44:49Z [overcloud.Controller.0.Controller]: DELETE_COMPLETE  state changed
2018-05-18 14:44:50Z [overcloud.Compute.1.NovaCompute]: DELETE_COMPLETE  state changed
2018-05-18 14:44:59Z [overcloud.Compute.1.NovaCompute]: CREATE_IN_PROGRESS  state changed
2018-05-18 14:45:04Z [overcloud.Compute.1.NovaCompute]: CREATE_FAILED  ResourceInError: resources.NovaCompute: Went to status ERROR due to "Message: Build of instance 158cf5f9-5d90-484d-8c3e-3104e0e9cfb9 aborted: Failure prepping block device., Code: 500"
2018-05-18 14:45:04Z [overcloud.Compute.1.NovaCompute]: DELETE_IN_PROGRESS  state changed
2018-05-18 14:45:06Z [overcloud.Controller.0.Controller]: CREATE_IN_PROGRESS  state changed
2018-05-18 14:45:06Z [overcloud.Compute.1.NovaCompute]: DELETE_COMPLETE  state changed
2018-05-18 14:45:24Z [overcloud.Compute.1.NovaCompute]: CREATE_IN_PROGRESS  state changed
2018-05-18 14:45:29Z [overcloud.Compute.1.NovaCompute]: CREATE_FAILED  ResourceInError: resources.NovaCompute: Went to status ERROR due to "Message: Build of instance 8f3273de-c2d8-43f3-b3db-d87a28a080c4 aborted: Failure prepping block device., Code: 500"
2018-05-18 14:45:29Z [overcloud.Compute.1.NovaCompute]: DELETE_IN_PROGRESS  state changed
2018-05-18 14:45:31Z [overcloud.Compute.1.NovaCompute]: DELETE_COMPLETE  state changed
2018-05-18 14:46:04Z [overcloud.Compute.1.NovaCompute]: CREATE_IN_PROGRESS  state changed
2018-05-18 14:46:14Z [overcloud.Compute.1.NovaCompute]: CREATE_FAILED  ResourceInError: resources.NovaCompute: Went to status ERROR due to "Message: Build of instance 977fdece-cad0-47eb-a47f-5018c18e3fe8 aborted: Failure prepping block device., Code: 500"
2018-05-18 14:46:14Z [overcloud.Compute.1]: CREATE_FAILED  Resource CREATE failed: ResourceInError: resources.NovaCompute: Went to status ERROR due to "Message: Build of instance cc92987b-7f6b-4b0a-9968-76d44a4e7f5f aborted: Failure prepping block device., Code: 500"
2018-05-18 14:46:14Z [overcloud.Compute.1]: CREATE_FAILED  ResourceInError: resources[1].resources.NovaCompute: Went to status ERROR due to "Message: Build of instance cc92987b-7f6b-4b0a-9968-76d44a4e7f5f aborted: Failure prepping block device., Code: 500"
2018-05-18 14:46:14Z [overcloud.Compute]: UPDATE_FAILED  Resource CREATE failed: ResourceInError: resources[1].resources.NovaCompute: Went to status ERROR due to "Message: Build of instance cc92987b-7f6b-4b0a-9968-76d44a4e7f5f aborted: Failure prepping block device., Code: 500"
2018-05-18 14:46:15Z [overcloud.Compute]: CREATE_FAILED  resources.Compute: Resource CREATE failed: ResourceInError: resources[1].resources.NovaCompute: Went to status ERROR due to "Message: Build of instance cc92987b-7f6b-4b0a-9968-76d44a4e7f5f aborted: Failure prepping block device., Code: 500"
2018-05-18 14:46:15Z [overcloud]: CREATE_FAILED  Resource CREATE failed: resources.Compute: Resource CREATE failed: ResourceInError: resources[1].resources.NovaCompute: Went to status ERROR due to "Message: Build of instance cc92987b-7f6b-4b0a-9968-76d44a4e7f5f aborted: Failure prepping block device., C

 Stack overcloud CREATE_FAILED 

overcloud.Compute.1.NovaCompute:
  resource_type: OS::TripleO::ComputeServer
  physical_resource_id: 977fdece-cad0-47eb-a47f-5018c18e3fe8
  status: CREATE_FAILED
  status_reason: |
    ResourceInError: resources.NovaCompute: Went to status ERROR due to "Message: Build of instance 977fdece-cad0-47eb-a47f-5018c18e3fe8 aborted: Failure prepping block deviHeat Stack create failed.
Heat Stack create failed.
ce., Code: 500"

Comment 45 Julia Kreger 2018-05-18 15:31:51 UTC
We're going to need to examine a system where this has occured with the fix as the error that is being exposed is not the original error, so we need to validate configuration/settings and determine how it reached that end state.

Comment 47 Carlos Goncalves 2018-05-18 15:40:20 UTC
Julia, this has just happened on my baremetal@home. Please let me know which info you'd need. Alternatively I can also give you access to the machine. In case the error is not the original error reported on comment #0 I can open a new rhbz.

Comment 48 Julia Kreger 2018-05-18 16:00:33 UTC
openstack flavor list
openstack flavor show
nova hypervisor-list
openstack server list

For each instance in nova:
openstack server show <instance_uuid>

for each baremetal node
openstack baremetal node show <node_uuid>
openstack baremetal baremetal node vif list <node_uuid>

And these three folders on the undercloud should do it:
/var/log/nova
/var/log/httpd
/var/log/ironic

Comment 49 Julia Kreger 2018-05-18 16:09:58 UTC
Minor quick correction, I need the openstack flavor show output for each flavor in use.

Comment 50 Carlos Goncalves 2018-05-19 08:48 UTC
Created attachment 1438865 [details]
rhbz1560690-undercloud.log

Comment 51 Carlos Goncalves 2018-05-19 08:51 UTC
Created attachment 1438866 [details]
sosreport (nova, ironic, instack, heat, cinder, neutron)

In addition to command outputs attached in comment #49, I'm also attaching sosreport from the undercloud.

Comment 52 Bob Fournier 2018-05-21 00:55:23 UTC
Carlos - it looks like you are getting IPMI power-on failures.  

In ironic-conductor.log:
2018-05-18 10:44:08.580 24804 WARNING ironic.drivers.modules.ipmitool [req-545ca4e9-e609-4507-a169-bffb5a794b3a 868f66f3bd734efaa6332bcb6b2a6fab 23b75b4d1a164271969560c81d6883ad - default default] IPMI power action power on failed for node c3d408ac-d913-440c-b5a3-b70ae2b847c2 with error: Unexpected error while running command.
Command: ipmitool -I lanplus -H 172.16.0.1 -L ADMINISTRATOR -p 6231 -U admin -R 12 -N 5 -f /tmp/tmpYJ0xip power on
Exit code: 1
Stdout: u''
Stderr: u'Set Chassis Power Control to Up/On failed: Command not supported in present state\n'.: ProcessExecutionError: Unexpected error while running command.
2018-05-18 10:44:08.595 24804 ERROR ironic.conductor.manager [req-545ca4e9-e609-4507-a169-bffb5a794b3a 868f66f3bd734efaa6332bcb6b2a6fab 23b75b4d1a164271969560c81d6883ad - default default] Error in deploy of node c3d408ac-d913-440c-b5a3-b70ae2b847c2: IPMI call failed: power on.: IPMIFailure: IPMI call failed: power on.

Are the IPMI credentials configured correctly for this node, or is there an issue with the BMC on this node?

Not being able to power on the node can result in this "Failure prepping block device" error.

Comment 53 Omri Hochman 2018-05-21 12:54:32 UTC
Unable to reproduce this issue with the latest puddle from 05/18.  

To test this issue I ran durability test over the weekend, the test deploy and delete the overcloud stack for X10 times. 

Environment: 
-------------
openstack-heat-engine-10.0.1-0.20180411125640.el7ost.noarch
openstack-heat-common-10.0.1-0.20180411125640.el7ost.noarch
python-heat-agent-1.5.4-0.20180308153305.ecf43c7.el7ost.noarch
openstack-tripleo-heat-templates-8.0.2-22.el7ost.noarch
heat-cfntools-1.3.0-2.el7ost.noarch
python2-heatclient-1.14.0-1.el7ost.noarch
openstack-heat-api-cfn-10.0.1-0.20180411125640.el7ost.noarch
puppet-heat-12.4.1-0.20180416203421.90e3fb0.el7ost.noarch
openstack-heat-api-10.0.1-0.20180411125640.el7ost.noarch



** please re-open if the issue reproduced using this puddle.

Comment 54 Bob Fournier 2018-05-24 15:41:21 UTC
*** Bug 1571929 has been marked as a duplicate of this bug. ***

Comment 56 errata-xmlrpc 2018-06-27 13:48:49 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2018:2086

Comment 57 John Fulton 2018-08-08 13:30:52 UTC
*** Bug 1608702 has been marked as a duplicate of this bug. ***

Comment 58 Donny Davis 2018-10-09 02:09:46 UTC
I am seeing this issue on OSP 13 when deploying with the networker role, and it translated into a stack that could not be deleted( completely different issue)

I added this to the networker flavor, and everything seems to be working. 

openstack flavor set --property --property "capabilities:boot_option"="local" --property "capabilities:profile"="networker" --property "resources:CUSTOM_BAREMETAL"="1" --property "resources:DISK_GB"="0" --property "resources:MEMORY_MB"="0" --property "resources:VCPU"="0" networker

Comment 59 Edu Alcaniz 2018-10-25 15:49:59 UTC
Correction: 

openstack flavor set  --property "capabilities:boot_option"="local" --property "capabilities:profile"="networker" --property "resources:CUSTOM_BAREMETAL"="1" --property "resources:DISK_GB"="0" --property "resources:MEMORY_MB"="0" --property "resources:VCPU"="0" networker

Comment 60 David Hill 2019-05-05 12:37:39 UTC
I can reproduce this easily when doing this:

1) Deploy an undercloud
2) Deploy an overcloud
3) Delete the overcloud
4) Deploy an overcloud

The following errors are found in nova-compute.log:

2019-05-03 16:10:52.011 5283 ERROR ironicclient.common.http [req-d1b9ce6f-6573-4ab1-88ce-fde37a9fdd0b 80b23670e885452cb5a4bec554515d71 16614237b794447694ff003b709f68c6 - default default] Error contacting Ironic server: Node 5ec50799-249a-
48a9-a338-5cdaa22a0860 is locked by host localhost.localdomain, please retry after the current operation is completed. (HTTP 409). Attempt 2 of 2: Conflict: Node 5ec50799-249a-48a9-a338-5cdaa22a0860 is locked by host localhost.localdomain
, please retry after the current operation is completed. (HTTP 409)
2019-05-03 16:10:52.012 5283 ERROR nova.compute.manager [req-d1b9ce6f-6573-4ab1-88ce-fde37a9fdd0b 80b23670e885452cb5a4bec554515d71 16614237b794447694ff003b709f68c6 - default default] [instance: e071b30d-aefe-4cc8-a262-e6b9215c2089] Instan
ce failed to spawn: Conflict: Node 5ec50799-249a-48a9-a338-5cdaa22a0860 is locked by host localhost.localdomain, please retry after the current operation is completed. (HTTP 409)
2019-05-03 16:10:52.012 5283 ERROR nova.compute.manager [instance: e071b30d-aefe-4cc8-a262-e6b9215c2089] Traceback (most recent call last):
2019-05-03 16:10:52.012 5283 ERROR nova.compute.manager [instance: e071b30d-aefe-4cc8-a262-e6b9215c2089]   File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 2252, in _build_resources
2019-05-03 16:10:52.012 5283 ERROR nova.compute.manager [instance: e071b30d-aefe-4cc8-a262-e6b9215c2089]     yield resources
2019-05-03 16:10:52.012 5283 ERROR nova.compute.manager [instance: e071b30d-aefe-4cc8-a262-e6b9215c2089]   File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 2032, in _build_and_run_instance
2019-05-03 16:10:52.012 5283 ERROR nova.compute.manager [instance: e071b30d-aefe-4cc8-a262-e6b9215c2089]     block_device_info=block_device_info)
2019-05-03 16:10:52.012 5283 ERROR nova.compute.manager [instance: e071b30d-aefe-4cc8-a262-e6b9215c2089]   File "/usr/lib/python2.7/site-packages/nova/virt/ironic/driver.py", line 1044, in spawn
2019-05-03 16:10:52.012 5283 ERROR nova.compute.manager [instance: e071b30d-aefe-4cc8-a262-e6b9215c2089]     block_device_info=block_device_info)
2019-05-03 16:10:52.012 5283 ERROR nova.compute.manager [instance: e071b30d-aefe-4cc8-a262-e6b9215c2089]   File "/usr/lib/python2.7/site-packages/nova/virt/ironic/driver.py", line 394, in _add_instance_info_to_node
2019-05-03 16:10:52.012 5283 ERROR nova.compute.manager [instance: e071b30d-aefe-4cc8-a262-e6b9215c2089]     retry_on_conflict=False)
2019-05-03 16:10:52.012 5283 ERROR nova.compute.manager [instance: e071b30d-aefe-4cc8-a262-e6b9215c2089]   File "/usr/lib/python2.7/site-packages/nova/virt/ironic/client_wrapper.py", line 158, in call
2019-05-03 16:10:52.012 5283 ERROR nova.compute.manager [instance: e071b30d-aefe-4cc8-a262-e6b9215c2089]     return self._multi_getattr(client, method)(*args, **kwargs)
2019-05-03 16:10:52.012 5283 ERROR nova.compute.manager [instance: e071b30d-aefe-4cc8-a262-e6b9215c2089]   File "/usr/lib/python2.7/site-packages/ironicclient/v1/node.py", line 336, in update
2019-05-03 16:10:52.012 5283 ERROR nova.compute.manager [instance: e071b30d-aefe-4cc8-a262-e6b9215c2089]     method=http_method)
2019-05-03 16:10:52.012 5283 ERROR nova.compute.manager [instance: e071b30d-aefe-4cc8-a262-e6b9215c2089]   File "/usr/lib/python2.7/site-packages/ironicclient/common/base.py", line 200, in _update
2019-05-03 16:10:52.012 5283 ERROR nova.compute.manager [instance: e071b30d-aefe-4cc8-a262-e6b9215c2089]     resp, body = self.api.json_request(method, url, **kwargs)
2019-05-03 16:10:52.012 5283 ERROR nova.compute.manager [instance: e071b30d-aefe-4cc8-a262-e6b9215c2089]   File "/usr/lib/python2.7/site-packages/ironicclient/common/http.py", line 645, in json_request
2019-05-03 16:10:52.012 5283 ERROR nova.compute.manager [instance: e071b30d-aefe-4cc8-a262-e6b9215c2089]     resp = self._http_request(url, method, **kwargs)
2019-05-03 16:10:52.012 5283 ERROR nova.compute.manager [instance: e071b30d-aefe-4cc8-a262-e6b9215c2089]   File "/usr/lib/python2.7/site-packages/ironicclient/common/http.py", line 247, in wrapper
2019-05-03 16:10:52.012 5283 ERROR nova.compute.manager [instance: e071b30d-aefe-4cc8-a262-e6b9215c2089]     return func(self, url, method, **kwargs)
2019-05-03 16:10:52.012 5283 ERROR nova.compute.manager [instance: e071b30d-aefe-4cc8-a262-e6b9215c2089]   File "/usr/lib/python2.7/site-packages/ironicclient/common/http.py", line 627, in _http_request
2019-05-03 16:10:52.012 5283 ERROR nova.compute.manager [instance: e071b30d-aefe-4cc8-a262-e6b9215c2089]     error_json.get('debuginfo'), method, url)
2019-05-03 16:10:52.012 5283 ERROR nova.compute.manager [instance: e071b30d-aefe-4cc8-a262-e6b9215c2089] Conflict: Node 5ec50799-249a-48a9-a338-5cdaa22a0860 is locked by host localhost.localdomain, please retry after the current operation
 is completed. (HTTP 409)
2019-05-03 16:10:52.012 5283 ERROR nova.compute.manager [instance: e071b30d-aefe-4cc8-a262-e6b9215c2089] 
2019-05-03 16:10:52.013 5283 INFO nova.compute.manager [req-d1b9ce6f-6573-4ab1-88ce-fde37a9fdd0b 80b23670e885452cb5a4bec554515d71 16614237b794447694ff003b709f68c6 - default default] [instance: e071b30d-aefe-4cc8-a262-e6b9215c2089] Termina
ting instance
2019-05-03 16:10:53.793 5283 ERROR nova.compute.manager [req-c8308fc5-f76b-4063-90d7-d652fbb0f6a8 80b23670e885452cb5a4bec554515d71 16614237b794447694ff003b709f68c6 - default default] [instance: 9fc91b79-2e98-4b02-8a38-9cbb61aef533] Build 
of instance 9fc91b79-2e98-4b02-8a38-9cbb61aef533 aborted: Failure prepping block device.: BuildAbortException: Build of instance 9fc91b79-2e98-4b02-8a38-9cbb61aef533 aborted: Failure prepping block device.
2019-05-03 16:10:53.793 5283 ERROR nova.compute.manager [instance: 9fc91b79-2e98-4b02-8a38-9cbb61aef533] Traceback (most recent call last):
2019-05-03 16:10:53.793 5283 ERROR nova.compute.manager [instance: 9fc91b79-2e98-4b02-8a38-9cbb61aef533]   File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 1841, in _do_build_and_run_instance
2019-05-03 16:10:53.793 5283 ERROR nova.compute.manager [instance: 9fc91b79-2e98-4b02-8a38-9cbb61aef533]     filter_properties, request_spec)
2019-05-03 16:10:53.793 5283 ERROR nova.compute.manager [instance: 9fc91b79-2e98-4b02-8a38-9cbb61aef533]   File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 2063, in _build_and_run_instance
2019-05-03 16:10:53.793 5283 ERROR nova.compute.manager [instance: 9fc91b79-2e98-4b02-8a38-9cbb61aef533]     bdms=block_device_mapping)
2019-05-03 16:10:53.793 5283 ERROR nova.compute.manager [instance: 9fc91b79-2e98-4b02-8a38-9cbb61aef533]   File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 220, in __exit__
2019-05-03 16:10:53.793 5283 ERROR nova.compute.manager [instance: 9fc91b79-2e98-4b02-8a38-9cbb61aef533]     self.force_reraise()
2019-05-03 16:10:53.793 5283 ERROR nova.compute.manager [instance: 9fc91b79-2e98-4b02-8a38-9cbb61aef533]   File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 196, in force_reraise
2019-05-03 16:10:53.793 5283 ERROR nova.compute.manager [instance: 9fc91b79-2e98-4b02-8a38-9cbb61aef533]     six.reraise(self.type_, self.value, self.tb)
2019-05-03 16:10:53.793 5283 ERROR nova.compute.manager [instance: 9fc91b79-2e98-4b02-8a38-9cbb61aef533]   File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 2015, in _build_and_run_instance
2019-05-03 16:10:53.793 5283 ERROR nova.compute.manager [instance: 9fc91b79-2e98-4b02-8a38-9cbb61aef533]     block_device_mapping) as resources:
2019-05-03 16:10:53.793 5283 ERROR nova.compute.manager [instance: 9fc91b79-2e98-4b02-8a38-9cbb61aef533]   File "/usr/lib64/python2.7/contextlib.py", line 17, in __enter__
2019-05-03 16:10:53.793 5283 ERROR nova.compute.manager [instance: 9fc91b79-2e98-4b02-8a38-9cbb61aef533]     return self.gen.next()
2019-05-03 16:10:53.793 5283 ERROR nova.compute.manager [instance: 9fc91b79-2e98-4b02-8a38-9cbb61aef533]   File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 2235, in _build_resources
2019-05-03 16:10:53.793 5283 ERROR nova.compute.manager [instance: 9fc91b79-2e98-4b02-8a38-9cbb61aef533]     reason=msg)
2019-05-03 16:10:53.793 5283 ERROR nova.compute.manager [instance: 9fc91b79-2e98-4b02-8a38-9cbb61aef533] BuildAbortException: Build of instance 9fc91b79-2e98-4b02-8a38-9cbb61aef533 aborted: Failure prepping block device.

Comment 61 Bob Fournier 2019-05-22 13:42:26 UTC
Hi Dave - can we get an sosreport or logs with the your duplication method in https://bugzilla.redhat.com/show_bug.cgi?id=1560690#c60?  We haven't seen that in our tests. Also I'd like to create a separate bug for this - the "Failure prepping block device" is a generic error message and the original problem has been fixed.  Thanks.

Comment 62 Bob Fournier 2019-05-24 20:26:00 UTC
Setting this back to closed and creating a new bug.  Note the "prepping block device" is a generic error message and not indicative that the same problem is occurring again.

Comment 63 Bob Fournier 2019-05-24 20:30:12 UTC
New bug is https://bugzilla.redhat.com/show_bug.cgi?id=1713790.


Note You need to log in before you can comment on or make changes to this bug.