Bug 1560690
Summary: | Re-deployment of overcloud fails with: ResourceInError: resources.OpenShiftWorker: Went to status ERROR due to "Message: Build of instance c7c5191b-59ed-44a0-8b2a-0f68e48e9a52 aborted: Failure prepping block device., Code: 500" | ||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat OpenStack | Reporter: | Alexander Chuzhoy <sasha> | ||||||||||||
Component: | openstack-heat | Assignee: | Julia Kreger <jkreger> | ||||||||||||
Status: | CLOSED ERRATA | QA Contact: | Alexander Chuzhoy <sasha> | ||||||||||||
Severity: | high | Docs Contact: | |||||||||||||
Priority: | high | ||||||||||||||
Version: | 13.0 (Queens) | CC: | agurenko, ahrechan, bfournie, bschmaus, cgoncalves, dbecker, dhill, dondavis, ealcaniz, emacchi, gfidente, hjensas, imain, jkreger, jliberma, jschluet, jtaleric, klatouch, lnatapov, mburns, michele, mlammon, morazi, nsatsia, ohochman, racedoro, racpatel, rhel-osp-director-maint, rlondhe, sasha, sbaker, shardy, smalleni, sputhenp, srevivo | ||||||||||||
Target Milestone: | rc | Keywords: | Reopened, Triaged | ||||||||||||
Target Release: | 13.0 (Queens) | Flags: | jkreger:
needinfo-
|
||||||||||||
Hardware: | Unspecified | ||||||||||||||
OS: | Unspecified | ||||||||||||||
Whiteboard: | |||||||||||||||
Fixed In Version: | openstack-heat-10.0.1-0.20180411125640.el7ost | Doc Type: | If docs needed, set a value | ||||||||||||
Doc Text: | Story Points: | --- | |||||||||||||
Clone Of: | |||||||||||||||
: | 1687381 (view as bug list) | Environment: | |||||||||||||
Last Closed: | 2019-05-24 20:26:00 UTC | Type: | Bug | ||||||||||||
Regression: | --- | Mount Type: | --- | ||||||||||||
Documentation: | --- | CRM: | |||||||||||||
Verified Versions: | Category: | --- | |||||||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||||||
Embargoed: | |||||||||||||||
Bug Depends On: | |||||||||||||||
Bug Blocks: | 1214284, 1687381 | ||||||||||||||
Attachments: |
|
Description
Alexander Chuzhoy
2018-03-26 18:52:45 UTC
Created attachment 1413294 [details]
nova and ironic logs
Created attachment 1413311 [details]
neutron logs
The neutron l3_agent logs are full of these errors: e_timeout option as Neutron server(s) may be overloaded and unable to respond quickly enough.: MessagingTimeout: Timed out waiting for a reply to message ID 83d509e6c032450e8cb7aa246385a2dd l3-agent.log:2018-03-21 08:50:40.232 28398 ERROR neutron.common.rpc [req-bba392f7-b308-4543-b9cb-de85e2d724c2 - - - - -] Timeout in RPC method get_service_plugin_list. Waiting for 6 seconds before next attempt. If the server is not down, consider increasing the rpc_response_timeout option as Neutron server(s) may be overloaded and unable to respond quickly enough.: MessagingTimeout: Timed out waiting for a reply to message ID fa19acc6923f478fa039c703ead62ac2 l3-agent.log:2018-03-21 09:00:46.570 28398 ERROR neutron.common.rpc [req-0536a495-7892-46f6-9d37-70ff11afa3f2 - - - - -] Timeout in RPC method get_service_plugin_list. Waiting for 42 seconds before next attempt. If the server is not down, consider increasing the rpc_response_timeout option as Neutron server(s) may be overloaded and unable to respond quickly enough.: MessagingTimeout: Timed out waiting for a reply to message ID 2301c81877e649148fe731408d73f508 l3-agent.log:2018-03-21 09:11:28.913 28398 ERROR neutron.common.rpc [req-f4755bfb-deea-4685-bd3b-e2fd5140047b - - - - -] Timeout in RPC method get_service_plugin_list. Waiting for 37 seconds before next attempt. If the server is not down, consider increasing the rpc_response_timeout option as Neutron server(s) may be overloaded and unable to respond quickly enough.: MessagingTimeout: Timed out waiting for a reply to message ID 83e22a0803904dfdaf5c8a893b2551d2 The neutron server logs have hundreds of these errors: 2018-03-26 03:44:15.687 3281 WARNING neutron.pecan_wsgi.controllers.root [req-68e614e3-1b38-43a9-aca0-d845f3f1036b 70e7c1e7b21a4d8bbf8e3b42b053eced 7a80892307b343d7ac09445827216679 - default default] No controller found for: floatingips - returning response code 404: PecanNotFound 2018-03-26 03:44:15.689 3281 INFO neutron.pecan_wsgi.hooks.translation [req-68e614e3-1b38-43a9-aca0-d845f3f1036b 70e7c1e7b21a4d8bbf8e3b42b053eced 7a80892307b343d7ac09445827216679 - default default] GET failed (client error): The resource could not be found. According to Ihar, those neutron logs just indicate the l3 plugin is not loaded in the undercloud - 'what the warning says is that we received a request for /floatingips but no loaded plugin provides the get_floatingips method. nova shouldn't issue those requests so maybe it's an issue just low priority nova seems to not check extension is enabled which is the wrong way of using the api, they should first check it's on, then call to it. I understand that l3 is 99% of time is on but e.g. undercloud is different, definitely not the root cause though." In the Ironic conductor logs we see a couple connection errors and a power off node failure: 2018-03-26 12:49:12.550 1878 ERROR ironic.drivers.modules.agent_client [req-36620f31-5221-44f4-a8a0-713a6043f7ac - - - - -] Error invoking agent command standby.power_off for node a1d2b4b1-17e9-416e-ad38-677f8a13cd30. Error: HTTPConnectionPool(host='10.19.4.86', port=9999): Max retries exceeded with url: /v1/commands?wait=false (Caused by NewConnectionError('<requests.packages.urllib3.connection.HTTPConnection object at 0x7fcd387b6710>: Failed to establish a new connection: [Errno 111] ECONNREFUSED',)): ConnectionError: HTTPConnectionPool(host='10.19.4.86', port=9999): Max retries exceeded with url: /v1/commands?wait=false (Caused by NewConnectionError('<requests.packages.urllib3.connection.HTTPConnection object at 0x7fcd387b6710>: Failed to establish a new connection: [Errno 111] ECONNREFUSED',))2018-03-26 12:49:12.552 1878 WARNING ironic.drivers.modules.agent_base_vendor [req-36620f31-5221-44f4-a8a0-713a6043f7ac - - - - -] Failed to soft power off node a1d2b4b1-17e9-416e-ad38-677f8a13cd30 in at least 30 seconds. Error: Error invoking agent command standby.power_off for node a1d2b4b1-17e9-416e-ad38-677f8a13cd30. Error: HTTPConnectionPool(host='10.19.4.86', port=9999): Max retries exceeded with url: /v1/commands?wait=false (Caused by NewConnectionError('<requests.packages.urllib3.connection.HTTPConnection object at 0x7fcd387b6710>: Failed to establish a new connection: [Errno 111] ECONNREFUSED',)): IronicException: Error invoking agent command standby.power_off for node a1d2b4b1-17e9-416e-ad38-677f8a13cd30. Error: HTTPConnectionPool(host='10.19.4.86', port=9999): Max retries exceeded with url: /v1/commands?wait=false (Caused by NewConnectionError('<requests.packages.urllib3.connection.HTTPConnection object at 0x7fcd387b6710>: Failed to establish a new connection: [Errno 111] ECONNREFUSED',)) I was looking through the ironic state and found that the entry that was failing had a duplicate 'tenant_vif_port_id' for some reason. # ironic port-show 7679f37a-c6a1-476c-9a4a-9ac53a3a5ff4 +-----------------------+------------------------------------------------------------------+ | Property | Value | +-----------------------+------------------------------------------------------------------+ | address | 54:9f:35:f6:70:55 | | created_at | 2018-03-20T16:32:57+00:00 | | extra | {} | | internal_info | {u'tenant_vif_port_id': u'2154ee13-0ba8-48cf-a45a-71e111ffece4'} | | local_link_connection | {u'port_id': u'Gi1/0/5', u'switch_id': u'f8:b1:56:37:7e:b4'} | | node_uuid | 6ae0110d-eef7-4e3d-b2db-414a2db39b67 | | physical_network | None | | portgroup_uuid | None | | pxe_enabled | True | | updated_at | 2018-03-26T20:09:34+00:00 | | uuid | 7679f37a-c6a1-476c-9a4a-9ac53a3a5ff4 | +-----------------------+------------------------------------------------------------------+ # ironic port-show a479cb4c-5239-4738-b0d9-253f21d35066 +-----------------------+------------------------------------------------------------------+ | Property | Value | +-----------------------+------------------------------------------------------------------+ | address | 54:9f:35:f6:70:3b | | created_at | 2018-03-20T16:31:42+00:00 | | extra | {} | | internal_info | {u'tenant_vif_port_id': u'b578dd75-6707-49c8-81b3-30267c2310a1'} | | local_link_connection | {u'port_id': u'Gi1/0/3', u'switch_id': u'f8:b1:56:37:7e:b4'} | | node_uuid | 9ebf1192-2987-40b0-8028-38b923f2332a | | physical_network | None | | portgroup_uuid | None | | pxe_enabled | True | | updated_at | 2018-03-20T21:30:30+00:00 | | uuid | a479cb4c-5239-4738-b0d9-253f21d35066 | +-----------------------+------------------------------------------------------------------+ With one of them being 'active' and the other 'available'. The error occurred while trying to set up the available node listed here. Sorry I grabbed the wrong entry there :) The duplicate one is: # ironic port-show 1db7c067-ef37-4052-a99d-0d7885424d82 +-----------------------+------------------------------------------------------------------+ | Property | Value | +-----------------------+------------------------------------------------------------------+ | address | 54:9f:35:f6:70:2e | | created_at | 2018-03-20T16:32:28+00:00 | | extra | {} | | internal_info | {u'tenant_vif_port_id': u'2154ee13-0ba8-48cf-a45a-71e111ffece4'} | | local_link_connection | {u'port_id': u'Gi1/0/2', u'switch_id': u'f8:b1:56:37:7e:b4'} | | node_uuid | a1d2b4b1-17e9-416e-ad38-677f8a13cd30 | | physical_network | None | | portgroup_uuid | None | | pxe_enabled | True | | updated_at | 2018-03-26T20:09:53+00:00 | | uuid | 1db7c067-ef37-4052-a99d-0d7885424d82 | +-----------------------+------------------------------------------------------------------+ which has the same vif id as the first entry in the last post. Sorry for the confusion! In the logs there are errors both for attach and detach: 2018-03-26 14:44:20.930 9165 DEBUG wsme.api [req-f95d1c83-636e-4ee9-a307-cca2cb228405 b129778b3822410ea1199ee7beee87a4 7a80892307b343d7ac09445827216679 - default default] Client-side error: Unable to attach VIF 72b500c7-da06-40be-924e-941f87008365, not enough free physical ports. Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/server.py", line 226, in inner return func(*args, **kwargs) File "/usr/lib/python2.7/site-packages/ironic/conductor/manager.py", line 2924, in vif_attach task.driver.network.vif_attach(task, vif_info) File "/usr/lib/python2.7/site-packages/ironic/drivers/modules/network/common.py", line 551, in vif_attach port_like_obj = get_free_port_like_object(task, vif_id, physnets) File "/usr/lib/python2.7/site-packages/ironic/drivers/modules/network/common.py", line 175, in get_free_port_like_object raise exception.NoFreePhysicalPorts(vif=vif_id) NoFreePhysicalPorts: Unable to attach VIF 72b500c7-da06-40be-924e-941f87008365, not enough free physical ports. format_exception /usr/lib/python2.7/site-packages/wsme/api.py:222 2018-03-26 14:44:23.818 9165 DEBUG wsme.api [req-fafebeda-a61b-4fde-af8e-93722c3afdb3 b129778b3822410ea1199ee7beee87a4 7a80892307b343d7ac09445827216679 - default default] Client-side error: Unable to detach VIF 72b500c7-da06-40be-924e-941f87008365 from node 00b6bf71-2a7e-4e1c-8b9b-ce24c7a935b0 because it is not attached to it. Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/server.py", line 226, in inner return func(*args, **kwargs) File "/usr/lib/python2.7/site-packages/ironic/conductor/manager.py", line 2950, in vif_detach task.driver.network.vif_detach(task, vif_id) File "/usr/lib/python2.7/site-packages/ironic/drivers/modules/network/common.py", line 582, in vif_detach port_like_obj = self._get_port_like_obj_by_vif_id(task, vif_id) File "/usr/lib/python2.7/site-packages/ironic/drivers/modules/network/common.py", line 333, in _get_port_like_obj_by_vif_id raise exception.VifNotAttached(vif=vif_id, node=task.node.uuid) VifNotAttached: Unable to detach VIF 72b500c7-da06-40be-924e-941f87008365 from node 00b6bf71-2a7e-4e1c-8b9b-ce24c7a935b0 because it is not attached to it. For the NoFreePhysicalPorts error, it would be interesting to know which one. https://github.com/openstack/ironic/blob/master/ironic/drivers/modules/network/common.py#L114-L123 Edit: I was going to suggest some additional debug logging, but based on Ian's findings above it is likely L117: if _vif_attached(p, vif_id): Proposed patch upstream https://review.openstack.org/#/c/556748/ will need to be backported. I have also observed similar error with redeployment in last week but today I am facing similar error with new deployment. - deleted previous stack - reimaged undercloud server. - deplpoy again "Went to status ERROR due to "Message: Build of instance 7792026b-8c4d-4b44-81c3-102ccaec2a92 aborted: Failure prepping block device., Code: 500"" I am using "rhos-release 13 -p 2018-03-19.2". Let me know if you need logs or access to servers or any other details. Please update me if my error is not related to this bug and I should open new BZ for it The following worked for me... As a W/A you can either: Removing all non-active entries from ironic and re-insert/introspect or Apply patch from comment #9 and restart openstack-ironic-conductor. (In reply to Alexander Chuzhoy from comment #11) > The following worked for me... > As a W/A you can either: > Removing all non-active entries from ironic and re-insert/introspect > or > Apply patch from comment #9 and restart openstack-ironic-conductor. Tried both approach 1) remove all entries from ironic and re-insert/introspect 2) apply patch and restart service curl -4 -k https://review.openstack.org/changes/556748/revisions/9715f80915c124b47c4463d7af4d505c5bedab9b/patch?download|base64 -d|sudo patch -t -d /usr/lib/python2.7/site-packages/ -p1 sudo systemctl stop openstack-ironic-conductor.service sudo systemctl start openstack-ironic-conductor.service Error:- ====== ResourceInError: resources.ComputeHCI: Went to status ERROR due to "Message: Build of instance d3a4342c-42ae-46af-ac70-6b670c2bc309 aborted: Failure prepping block device., Code: 500" My Testing is blocked due to this error. (Let me know if you need logs or access to servers or any other details. Please update me if my error is not related to this bug and I should open new BZ for it) Version: openstack-nova-compute-17.0.1-0.20180302144923.9ace6ed.el7ost.noarch python-nova-17.0.1-0.20180302144923.9ace6ed.el7ost.noarch puppet-neutron-12.3.1-0.20180222064632.07b93f1.el7ost.noarch python-neutron-12.0.0-0.20180222093624.abb60c6.0rc2.el7ost.noarch python-novaclient-9.1.1-1.el7ost.noarch openstack-nova-api-17.0.1-0.20180302144923.9ace6ed.el7ost.noarch openstack-nova-scheduler-17.0.1-0.20180302144923.9ace6ed.el7ost.noarch openstack-tripleo-heat-templates-8.0.0-0.20180304031148.el7ost.noarch openstack-neutron-ml2-12.0.0-0.20180222093624.abb60c6.0rc2.el7ost.noarch openstack-ironic-conductor-10.1.2-0.20180302010846.7cd9deb.el7ost.noarch instack-undercloud-8.3.1-0.20180304032746.fc5704f.el7ost.noarch puppet-ironic-12.3.1-0.20180221115553.12ab03d.el7ost.noarch openstack-neutron-12.0.0-0.20180222093624.abb60c6.0rc2.el7ost.noarch python2-neutronclient-6.7.0-0.20180211221651.95d64ce.el7ost.noarch puppet-nova-12.3.1-0.20180221110444.c11423e.el7ost.noarch python2-ironicclient-2.2.0-0.20180211230646.683b7c6.el7ost.noarch openstack-ironic-common-10.1.2-0.20180302010846.7cd9deb.el7ost.noarch python2-neutron-lib-1.13.0-0.20180211233639.dcf96cd.el7ost.noarch openstack-ironic-staging-drivers-0.9.0-0.20180220235748.de59d74.el7ost.noarch openstack-nova-conductor-17.0.1-0.20180302144923.9ace6ed.el7ost.noarch openstack-ironic-api-10.1.2-0.20180302010846.7cd9deb.el7ost.noarch openstack-ironic-inspector-7.2.1-0.20180302142656.397a98a.el7ost.noarch openstack-nova-common-17.0.1-0.20180302144923.9ace6ed.el7ost.noarch openstack-neutron-common-12.0.0-0.20180222093624.abb60c6.0rc2.el7ost.noarch python-ironic-inspector-client-3.1.0-0.20180213173236.c82b59f.el7ost.noarch openstack-neutron-openvswitch-12.0.0-0.20180222093624.abb60c6.0rc2.el7ost.noarch python-ironic-lib-2.12.0-0.20180213172054.831c55b.el7ost.noarch openstack-nova-placement-api-17.0.1-0.20180302144923.9ace6ed.el7ost.noarch So after re-installing the setup and applying the patch from comment #9, we saw the error appearing in the logs: 2018-03-29 20:57:48Z [openshift.OpenShiftMaster.0.OpenShiftMaster]: CREATE_IN_PROGRESS state changed 2018-03-29 20:58:15Z [openshift.OpenShiftWorker.0.OpenShiftWorker]: CREATE_FAILED ResourceInError: resources.OpenShiftWorker: Went to status ERROR due to "Message: Build of instance 7fccdbd 5-fdea-42d1-bb30-696f4c62f47a aborted: Failure prepping block device., Code: 500" 2018-03-29 20:58:15Z [openshift.OpenShiftWorker.0.OpenShiftWorker]: DELETE_IN_PROGRESS state changed 2018-03-29 20:58:18Z [openshift.OpenShiftWorker.0.OpenShiftWorker]: DELETE_COMPLETE state changed 2018-03-29 20:58:21Z [openshift.OpenShiftWorker.0.OpenShiftWorker]: CREATE_IN_PROGRESS state changed But the deployment continued and completed successfully. Successfully re-deployed 4 times but now. This seems like something is occurring in nova in the block device handling (which also requires network handling). The node could also just be getting rescheduled possibly.... 1) Have the ports in ironic been checked for any legacy vifs that are no longer in use? In particular on any presently undeployed nodes? 2) Are there current logs with this error available that can be reviewed? Last comment was from the 29th, the attachments appear to be the 26th. -Julia After looking at the environment where this was encountered, and four successful stack deploys later I'm fairly certain I understand what is going on. In essence, what is happening in this case (which is different from some of the other cases of a vif being orphaned in Ironic's database), since in queens, any sort of scheduling/locking issue with obtaining a baremetal node or trying to use an already scheduled node will appear as this error. With the configuration being used for the flavor: +----------------------------+---------------------------------------------------------------------------------+ | Field | Value | +----------------------------+---------------------------------------------------------------------------------+ | OS-FLV-DISABLED:disabled | False | | OS-FLV-EXT-DATA:ephemeral | 0 | | access_project_ids | None | | disk | 40 | | id | 51bd7b31-94cb-43a6-8d53-ea9810f0bc14 | | name | baremetal | | os-flavor-access:is_public | True | | properties | capabilities:boot_option='local', capabilities:profile='baremetal', cpu_arch='x86_64' | | ram | 4096 | | rxtx_factor | 1.0 | | swap | | | vcpus | 1 | +----------------------------+---------------------------------------------------------------------------------+ There was no use of the resource class. So what was occurring, was nova was saying "Oh, it looks like you can fit two of the requested instances on to this single baremetal node", which architecturally is impossible. This was presenting as something like: 2018-04-19 19:49:06.832 18246 ERROR nova.virt.ironic.driver [req-90f1e5e7-1ee0-4f1d-af88-a42f74b0a8e0 e9b4e6ab60ae40cc84ee5689c38608ef f3dccd2210514e3695c4d087d81a65a7 - default default] Can 40-50b3-489b-ae1e-0840e0608253 to the node 64eea9c8-b69f-4a30-a05f-2c7b09f5b1d8 due to error: Unable to attach VIF 7d0a6b40-50b3-489b-ae1e-0840e0608253, not enough free physical ports. (HTTP able to attach VIF 7d0a6b40-50b3-489b-ae1e-0840e0608253, not enough free physical ports. (HTTP 400) 2018-04-19 19:49:06.833 18246 ERROR nova.virt.ironic.driver [req-90f1e5e7-1ee0-4f1d-af88-a42f74b0a8e0 e9b4e6ab60ae40cc84ee5689c38608ef f3dccd2210514e3695c4d087d81a65a7 - default default] [in -4067-af8e-cd6e25fb6b59] Error preparing deploy for instance 964f0d93-e9c4-4067-af8e-cd6e25fb6b59 on baremetal node 64eea9c8-b69f-4a30-a05f-2c7b09f5b1d8.: VirtualInterfacePlugException: Cann 0-50b3-489b-ae1e-0840e0608253 to the node 64eea9c8-b69f-4a30-a05f-2c7b09f5b1d8 due to error: Unable to attach VIF 7d0a6b40-50b3-489b-ae1e-0840e0608253, not enough free physical ports. (HTTP 2018-04-19 19:49:06.833 18246 ERROR nova.compute.manager [req-90f1e5e7-1ee0-4f1d-af88-a42f74b0a8e0 e9b4e6ab60ae40cc84ee5689c38608ef f3dccd2210514e3695c4d087d81a65a7 - default default] [insta 67-af8e-cd6e25fb6b59] Failure prepping block device: VirtualInterfacePlugException: Cannot attach VIF 7d0a6b40-50b3-489b-ae1e-0840e0608253 to the node 64eea9c8-b69f-4a30-a05f-2c7b09f5b1d8 du attach VIF 7d0a6b40-50b3-489b-ae1e-0840e0608253, not enough free physical ports. (HTTP 400) Looking at the http transaction log: 192.168.24.1 - - [19/Apr/2018:19:49:00 -0400] "POST /v1/nodes/64eea9c8-b69f-4a30-a05f-2c7b09f5b1d8/vifs HTTP/1.1" 204 - "-" "python-ironicclient" 192.168.24.1 - - [19/Apr/2018:19:49:04 -0400] "POST /v1/nodes/64eea9c8-b69f-4a30-a05f-2c7b09f5b1d8/vifs HTTP/1.1" 400 177 "-" "python-ironicclient" 192.168.24.1 - - [19/Apr/2018:19:49:05 -0400] "PATCH /v1/nodes/64eea9c8-b69f-4a30-a05f-2c7b09f5b1d8 HTTP/1.1" 200 1239 "-" "python-ironicclient" 192.168.24.1 - - [19/Apr/2018:19:49:06 -0400] "DELETE /v1/nodes/64eea9c8-b69f-4a30-a05f-2c7b09f5b1d8/vifs/7d0a6b40-50b3-489b-ae1e-0840e0608253 HTTP/1.1" 400 225 "-" "python-ironicclient" 192.168.24.1 - - [19/Apr/2018:19:49:20 -0400] "POST /v1/nodes/67eeb698-cfd7-4f30-83c7-c6748f78da60/vifs HTTP/1.1" 204 - "-" "python-ironicclient" When attempting to corrilate the vif plugging API actions to the nova-compute service log, I discovered that the payloads of the POST operations were for two different VIFs. The above patch operation is where the node is essentially "reserved", however due to the needs of BFV, we attempt to attach the VIF first. This is where things begin to go sideways. If we're trying to put two different vifs on the same node, with no concept that that node is already in use, then we will absolutely fail. With many nodes being deployed at the same time, it is conceivable that this cascades to the point where nova eventually gives up because it has retried three times. There is likely something in the "placement" algorithm that creates hotspots, which causes longer lived undercloud deployments to prefer scheduling on to the same nodes more frequently, which may be why this is seemingly happening more frequently. What should we be doing? We should be setting the resource class fields on the flavor: openstack flavor set baremetal --property "resources:CUSTOM_BAREMETAL"="1" openstack flavor set baremetal --property "resources:DISK_GB"="0" openstack flavor set baremetal --property "resources:MEMORY_MB"="0" openstack flavor set baremetal --property "resources:VCPU"="0" Which results in the resource class looking like: +----------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------- ------+ | Field | Value | +----------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------- ------+ | OS-FLV-DISABLED:disabled | False | | OS-FLV-EXT-DATA:ephemeral | 0 | | access_project_ids | None | | disk | 40 | | id | 36e063d9-bcb8-4f02-bb79-8edc8e8dc9ae | | name | baremetal | | os-flavor-access:is_public | True | | properties | capabilities:boot_option='local', capabilities:profile='baremetal', cpu_arch='x86_64', resources:CUSTOM_BAREMETAL='1', resources:DISK_GB='0', resources:MEMORY_ U='0' | | ram | 4096 | | rxtx_factor | 1.0 | | swap | | | vcpus | 2 | +----------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------- ------+ Doing this results in a better looking behavior when requesting nodes: 192.168.24.1 - - [20/Apr/2018:01:02:07 -0400] "POST /v1/nodes/64eea9c8-b69f-4a30-a05f-2c7b09f5b1d8/vifs HTTP/1.1" 204 - "-" "python-ironicclient" 192.168.24.1 - - [20/Apr/2018:01:02:08 -0400] "POST /v1/nodes/67eeb698-cfd7-4f30-83c7-c6748f78da60/vifs HTTP/1.1" 204 - "-" "python-ironicclient" 192.168.24.1 - - [20/Apr/2018:01:02:12 -0400] "PATCH /v1/nodes/64eea9c8-b69f-4a30-a05f-2c7b09f5b1d8 HTTP/1.1" 200 1245 "-" "python-ironicclient" 192.168.24.1 - - [20/Apr/2018:01:02:13 -0400] "PATCH /v1/nodes/67eeb698-cfd7-4f30-83c7-c6748f78da60 HTTP/1.1" 200 1246 "-" "python-ironicclient" Note, the two different node UUIDs being used upfront. Before switching to resource classes for scheduling, I was experiencing a 50% rate where a single node would be selected by placement for the instances. With only two instances being requested and two nodes available in the resource class at present, it is relatively easy for the instance that needs to be rescheduled, to end up rescheduled. However with more nodes being requested at one time, the odds of three failures occurring for any given requested instance begins to go up. Can this be fixed without changing the flavors? tl;dr somewhere between maybe and most likely not. This is a complex race that really only ironic can solve in its virt driver inside of nova by adding additional checks and attempting to lock a node earlier. That being said, I'm unsure at present if we are able to or will be able to continue to raise a particular exception to force rescheduling of the requested instance without a negative impact. If we do try and fix it, then we're ultimately trying to work around the side effect of deprecated behavior, which is not ideal. Where do we go from here? It would be good to understand if the environments where we have been encountering extremely similar issues, are configured this way or the prior way. If this can be tested on a few of those more complex environments, that would help as well. > Can this be fixed without changing the flavors?
Maybe it can, but it should not. We have to use the correct flavors, or we'll get broken one day anyway.
I'm going to take a look at double checking/fixing the locking behavior so we hopefully gracefully avoid this. Opened launchpad https://bugs.launchpad.net/nova/+bug/1766301 Saw this error on a setup during first attempt to deploy. I did delete baremetal nodes and re-imported them using ipmi driver though. 2018-04-24 11:24:35.603 25823 ERROR nova.compute.manager [instance: 3327c681-7b3b-446c-a771-0de451a4d72e] reason=msg) │···················· 2018-04-24 11:24:35.603 25823 ERROR nova.compute.manager [instance: 3327c681-7b3b-446c-a771-0de451a4d72e] BuildAbortException: Build of instance 3327c681-7b3b-446c-a771-0de451a4d72e aborted:│···················· Failure prepping block device. │···················· 2018-04-24 11:24:35.603 25823 ERROR nova.compute.manager [instance: 3327c681-7b3b-446c-a771-0de451a4d72e] Adding ironic and nova logs. Created attachment 1426153 [details]
nova and ironic logs
Happens to me too. Also getting messages: No valid host was found. There are not enough hosts available., Code: 500" 3 controllers 1 compute topology. Reducing the number of overcloud nodes to 1 compute and 1 controller solves the problem. ----- 2018-05-08 10:11:53Z [overcloud.Controller.2.NodeAdminUserData]: CREATE_COMPLETE state changed 2018-05-08 10:11:54Z [overcloud.Controller.1.NodeAdminUserData]: CREATE_COMPLETE state changed 2018-05-08 10:11:54Z [overcloud.Controller.2.DeploymentActions]: CREATE_COMPLETE state changed 2018-05-08 10:11:54Z [overcloud.Controller.1.RoleUserData]: CREATE_IN_PROGRESS state changed 2018-05-08 10:11:54Z [overcloud.Compute.0.NodeUserData]: CREATE_COMPLETE state changed 2018-05-08 10:11:54Z [overcloud.Controller.2.UpdateConfig]: CREATE_COMPLETE state changed 2018-05-08 10:11:54Z [overcloud.Compute.0.UserData]: CREATE_IN_PROGRESS state changed 2018-05-08 10:11:54Z [overcloud.Controller.0.DeploymentActions]: CREATE_IN_PROGRESS state changed 2018-05-08 10:11:54Z [overcloud.Controller.0.DeploymentActions]: CREATE_COMPLETE state changed 2018-05-08 10:11:54Z [overcloud.Compute.0.UserData]: CREATE_COMPLETE state changed 2018-05-08 10:11:55Z [overcloud.Controller.0.UpdateConfig]: CREATE_COMPLETE state changed 2018-05-08 10:11:55Z [overcloud.Controller.2.NodeUserData]: CREATE_IN_PROGRESS state changed 2018-05-08 10:11:55Z [overcloud.Compute.0.NovaCompute]: CREATE_IN_PROGRESS state changed 2018-05-08 10:11:55Z [overcloud.Controller.1.ControllerUpgradeInitConfig]: CREATE_IN_PROGRESS state changed 2018-05-08 10:11:55Z [overcloud.Controller.1.ControllerUpgradeInitConfig]: CREATE_COMPLETE state changed 2018-05-08 10:11:55Z [overcloud.Compute.0.UpdateConfig]: CREATE_COMPLETE state changed 2018-05-08 10:11:55Z [overcloud.Controller.0.NodeAdminUserData]: CREATE_IN_PROGRESS state changed 2018-05-08 10:11:55Z [overcloud.Controller.2.RoleUserData]: CREATE_IN_PROGRESS state changed 2018-05-08 10:11:56Z [overcloud.Controller.1.RoleUserData]: CREATE_COMPLETE state changed 2018-05-08 10:11:56Z [overcloud.Controller.1.NodeUserData]: CREATE_IN_PROGRESS state changed 2018-05-08 10:11:56Z [overcloud.Controller.2.NodeUserData]: CREATE_COMPLETE state changed 2018-05-08 10:11:57Z [overcloud.Controller.0.NodeAdminUserData]: CREATE_COMPLETE state changed 2018-05-08 10:11:57Z [overcloud.Controller.2.RoleUserData]: CREATE_COMPLETE state changed 2018-05-08 10:11:57Z [overcloud.Controller.1.UpdateConfig]: CREATE_IN_PROGRESS state changed 2018-05-08 10:11:57Z [overcloud.Controller.0.UserData]: CREATE_IN_PROGRESS state changed 2018-05-08 10:11:57Z [overcloud.Controller.2.UserData]: CREATE_IN_PROGRESS state changed 2018-05-08 10:11:57Z [overcloud.Controller.0.UserData]: CREATE_COMPLETE state changed 2018-05-08 10:11:57Z [overcloud.Controller.0.Controller]: CREATE_IN_PROGRESS state changed 2018-05-08 10:11:58Z [overcloud.Controller.1.NodeUserData]: CREATE_COMPLETE state changed 2018-05-08 10:11:58Z [overcloud.Controller.2.UserData]: CREATE_COMPLETE state changed 2018-05-08 10:11:58Z [overcloud.Controller.1.UserData]: CREATE_IN_PROGRESS state changed 2018-05-08 10:11:58Z [overcloud.Controller.2.Controller]: CREATE_IN_PROGRESS state changed 2018-05-08 10:11:58Z [overcloud.Controller.1.UserData]: CREATE_COMPLETE state changed 2018-05-08 10:11:58Z [overcloud.Controller.1.Controller]: CREATE_IN_PROGRESS state changed 2018-05-08 10:11:59Z [overcloud.Controller.1.UpdateConfig]: CREATE_COMPLETE state changed 2018-05-08 10:14:55Z [overcloud.Compute.0.NovaCompute]: CREATE_FAILED ResourceInError: resources.NovaCompute: Went to status ERROR due to "Message: Exceeded maximum number of retries. Exhausted all hosts available for retrying build failures for instance e7d10312-a672-4cab-ace2-984116e8832c., Code: 500" 2018-05-08 10:14:55Z [overcloud.Compute.0.NovaCompute]: DELETE_IN_PROGRESS state changed 2018-05-08 10:14:58Z [overcloud.Compute.0.NovaCompute]: DELETE_COMPLETE state changed 2018-05-08 10:15:01Z [overcloud.Compute.0.NovaCompute]: CREATE_IN_PROGRESS state changed 2018-05-08 10:15:04Z [overcloud.Compute.0.NovaCompute]: CREATE_FAILED ResourceInError: resources.NovaCompute: Went to status ERROR due to "Message: No valid host was found. There are not enough hosts available., Code: 500" 2018-05-08 10:15:04Z [overcloud.Compute.0.NovaCompute]: DELETE_IN_PROGRESS state changed 2018-05-08 10:15:06Z [overcloud.Controller.2.Controller]: CREATE_FAILED ResourceInError: resources.Controller: Went to status ERROR due to "Message: Unknown, Code: Unknown" 2018-05-08 10:15:06Z [overcloud.Controller.2.Controller]: DELETE_IN_PROGRESS state changed 2018-05-08 10:15:06Z [overcloud.Compute.0.NovaCompute]: DELETE_COMPLETE state changed 2018-05-08 10:15:08Z [overcloud.Controller.2.Controller]: DELETE_COMPLETE state changed 2018-05-08 10:15:11Z [overcloud.Compute.0.NovaCompute]: CREATE_IN_PROGRESS state changed 2018-05-08 10:15:11Z [overcloud.Controller.2.Controller]: CREATE_IN_PROGRESS state changed 2018-05-08 10:15:14Z [overcloud.Compute.0.NovaCompute]: CREATE_FAILED ResourceInError: resources.NovaCompute: Went to status ERROR due to "Message: No valid host was found. There are not enough hosts available., Code: 500" 2018-05-08 10:15:14Z [overcloud.Compute.0.NovaCompute]: DELETE_IN_PROGRESS state changed 2018-05-08 10:15:16Z [overcloud.Compute.0.NovaCompute]: DELETE_COMPLETE state changed 2018-05-08 10:15:20Z [overcloud.Controller.2.Controller]: CREATE_FAILED ResourceInError: resources.Controller: Went to status ERROR due to "Message: Build of instance 22ba25e5-1336-4594-9a71-c93ccf43d1d1 aborted: Failure prepping block device., Code: 500" 2018-05-08 10:15:20Z [overcloud.Controller.2.Controller]: DELETE_IN_PROGRESS state changed 2018-05-08 10:15:22Z [overcloud.Controller.2.Controller]: DELETE_COMPLETE state changed 2018-05-08 10:15:25Z [overcloud.Compute.0.NovaCompute]: CREATE_IN_PROGRESS state changed 2018-05-08 10:15:28Z [overcloud.Compute.0.NovaCompute]: CREATE_FAILED ResourceInError: resources.NovaCompute: Went to status ERROR due to "Message: No valid host was found. There are not enough hosts available., Code: 500" 2018-05-08 10:15:28Z [overcloud.Compute.0.NovaCompute]: DELETE_IN_PROGRESS state changed 2018-05-08 10:15:28Z [overcloud.Controller.2.Controller]: CREATE_IN_PROGRESS state changed 2018-05-08 10:15:30Z [overcloud.Compute.0.NovaCompute]: DELETE_COMPLETE state changed 2018-05-08 10:15:31Z [overcloud.Controller.2.Controller]: CREATE_FAILED ResourceInError: resources.Controller: Went to status ERROR due to "Message: No valid host was found. There are not enough hosts available., Code: 500" 2018-05-08 10:15:31Z [overcloud.Controller.2.Controller]: DELETE_IN_PROGRESS state changed 2018-05-08 10:15:33Z [overcloud.Controller.2.Controller]: DELETE_COMPLETE state changed 2018-05-08 10:15:34Z [overcloud.Controller.0.Controller]: CREATE_FAILED ResourceInError: resources.Controller: Went to status ERROR due to "Message: Build of instance 6fc2fd23-b9d8-4b01-af03-6fe5fbdd2a86 aborted: Failure prepping block device., Code: 500" 2018-05-08 10:15:34Z [overcloud.Controller.0.Controller]: DELETE_IN_PROGRESS state changed 2018-05-08 10:15:36Z [overcloud.Controller.0.Controller]: DELETE_COMPLETE state changed 2018-05-08 10:15:39Z [overcloud.Controller.0.Controller]: CREATE_IN_PROGRESS state changed 2018-05-08 10:15:43Z [overcloud.Controller.2.Controller]: CREATE_IN_PROGRESS state changed parameter_defaults: 2018-05-08 10:15:47Z [overcloud.Compute.0.NovaCompute]: CREATE_IN_PROGRESS state changed 2018-05-08 10:15:49Z [overcloud.Controller.0.Controller]: CREATE_FAILED ResourceInError: resources.Controller: Went to status ERROR due to "Message: Build of instance 4db60876-5533-488f-8dd1-57b240faa6f3 aborted: Failure prepping block device., Code: 500" 2018-05-08 10:15:49Z [overcloud.Controller.0.Controller]: DELETE_IN_PROGRESS state changed 2018-05-08 10:15:52Z [overcloud.Controller.0.Controller]: DELETE_COMPLETE state changed 2018-05-08 10:15:57Z [overcloud.Controller.0.Controller]: CREATE_IN_PROGRESS state changed 2018-05-08 10:16:08Z [overcloud.Controller.0.Controller]: CREATE_FAILED ResourceInError: resources.Controller: Went to status ERROR due to "Message: Build of instance 885006cd-385d-4d17-8b0f-8cfd4a5506a0 aborted: Failure prepping block device., Code: 500" 2018-05-08 10:16:08Z [overcloud.Controller.0.Controller]: DELETE_IN_PROGRESS state changed 2018-05-08 10:16:10Z [overcloud.Controller.0.Controller]: DELETE_COMPLETE state changed 2018-05-08 10:16:19Z [overcloud.Controller.0.Controller]: CREATE_IN_PROGRESS state changed 2018-05-08 10:16:29Z [overcloud.Controller.0.Controller]: CREATE_FAILED ResourceInError: resources.Controller: Went to status ERROR due to "Message: Build of instance 4686fb22-6060-424a-b042-627cda0db173 aborted: Failure prepping block device., Code: 500" 2018-05-08 10:16:29Z [overcloud.Controller.0.Controller]: DELETE_IN_PROGRESS state changed 2018-05-08 10:16:31Z [overcloud.Controller.0.Controller]: DELETE_COMPLETE state changed 2018-05-08 10:16:49Z [overcloud.Controller.0.Controller]: CREATE_IN_PROGRESS state changed 2018-05-08 10:17:03Z [overcloud.Controller.0.Controller]: CREATE_FAILED ResourceInError: resources.Controller: Went to status ERROR due to "Message: Build of instance cc9a7364-c101-4151-92d5-20b29cdcb666 aborted: Failure prepping block device., Code: 500" 2018-05-08 10:17:03Z [overcloud.Controller.0.Controller]: DELETE_IN_PROGRESS state changed 2018-05-08 10:17:06Z [overcloud.Controller.0.Controller]: DELETE_COMPLETE state changed 2018-05-08 10:17:41Z [overcloud.Controller.0.Controller]: CREATE_IN_PROGRESS state changed 2018-05-08 10:17:52Z [overcloud.Controller.0.Controller]: CREATE_FAILED ResourceInError: resources.Controller: Went to status ERROR due to "Message: Build of instance 828c2323-b6d3-4ee6-bf1e-82a5f81337fc aborted: Failure prepping block device., Code: 500" 2018-05-08 10:17:52Z [overcloud.Controller.0]: CREATE_FAILED Resource CREATE failed: ResourceInError: resources.Controller: Went to status ERROR due to "Message: Build of instance 6fc2fd23-b9d8-4b01-af03-6fe5fbdd2a86 aborted: Failure prepping block device., Code: 500" 2018-05-08 10:17:52Z [overcloud.Controller.0]: CREATE_FAILED ResourceInError: resources[0].resources.Controller: Went to status ERROR due to "Message: Build of instance 6fc2fd23-b9d8-4b01-af03-6fe5fbdd2a86 aborted: Failure prepping block device., Code: 500" 2018-05-08 10:17:52Z [overcloud.Controller]: UPDATE_FAILED Resource CREATE failed: ResourceInError: resources[0].resources.Controller: Went to status ERROR due to "Message: Build of instance 6fc2fd23-b9d8-4b01-af03-6fe5fbdd2a86 aborted: Failure prepping block device., Code: 500" 2018-05-08 10:17:53Z [overcloud.Controller]: CREATE_FAILED resources.Controller: Resource CREATE failed: ResourceInError: resources[0].resources.Controller: Went to status ERROR due to "Message: Build of instance 6fc2fd23-b9d8-4b01-af03-6fe5fbdd2a86 aborted: Failure prepping block device., Code: 500" 2018-05-08 10:17:53Z [overcloud]: CREATE_FAILED Resource CREATE failed: resources.Controller: Resource CREATE failed: ResourceInError: resources[0].resources.Controller: Went to status ERROR due to "Message: Build of instance 6fc2fd23-b9d8-4b01-af03-6fe5fbdd2a86 aborted: Failure prepping block device., Stack overcloud CREATE_FAILED overcloud.Controller.0.Controller: resource_type: OS::TripleO::ControllerServer physical_resource_id: 828c2323-b6d3-4ee6-bf1e-82a5f81337fc status: CREATE_FAILED status_reason: | ResourceInError: resources.Controller: Went to status ERROR due to "Message: Build of instance 828c2323-b6d3-4ee6-bf1e-82a5f81337fc aborted: Failure prepping block device., Code: 500" Heat Stack create failed. Heat Stack create failed. Still seeing this on my puddle from 2018-05-02.5 Per https://bugzilla.redhat.com/show_bug.cgi?id=1560690#c23 Please ensure you are setting the resource class fields on the flavor: openstack flavor set baremetal --property "resources:CUSTOM_BAREMETAL"="1" openstack flavor set baremetal --property "resources:DISK_GB"="0" openstack flavor set baremetal --property "resources:MEMORY_MB"="0" openstack flavor set baremetal --property "resources:VCPU"="0" Hey Bob, Why not have these properties set by default on the baremetal flavor that director sets up? In Sai's setup I confirmed that he changed to use the resource correctly in the baremetal flavor: (undercloud) [stack@c08-h26-r630 ~]$ openstack flavor show baremetal +----------------------------+--------------------------------------------------------------------------------------------------------------------------------------+ | Field | Value | +----------------------------+--------------------------------------------------------------------------------------------------------------------------------------+ | OS-FLV-DISABLED:disabled | False | | OS-FLV-EXT-DATA:ephemeral | 0 | | access_project_ids | None | | disk | 40 | | id | 077a71a5-d336-4508-be14-28f224650c32 | | name | baremetal | | os-flavor-access:is_public | True | | properties | capabilities:boot_option='local', resources:CUSTOM_BAREMETAL='1', resources:DISK_GB='0', resources:MEMORY_MB='0', resources:VCPU='0' | | ram | 4096 | | rxtx_factor | 1.0 | | swap | | | vcpus | 1 | +----------------------------+--------------------------------------------------------------------------------------------------------------------------------------+ (undercloud) [stack@c08-h26-r630 ~]$ However he still occasionally gets the scheduling issue, we see this in nova logs: nova/nova-compute.log:2018-05-08 16:57:07.491 22079 ERROR nova.compute.manager [req-e81cc5d3-81d9-436e-9166-ab12a649c20e 05f424ebba56487797be7f76cc3fc5b7 cb950f51a4dd41799b85f96c5cf712d7 - default default] [instance: cb7f12e1-7ab4-42a3-b13e-af18af69e093] Failure prepping block device: VirtualInterfacePlugException: Cannot attach VIF 693dec9a-1691-4dbe-83dd-afaecdb6e166 to the node c3c21b1f-6410-4397-bb3b-0ce385ffe232 due to error: Unable to attach VIF 693dec9a-1691-4dbe-83dd-afaecdb6e166, not enough free physical ports. (HTTP 400) We are also seeing many of these errors in nova, related to the resource: /nova-placement-api.log:2018-05-08 16:54:56.456 119895 DEBUG nova.api.openstack.placement.wsgi_wrapper [req-080364c1-e4da-4f6a-bd80-410b0e817ad9 f7e2a20676bf4948ab011404ba3d4353 cb950f51a4dd41799b85f96c5cf712d7 - default default] Placement API returning an error response: Unable to allocate inventory: Unable to create allocation for 'CUSTOM_BAREMETAL' on resource provider '40b677b5-fe1a-4006-ae87-d62082bbe43d'. The requested amount would exceed the capacity. call_func /usr/lib/python2.7/site-packages/nova/api/openstack/placement/wsgi_wrapper.py:31 > Why not have these properties set by default on the baremetal flavor that director > sets up? We have an RFE to validate they are set: https://bugzilla.redhat.com/show_bug.cgi?id=1564207 I'm seeing this VIF exception in many deployments recently, it looks like the underlying problem is with ironic power commands (which cause ironic to retry with a different baremetal node and catch the VIF issue by the way), more details in https://bugzilla.redhat.com/show_bug.cgi?id=1576073 I'm facing the reported bug. I ran a clean undercloud+overcloud deployment (latest puddle, 2018-05-18.1) with infrared. [stack@undercloud-0 ~]$ cat /etc/yum.repos.d/latest-installed 13 -p 2018-05-18.1 [stack@undercloud-0 ~]$ rpm -q openstack-heat-common openstack-heat-common-10.0.1-0.20180411125640.el7ost.noarch overcloud_install.log: [...] 2018-05-18 14:43:11Z [overcloud.Compute.1.UserData]: CREATE_IN_PROGRESS state changed 2018-05-18 14:43:11Z [overcloud.Compute.0.UserData]: CREATE_IN_PROGRESS state changed 2018-05-18 14:43:11Z [overcloud.Compute.1.UserData]: CREATE_COMPLETE state changed 2018-05-18 14:43:11Z [overcloud.Compute.0.UserData]: CREATE_COMPLETE state changed 2018-05-18 14:43:11Z [overcloud.Compute.1.NovaCompute]: CREATE_IN_PROGRESS state changed 2018-05-18 14:43:11Z [overcloud.Compute.0.NovaCompute]: CREATE_IN_PROGRESS state changed 2018-05-18 14:43:12Z [overcloud.Compute.0.UpdateConfig]: CREATE_COMPLETE state changed 2018-05-18 14:44:16Z [overcloud.Controller.0.Controller]: CREATE_FAILED ResourceInError: resources.Controller: Went to status ERROR due to "Message: Exceeded maximum number of retries. Exhausted all hosts available for retrying build failures for instance a0c9d982-beb1-4ee2-b732-00ffe0ed904d., Code: 500" 2018-05-18 14:44:16Z [overcloud.Controller.0.Controller]: DELETE_IN_PROGRESS state changed 2018-05-18 14:44:18Z [overcloud.Controller.0.Controller]: DELETE_COMPLETE state changed 2018-05-18 14:44:22Z [overcloud.Controller.0.Controller]: CREATE_IN_PROGRESS state changed 2018-05-18 14:44:23Z [overcloud.Compute.1.NovaCompute]: CREATE_FAILED ResourceInError: resources.NovaCompute: Went to status ERROR due to "Message: Build of instance cc92987b-7f6b-4b0a-9968-76d44a4e7f5f aborted: Failure prepping block device., Code: 500" 2018-05-18 14:44:23Z [overcloud.Compute.1.NovaCompute]: DELETE_IN_PROGRESS state changed 2018-05-18 14:44:24Z [overcloud.Compute.1.NovaCompute]: DELETE_COMPLETE state changed 2018-05-18 14:44:24Z [overcloud.Controller.0.Controller]: CREATE_FAILED ResourceInError: resources.Controller: Went to status ERROR due to "Message: No valid host was found. There are not enough hosts available., Code: 500" 2018-05-18 14:44:24Z [overcloud.Controller.0.Controller]: DELETE_IN_PROGRESS state changed 2018-05-18 14:44:26Z [overcloud.Controller.0.Controller]: DELETE_COMPLETE state changed 2018-05-18 14:44:28Z [overcloud.Compute.1.NovaCompute]: CREATE_IN_PROGRESS state changed 2018-05-18 14:44:31Z [overcloud.Controller.0.Controller]: CREATE_IN_PROGRESS state changed 2018-05-18 14:44:33Z [overcloud.Controller.0.Controller]: CREATE_FAILED ResourceInError: resources.Controller: Went to status ERROR due to "Message: No valid host was found. There are not enough hosts available., Code: 500" 2018-05-18 14:44:33Z [overcloud.Controller.0.Controller]: DELETE_IN_PROGRESS state changed 2018-05-18 14:44:35Z [overcloud.Controller.0.Controller]: DELETE_COMPLETE state changed 2018-05-18 14:44:35Z [overcloud.Compute.1.NovaCompute]: CREATE_FAILED ResourceInError: resources.NovaCompute: Went to status ERROR due to "Message: Build of instance 10f0d584-975d-4285-9435-8e5ebc4aaaf4 aborted: Failure prepping block device., Code: 500" 2018-05-18 14:44:35Z [overcloud.Compute.1.NovaCompute]: DELETE_IN_PROGRESS state changed 2018-05-18 14:44:37Z [overcloud.Compute.1.NovaCompute]: DELETE_COMPLETE state changed 2018-05-18 14:44:43Z [overcloud.Compute.1.NovaCompute]: CREATE_IN_PROGRESS state changed 2018-05-18 14:44:45Z [overcloud.Controller.0.Controller]: CREATE_IN_PROGRESS state changed 2018-05-18 14:44:47Z [overcloud.Controller.0.Controller]: CREATE_FAILED ResourceInError: resources.Controller: Went to status ERROR due to "Message: No valid host was found. There are not enough hosts available., Code: 500" 2018-05-18 14:44:47Z [overcloud.Controller.0.Controller]: DELETE_IN_PROGRESS state changed 2018-05-18 14:44:48Z [overcloud.Compute.1.NovaCompute]: CREATE_FAILED ResourceInError: resources.NovaCompute: Went to status ERROR due to "Message: Build of instance 56fe683d-5049-445b-934b-4d6038787304 aborted: Failure prepping block device., Code: 500" 2018-05-18 14:44:48Z [overcloud.Compute.1.NovaCompute]: DELETE_IN_PROGRESS state changed 2018-05-18 14:44:49Z [overcloud.Controller.0.Controller]: DELETE_COMPLETE state changed 2018-05-18 14:44:50Z [overcloud.Compute.1.NovaCompute]: DELETE_COMPLETE state changed 2018-05-18 14:44:59Z [overcloud.Compute.1.NovaCompute]: CREATE_IN_PROGRESS state changed 2018-05-18 14:45:04Z [overcloud.Compute.1.NovaCompute]: CREATE_FAILED ResourceInError: resources.NovaCompute: Went to status ERROR due to "Message: Build of instance 158cf5f9-5d90-484d-8c3e-3104e0e9cfb9 aborted: Failure prepping block device., Code: 500" 2018-05-18 14:45:04Z [overcloud.Compute.1.NovaCompute]: DELETE_IN_PROGRESS state changed 2018-05-18 14:45:06Z [overcloud.Controller.0.Controller]: CREATE_IN_PROGRESS state changed 2018-05-18 14:45:06Z [overcloud.Compute.1.NovaCompute]: DELETE_COMPLETE state changed 2018-05-18 14:45:24Z [overcloud.Compute.1.NovaCompute]: CREATE_IN_PROGRESS state changed 2018-05-18 14:45:29Z [overcloud.Compute.1.NovaCompute]: CREATE_FAILED ResourceInError: resources.NovaCompute: Went to status ERROR due to "Message: Build of instance 8f3273de-c2d8-43f3-b3db-d87a28a080c4 aborted: Failure prepping block device., Code: 500" 2018-05-18 14:45:29Z [overcloud.Compute.1.NovaCompute]: DELETE_IN_PROGRESS state changed 2018-05-18 14:45:31Z [overcloud.Compute.1.NovaCompute]: DELETE_COMPLETE state changed 2018-05-18 14:46:04Z [overcloud.Compute.1.NovaCompute]: CREATE_IN_PROGRESS state changed 2018-05-18 14:46:14Z [overcloud.Compute.1.NovaCompute]: CREATE_FAILED ResourceInError: resources.NovaCompute: Went to status ERROR due to "Message: Build of instance 977fdece-cad0-47eb-a47f-5018c18e3fe8 aborted: Failure prepping block device., Code: 500" 2018-05-18 14:46:14Z [overcloud.Compute.1]: CREATE_FAILED Resource CREATE failed: ResourceInError: resources.NovaCompute: Went to status ERROR due to "Message: Build of instance cc92987b-7f6b-4b0a-9968-76d44a4e7f5f aborted: Failure prepping block device., Code: 500" 2018-05-18 14:46:14Z [overcloud.Compute.1]: CREATE_FAILED ResourceInError: resources[1].resources.NovaCompute: Went to status ERROR due to "Message: Build of instance cc92987b-7f6b-4b0a-9968-76d44a4e7f5f aborted: Failure prepping block device., Code: 500" 2018-05-18 14:46:14Z [overcloud.Compute]: UPDATE_FAILED Resource CREATE failed: ResourceInError: resources[1].resources.NovaCompute: Went to status ERROR due to "Message: Build of instance cc92987b-7f6b-4b0a-9968-76d44a4e7f5f aborted: Failure prepping block device., Code: 500" 2018-05-18 14:46:15Z [overcloud.Compute]: CREATE_FAILED resources.Compute: Resource CREATE failed: ResourceInError: resources[1].resources.NovaCompute: Went to status ERROR due to "Message: Build of instance cc92987b-7f6b-4b0a-9968-76d44a4e7f5f aborted: Failure prepping block device., Code: 500" 2018-05-18 14:46:15Z [overcloud]: CREATE_FAILED Resource CREATE failed: resources.Compute: Resource CREATE failed: ResourceInError: resources[1].resources.NovaCompute: Went to status ERROR due to "Message: Build of instance cc92987b-7f6b-4b0a-9968-76d44a4e7f5f aborted: Failure prepping block device., C Stack overcloud CREATE_FAILED overcloud.Compute.1.NovaCompute: resource_type: OS::TripleO::ComputeServer physical_resource_id: 977fdece-cad0-47eb-a47f-5018c18e3fe8 status: CREATE_FAILED status_reason: | ResourceInError: resources.NovaCompute: Went to status ERROR due to "Message: Build of instance 977fdece-cad0-47eb-a47f-5018c18e3fe8 aborted: Failure prepping block deviHeat Stack create failed. Heat Stack create failed. ce., Code: 500" We're going to need to examine a system where this has occured with the fix as the error that is being exposed is not the original error, so we need to validate configuration/settings and determine how it reached that end state. Julia, this has just happened on my baremetal@home. Please let me know which info you'd need. Alternatively I can also give you access to the machine. In case the error is not the original error reported on comment #0 I can open a new rhbz. openstack flavor list openstack flavor show nova hypervisor-list openstack server list For each instance in nova: openstack server show <instance_uuid> for each baremetal node openstack baremetal node show <node_uuid> openstack baremetal baremetal node vif list <node_uuid> And these three folders on the undercloud should do it: /var/log/nova /var/log/httpd /var/log/ironic Minor quick correction, I need the openstack flavor show output for each flavor in use. Created attachment 1438865 [details]
rhbz1560690-undercloud.log
Created attachment 1438866 [details] sosreport (nova, ironic, instack, heat, cinder, neutron) In addition to command outputs attached in comment #49, I'm also attaching sosreport from the undercloud. Carlos - it looks like you are getting IPMI power-on failures. In ironic-conductor.log: 2018-05-18 10:44:08.580 24804 WARNING ironic.drivers.modules.ipmitool [req-545ca4e9-e609-4507-a169-bffb5a794b3a 868f66f3bd734efaa6332bcb6b2a6fab 23b75b4d1a164271969560c81d6883ad - default default] IPMI power action power on failed for node c3d408ac-d913-440c-b5a3-b70ae2b847c2 with error: Unexpected error while running command. Command: ipmitool -I lanplus -H 172.16.0.1 -L ADMINISTRATOR -p 6231 -U admin -R 12 -N 5 -f /tmp/tmpYJ0xip power on Exit code: 1 Stdout: u'' Stderr: u'Set Chassis Power Control to Up/On failed: Command not supported in present state\n'.: ProcessExecutionError: Unexpected error while running command. 2018-05-18 10:44:08.595 24804 ERROR ironic.conductor.manager [req-545ca4e9-e609-4507-a169-bffb5a794b3a 868f66f3bd734efaa6332bcb6b2a6fab 23b75b4d1a164271969560c81d6883ad - default default] Error in deploy of node c3d408ac-d913-440c-b5a3-b70ae2b847c2: IPMI call failed: power on.: IPMIFailure: IPMI call failed: power on. Are the IPMI credentials configured correctly for this node, or is there an issue with the BMC on this node? Not being able to power on the node can result in this "Failure prepping block device" error. Unable to reproduce this issue with the latest puddle from 05/18. To test this issue I ran durability test over the weekend, the test deploy and delete the overcloud stack for X10 times. Environment: ------------- openstack-heat-engine-10.0.1-0.20180411125640.el7ost.noarch openstack-heat-common-10.0.1-0.20180411125640.el7ost.noarch python-heat-agent-1.5.4-0.20180308153305.ecf43c7.el7ost.noarch openstack-tripleo-heat-templates-8.0.2-22.el7ost.noarch heat-cfntools-1.3.0-2.el7ost.noarch python2-heatclient-1.14.0-1.el7ost.noarch openstack-heat-api-cfn-10.0.1-0.20180411125640.el7ost.noarch puppet-heat-12.4.1-0.20180416203421.90e3fb0.el7ost.noarch openstack-heat-api-10.0.1-0.20180411125640.el7ost.noarch ** please re-open if the issue reproduced using this puddle. *** Bug 1571929 has been marked as a duplicate of this bug. *** Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2018:2086 *** Bug 1608702 has been marked as a duplicate of this bug. *** I am seeing this issue on OSP 13 when deploying with the networker role, and it translated into a stack that could not be deleted( completely different issue) I added this to the networker flavor, and everything seems to be working. openstack flavor set --property --property "capabilities:boot_option"="local" --property "capabilities:profile"="networker" --property "resources:CUSTOM_BAREMETAL"="1" --property "resources:DISK_GB"="0" --property "resources:MEMORY_MB"="0" --property "resources:VCPU"="0" networker Correction: openstack flavor set --property "capabilities:boot_option"="local" --property "capabilities:profile"="networker" --property "resources:CUSTOM_BAREMETAL"="1" --property "resources:DISK_GB"="0" --property "resources:MEMORY_MB"="0" --property "resources:VCPU"="0" networker I can reproduce this easily when doing this: 1) Deploy an undercloud 2) Deploy an overcloud 3) Delete the overcloud 4) Deploy an overcloud The following errors are found in nova-compute.log: 2019-05-03 16:10:52.011 5283 ERROR ironicclient.common.http [req-d1b9ce6f-6573-4ab1-88ce-fde37a9fdd0b 80b23670e885452cb5a4bec554515d71 16614237b794447694ff003b709f68c6 - default default] Error contacting Ironic server: Node 5ec50799-249a- 48a9-a338-5cdaa22a0860 is locked by host localhost.localdomain, please retry after the current operation is completed. (HTTP 409). Attempt 2 of 2: Conflict: Node 5ec50799-249a-48a9-a338-5cdaa22a0860 is locked by host localhost.localdomain , please retry after the current operation is completed. (HTTP 409) 2019-05-03 16:10:52.012 5283 ERROR nova.compute.manager [req-d1b9ce6f-6573-4ab1-88ce-fde37a9fdd0b 80b23670e885452cb5a4bec554515d71 16614237b794447694ff003b709f68c6 - default default] [instance: e071b30d-aefe-4cc8-a262-e6b9215c2089] Instan ce failed to spawn: Conflict: Node 5ec50799-249a-48a9-a338-5cdaa22a0860 is locked by host localhost.localdomain, please retry after the current operation is completed. (HTTP 409) 2019-05-03 16:10:52.012 5283 ERROR nova.compute.manager [instance: e071b30d-aefe-4cc8-a262-e6b9215c2089] Traceback (most recent call last): 2019-05-03 16:10:52.012 5283 ERROR nova.compute.manager [instance: e071b30d-aefe-4cc8-a262-e6b9215c2089] File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 2252, in _build_resources 2019-05-03 16:10:52.012 5283 ERROR nova.compute.manager [instance: e071b30d-aefe-4cc8-a262-e6b9215c2089] yield resources 2019-05-03 16:10:52.012 5283 ERROR nova.compute.manager [instance: e071b30d-aefe-4cc8-a262-e6b9215c2089] File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 2032, in _build_and_run_instance 2019-05-03 16:10:52.012 5283 ERROR nova.compute.manager [instance: e071b30d-aefe-4cc8-a262-e6b9215c2089] block_device_info=block_device_info) 2019-05-03 16:10:52.012 5283 ERROR nova.compute.manager [instance: e071b30d-aefe-4cc8-a262-e6b9215c2089] File "/usr/lib/python2.7/site-packages/nova/virt/ironic/driver.py", line 1044, in spawn 2019-05-03 16:10:52.012 5283 ERROR nova.compute.manager [instance: e071b30d-aefe-4cc8-a262-e6b9215c2089] block_device_info=block_device_info) 2019-05-03 16:10:52.012 5283 ERROR nova.compute.manager [instance: e071b30d-aefe-4cc8-a262-e6b9215c2089] File "/usr/lib/python2.7/site-packages/nova/virt/ironic/driver.py", line 394, in _add_instance_info_to_node 2019-05-03 16:10:52.012 5283 ERROR nova.compute.manager [instance: e071b30d-aefe-4cc8-a262-e6b9215c2089] retry_on_conflict=False) 2019-05-03 16:10:52.012 5283 ERROR nova.compute.manager [instance: e071b30d-aefe-4cc8-a262-e6b9215c2089] File "/usr/lib/python2.7/site-packages/nova/virt/ironic/client_wrapper.py", line 158, in call 2019-05-03 16:10:52.012 5283 ERROR nova.compute.manager [instance: e071b30d-aefe-4cc8-a262-e6b9215c2089] return self._multi_getattr(client, method)(*args, **kwargs) 2019-05-03 16:10:52.012 5283 ERROR nova.compute.manager [instance: e071b30d-aefe-4cc8-a262-e6b9215c2089] File "/usr/lib/python2.7/site-packages/ironicclient/v1/node.py", line 336, in update 2019-05-03 16:10:52.012 5283 ERROR nova.compute.manager [instance: e071b30d-aefe-4cc8-a262-e6b9215c2089] method=http_method) 2019-05-03 16:10:52.012 5283 ERROR nova.compute.manager [instance: e071b30d-aefe-4cc8-a262-e6b9215c2089] File "/usr/lib/python2.7/site-packages/ironicclient/common/base.py", line 200, in _update 2019-05-03 16:10:52.012 5283 ERROR nova.compute.manager [instance: e071b30d-aefe-4cc8-a262-e6b9215c2089] resp, body = self.api.json_request(method, url, **kwargs) 2019-05-03 16:10:52.012 5283 ERROR nova.compute.manager [instance: e071b30d-aefe-4cc8-a262-e6b9215c2089] File "/usr/lib/python2.7/site-packages/ironicclient/common/http.py", line 645, in json_request 2019-05-03 16:10:52.012 5283 ERROR nova.compute.manager [instance: e071b30d-aefe-4cc8-a262-e6b9215c2089] resp = self._http_request(url, method, **kwargs) 2019-05-03 16:10:52.012 5283 ERROR nova.compute.manager [instance: e071b30d-aefe-4cc8-a262-e6b9215c2089] File "/usr/lib/python2.7/site-packages/ironicclient/common/http.py", line 247, in wrapper 2019-05-03 16:10:52.012 5283 ERROR nova.compute.manager [instance: e071b30d-aefe-4cc8-a262-e6b9215c2089] return func(self, url, method, **kwargs) 2019-05-03 16:10:52.012 5283 ERROR nova.compute.manager [instance: e071b30d-aefe-4cc8-a262-e6b9215c2089] File "/usr/lib/python2.7/site-packages/ironicclient/common/http.py", line 627, in _http_request 2019-05-03 16:10:52.012 5283 ERROR nova.compute.manager [instance: e071b30d-aefe-4cc8-a262-e6b9215c2089] error_json.get('debuginfo'), method, url) 2019-05-03 16:10:52.012 5283 ERROR nova.compute.manager [instance: e071b30d-aefe-4cc8-a262-e6b9215c2089] Conflict: Node 5ec50799-249a-48a9-a338-5cdaa22a0860 is locked by host localhost.localdomain, please retry after the current operation is completed. (HTTP 409) 2019-05-03 16:10:52.012 5283 ERROR nova.compute.manager [instance: e071b30d-aefe-4cc8-a262-e6b9215c2089] 2019-05-03 16:10:52.013 5283 INFO nova.compute.manager [req-d1b9ce6f-6573-4ab1-88ce-fde37a9fdd0b 80b23670e885452cb5a4bec554515d71 16614237b794447694ff003b709f68c6 - default default] [instance: e071b30d-aefe-4cc8-a262-e6b9215c2089] Termina ting instance 2019-05-03 16:10:53.793 5283 ERROR nova.compute.manager [req-c8308fc5-f76b-4063-90d7-d652fbb0f6a8 80b23670e885452cb5a4bec554515d71 16614237b794447694ff003b709f68c6 - default default] [instance: 9fc91b79-2e98-4b02-8a38-9cbb61aef533] Build of instance 9fc91b79-2e98-4b02-8a38-9cbb61aef533 aborted: Failure prepping block device.: BuildAbortException: Build of instance 9fc91b79-2e98-4b02-8a38-9cbb61aef533 aborted: Failure prepping block device. 2019-05-03 16:10:53.793 5283 ERROR nova.compute.manager [instance: 9fc91b79-2e98-4b02-8a38-9cbb61aef533] Traceback (most recent call last): 2019-05-03 16:10:53.793 5283 ERROR nova.compute.manager [instance: 9fc91b79-2e98-4b02-8a38-9cbb61aef533] File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 1841, in _do_build_and_run_instance 2019-05-03 16:10:53.793 5283 ERROR nova.compute.manager [instance: 9fc91b79-2e98-4b02-8a38-9cbb61aef533] filter_properties, request_spec) 2019-05-03 16:10:53.793 5283 ERROR nova.compute.manager [instance: 9fc91b79-2e98-4b02-8a38-9cbb61aef533] File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 2063, in _build_and_run_instance 2019-05-03 16:10:53.793 5283 ERROR nova.compute.manager [instance: 9fc91b79-2e98-4b02-8a38-9cbb61aef533] bdms=block_device_mapping) 2019-05-03 16:10:53.793 5283 ERROR nova.compute.manager [instance: 9fc91b79-2e98-4b02-8a38-9cbb61aef533] File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 220, in __exit__ 2019-05-03 16:10:53.793 5283 ERROR nova.compute.manager [instance: 9fc91b79-2e98-4b02-8a38-9cbb61aef533] self.force_reraise() 2019-05-03 16:10:53.793 5283 ERROR nova.compute.manager [instance: 9fc91b79-2e98-4b02-8a38-9cbb61aef533] File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 196, in force_reraise 2019-05-03 16:10:53.793 5283 ERROR nova.compute.manager [instance: 9fc91b79-2e98-4b02-8a38-9cbb61aef533] six.reraise(self.type_, self.value, self.tb) 2019-05-03 16:10:53.793 5283 ERROR nova.compute.manager [instance: 9fc91b79-2e98-4b02-8a38-9cbb61aef533] File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 2015, in _build_and_run_instance 2019-05-03 16:10:53.793 5283 ERROR nova.compute.manager [instance: 9fc91b79-2e98-4b02-8a38-9cbb61aef533] block_device_mapping) as resources: 2019-05-03 16:10:53.793 5283 ERROR nova.compute.manager [instance: 9fc91b79-2e98-4b02-8a38-9cbb61aef533] File "/usr/lib64/python2.7/contextlib.py", line 17, in __enter__ 2019-05-03 16:10:53.793 5283 ERROR nova.compute.manager [instance: 9fc91b79-2e98-4b02-8a38-9cbb61aef533] return self.gen.next() 2019-05-03 16:10:53.793 5283 ERROR nova.compute.manager [instance: 9fc91b79-2e98-4b02-8a38-9cbb61aef533] File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 2235, in _build_resources 2019-05-03 16:10:53.793 5283 ERROR nova.compute.manager [instance: 9fc91b79-2e98-4b02-8a38-9cbb61aef533] reason=msg) 2019-05-03 16:10:53.793 5283 ERROR nova.compute.manager [instance: 9fc91b79-2e98-4b02-8a38-9cbb61aef533] BuildAbortException: Build of instance 9fc91b79-2e98-4b02-8a38-9cbb61aef533 aborted: Failure prepping block device. Hi Dave - can we get an sosreport or logs with the your duplication method in https://bugzilla.redhat.com/show_bug.cgi?id=1560690#c60? We haven't seen that in our tests. Also I'd like to create a separate bug for this - the "Failure prepping block device" is a generic error message and the original problem has been fixed. Thanks. Setting this back to closed and creating a new bug. Note the "prepping block device" is a generic error message and not indicative that the same problem is occurring again. |