Bug 1713790

Summary:	Re-deployment of overcloud fails with "Failure prepping block device"
Product:	Red Hat OpenStack	Reporter:	Bob Fournier <bfournie>
Component:	openstack-ironic	Assignee:	RHOS Maint <rhos-maint>
Status:	CLOSED INSUFFICIENT_DATA	QA Contact:	mlammon
Severity:	medium	Docs Contact:
Priority:	medium
Version:	13.0 (Queens)	CC:	bfournie, chshort, dhill, hjensas, jkreger, lbragsta, mburns
Target Milestone:	---	Keywords:	Triaged
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2019-12-11 01:29:51 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Bob Fournier 2019-05-24 20:29:04 UTC

This problem Dave Hill reported in https://bugzilla.redhat.com/show_bug.cgi?id=1560690

=================================

I can reproduce this easily when doing this:

1) Deploy an undercloud
2) Deploy an overcloud
3) Delete the overcloud
4) Deploy an overcloud

The following errors are found in nova-compute.log:

2019-05-03 16:10:52.011 5283 ERROR ironicclient.common.http [req-d1b9ce6f-6573-4ab1-88ce-fde37a9fdd0b 80b23670e885452cb5a4bec554515d71 16614237b794447694ff003b709f68c6 - default default] Error contacting Ironic server: Node 5ec50799-249a-
48a9-a338-5cdaa22a0860 is locked by host localhost.localdomain, please retry after the current operation is completed. (HTTP 409). Attempt 2 of 2: Conflict: Node 5ec50799-249a-48a9-a338-5cdaa22a0860 is locked by host localhost.localdomain
, please retry after the current operation is completed. (HTTP 409)
2019-05-03 16:10:52.012 5283 ERROR nova.compute.manager [req-d1b9ce6f-6573-4ab1-88ce-fde37a9fdd0b 80b23670e885452cb5a4bec554515d71 16614237b794447694ff003b709f68c6 - default default] [instance: e071b30d-aefe-4cc8-a262-e6b9215c2089] Instan
ce failed to spawn: Conflict: Node 5ec50799-249a-48a9-a338-5cdaa22a0860 is locked by host localhost.localdomain, please retry after the current operation is completed. (HTTP 409)
2019-05-03 16:10:52.012 5283 ERROR nova.compute.manager [instance: e071b30d-aefe-4cc8-a262-e6b9215c2089] Traceback (most recent call last):
2019-05-03 16:10:52.012 5283 ERROR nova.compute.manager [instance: e071b30d-aefe-4cc8-a262-e6b9215c2089]   File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 2252, in _build_resources
2019-05-03 16:10:52.012 5283 ERROR nova.compute.manager [instance: e071b30d-aefe-4cc8-a262-e6b9215c2089]     yield resources
2019-05-03 16:10:52.012 5283 ERROR nova.compute.manager [instance: e071b30d-aefe-4cc8-a262-e6b9215c2089]   File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 2032, in _build_and_run_instance
2019-05-03 16:10:52.012 5283 ERROR nova.compute.manager [instance: e071b30d-aefe-4cc8-a262-e6b9215c2089]     block_device_info=block_device_info)
2019-05-03 16:10:52.012 5283 ERROR nova.compute.manager [instance: e071b30d-aefe-4cc8-a262-e6b9215c2089]   File "/usr/lib/python2.7/site-packages/nova/virt/ironic/driver.py", line 1044, in spawn
2019-05-03 16:10:52.012 5283 ERROR nova.compute.manager [instance: e071b30d-aefe-4cc8-a262-e6b9215c2089]     block_device_info=block_device_info)
2019-05-03 16:10:52.012 5283 ERROR nova.compute.manager [instance: e071b30d-aefe-4cc8-a262-e6b9215c2089]   File "/usr/lib/python2.7/site-packages/nova/virt/ironic/driver.py", line 394, in _add_instance_info_to_node
2019-05-03 16:10:52.012 5283 ERROR nova.compute.manager [instance: e071b30d-aefe-4cc8-a262-e6b9215c2089]     retry_on_conflict=False)
2019-05-03 16:10:52.012 5283 ERROR nova.compute.manager [instance: e071b30d-aefe-4cc8-a262-e6b9215c2089]   File "/usr/lib/python2.7/site-packages/nova/virt/ironic/client_wrapper.py", line 158, in call
2019-05-03 16:10:52.012 5283 ERROR nova.compute.manager [instance: e071b30d-aefe-4cc8-a262-e6b9215c2089]     return self._multi_getattr(client, method)(*args, **kwargs)
2019-05-03 16:10:52.012 5283 ERROR nova.compute.manager [instance: e071b30d-aefe-4cc8-a262-e6b9215c2089]   File "/usr/lib/python2.7/site-packages/ironicclient/v1/node.py", line 336, in update
2019-05-03 16:10:52.012 5283 ERROR nova.compute.manager [instance: e071b30d-aefe-4cc8-a262-e6b9215c2089]     method=http_method)
2019-05-03 16:10:52.012 5283 ERROR nova.compute.manager [instance: e071b30d-aefe-4cc8-a262-e6b9215c2089]   File "/usr/lib/python2.7/site-packages/ironicclient/common/base.py", line 200, in _update
2019-05-03 16:10:52.012 5283 ERROR nova.compute.manager [instance: e071b30d-aefe-4cc8-a262-e6b9215c2089]     resp, body = self.api.json_request(method, url, **kwargs)
2019-05-03 16:10:52.012 5283 ERROR nova.compute.manager [instance: e071b30d-aefe-4cc8-a262-e6b9215c2089]   File "/usr/lib/python2.7/site-packages/ironicclient/common/http.py", line 645, in json_request
2019-05-03 16:10:52.012 5283 ERROR nova.compute.manager [instance: e071b30d-aefe-4cc8-a262-e6b9215c2089]     resp = self._http_request(url, method, **kwargs)
2019-05-03 16:10:52.012 5283 ERROR nova.compute.manager [instance: e071b30d-aefe-4cc8-a262-e6b9215c2089]   File "/usr/lib/python2.7/site-packages/ironicclient/common/http.py", line 247, in wrapper
2019-05-03 16:10:52.012 5283 ERROR nova.compute.manager [instance: e071b30d-aefe-4cc8-a262-e6b9215c2089]     return func(self, url, method, **kwargs)
2019-05-03 16:10:52.012 5283 ERROR nova.compute.manager [instance: e071b30d-aefe-4cc8-a262-e6b9215c2089]   File "/usr/lib/python2.7/site-packages/ironicclient/common/http.py", line 627, in _http_request
2019-05-03 16:10:52.012 5283 ERROR nova.compute.manager [instance: e071b30d-aefe-4cc8-a262-e6b9215c2089]     error_json.get('debuginfo'), method, url)
2019-05-03 16:10:52.012 5283 ERROR nova.compute.manager [instance: e071b30d-aefe-4cc8-a262-e6b9215c2089] Conflict: Node 5ec50799-249a-48a9-a338-5cdaa22a0860 is locked by host localhost.localdomain, please retry after the current operation
 is completed. (HTTP 409)
2019-05-03 16:10:52.012 5283 ERROR nova.compute.manager [instance: e071b30d-aefe-4cc8-a262-e6b9215c2089] 
2019-05-03 16:10:52.013 5283 INFO nova.compute.manager [req-d1b9ce6f-6573-4ab1-88ce-fde37a9fdd0b 80b23670e885452cb5a4bec554515d71 16614237b794447694ff003b709f68c6 - default default] [instance: e071b30d-aefe-4cc8-a262-e6b9215c2089] Termina
ting instance
2019-05-03 16:10:53.793 5283 ERROR nova.compute.manager [req-c8308fc5-f76b-4063-90d7-d652fbb0f6a8 80b23670e885452cb5a4bec554515d71 16614237b794447694ff003b709f68c6 - default default] [instance: 9fc91b79-2e98-4b02-8a38-9cbb61aef533] Build 
of instance 9fc91b79-2e98-4b02-8a38-9cbb61aef533 aborted: Failure prepping block device.: BuildAbortException: Build of instance 9fc91b79-2e98-4b02-8a38-9cbb61aef533 aborted: Failure prepping block device.
2019-05-03 16:10:53.793 5283 ERROR nova.compute.manager [instance: 9fc91b79-2e98-4b02-8a38-9cbb61aef533] Traceback (most recent call last):
2019-05-03 16:10:53.793 5283 ERROR nova.compute.manager [instance: 9fc91b79-2e98-4b02-8a38-9cbb61aef533]   File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 1841, in _do_build_and_run_instance
2019-05-03 16:10:53.793 5283 ERROR nova.compute.manager [instance: 9fc91b79-2e98-4b02-8a38-9cbb61aef533]     filter_properties, request_spec)
2019-05-03 16:10:53.793 5283 ERROR nova.compute.manager [instance: 9fc91b79-2e98-4b02-8a38-9cbb61aef533]   File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 2063, in _build_and_run_instance
2019-05-03 16:10:53.793 5283 ERROR nova.compute.manager [instance: 9fc91b79-2e98-4b02-8a38-9cbb61aef533]     bdms=block_device_mapping)
2019-05-03 16:10:53.793 5283 ERROR nova.compute.manager [instance: 9fc91b79-2e98-4b02-8a38-9cbb61aef533]   File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 220, in __exit__
2019-05-03 16:10:53.793 5283 ERROR nova.compute.manager [instance: 9fc91b79-2e98-4b02-8a38-9cbb61aef533]     self.force_reraise()
2019-05-03 16:10:53.793 5283 ERROR nova.compute.manager [instance: 9fc91b79-2e98-4b02-8a38-9cbb61aef533]   File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 196, in force_reraise
2019-05-03 16:10:53.793 5283 ERROR nova.compute.manager [instance: 9fc91b79-2e98-4b02-8a38-9cbb61aef533]     six.reraise(self.type_, self.value, self.tb)
2019-05-03 16:10:53.793 5283 ERROR nova.compute.manager [instance: 9fc91b79-2e98-4b02-8a38-9cbb61aef533]   File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 2015, in _build_and_run_instance
2019-05-03 16:10:53.793 5283 ERROR nova.compute.manager [instance: 9fc91b79-2e98-4b02-8a38-9cbb61aef533]     block_device_mapping) as resources:
2019-05-03 16:10:53.793 5283 ERROR nova.compute.manager [instance: 9fc91b79-2e98-4b02-8a38-9cbb61aef533]   File "/usr/lib64/python2.7/contextlib.py", line 17, in __enter__
2019-05-03 16:10:53.793 5283 ERROR nova.compute.manager [instance: 9fc91b79-2e98-4b02-8a38-9cbb61aef533]     return self.gen.next()
2019-05-03 16:10:53.793 5283 ERROR nova.compute.manager [instance: 9fc91b79-2e98-4b02-8a38-9cbb61aef533]   File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 2235, in _build_resources
2019-05-03 16:10:53.793 5283 ERROR nova.compute.manager [instance: 9fc91b79-2e98-4b02-8a38-9cbb61aef533]     reason=msg)
2019-05-03 16:10:53.793 5283 ERROR nova.compute.manager [instance: 9fc91b79-2e98-4b02-8a38-9cbb61aef533] BuildAbortException: Build of instance 9fc91b79-2e98-4b02-8a38-9cbb61aef533 aborted: Failure prepping block device.

Comment 1 Julia Kreger 2019-05-24 20:52:22 UTC

When you delete, how long are you waiting? Wondering if cleanup/undeployment is actually completing....  Seems line Nova thinks it an use the node.

Any indication of what is actually occurring inside the ironic conductor or what the present information state of the node is?

Comment 2 Julia Kreger 2019-07-31 19:27:14 UTC

I discussed this issue with an associate, and essentially:

1) Deploy failed without performing a deployment.
2) This resulted in an orphaned vif.
3) Nova gave up after _2_ tries to remove the vif upon giving up on the deployment, which never occurred because the physical machine was locked due to power sync.

So to move past this:

1) Set [ironic]api_max_retries=30  (or 60, which should be default, but the deployment for some reason was only trying twice.)
2) restart the openstack-nova-compute service
3) Ensure that there are no orphaned vifs on the unused baremetal nodes showing in available state, using "openstack baremetal node vif list" and "openstack baremetal node vif detach" as necessary.
4) Retry the deployment.

Comment 3 David Hill 2019-08-01 10:28:58 UTC

Could it be possible that this line is wrong :

virt/ironic/client_wrapper.py:        max_retries = CONF.ironic.api_max_retries if retry_on_conflict else 1
 ?

Or maybe the above is right but what's below is wrong and if retry_on_conflict is false, it would only try once (or twice if there's a logic bug) ?

        try:
            # FIXME(lucasagomes): The "retry_on_conflict" parameter was added
            # to basically causes the deployment to fail faster in case the
            # node picked by the scheduler is already associated with another
            # instance due bug #1341420.
            self.ironicclient.call('node.update', node.uuid, patch,
                                   retry_on_conflict=False)
...

        for attempt in range(0, last_attempt + 1):
            try:
                self.ironicclient.call("node.vif_attach", node.uuid,
                                       port_id, retry_on_conflict=False)

Comment 4 Julia Kreger 2019-08-01 10:49:47 UTC

That does seem to be  part of the conundrum. 

The issue they encountered was in two places.

Initial information setting which failed due to the node being locked which issues a conflict as a reply.
The second is upon detaching the vif, the client_wrapper gave up after only two retries when it _should_ try for a while because if that record is not removed then it becomes orphaned as that is one of the last interactions with ironic, if not the very last, and no state change ever occurred to cause ironic to know it should go clean up the record. :(

Comment 5 Bob Fournier 2019-10-18 15:41:06 UTC

Dave - is there any chance of retrying this at customer site with a change to max_api_retries as suggested?

Set [ironic]api_max_retries=30  (or 60, which should be default, but the deployment for some reason was only trying twice.)

Comment 6 David Hill 2019-10-31 13:24:19 UTC

I'll test this in my lab as soon as I have the chance.  I have some other issues these days with other bugs .  I'll keep this BZ as needinfo to me for the time being.

Comment 7 Bob Fournier 2019-12-11 01:29:51 UTC

Dave - closing this for now as we don't really have anything to go on except for the recommendation for the max_api_retries config change.  Please reopen if we can duplicate this.