Bug 1647005

Summary: Ansible Networking - cannot delete BM guests on Overcloud
Product: Red Hat OpenStack Reporter: Arkady Shtempler <ashtempl>
Component: openstack-tripleo-heat-templatesAssignee: Jakub Libosvar <jlibosva>
Status: CLOSED ERRATA QA Contact: Arkady Shtempler <ashtempl>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 14.0 (Rocky)CC: bfournie, cjanisze, dradez, jamsmith, jlibosva, mburns, michapma, shrjoshi
Target Milestone: betaKeywords: Triaged, ZStream
Target Release: 16.0 (Train on RHEL 8.1)   
Hardware: Unspecified   
OS: Linux   
Whiteboard:
Fixed In Version: openstack-tripleo-heat-templates-11.3.1-0.20191202212740.a4800ba.el8ost Doc Type: Known Issue
Doc Text:
Nova-compute ironic driver tries to update BM node while the node is being cleaned up. The cleaning takes approximately five minutes but nova-compute attempts to update the node for approximately two minutes. After timeout, nova-compute stops and puts nova instance into ERROR state. As a workaround, set the following configuration option for nova-compute service: [ironic] api_max_retries = 180 As a result, nova-compute continues to attempt to update BM node longer and eventually succeeds.
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-02-06 14:39:53 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Nova ERROR none

Description Arkady Shtempler 2018-11-06 13:50:38 UTC
Created attachment 1502461 [details]
Nova ERROR

I'm trying to delete BM guest on Overcloud using:
openstack server delete <BM-ID> 

Note: The above command doesn't prompt any Error or Warning message

Then it takes some (a few minutes) and Status field on:
openstack server list 
command output (for this specific BM) is getting into ERROR status.

There were no ERRORs in Neutron server.log (3 controllers), the only ERROR i saw was in nova-compute.log (see attachment)

Comment 1 Jakub Libosvar 2018-11-15 09:06:08 UTC
I troubleshooted the issue and from the logs it seems cleaning just takes too much time and nova-compute service gives up too early.

conductor logs deleting started at 15:51:24.789 and node went to "available" state at 15:55:32.908 
while nova ironic driver started to update "something" at 15:52:47.540 - so the node was still in transition state between cleaning and available and ironic driver stopped trying at 15:54:51.352 before the node moved to available

As a workaround we can bump api_max_retries on ironic client side.

Given that we have a reasonable workaround for this issue, I'm lowering severity and removing blocker flag.

Comment 2 Bob Fournier 2018-12-04 15:49:14 UTC
See also https://bugzilla.redhat.com/show_bug.cgi?id=1563303, specifically the comment here about changing api_max_retries https://bugzilla.redhat.com/show_bug.cgi?id=1563303#c15

Comment 4 Jakub Libosvar 2019-02-20 08:07:33 UTC
We got another bug 1678868 which could be a dup of this one but the environment is different. I'm rising priority and severity of this bug and assigning it to myself.

Comment 5 Jakub Libosvar 2019-02-21 08:59:56 UTC
*** Bug 1678868 has been marked as a duplicate of this bug. ***

Comment 6 Bob Fournier 2019-05-29 20:05:37 UTC
Jakub - it looks like the referenced patch [0] has merged, its not clear if the patch in the Launchpad bug [1] is also needed?  Can this bug move to POST or is more work needed?

[0] https://review.opendev.org/#/c/636571/
[1] https://review.opendev.org/#/c/638119/

Comment 7 Jakub Libosvar 2019-05-30 06:11:15 UTC
(In reply to Bob Fournier from comment #6)
> Jakub - it looks like the referenced patch [0] has merged, its not clear if
> the patch in the Launchpad bug [1] is also needed?  Can this bug move to
> POST or is more work needed?
> 
> [0] https://review.opendev.org/#/c/636571/
> [1] https://review.opendev.org/#/c/638119/

Whoops, there is a wrong patch linked from this BZ. Correct one is in the LP. https://review.opendev.org/#/c/638119/ is what we need. Thanks for pointing that out.

Comment 8 Dan Radez 2019-08-27 15:24:43 UTC
I'd say let's move this to post.
I think the issue is addressed, if it resurfaces we'll address it again.

Comment 9 Dan Radez 2019-08-27 17:32:09 UTC
sry, prematurely set this to on_dev, the patches haven't merged yet.

Comment 10 Bob Fournier 2019-11-26 18:17:54 UTC
Jakub - should this be backported to stable/train so it will be in OSP-16?

Comment 11 Jakub Libosvar 2019-11-27 13:08:35 UTC
(In reply to Bob Fournier from comment #10)
> Jakub - should this be backported to stable/train so it will be in OSP-16?

I requested a backport here: https://review.opendev.org/#/c/696300/

I'm not sure though, if it's not breaking a backporting policy as it introduces a new config option. We'll see. If it gets accepted, I'll re-schedule the bug to be included in OSP16

Comment 17 errata-xmlrpc 2020-02-06 14:39:53 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2020:0283