Bug 1647005 - Ansible Networking - cannot delete BM guests on Overcloud
Summary: Ansible Networking - cannot delete BM guests on Overcloud
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-tripleo-heat-templates
Version: 14.0 (Rocky)
Hardware: Unspecified
OS: Linux
urgent
urgent
Target Milestone: beta
: 16.0 (Train on RHEL 8.1)
Assignee: Jakub Libosvar
QA Contact: Arkady Shtempler
URL:
Whiteboard:
: 1678868 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-11-06 13:50 UTC by Arkady Shtempler
Modified: 2022-09-06 15:55 UTC (History)
8 users (show)

Fixed In Version: openstack-tripleo-heat-templates-11.3.1-0.20191202212740.a4800ba.el8ost
Doc Type: Known Issue
Doc Text:
Nova-compute ironic driver tries to update BM node while the node is being cleaned up. The cleaning takes approximately five minutes but nova-compute attempts to update the node for approximately two minutes. After timeout, nova-compute stops and puts nova instance into ERROR state. As a workaround, set the following configuration option for nova-compute service: [ironic] api_max_retries = 180 As a result, nova-compute continues to attempt to update BM node longer and eventually succeeds.
Clone Of:
Environment:
Last Closed: 2020-02-06 14:39:53 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
Nova ERROR (24.59 KB, text/plain)
2018-11-06 13:50 UTC, Arkady Shtempler
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Launchpad 1816728 0 None None None 2019-02-20 09:57:51 UTC
OpenStack gerrit 638119 0 'None' MERGED ml2-ansible: Set api_max_retries when net-ansible is used 2020-06-29 20:55:21 UTC
OpenStack gerrit 696300 0 'None' MERGED ml2-ansible: Set api_max_retries when net-ansible is used 2020-06-29 20:55:21 UTC
Red Hat Issue Tracker OSP-18566 0 None None None 2022-09-06 15:55:55 UTC
Red Hat Product Errata RHEA-2020:0283 0 None None None 2020-02-06 14:40:38 UTC

Description Arkady Shtempler 2018-11-06 13:50:38 UTC
Created attachment 1502461 [details]
Nova ERROR

I'm trying to delete BM guest on Overcloud using:
openstack server delete <BM-ID> 

Note: The above command doesn't prompt any Error or Warning message

Then it takes some (a few minutes) and Status field on:
openstack server list 
command output (for this specific BM) is getting into ERROR status.

There were no ERRORs in Neutron server.log (3 controllers), the only ERROR i saw was in nova-compute.log (see attachment)

Comment 1 Jakub Libosvar 2018-11-15 09:06:08 UTC
I troubleshooted the issue and from the logs it seems cleaning just takes too much time and nova-compute service gives up too early.

conductor logs deleting started at 15:51:24.789 and node went to "available" state at 15:55:32.908 
while nova ironic driver started to update "something" at 15:52:47.540 - so the node was still in transition state between cleaning and available and ironic driver stopped trying at 15:54:51.352 before the node moved to available

As a workaround we can bump api_max_retries on ironic client side.

Given that we have a reasonable workaround for this issue, I'm lowering severity and removing blocker flag.

Comment 2 Bob Fournier 2018-12-04 15:49:14 UTC
See also https://bugzilla.redhat.com/show_bug.cgi?id=1563303, specifically the comment here about changing api_max_retries https://bugzilla.redhat.com/show_bug.cgi?id=1563303#c15

Comment 4 Jakub Libosvar 2019-02-20 08:07:33 UTC
We got another bug 1678868 which could be a dup of this one but the environment is different. I'm rising priority and severity of this bug and assigning it to myself.

Comment 5 Jakub Libosvar 2019-02-21 08:59:56 UTC
*** Bug 1678868 has been marked as a duplicate of this bug. ***

Comment 6 Bob Fournier 2019-05-29 20:05:37 UTC
Jakub - it looks like the referenced patch [0] has merged, its not clear if the patch in the Launchpad bug [1] is also needed?  Can this bug move to POST or is more work needed?

[0] https://review.opendev.org/#/c/636571/
[1] https://review.opendev.org/#/c/638119/

Comment 7 Jakub Libosvar 2019-05-30 06:11:15 UTC
(In reply to Bob Fournier from comment #6)
> Jakub - it looks like the referenced patch [0] has merged, its not clear if
> the patch in the Launchpad bug [1] is also needed?  Can this bug move to
> POST or is more work needed?
> 
> [0] https://review.opendev.org/#/c/636571/
> [1] https://review.opendev.org/#/c/638119/

Whoops, there is a wrong patch linked from this BZ. Correct one is in the LP. https://review.opendev.org/#/c/638119/ is what we need. Thanks for pointing that out.

Comment 8 Dan Radez 2019-08-27 15:24:43 UTC
I'd say let's move this to post.
I think the issue is addressed, if it resurfaces we'll address it again.

Comment 9 Dan Radez 2019-08-27 17:32:09 UTC
sry, prematurely set this to on_dev, the patches haven't merged yet.

Comment 10 Bob Fournier 2019-11-26 18:17:54 UTC
Jakub - should this be backported to stable/train so it will be in OSP-16?

Comment 11 Jakub Libosvar 2019-11-27 13:08:35 UTC
(In reply to Bob Fournier from comment #10)
> Jakub - should this be backported to stable/train so it will be in OSP-16?

I requested a backport here: https://review.opendev.org/#/c/696300/

I'm not sure though, if it's not breaking a backporting policy as it introduces a new config option. We'll see. If it gets accepted, I'll re-schedule the bug to be included in OSP16

Comment 17 errata-xmlrpc 2020-02-06 14:39:53 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2020:0283


Note You need to log in before you can comment on or make changes to this bug.