Bug 1647005

Summary:

Ansible Networking - cannot delete BM guests on Overcloud

Product:

Red Hat OpenStack

Reporter:

Arkady Shtempler <ashtempl>

Component:

openstack-tripleo-heat-templates

Assignee:

Jakub Libosvar <jlibosva>

Status:

CLOSED ERRATA

QA Contact:

Arkady Shtempler <ashtempl>

Severity:

urgent

Docs Contact:

Priority:

urgent

Version:

14.0 (Rocky)

CC:

bfournie, cjanisze, dradez, jamsmith, jlibosva, mburns, michapma, shrjoshi

Target Milestone:

beta

Keywords:

Triaged, ZStream

Target Release:

16.0 (Train on RHEL 8.1)

Hardware:

Unspecified

OS:

Linux

Whiteboard:

Fixed In Version:

openstack-tripleo-heat-templates-11.3.1-0.20191202212740.a4800ba.el8ost

Doc Type:

Known Issue

Doc Text:

Nova-compute ironic driver tries to update BM node while the node is being cleaned up. The cleaning takes approximately five minutes but nova-compute attempts to update the node for approximately two minutes. After timeout, nova-compute stops and puts nova instance into ERROR state. As a workaround, set the following configuration option for nova-compute service: [ironic] api_max_retries = 180 As a result, nova-compute continues to attempt to update BM node longer and eventually succeeds.

Story Points:

---

Clone Of:

Environment:

Last Closed:

2020-02-06 14:39:53 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
Nova ERROR	none

Description Arkady Shtempler 2018-11-06 13:50:38 UTC

Created attachment 1502461 [details]
Nova ERROR

I'm trying to delete BM guest on Overcloud using:
openstack server delete <BM-ID> 

Note: The above command doesn't prompt any Error or Warning message

Then it takes some (a few minutes) and Status field on:
openstack server list 
command output (for this specific BM) is getting into ERROR status.

There were no ERRORs in Neutron server.log (3 controllers), the only ERROR i saw was in nova-compute.log (see attachment)

Comment 1 Jakub Libosvar 2018-11-15 09:06:08 UTC

I troubleshooted the issue and from the logs it seems cleaning just takes too much time and nova-compute service gives up too early.

conductor logs deleting started at 15:51:24.789 and node went to "available" state at 15:55:32.908 
while nova ironic driver started to update "something" at 15:52:47.540 - so the node was still in transition state between cleaning and available and ironic driver stopped trying at 15:54:51.352 before the node moved to available

As a workaround we can bump api_max_retries on ironic client side.

Given that we have a reasonable workaround for this issue, I'm lowering severity and removing blocker flag.

Comment 2 Bob Fournier 2018-12-04 15:49:14 UTC

See also https://bugzilla.redhat.com/show_bug.cgi?id=1563303, specifically the comment here about changing api_max_retries https://bugzilla.redhat.com/show_bug.cgi?id=1563303#c15

Comment 4 Jakub Libosvar 2019-02-20 08:07:33 UTC

We got another bug 1678868 which could be a dup of this one but the environment is different. I'm rising priority and severity of this bug and assigning it to myself.

Comment 5 Jakub Libosvar 2019-02-21 08:59:56 UTC

*** Bug 1678868 has been marked as a duplicate of this bug. ***

Comment 6 Bob Fournier 2019-05-29 20:05:37 UTC

Jakub - it looks like the referenced patch [0] has merged, its not clear if the patch in the Launchpad bug [1] is also needed?  Can this bug move to POST or is more work needed?

[0] https://review.opendev.org/#/c/636571/
[1] https://review.opendev.org/#/c/638119/

Comment 7 Jakub Libosvar 2019-05-30 06:11:15 UTC

(In reply to Bob Fournier from comment #6)
> Jakub - it looks like the referenced patch [0] has merged, its not clear if
> the patch in the Launchpad bug [1] is also needed?  Can this bug move to
> POST or is more work needed?
> 
> [0] https://review.opendev.org/#/c/636571/
> [1] https://review.opendev.org/#/c/638119/

Whoops, there is a wrong patch linked from this BZ. Correct one is in the LP. https://review.opendev.org/#/c/638119/ is what we need. Thanks for pointing that out.

Comment 8 Dan Radez 2019-08-27 15:24:43 UTC

I'd say let's move this to post.
I think the issue is addressed, if it resurfaces we'll address it again.

Comment 9 Dan Radez 2019-08-27 17:32:09 UTC

sry, prematurely set this to on_dev, the patches haven't merged yet.

Comment 10 Bob Fournier 2019-11-26 18:17:54 UTC

Jakub - should this be backported to stable/train so it will be in OSP-16?

Comment 11 Jakub Libosvar 2019-11-27 13:08:35 UTC

(In reply to Bob Fournier from comment #10)
> Jakub - should this be backported to stable/train so it will be in OSP-16?

I requested a backport here: https://review.opendev.org/#/c/696300/

I'm not sure though, if it's not breaking a backporting policy as it introduces a new config option. We'll see. If it gets accepted, I'll re-schedule the bug to be included in OSP16

Comment 17 errata-xmlrpc 2020-02-06 14:39:53 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2020:0283