Bug 1712561

Summary: Switching a node from manage to provide kicks of automated_clean but will not PXE boot if node is in maintenance state
Product: Red Hat OpenStack Reporter: Andreas Karis <akaris>
Component: openstack-tripleo-heat-templatesAssignee: Dmitry Tantsur <dtantsur>
Status: CLOSED ERRATA QA Contact: mlammon
Severity: low Docs Contact:
Priority: low    
Version: 16.0 (Train)CC: bfournie, dtantsur, harsh.kotak, jkreger, jschluet, mburns, pmannidi
Target Milestone: betaKeywords: Reopened, Triaged, ZStream
Target Release: 16.0 (Train on RHEL 8.1)   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: openstack-tripleo-heat-templates-11.3.1-0.20191126041653.414d4d9.el8ost Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-02-06 14:40:53 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Andreas Karis 2019-05-21 19:18:05 UTC
Description of problem:
Switching a node from manage to provide kicks of automated_clean but will not PXE boot if node is in maintenance state

Version-Release number of selected component (if applicable):
most recent OSP 13

How reproducible:
~~~
ironic node-set-maintenance compute1 true
ironic node-set-provision-state compute1 manage
ironic node-set-provision-state compute1 provide
~~~

~~~
(undercloud) [stack@director ~]$ sudo grep clean /etc/ironic -R | grep -v ':#'
/etc/ironic/ironic.conf:automated_clean=true
(...)
~~~

Steps to Reproduce:
1.
2.
3.

Actual results:
compute1 will go to clean, will be booted by ironic but fail on iPXE boot

Expected results:
compute1 should refuse to go to clean, should not boot, the user should be presented with an error message of some sort

Additional info:

Comment 1 Dmitry Tantsur 2019-05-22 08:27:10 UTC
It was discussed as part of https://storyboard.openstack.org/#!/story/1563644 and the community wanted to keep the current behavior. We can try having this conversation again, but of course I cannot guarantee different results.

Comment 2 Bob Fournier 2019-06-03 12:56:08 UTC
Per Comment 1, this is as expected and the accepted upstream behavior.  Closing.

Comment 3 Andreas Karis 2019-06-03 13:12:34 UTC
Hi,

Can we keep this open and re-discuss with upstream? The current situation is *very* misleading for administrators. At least a warning message would be the minimum.

The logical behavior here would be that: automated clean does *not* kick in when the node is in maintenance state and some error message is thrown, etc. With the current behavior, the node even PXE boots and after some amount of time goes into clean_failed but with no obvious reason for the administrator.

- Andreas

Comment 4 Bob Fournier 2019-06-03 13:23:51 UTC
OK, let's keep this open and we'll discuss this with upstream/investigate a warning message at minimum.

Comment 5 Andreas Karis 2019-06-03 13:26:56 UTC
I created a KCS. Upon inspection of the bugreport and change reviews, this doesn't look trivial to fix. If there's nothing else we can do, then I'm fine with the knowledge base solution, only. However, I perceive this is a (minor) issue, so if we can fix it by adding a warning message or something or making it easier for admins to understand what's going on, that would be appreciated!

Comment 6 Bob Fournier 2019-07-26 16:29:11 UTC
Andreas - I think our best bet is the KCS article as you indicated in Comment 5.  The state machine is designed to function like this and adding a warning message isn't possible since it would require querying the node before the action was taken.

Comment 7 Dmitry Tantsur 2019-07-29 09:08:41 UTC
As the last resort I'm going to propose a patch with an option to not start cleaning in maintenance. The default behavior will not change (as desired upstream), but we'll be able to change it for TripleO. If this approach is rejected, I'll have not options other than to close the bug.

Comment 10 Bob Fournier 2019-09-21 16:51:29 UTC
Fix has merged to master.

Comment 11 Dmitry Tantsur 2019-09-23 14:17:01 UTC
TripleO patch proposed

Comment 12 Bob Fournier 2019-09-26 13:26:29 UTC
As there are multiple fixes here including an addition configuration parameter (and the bz severity is low) marking this for OSP-16.  Prior to 16 we will have to rely on the KCS article that Andreas created.

Comment 16 errata-xmlrpc 2020-02-06 14:40:53 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2020:0283