Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1433416

Summary: Ironic reports incorrect power state "power off" after a clean failed node is put into Maintenance
Product: Red Hat OpenStack Reporter: Sai Sindhur Malleni <smalleni>
Component: openstack-ironicAssignee: Dmitry Tantsur <dtantsur>
Status: CLOSED WONTFIX QA Contact: mlammon
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 10.0 (Newton)CC: akrzos, dtantsur, jkilpatr, jtaleric, mburns, racedoro, rhel-osp-director-maint, smalleni, srevivo
Target Milestone: ---Keywords: Triaged, ZStream
Target Release: 10.0 (Newton)   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-07-18 15:31:36 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Sai Sindhur Malleni 2017-03-17 15:35:22 UTC
Description of problem:
When using ironic automated cleaning, if the node clean fails it is put into clean failed state and in maintenance. Ironic hence loses track of the power state, and reports power off even if the node is actually turned on. This causes very difficult to debug problems from an os-collect-config/os-net-config perspective, because the node that failed cleaning might still have an IP that the undercloud will use to provision new machines. So the "clean failed" will keep trying to get metadata from the undercloud and the new node that gets the same IP address doesn't get metadata/ gets it intermittently. This leads to failed deployments when machines are recycled from one workload to another.
If the ironic power state is reported correctly as power off or atleast as None as per conversation with Dmitry upstream, it would hint the operator to power off the node via ipmi/console before attempting redeployment.

Version-Release number of selected component (if applicable):
RHOP 10 2017-03-03.1 puddle

How reproducible:
100% when attempting redeployment with some nodes from previous deployment in clean failed state

Steps to Reproduce:
1. Enable ironic cleaning
2. Deploy overcloud with a certain node code
3. Delete overcloud and see if one of the nodes goes into clean failed state
4. reattempt deployment with a subset of original nodes (as clean failed nodes are not used)
Actual results:
Deployments fail as the clean failed nodes are left powered on and with network config from previous deployments

Expected results:
Power state of clean failed node should be reported accurately/set to None to hint the operator to look at this node

Additional info:

Comment 2 Dmitry Tantsur 2017-04-26 09:07:02 UTC
I can propose a patch to set power state to None on setting maintenance, dunno if the folks are going to accept it. Also not sure if it's going to be backportable to 10.

Comment 3 Dmitry Tantsur 2017-04-27 12:19:56 UTC
The patch is posted, let's see what people think.

Comment 4 Dmitry Tantsur 2017-07-18 15:31:36 UTC
Unfortunately, it does not seem like we have an agreement upstream about changing the behavior. We can probably try later.. :(