DescriptionSai Sindhur Malleni
2017-03-17 15:35:22 UTC
Description of problem:
When using ironic automated cleaning, if the node clean fails it is put into clean failed state and in maintenance. Ironic hence loses track of the power state, and reports power off even if the node is actually turned on. This causes very difficult to debug problems from an os-collect-config/os-net-config perspective, because the node that failed cleaning might still have an IP that the undercloud will use to provision new machines. So the "clean failed" will keep trying to get metadata from the undercloud and the new node that gets the same IP address doesn't get metadata/ gets it intermittently. This leads to failed deployments when machines are recycled from one workload to another.
If the ironic power state is reported correctly as power off or atleast as None as per conversation with Dmitry upstream, it would hint the operator to power off the node via ipmi/console before attempting redeployment.
Version-Release number of selected component (if applicable):
RHOP 10 2017-03-03.1 puddle
How reproducible:
100% when attempting redeployment with some nodes from previous deployment in clean failed state
Steps to Reproduce:
1. Enable ironic cleaning
2. Deploy overcloud with a certain node code
3. Delete overcloud and see if one of the nodes goes into clean failed state
4. reattempt deployment with a subset of original nodes (as clean failed nodes are not used)
Actual results:
Deployments fail as the clean failed nodes are left powered on and with network config from previous deployments
Expected results:
Power state of clean failed node should be reported accurately/set to None to hint the operator to look at this node
Additional info:
I can propose a patch to set power state to None on setting maintenance, dunno if the folks are going to accept it. Also not sure if it's going to be backportable to 10.