Created attachment 1166021 [details]
ironic, keystone, neutorn logs
Description of problem:
automatic node cleaning fails with tear down error
Version-Release number of selected component (if applicable):
Steps to Reproduce:
1. Enable automated_clean in ironic.conf
2. In virtual environment, or if your ssd doe's not support security erase - disable erase_devices by setting its priority to 0 (erase_devices_priority=0)
3. Restart ironic-conductor.
4. move node from manage state to provide (available)
node state change to cleaning.
node is rebooted, and shut down.
node state change to clean failed and maintenance = True
ironic node-show xxxx-xxx output:
"maintenance_reason | Failed to tear down from cleaning for node xxxx-xxxx"
Node cleaning finish and node status change to available
Bug opened in upstream launchpad
@rbartal I'm having trouble duplicating this problem - how did you set up your environment, and which node are you running the commands on?
@rbartal alternatively, can you provide me with an environment on which I can duplicate this bug?
The same problem happens on virt and BM env,
you can connect to my seal system to test it.
The failure happens when Ironic tries to list the Neutron ports to be torn down during cleaning.
- Ironic-conductor doesn't pass an auth token to python-neutronclient
- python-neutronclient tries to fetch one itself
- the auth_uri setting in ironic.conf is in the wrong format for python-neutronclient (it needs /auth added to the end, but this breaks other parts of ironic)
- if I put in a hack to add /auth, the authentication request fails anyway with the error "Expecting to find identity in auth - the server could not comply with the request since it is either malformed or otherwise incorrect. The client is assumed to be in error."
I'll continue to look into this tomorrow.
I was able to make cleaning succeed by adding the line
task.context.auth_token = keystone.get_admin_auth_token()
to ironic.dhcp.neutron.NeutronDHCPApi.delete_cleaning_ports, but that's clearly not a proper solution. I'll ask upstream where the auth token should come from.
There's a patch currently in review to completely rework the way we use Keystone, which should fix this: https://review.openstack.org/#/c/236982/
I believe we've fixed this in the Newton (OSP10) release. Cleaning did work for me in overcloud a few weeks ago. Could you please retest it?
This cleaning scenario was tested on RHOS 10 (Newton) and pass,
The erase_device step was disabled for the test as the disk on my machine do not support security erase.
Ironic RPM in this RHOS10 puddle are:
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory, and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.