Bug 1344004

Summary: Node cleaning fails with 'Failed to tear down from cleaning for node UUID'
Product: Red Hat OpenStack Reporter: Raviv Bar-Tal <rbartal>
Component: openstack-ironicAssignee: Miles Gould <mgould>
Status: CLOSED ERRATA QA Contact: Raviv Bar-Tal <rbartal>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 9.0 (Mitaka)CC: dnavale, dtantsur, jdonohue, jschluet, kbasil, mburns, mgould, mlammon, rbartal, rhel-osp-director-maint, srevivo
Target Milestone: betaKeywords: Triaged, Upstream
Target Release: 10.0 (Newton)   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: openstack-ironic-6.1.1-0.20160907120305.0acdfca.el7ost Doc Type: Bug Fix
Doc Text:
Previously, 'ironic-conductor' did not correctly pass the authentication token to the 'python-neutronclient'. As a result, automatic node cleaning failed with a tear down error. With this update, OpenStack Baremetal Provisioning (ironic) was migrated to use the 'keystoneauth' sessions rather than directly constructing Identity service client objects. As a result, nodes can now be successfully torn down after cleaning.
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-12-14 15:36:33 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1335596, 1409892    
Attachments:
Description Flags
ironic, keystone, neutorn logs none

Description Raviv Bar-Tal 2016-06-08 13:24:00 UTC
Created attachment 1166021 [details]
ironic, keystone, neutorn logs

Description of problem:
automatic node cleaning fails with tear down error

Version-Release number of selected component (if applicable):


How reproducible:
always 

Steps to Reproduce:
1. Enable automated_clean in ironic.conf
2. In virtual environment, or if your ssd doe's not support security erase - disable erase_devices by setting its priority to 0 (erase_devices_priority=0)
3. Restart ironic-conductor.
4. move node from manage state to provide (available) 

Actual results:

node state change to cleaning.
node is rebooted, and shut down.
node state change to clean failed and maintenance = True
ironic node-show xxxx-xxx output:
"maintenance_reason | Failed to tear down from cleaning for node xxxx-xxxx"

Expected results:
Node cleaning finish and node status change to available

Additional info:
Bug opened in upstream launchpad

Comment 1 Miles Gould 2016-06-20 13:09:51 UTC
@rbartal I'm having trouble duplicating this problem - how did you set up your environment, and which node are you running the commands on?

Comment 2 Miles Gould 2016-06-21 13:35:26 UTC
@rbartal alternatively, can you provide me with an environment on which I can duplicate this bug?

Comment 3 Raviv Bar-Tal 2016-06-23 09:06:44 UTC
HI Miles, 
The same problem happens on virt and BM env, 
you can connect to my seal system to test it.
Raviv

Comment 4 Miles Gould 2016-06-27 17:29:06 UTC
The failure happens when Ironic tries to list the Neutron ports to be torn down during cleaning.

 - Ironic-conductor doesn't pass an auth token to python-neutronclient
 - python-neutronclient tries to fetch one itself
 - the auth_uri setting in ironic.conf is in the wrong format for python-neutronclient (it needs /auth added to the end, but this breaks other parts of ironic)
 - if I put in a hack to add /auth, the authentication request fails anyway with the error "Expecting to find identity in auth - the server could not comply with the request since it is either malformed or otherwise incorrect. The client is assumed to be in error."

I'll continue to look into this tomorrow.

Comment 5 Miles Gould 2016-06-28 13:15:35 UTC
I was able to make cleaning succeed by adding the line

task.context.auth_token = keystone.get_admin_auth_token()

to ironic.dhcp.neutron.NeutronDHCPApi.delete_cleaning_ports, but that's clearly not a proper solution. I'll ask upstream where the auth token should come from.

Comment 6 Miles Gould 2016-06-28 15:37:50 UTC
There's a patch currently in review to completely rework the way we use Keystone, which should fix this: https://review.openstack.org/#/c/236982/

Comment 7 Dmitry Tantsur 2016-09-06 15:32:17 UTC
Hi!

I believe we've fixed this in the Newton (OSP10) release. Cleaning did work for me in overcloud a few weeks ago. Could you please retest it?

Comment 8 Raviv Bar-Tal 2016-09-15 14:59:52 UTC
This cleaning scenario was tested on RHOS 10 (Newton) and pass,
The erase_device step was disabled for the test as the disk on my machine do not support security erase.

Ironic RPM in this RHOS10 puddle are:
openstack-ironic-api-6.1.1-0.20160907120305.0acdfca.el7ost.noarch
openstack-ironic-inspector-4.1.1-0.20160906074601.0276422.el7ost.noarch
openstack-ironic-conductor-6.1.1-0.20160907120305.0acdfca.el7ost.noarch
openstack-ironic-common-6.1.1-0.20160907120305.0acdfca.el7ost.noarch

Comment 13 errata-xmlrpc 2016-12-14 15:36:33 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHEA-2016-2948.html