Bug 1244056

Summary:	ironic allowed available host to stay powered on, blocking deployment of active hosts
Product:	Red Hat OpenStack	Reporter:	Dan Sneddon <dsneddon>
Component:	rhosp-director	Assignee:	Lucas Alvares Gomes <lmartins>
Status:	CLOSED DUPLICATE	QA Contact:	Shai Revivo <srevivo>
Severity:	high	Docs Contact:
Priority:	high
Version:	7.0 (Kilo)	CC:	dsneddon, dtantsur, hbrock, lmartins, mburns, rhel-osp-director-maint
Target Milestone:	---
Target Release:	11.0 (Ocata)
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2016-11-25 13:50:15 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Dan Sneddon 2015-07-17 01:36:39 UTC

Description of problem:
I deployed an overcloud which failed in postdeploy, upon further inspection some of the VLANs hadn't come up. When I tried to bring the VLANs up manually, I got a duplicate IP error. It turned out that one of the nodes that had been used in a previous deployment was turned on, even though it was in available state.

Version-Release number of selected component (if applicable):
Poodle from 2015-06-16, late in the day
openstack-ironic-api.noarch       2015.1.0-9.el7ost       @rhelosp-7.0-poodle   
openstack-ironic-common.noarch    2015.1.0-9.el7ost       @rhelosp-7.0-poodle   
openstack-ironic-conductor.noarch 2015.1.0-9.el7ost       @rhelosp-7.0-poodle   
openstack-ironic-discoverd.noarch 1.1.0-5.el7ost          @rhelosp-7.0-director-poodle
python-ironic-discoverd.noarch    1.1.0-5.el7ost          @rhelosp-7.0-director-poodle
python-ironicclient.noarch        0.5.1-9.el7ost          @rhelosp-7.0-poodle   


How reproducible:
Not sure

Steps to Reproduce:
1. Install undercloud
2. Discover nodes
3. Deploy, then delete stack
4. Stack delete fails, delete stack again (one of the nodes doesn't get powered down somehow)
5. Redeploy.

Actual results:
If the node from the previous deployment is not powered down but is not chosen for the next deployment, it can interfere with the deployment.

Expected results:
If a node is not available, but not active, it should be powered down, right? Otherwise, how can we protect our deployment from stale nodes with old IP addresses?

Additional info:
Here is a before-and-after. I noticed that one node was available but powered on, so I set it powered off.

[stack@host01 ~]$ ironic node-list
+--------------------------------------+------+--------------------------------------+-------------+-----------------+-------------+
| UUID                                 | Name | Instance UUID                        | Power State | Provision State | Maintenance |
+--------------------------------------+------+--------------------------------------+-------------+-----------------+-------------+
| 843766dd-e4f3-45a8-aeb6-1b7d3d8972d7 | None | None                                 | power off   | available       | False       |
| 182b83d2-bcd4-41bd-9b32-071f25296494 | None | eca8f403-15e5-4fb2-bf19-788dcadca392 | power on    | active          | False       |
| d0d937c4-bee9-465f-8424-a9df7c2e4d96 | None | 9dac558b-434f-4118-8230-eb1586911ba8 | power on    | active          | False       |
| 3b955ad3-3ec2-411e-b627-40655d9353bd | None | None                                 | power on    | available       | False       |
+--------------------------------------+------+--------------------------------------+-------------+-----------------+-------------+
[stack@host01 ~]$ ironic node-set-power-state 3b955ad3-3ec2-411e-b627-40655d9353bd off
[stack@host01 ~]$ ironic node-list
+--------------------------------------+------+--------------------------------------+-------------+-----------------+-------------+
| UUID                                 | Name | Instance UUID                        | Power State | Provision State | Maintenance |
+--------------------------------------+------+--------------------------------------+-------------+-----------------+-------------+
| 843766dd-e4f3-45a8-aeb6-1b7d3d8972d7 | None | None                                 | power off   | available       | False       |
| 182b83d2-bcd4-41bd-9b32-071f25296494 | None | eca8f403-15e5-4fb2-bf19-788dcadca392 | power on    | active          | False       |
| d0d937c4-bee9-465f-8424-a9df7c2e4d96 | None | 9dac558b-434f-4118-8230-eb1586911ba8 | power on    | active          | False       |
| 3b955ad3-3ec2-411e-b627-40655d9353bd | None | None                                 | power off   | available       | False       |
+--------------------------------------+------+--------------------------------------+-------------+-----------------+-------------+

Comment 3 Dmitry Tantsur 2015-07-17 07:24:58 UTC

Hi! Right, ironic usually powers nodes off. Failure do to so may designate some problems with driver and/or hardware. Was it on VM or bare metal? What was the hardware vendor and which driver did you use? Could you please provide related ironic-conductor logs?

Comment 7 Dan Sneddon 2015-08-21 18:46:25 UTC

(In reply to Dmitry Tantsur from comment #3)
> Hi! Right, ironic usually powers nodes off. Failure do to so may designate
> some problems with driver and/or hardware. Was it on VM or bare metal? What
> was the hardware vendor and which driver did you use? Could you please
> provide related ironic-conductor logs?

This was on bare metal. The hosts were Dell PowerEdge R320 running DRAC 7 firmware 1.57.57 (Build 04).

This seems to happen incredibly rarely. I think it might only be happening on Dell, but I don't have confirmation.

Comment 8 Lucas Alvares Gomes 2015-08-28 15:33:46 UTC

Hi @Dan,

Yeah, the node should be powered off once it's teared down. So the ironic-conductor will tell the driver to tear down [1], the driver will then power off the node [2][3]. 

Now what's odd about this error is that it's still powered on and there's no error registered with it right? As you can see in [1], if any (except Exception) exception is raised when the node is being teared down we are going mark the node as ERROR and fill out the node's last_error field as well.

So, I'm guessing here that all the calls we did to the BMC have worked in the tear down of that node but the BMC just didn't do it? (Even tho it returned success ?)

...

Now one thing about the "Actual results":

> Actual results:
> If the node from the previous deployment is not powered down but is not
> chosen for the next deployment, it can interfere with the deployment.

The power state of the node shouldn't matter for either Ironic or Nova scheduler when picking a node for deployment. In nova [4] it checks whether the node provision state is AVAILABLE or NOSTATE (for backward compat), if the power state is not ERROR or NOSTATE and the node is not in maintenance.

For Ironic we don't care about the power state when deploying a node because we are going to reboot it anyway.

Can you do a simple test please?

Power the nodes on before doing a deployment, see if Nova or Ironic will complain about it, i.e:

$ ironic node-set-power-state <node uuid or name> on

And then deploy.


[1] https://github.com/openstack/ironic/blob/master/ironic/conductor/manager.py#L786-L798

[2] https://github.com/openstack/ironic/blob/master/ironic/drivers/modules/iscsi_deploy.py#L694-L705

[3] https://github.com/openstack/ironic/blob/2a07b0cbf529b8fdce0178d96022bc2eceb2cac5/ironic/conductor/utils.py#L110-L129

[4] https://github.com/openstack/nova/blob/master/nova/virt/ironic/driver.py#L209-L213

Comment 9 Dan Sneddon 2015-08-28 17:12:54 UTC

(In reply to Lucas Alvares Gomes from comment #8)

The problem happens when one node is powered on and it has a config from a previous deployment. In this case, the rogue node will be using IP addresses that will conflict with the newly deployed nodes in the next deployment.

I think what we need to do in this case is make Ironic & Nova care about the power state of available nodes. Nodes that are not active should not be allowed to stay on. Otherwise, we don't know if they are interfering with active nodes.

Comment 10 Lucas Alvares Gomes 2015-08-31 16:51:52 UTC

(In reply to Dan Sneddon from comment #9)
> (In reply to Lucas Alvares Gomes from comment #8)
> 
> The problem happens when one node is powered on and it has a config from a
> previous deployment. In this case, the rogue node will be using IP addresses
> that will conflict with the newly deployed nodes in the next deployment.
> 
> I think what we need to do in this case is make Ironic & Nova care about the
> power state of available nodes. Nodes that are not active should not be
> allowed to stay on. Otherwise, we don't know if they are interfering with
> active nodes.

That's a good point. Ironic has a configuration option to sync the power state of the node, so after the deletion the node will be set to power off and Ironic will make sure the node stays on that power state[1] (see the force_power_state_during_sync configuration option). But we currently disable that because it conflicts with the HA configuration (pacemaker) for node fencing.

So the only thing I can think off that will fix this is if we were using CLEANING in Ironic. With cleaning Ironic will boot up the node after deletion (and also from manageable -> available) and erase the disks of that node to remove all the data from the previous tenant. But, to do that we would need to use IPA as the deploy/clean ramdisk (Which doesn't seem to be targeted for this release version of ospd)

[1] https://github.com/openstack/ironic/blob/master/etc/ironic/ironic.conf.sample#L497-L501

Comment 12 Mike Burns 2016-04-07 20:43:53 UTC

This bug did not make the OSP 8.0 release.  It is being deferred to OSP 10.

Comment 15 Dmitry Tantsur 2016-10-14 15:15:24 UTC

Hi!

We should really restart conversation with TripleO folks about re-enabling cleaning. I think this is the only real solution for such problems. Lucas, WDYT?

Comment 16 Dmitry Tantsur 2016-11-25 13:50:15 UTC


*** This bug has been marked as a duplicate of bug 1398657 ***