Bug 1244056
Summary: | ironic allowed available host to stay powered on, blocking deployment of active hosts | ||
---|---|---|---|
Product: | Red Hat OpenStack | Reporter: | Dan Sneddon <dsneddon> |
Component: | rhosp-director | Assignee: | Lucas Alvares Gomes <lmartins> |
Status: | CLOSED DUPLICATE | QA Contact: | Shai Revivo <srevivo> |
Severity: | high | Docs Contact: | |
Priority: | high | ||
Version: | 7.0 (Kilo) | CC: | dsneddon, dtantsur, hbrock, lmartins, mburns, rhel-osp-director-maint |
Target Milestone: | --- | ||
Target Release: | 11.0 (Ocata) | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2016-11-25 13:50:15 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Dan Sneddon
2015-07-17 01:36:39 UTC
Hi! Right, ironic usually powers nodes off. Failure do to so may designate some problems with driver and/or hardware. Was it on VM or bare metal? What was the hardware vendor and which driver did you use? Could you please provide related ironic-conductor logs? (In reply to Dmitry Tantsur from comment #3) > Hi! Right, ironic usually powers nodes off. Failure do to so may designate > some problems with driver and/or hardware. Was it on VM or bare metal? What > was the hardware vendor and which driver did you use? Could you please > provide related ironic-conductor logs? This was on bare metal. The hosts were Dell PowerEdge R320 running DRAC 7 firmware 1.57.57 (Build 04). This seems to happen incredibly rarely. I think it might only be happening on Dell, but I don't have confirmation. Hi @Dan, Yeah, the node should be powered off once it's teared down. So the ironic-conductor will tell the driver to tear down [1], the driver will then power off the node [2][3]. Now what's odd about this error is that it's still powered on and there's no error registered with it right? As you can see in [1], if any (except Exception) exception is raised when the node is being teared down we are going mark the node as ERROR and fill out the node's last_error field as well. So, I'm guessing here that all the calls we did to the BMC have worked in the tear down of that node but the BMC just didn't do it? (Even tho it returned success ?) ... Now one thing about the "Actual results": > Actual results: > If the node from the previous deployment is not powered down but is not > chosen for the next deployment, it can interfere with the deployment. The power state of the node shouldn't matter for either Ironic or Nova scheduler when picking a node for deployment. In nova [4] it checks whether the node provision state is AVAILABLE or NOSTATE (for backward compat), if the power state is not ERROR or NOSTATE and the node is not in maintenance. For Ironic we don't care about the power state when deploying a node because we are going to reboot it anyway. Can you do a simple test please? Power the nodes on before doing a deployment, see if Nova or Ironic will complain about it, i.e: $ ironic node-set-power-state <node uuid or name> on And then deploy. [1] https://github.com/openstack/ironic/blob/master/ironic/conductor/manager.py#L786-L798 [2] https://github.com/openstack/ironic/blob/master/ironic/drivers/modules/iscsi_deploy.py#L694-L705 [3] https://github.com/openstack/ironic/blob/2a07b0cbf529b8fdce0178d96022bc2eceb2cac5/ironic/conductor/utils.py#L110-L129 [4] https://github.com/openstack/nova/blob/master/nova/virt/ironic/driver.py#L209-L213 (In reply to Lucas Alvares Gomes from comment #8) The problem happens when one node is powered on and it has a config from a previous deployment. In this case, the rogue node will be using IP addresses that will conflict with the newly deployed nodes in the next deployment. I think what we need to do in this case is make Ironic & Nova care about the power state of available nodes. Nodes that are not active should not be allowed to stay on. Otherwise, we don't know if they are interfering with active nodes. (In reply to Dan Sneddon from comment #9) > (In reply to Lucas Alvares Gomes from comment #8) > > The problem happens when one node is powered on and it has a config from a > previous deployment. In this case, the rogue node will be using IP addresses > that will conflict with the newly deployed nodes in the next deployment. > > I think what we need to do in this case is make Ironic & Nova care about the > power state of available nodes. Nodes that are not active should not be > allowed to stay on. Otherwise, we don't know if they are interfering with > active nodes. That's a good point. Ironic has a configuration option to sync the power state of the node, so after the deletion the node will be set to power off and Ironic will make sure the node stays on that power state[1] (see the force_power_state_during_sync configuration option). But we currently disable that because it conflicts with the HA configuration (pacemaker) for node fencing. So the only thing I can think off that will fix this is if we were using CLEANING in Ironic. With cleaning Ironic will boot up the node after deletion (and also from manageable -> available) and erase the disks of that node to remove all the data from the previous tenant. But, to do that we would need to use IPA as the deploy/clean ramdisk (Which doesn't seem to be targeted for this release version of ospd) [1] https://github.com/openstack/ironic/blob/master/etc/ironic/ironic.conf.sample#L497-L501 This bug did not make the OSP 8.0 release. It is being deferred to OSP 10. Hi! We should really restart conversation with TripleO folks about re-enabling cleaning. I think this is the only real solution for such problems. Lucas, WDYT? *** This bug has been marked as a duplicate of bug 1398657 *** |