Description of problem: In a compute node scale out scenario, we are not able to bring up the nodes during deployment (stack update). We are manually able to bring up the nodes using 'ironic node-set-power-state <uuid> on' but for the same node, ipmitool command is not able to bring up the node. We are able to successfully introspect the nodes. Version-Release number of selected component (if applicable): RHOS 8 How reproducible: Always Actual results: Nodes not coming up during deployment. Expected results: Nodes should come up during deployment.
What is your hardware model? Looks like Dell, right? Could you check (via iDrac web interface) that IPMI access is properly enabled and the user you use is allowed to use it? Have you tried pxe_drac driver?
Yes. Its Dell. IPMI access is properly enabled and there is only one user, the administrator who has full access to it. I have not tried pxe_drac driver.
Hi, So, I quick looked at the logs and so far the only error I can find in the recent (29/01/17) ironic conductor logs is: 2017-01-29 03:55:30.435 48368 WARNING ironic.drivers.modules.agent_base_vendor [-] Failed to soft power off node 99bbebec-9afd-451a-a4a6-ac9491a8ce31 in at least 30 seconds. Error: RetryError[Attempts: 7, Value: power on] Which indicates that the IPA ramdisk couldn't turn off by itself and Ironic fallback to a hard power off after all the attempts have failed. Later in the logs you can also see this: 2017-01-29 16:20:13.499 48368 DEBUG ironic.common.states [req-29646631-9d47-45ac-b516-5b3e1d821808 ] Exiting old state 'active' in response to event 'delete' on_exit /usr/lib/python2.7/site-packages/ironic/common/states.py:199 Which means that, Ironic actually deployed the image but later received a request to delete the instance, which I assume is heat rolling back after <something> has failed in the stack update process. I also couldn't find any relevant logs that indicates that power controlling the nodes with IPMITOOL has actually failed, can you please test the "ironic node-set-power-state <node UUID/name> {on, off}" and let us know if it works ? Because we invoke the same method in the driver's API when we need to power off/on the node as part of the deployment. ... So, two theories: 1) Due the failure of the soft power off, Ironic hard powered the node off and that corrupted the data that was written onto the disk, so the image deployed never came up. It would be good to understand why it actually failed when it was attempt to soft power off. One suggestion to debug it would be to enable Ironic to collect the logs from the IPA ramdisk, please take a look at: http://docs.openstack.org/developer/ironic/deploy/troubleshooting.html?highlight=retrieve#retrieving-logs-from-the-deploy-ramdisk 2) Some other component failed in the stack update process and that lead to heat to rollback and un-provision the nodes in Ironic. I'm still skimming the logs to see if I can find something about it.
Yes. I am able to bring up the node using 'ironic node-set-power-state <node UUID> on'. Also, please note that there was a failed scale-out operation carried out in which deployment was successful, but 'nova hypervisor-list' did not show the newly scaled nodes. Also, all the ironic nodes were in maintenance mode. They were deleted from the undercloud using 'nova delete <instance_id>' and 'ironic node-delete <node_id>' and then introspected again to carry on the deployment after which this issue is seen. Now when I boot the node using ironic, it boots with local disk and it loads the previously deployed OS.
> Yes. I am able to bring up the node using 'ironic node-set-power-state <node UUID> on'. So, does it mean that this bug self-healed, and now you're facing https://bugzilla.redhat.com/show_bug.cgi?id=1418566 instead? If you can power on/off nodes, it's no longer an ipmitool problem. If you do experience problems with power management, please try pxe_drac drivers instead of ipmitool ones.
We were able to bring up the node using ipmitool yesterday and we think this is more of a problem with heat rather than ipmitool / ironic.
Ok, I'm closing this bug in favor of https://bugzilla.redhat.com/show_bug.cgi?id=1418566, let's continue there.