Bug 1417914 - ipmitool not able to bring up the node
Summary: ipmitool not able to bring up the node
Keywords:
Status: CLOSED WORKSFORME
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: ipmitool
Version: 8.0 (Liberty)
Hardware: Unspecified
OS: Unspecified
high
medium
Target Milestone: async
: ---
Assignee: Lon Hohberger
QA Contact: Shai Revivo
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-01-31 12:20 UTC by Chaitanya
Modified: 2020-04-15 15:11 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2017-02-02 10:21:12 UTC
Target Upstream Version:


Attachments (Terms of Use)

Description Chaitanya 2017-01-31 12:20:54 UTC
Description of problem:
In a compute node scale out scenario, we are not able to bring up the nodes during deployment (stack update). We are manually able to bring up the nodes using 'ironic node-set-power-state <uuid> on' but for the same node, ipmitool command is not able to bring up the node.

We are able to successfully introspect the nodes.

Version-Release number of selected component (if applicable):
RHOS 8

How reproducible:
Always


Actual results:
Nodes not coming up during deployment.

Expected results:
Nodes should come up during deployment.

Comment 2 Dmitry Tantsur 2017-01-31 13:25:26 UTC
What is your hardware model? Looks like Dell, right? Could you check (via iDrac web interface) that IPMI access is properly enabled and the user you use is allowed to use it?

Have you tried pxe_drac driver?

Comment 3 Chaitanya 2017-01-31 13:32:52 UTC
Yes. Its Dell. IPMI access is properly enabled and there is only one user, the administrator who has full access to it.

I have not tried pxe_drac driver.

Comment 4 Lucas Alvares Gomes 2017-01-31 14:09:03 UTC
Hi,

So, I quick looked at the logs and so far the only error I can find in the recent (29/01/17) ironic conductor logs is:

2017-01-29 03:55:30.435 48368 WARNING ironic.drivers.modules.agent_base_vendor [-] Failed to soft power off node 99bbebec-9afd-451a-a4a6-ac9491a8ce31 in at least 30 seconds. Error: RetryError[Attempts: 7, Value: power on]

Which indicates that the IPA ramdisk couldn't turn off by itself and Ironic fallback to a hard power off after all the attempts have failed.

Later in the logs you can also see this:

2017-01-29 16:20:13.499 48368 DEBUG ironic.common.states [req-29646631-9d47-45ac-b516-5b3e1d821808 ] Exiting old state 'active' in response to event 'delete' on_exit /usr/lib/python2.7/site-packages/ironic/common/states.py:199

Which means that, Ironic actually deployed the image but later received a request to delete the instance, which I assume is heat rolling back after <something> has failed in the stack update process.

I also couldn't find any relevant logs that indicates that power controlling the nodes with IPMITOOL has actually failed, can you please test the "ironic node-set-power-state <node UUID/name> {on, off}" and let us know if it works ? Because we invoke the same method in the driver's API when we need to power off/on the node as part of the deployment.

...

So, two theories:

1) Due the failure of the soft power off, Ironic hard powered the node off and that corrupted the data that was written onto the disk, so the image deployed never came up. It would be good to understand why it actually failed when it was attempt to soft power off. One suggestion to debug it would be to enable Ironic to collect the logs from the IPA ramdisk, please take a look at: http://docs.openstack.org/developer/ironic/deploy/troubleshooting.html?highlight=retrieve#retrieving-logs-from-the-deploy-ramdisk

2) Some other component failed in the stack update process and that lead to heat to rollback and un-provision the nodes in Ironic. I'm still skimming the logs to see if I can find something about it.

Comment 5 Chaitanya 2017-01-31 15:36:10 UTC
Yes. I am able to bring up the node using 'ironic node-set-power-state <node UUID> on'. 

 Also, please note that there was a failed scale-out operation carried out in which deployment was successful, but 'nova hypervisor-list' did not show the newly scaled nodes. Also, all the ironic nodes were in maintenance mode. They were deleted from the undercloud using 'nova delete <instance_id>' and 'ironic node-delete <node_id>' and then introspected again to carry on the deployment after which this issue is seen. 

Now when I boot the node using ironic, it boots with local disk and it loads the previously deployed OS.

Comment 8 Dmitry Tantsur 2017-02-02 10:08:35 UTC
> Yes. I am able to bring up the node using 'ironic node-set-power-state <node UUID> on'. 

So, does it mean that this bug self-healed, and now you're facing https://bugzilla.redhat.com/show_bug.cgi?id=1418566 instead? If you can power on/off nodes, it's no longer an ipmitool problem.

If you do experience problems with power management, please try pxe_drac drivers instead of ipmitool ones.

Comment 9 Chaitanya 2017-02-02 10:17:00 UTC
We were able to bring up the node using ipmitool yesterday and we think this is more of a problem with heat rather than ipmitool / ironic.

Comment 10 Dmitry Tantsur 2017-02-02 10:21:12 UTC
Ok, I'm closing this bug in favor of https://bugzilla.redhat.com/show_bug.cgi?id=1418566, let's continue there.


Note You need to log in before you can comment on or make changes to this bug.