Bug 1235255 - ironic doesn't sync the power status of newly registered nodes
Summary: ironic doesn't sync the power status of newly registered nodes
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-ironic
Version: 7.0 (Kilo)
Hardware: Unspecified
OS: Unspecified
unspecified
urgent
Target Milestone: z5
: 7.0 (Kilo)
Assignee: Lucas Alvares Gomes
QA Contact: Raviv Bar-Tal
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2015-06-24 12:19 UTC by Udi Kalifon
Modified: 2017-02-22 18:19 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2016-10-03 15:30:27 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Udi Kalifon 2015-06-24 12:19:17 UTC
Description of problem:
I reprovisioned my bare metal host and I'm installing the latest puddle (2015-06-22) from scratch on it. After running the import command for the bare metal nodes I saw that 3 nodes don't have a power status:

]$ openstack baremetal import --json instackenv.json
/usr/lib/python2.7/site-packages/novaclient/v1_1/__init__.py:30: UserWarning: Module novaclient.v1_1 is deprecated (taken as a basis for novaclient.v2). The preferable way to get client class or object you can find in novaclient.client module.
  warnings.warn("Module novaclient.v1_1 is deprecated (taken as a basis for "

]$ ironic node-list
+----------+------+---------------+-------------+-----------------+---------+
| UUID     | Name | Instance UUID | Power State | Provision State | Maint.. |
+----------+------+---------------+-------------+-----------------+---------+
| 85a6a... | None | None          | None        | available       | False   |
| 0f53a... | None | None          | None        | available       | False   |
| 1cec6... | None | None          | power off   | available       | False   |
| 28c30... | None | None          | power off   | available       | False   |
| ce232... | None | None          | power off   | available       | False   |
| 0a009... | None | None          | None        | available       | False   |
| 5ca8e... | None | None          | power off   | available       | False   |
+----------+------+---------------+-------------+-----------------+---------+

To try and work around the problem, I deleted all the nodes with "ironic node-delete" and imported the nodes again. It was only partially successful, and now I have 2 nodes instead of 3 that don't have a power status:

]$ ironic node-list
+----------+------+---------------+-------------+-----------------+---------+
| UUID     | Name | Instance UUID | Power State | Provision State | Maint.. |
+----------+------+---------------+-------------+-----------------+---------+
| 3d082... | None | None          | power off   | available       | False   |
| de580... | None | None          | None        | available       | False   |
| 960b4... | None | None          | power off   | available       | False   |
| 341e8... | None | None          | power off   | available       | False   |
| 105a1... | None | None          | power off   | available       | False   |
| 144f9... | None | None          | None        | available       | False   |
| 49b7d... | None | None          | power off   | available       | False   |
+----------+------+---------------+-------------+-----------------+---------+

I had no problem determining, by using the management consoles, that these nodes are really turned OFF. These are the same nodes and the same instackenv.json file that I use all the time and it hasn't changed.

The nodes have been sitting there for well over an hour and still haven't synced up with the power state.

Trying manually from the command line I get:
~]$ ipmitool -I lanplus -H 10.35.160.16 -L ADMINISTRATOR -U admin -R 12 -N 5 -P *****
ipmitool: lanplus.c:2191: ipmi_lanplus_send_payload: Assertion `session->v2_data.session_state == LANPLUS_STATE_OPEN_SESSION_RECEIEVED' failed.
Aborted (core dumped)

~]$ echo $?
134


Version-Release number of selected component (if applicable):
ipmitool-1.8.13-8.el7_1.x86_64


How reproducible:
100%


Additional info:
From Lucas:
I can see the errors coming from ipmitool there:

Jun 24 13:29:24 puma01 ironic-conductor: Command: ipmitool -I lanplus -H 10.35.160.16 -L ADMINISTRATOR -U admin -R 12 -N 5 -f /tmp/tmp4261H8 power status
Jun 24 13:29:24 puma01 ironic-conductor: Exit code: -6                          
Jun 24 13:29:24 puma01 ironic-conductor: Stdout: u''                            
Jun 24 13:29:24 puma01 ironic-conductor: Stderr: u"ipmitool: lanplus.c:2177: ipmi_lanplus_send_payload: Assertion `session->v2_data.session_state == LANPLUS_STATE_PRESESSION' failed.\n".
Jun 24 13:29:24 puma01 ironic-conductor: 2015-06-24 13:29:24.171 31942 WARNING ironic.conductor.manager [-] During sync_power_state, could not get power state for node 2a4d7a55-f186-42fa-a84f-3a4ef24fc5cc. Error: IPMI call failed: power status..

Googling a bit it seems that this error is known for ipmitool:

https://bugzilla.redhat.com/show_bug.cgi?id=514257
https://www.redhat.com/archives/cluster-devel/2012-October/msg00079.html

Additional info:
I have ipmitool-1.8.13-8.el7_1.x86_64:
]$ ipmitool -V
ipmitool version 1.8.13

The servers are all HP Proliant DL170e G6.

Comment 3 Lucas Alvares Gomes 2015-06-24 13:30:18 UTC
Hi,

Some additional information here. Trying the power command using IPMI version 1.5 instead of 2.0 (-I lan instead of -I lanplus) it works.

$ ipmitool -I lanplus -H X.X.X.X -L ADMINISTRATOR -U *** -R 12 -N 5 -P *** power status
ipmitool: lanplus.c:2177: ipmi_lanplus_send_payload: Assertion `session->v2_data.session_state == LANPLUS_STATE_PRESESSION' failed.
Aborted (core dumped)

$ ipmitool -I lan -H X.X.X.X -L ADMINISTRATOR -U *** -R 12 -N 5 -P *** power status
Chassis Power is on

...

This fix perhaps should go upstream as well, trying to make Ironic more resilient to failures.

Comment 4 Lucas Alvares Gomes 2015-06-25 10:38:52 UTC
I have put a patch upstream that allows users to tell the ipmitool driver to use IPMI 1.5 (-I lan) instead of IPMI 2.0 (-I lanplus). Lemme know if that works for you:

The patch is: https://review.openstack.org/195157

You can set this on the failing nodes by doing:

$ ironic node-update <node uuid> add driver_info/ipmi_protocol_version="1.5"

Comment 5 Ofer Blaut 2015-06-25 14:18:18 UTC
workaround i used :

how to recover host with IPMI issues 
***********************************************
1. ironic node-show 170b0a88-20c1-4b4f-a94b-1740714b3bf8
2. ipmitool -I lan -H 10.35.160.78 -U XXXX -P XXXX mc  reset  cold
3. ipmitool -I lan -H 10.35.160.78 -U XXXX -P XXXX chassis  reset
4.  ironic node-set-maintenance 170b0a88-20c1-4b4f-a94b-1740714b3bf8  False
5. ironic  node-set-power-state 170b0a88-20c1-4b4f-a94b-1740714b3bf8 off

Comment 6 Lucas Alvares Gomes 2015-07-06 15:49:02 UTC
So, with that workaround you were able to use the normal -I lanplus interface right?

I wonder if I need to backport the https://review.openstack.org/195157 for this or not.

Comment 8 Udi Kalifon 2015-09-13 10:14:03 UTC
The workaround doesn't work for me. I set the node to use "lan" instead of "lanplus" and took it out of maintenance, and the power state was still "none". I gave it time (1 minute) and the node automatically went back to maintenance mode... so I found myself back in square 1. The workaround from comment #5 also never worked for me. If there is an update to the ipmitool package that might help - please provide a link to the rpm and I will try it on my bare metal setup.

Comment 9 Lucas Alvares Gomes 2015-09-14 12:56:41 UTC
Hi @Udi,

So if ipmitool with can manage the node independent of the IPMI protocol version and all I think we have few things to do. I would suggest that we first investigate whether this is a problem with tool itself (ipmitool in this case) by trying a new version of it or trying alternative tools such as FreeIPMI[1] or pyghmi [2] (Ironic already have a driver that works with  pyghmi, it goes by the name of "pxe_ipminative") and see if it works for that machine. We can try also updating the firmware of that node see if the problem goes away.

From the Ironic POV with a node using the pxe_ipmitool driver, it's doing the right thing. Since it can not manage the power state of the node it's marking the node as maintenance so instances doesn't get scheduled onto that node until it's fixed.

What type of machines are those? Dell, HP, ... ? I would suggest you to try another driver such as pxe_drac (for dell machines), pxe_ilo (for ilo machines), pxe_ipminative (generic IPMI, but doesn't rely on the ipmitool utility). Not much we can do here for Ironic if the tool used for that driver currently doesn't work with that machine.

[1] http://www.gnu.org/software/freeipmi/
[2] https://github.com/stackforge/pyghmi/

Hope that helps,
Lucas

Comment 10 Lucas Alvares Gomes 2015-09-14 12:59:54 UTC
(In reply to Lucas Alvares Gomes from comment #9)
> Hi @Udi,
> 
> So if ipmitool with can manage the node independent of the IPMI protocol
> version and all I think we have few things to do. I would suggest that we

I mean, if ipmitool can not manage the node independent of the IPMI protocol version used, I think we have few things to do.

Comment 11 Udi Kalifon 2015-09-16 08:41:34 UTC
I tested again with the new version of ipmitool from http://download.devel.redhat.com/brewroot/work/tasks/2648/9832648/ipmitool-1.8.15-4.el7.x86_64.rpm

The result: the node is still stuck in power state "none", and it took it about 5 minutes for it to be moved to maintenance mode (with the old ipmitool it took about 1 minute, so the situation is now actually a bit worse).

From the command line, I tried the "chassis status" command using the lanplus protocol and it didn't work. I got this error on the console:

> Error: no response from RAKP 1 message
> Error: no response from RAKP 1 message
> Error: no response from RAKP 1 message
> Error: no response from RAKP 1 message
Set Session Privilege Level to ADMINISTRATOR failed
Error: Unable to establish IPMI v2 / RMCP+ session

I tried again from the command line with the "lan" protocol (version 1.5) and was successful, so I changed the protocol version to 1.5 in ironic. The node, however, was still stuck in power state "none" and went back to maintenance mode (this time in a very reasonable time of less than 1 minute).

So to summarize: the node (which is an HP Proliant DL170e G6 in my case) is manageable via ipmitool (both the old and the new package) using protocol version 1.5 only. There is also no problem controlling the node via its web management interface. The problem is with ironic only. I will try the pxe_ipminative driver as well and let you know the results.

Comment 12 Udi Kalifon 2015-09-16 11:18:02 UTC
Lucas, I tried to enable the pxe_ipminative driver by adding it to the "enabled_drivers" line in ironic.conf. When I restarted the ironic-conductor service it complained that it couldn't find such a driver. I can see that there is a file called "ipminative.py" in /usr/lib/python2.7/site-packages/ironic/drivers/modules. What am I doing wrong ?

Comment 13 Lucas Alvares Gomes 2015-09-17 13:20:16 UTC
Hi @Udi,

When you tried to use the version 1.5 in Ironic did you apply [1] locally? Cause we didn't merge the backport for supporting different IPMI protocol versions in OSP yet.

...

Re pxe_ipminative, this driver is dependent on a library called pyghmi [2] (see comment #9). This library needs to be installed on the node running the ironic-conductor so Ironic can load the driver.

You can do a "yum install python-pyghmi", the lib is packaged in fedora [3], should be fine on RHEL too. If not you can get it from pip to test [4]

[1] https://code.engineering.redhat.com/gerrit/#/c/53823/
[2] https://github.com/stackforge/pyghmi/
[3] http://paste.openstack.org/show/466280/
[4] https://pypi.python.org/pypi/pyghmi

Comment 14 Udi Kalifon 2015-09-17 14:53:59 UTC
I applied the patch (I really didn't know until now that I have to apply it myself) and was able to set the protocol version to 1.5 and confirm that the workaround works. Thanks for clearing that point up :)

As for the pxe_ipminative - I was now able to add it to enabled_drivers and the service doesn't crash or complain any more. Don't know why it didn't work yesterday. By the way the pyghmi package is installed by default. I will play with this driver a bit now.

Comment 15 Lucas Alvares Gomes 2015-09-17 15:15:26 UTC
(In reply to Udi from comment #14)
> I applied the patch (I really didn't know until now that I have to apply it
> myself) and was able to set the protocol version to 1.5 and confirm that the
> workaround works. Thanks for clearing that point up :)
> 

Right, that's good! I just rebased the patch (it was having some merge conflicts). Waiting for the acks so we can merge it.

> As for the pxe_ipminative - I was now able to add it to enabled_drivers and
> the service doesn't crash or complain any more. Don't know why it didn't
> work yesterday. By the way the pyghmi package is installed by default. I
> will play with this driver a bit now.

Cool! Thanks for trying it out

Comment 16 Udi Kalifon 2015-10-20 09:34:38 UTC
Sometimes you hit this issue after there is already something deployed on the node, and then you can't delete it from nova or from ironic (and can't delete the stack either). A useful workaround to disassociate an ironic node from its nova instance is:

ironic node-update c45dbe12-56ab-4548-a4b7-5135501db259 remove instance_uuid

(NOTE: don't replace "instance_uuid" with the actual uuid from nova, leave it literally as "instance_uuid")

Comment 17 Udi Kalifon 2015-10-22 10:52:45 UTC
There is a serious regression in director 7.1 regarding this bug. Now, when you apply the patch from comment #4 and take the failed node out of maintenance - it somehow puts ALL the nodes into maintenance mode. No reason why a failure in one node should propagate to the entire system, and it's really very difficult to get yourself out of this situation when it happens.

Comment 18 Lucas Alvares Gomes 2015-11-30 14:34:52 UTC
(In reply to Udi from comment #17)
> There is a serious regression in director 7.1 regarding this bug. Now, when
> you apply the patch from comment #4 and take the failed node out of
> maintenance - it somehow puts ALL the nodes into maintenance mode. No reason
> why a failure in one node should propagate to the entire system, and it's
> really very difficult to get yourself out of this situation when it happens.

Hi Udi,

The patch haven't been updated so it's quite odd that we had this regression. Can you confirm that it's caused by that patch or something else in the environment you are running ? Do you have other patches applied ?

Comment 19 Udi Kalifon 2015-11-30 16:50:52 UTC
(In reply to Lucas Alvares Gomes from comment #18)
> Hi Udi,
> 
> The patch haven't been updated so it's quite odd that we had this
> regression. Can you confirm that it's caused by that patch or something else
> in the environment you are running ? Do you have other patches applied ?

The regression is that now this patch doesn't work. So, if you need to fall back to ipmi 1.5 - you will end up will all nodes in maintenance. I didn't apply manually any other patches.

It is very likely that this regression is a result of some other changes, but without this patch you can't fall back to ipmi 1.5 at all anyways. I should also check this patch on the latest puddles. Why is this patch not merged?

Comment 21 Lucas Alvares Gomes 2016-10-03 15:30:27 UTC
(In reply to Udi from comment #19)
> (In reply to Lucas Alvares Gomes from comment #18)
> > Hi Udi,
> > 
> > The patch haven't been updated so it's quite odd that we had this
> > regression. Can you confirm that it's caused by that patch or something else
> > in the environment you are running ? Do you have other patches applied ?
> 
> The regression is that now this patch doesn't work. So, if you need to fall
> back to ipmi 1.5 - you will end up will all nodes in maintenance. I didn't
> apply manually any other patches.
> 
> It is very likely that this regression is a result of some other changes,
> but without this patch you can't fall back to ipmi 1.5 at all anyways. I
> should also check this patch on the latest puddles. Why is this patch not
> merged?

Hi Udi, 

I'm closing this bug because I won't be able to backport the patch to configure the IPMI version to OSP 7.0. Do you see this problem happening on newer versions of Ironic (It should be present on OSP8 and newer) ? If, so please re-open this ticket.


Note You need to log in before you can comment on or make changes to this bug.