1460915 – The pxe_ilo driver ignores power requests under certain conditions with HP BL460

Bug 1460915 - The pxe_ilo driver ignores power requests under certain conditions with HP BL460

Summary: The pxe_ilo driver ignores power requests under certain conditions with HP BL460

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	python-proliantutils
Sub Component:
Version:	10.0 (Newton)
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	high
Target Milestone:	z7
Target Release:	10.0 (Newton)
Assignee:	Dmitry Tantsur
QA Contact:	mlammon
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1473267 1534806 1537724
TreeView+	depends on / blocked

Reported:	2017-06-13 06:53 UTC by PURANDHAR SAIRAM MANNIDI
Modified:	2022-08-16 12:56 UTC (History)
CC List:	20 users (show)
Fixed In Version:	python-proliantutils-2.2.0-3.el7ost
Doc Type:	Bug Fix
Doc Text:	Prior to this update, certain HPE hardware would not process power requests when another task was running on the BMC. Consequently, deployment or introspection could fail, as the node would never power on. With this update, the `proliantutils` library now retries power actions multiple times, until the target power state is reached. As a result, power actions work correctly even if another task is temporarily blocking the BMC.
Clone Of:
Clones:	1534806 1537724 (view as bug list)
Environment:
Last Closed:	2018-02-27 16:43:33 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
ironic-conductor.log (2.79 MB, text/plain) 2017-06-13 06:53 UTC, PURANDHAR SAIRAM MANNIDI	no flags	Details
OA logs, attempt 2, show all (110.33 KB, application/x-gzip) 2017-10-23 17:12 UTC, Andrew Ludwar	no flags	Details
View All

Links
System	ID	Priority	Status	Summary	Last Updated
Launchpad	1725204	None	None	None	2017-10-20 09:15:00 UTC
OpenStack gerrit	519967	None	MERGED	Retry power on operation for Blade servers	2020-05-17 09:13:09 UTC
Red Hat Issue Tracker	OSP-4643	None	None	None	2022-08-16 12:56:54 UTC
Red Hat Product Errata	RHBA-2018:0365	normal	SHIPPED_LIVE	Red Hat OpenStack Platform 10 Bug Fix and Enhancement Advisory	2018-02-27 21:42:55 UTC

Description PURANDHAR SAIRAM MANNIDI 2017-06-13 06:53:03 UTC

Created attachment 1287174 [details]
ironic-conductor.log

Description of problem:
Ironic fails to complete the deployment even though the disk creation is successful and shutdown the node

Version-Release number of selected component (if applicable):
RH OSP 10

How reproducible:
Always


Steps to Reproduce:
1. Import baremetal nodes with pxe_ilo driver with appropriate profile. configure deploy images, introspect the nodes.
2. Run openstack overcloud deploy
3. Controllers were able to finish and the provisioning state is active but the compute node is going into deploy_failed

Actual results:
Compute node deployment failing

Expected results:
compute node deployment failing  with error "iLO failed to change state to power on within 12 sec"

Additional info:
Increased the power_wait timeout from 2 to 20, still facing the same issue. As observed from iLO web-ui, the power state was off for much time until the error message.

Comment 2 Dmitry Tantsur 2017-06-13 11:00:23 UTC

Hi!

I think the immediate actions to try are the following:

1. try updating the firmware on nodes and resetting the iLO,
2. if it does not help, try using pxe_ipmitool instead of pxe_ilo,
3. if it does not help, try increasing the timeout to something ridiculous (10 minutes), and see if it works.

In the meantime, I'll try to figure out if there are any known limitations in these models.

Comment 3 Andrew Ludwar 2017-06-13 22:43:17 UTC

We've switched to using the pxe_ipmitool instead of pxe_ilo, and this has resolved the issue.

It appears there's an issue with the pxe_ilo driver in denying a power request when the enclosure is in a busy state.

The theory is that we are now losing a race condition most of the time with the pxe_ilo driver and how it requests the power state change. If it was delayed a bit (or retried) I believe it would work. But, it seems to be triggering too quickly after the deploy ramdisk powers off. I see the following in the iLO logs...

iLO Event Log:
276171		06/13/2017 21:00	06/13/2017 21:00	1	Power-On signal sent to host server by: OSPctl.

Integraged Management Log:
33		Rack Infrastructure	06/13/2017 21:00	06/13/2017 21:00	1	Server Blade Enclosure Power Request Denied: Enclosure Busy 

The system is busy with something when the power request is made and ignores it. We have already shown that after a brief delay the same command then works. But, there doesn't look to be available tunables to tweak this, so proceeding with pxe_ipmitool driver for now.

Dmitry, thank you very much for your help. I believe we can close this BZ.

Comment 4 Dmitry Tantsur 2017-06-14 09:09:08 UTC

I'm glad that this worked for you. I'll keep this bug open, if you don't mind. I'd like to follow-up with the iLO team about it. I'll close it, if I cannot get substantial attention from them.

Dropping the priority, as we have a simple workaround.

Comment 5 Dmitry Tantsur 2017-06-16 09:02:13 UTC

For the iLO developer upstream:

09:49 <Nisha_Agarwal> [04:50:05] pmannidi, dtantsur|afk this looks an issue in firmware.
09:49 <Nisha_Agarwal> [04:50:44] pmannidi, dtantsur|afk the hardware team here(who deals with ilo and enclosure) need certain details
09:49 <Nisha_Agarwal> [04:56:18] pmannidi, dtantsur|afk the bugzilla doesnt allwo me to edit the bug as i dont have login to it. could you get "the result of the OA command – “show all”" from customer's system? apart from that complete conductor logs would be required.

Comment 7 jzaher 2017-08-31 15:43:13 UTC

I feel that this BZ should be reopened.  It was opened initially to track the issues present in pxe_ilo.  The customer moved forward with pxe_ipmitool as a workaround -- but that is not their permanent solution.  They expect this to be fixed.  While the fix will ultimately be done on HP's side, there's value in tracking it on our side, as Dimitry suggested on 6/14.

Comment 8 Dmitry Tantsur 2017-09-04 07:48:40 UTC

Sure, we can reopen it when we find someone to reproduce the issue and provide the logs, etc.

Comment 9 Andrew Ludwar 2017-09-18 20:13:41 UTC

The customer has come across this issue again, and are available to send us the required logs. Re-opening the bug and asking for previous ironic conductor debug logs as well as the HP blade center OA "show all" output, and whatever else diagnostic information we can pull from the blade center.

Comment 15 Dmitry Tantsur 2017-10-20 09:15:01 UTC

Reported upstream, we'll ping them on IRC as well.

Comment 18 Nisha 2017-10-23 16:26:26 UTC

Hello, We still dont see the info about "Show All".

This is what firmware team says:
"I don’t see the SHOW ALL from OA. I can only see “SHOW SYSLOG SERVER ALL”. This is not enough to troubleshoot this issue."

Could you please provide this information so that the issue can be troubleshooted.

Regards
Nisha

Comment 20 Andrew Ludwar 2017-10-23 17:12:09 UTC

Created attachment 1342331 [details]
OA logs, attempt 2, show all

Comment 21 Andrew Ludwar 2017-10-23 17:13:08 UTC

Added new attachment with SHOW ALL from OA.

Comment 22 Nisha 2017-10-25 19:05:52 UTC

Hi, 

Is it possible for customer to add "deploy_forces_oob_reboot" to driver_info and set it to True and see if the issue goes away while using pxe_ilo?

Regards
Nisha

Comment 23 jzaher 2017-11-02 21:48:05 UTC

(In reply to Nisha from comment #22)

> Is it possible for customer to add "deploy_forces_oob_reboot" to driver_info
> and set it to True and see if the issue goes away while using pxe_ilo?

Hello, Nisha --

I spoke with the customer this afternoon and they confirmed that the have tried those settings and are still seeing the problem.

Thanks,
-joe.-

Comment 26 Nisha 2017-11-07 05:13:32 UTC

Hi Joe,

Thanks for the response. 
We have spoken to the firmware team here and they do not see any difference between RIS power on and ipmitool power on implementations.

The IML pasted in https://bugs.launchpad.net/proliantutils/+bug/1725204 and the shared conductor logs are not collected at the same time.

I am sorry but we would need to ask for the logs again. It would help us to investigate the issue further if you could provide following:
- The IML logs and the ironic conductor logs with pxe_ipmitool driver.(both collected at the same time).
- The IML logs and the ironic conductor logs with pxe_ilo driver(both collected at the same time).

Please collect OA logs also at the same time.

Please collect the logs on the same server for both the drivers so that they can be compared.

One more thing i see in the shared conductor logs is "iLO failed to change state to power on within 12 sec". This time looks to be set by customer in the config variable "power_state_change_timeout" as 12 secs. Could they use the default value of 30 secs and see if that wait helps them to resolve the issue? In any case, please provide the above logs for further triaging.

Regards
Nisha

Comment 28 Nisha 2017-11-15 08:08:41 UTC

Hello,

I have raised the patch https://review.openstack.org/#/c/519967/ against proliantutils. But we cannot test this workaround fix as we couldnt reproduce the issue inhouse till now. Is it possible for the customer to test the patch and confirm if the patch works for them? We cannot merge/release the fix in proliantutils unless it is tested.

Please note that the fix provided in this patch is still a workaround fix and the best which could be done as of now.

Regards
Nisha

Comment 33 Dmitry Tantsur 2017-12-04 11:43:51 UTC

Hi Nisha,

Do you plan on merging the patch upstream? We can backport it then, but I have some reservation on shipping something that your team has not accepted.

Thanks!

Comment 34 Dmitry Tantsur 2017-12-04 16:24:06 UTC

Hi all!

Nisha confirmed on IRC that the fix will be merged, if it proves to fix the problem.

Can someone who reproduces the problem please confirm that? Then we can proceed with backports and everything.

Thanks.

Comment 37 Andrew Ludwar 2018-01-04 14:32:31 UTC

Hello,

Our customer has tested the proposed upstream fix in their environment and has confirmed that it has solved their issue. With this new code, the issue appears to be fixed.

Thank you very much for your efforts!

Comment 38 Bob Fournier 2018-01-04 14:38:31 UTC

Thanks Andrew.  With your confirmation the upstream patch should be able to be merged and we can then backport it to OSP-10.

Comment 39 Dmitry Tantsur 2018-01-04 14:46:10 UTC

Thanks!

Comment 40 Bob Fournier 2018-01-16 02:07:49 UTC

Upstream patch is merged, downstream patch is https://code.engineering.redhat.com/gerrit/#/c/124744/.

Comment 50 errata-xmlrpc 2018-02-27 16:43:33 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:0365

Comment 51 RegisJ 2018-06-15 13:32:17 UTC

Hello we get ProLiant BL460c Gen9 with latest driver installed on it. we have OPpenstack Redhat 12 version. The package we have is : python-proliantutils-2.4.0-3.el7ost.noarch
And we encountered the issue describe in this tickets during introspection step
: Failed to get power state for node 0b94a3b2-62bc-4c00-9b78-d087d6c55cb4.

When we put pxe_ipmitool that's work well.

Note You need to log in before you can comment on or make changes to this bug.

agarwalnisha1980
aludwar
athomas
bfournie
cswanson
dtantsur
dvd
hbrock
jslagle
jtrowbri
jzaher
mburns
pablo.iranzo
pmannidi
racedoro
regis.jarde
rhel-osp-director-maint
sclewis
slinaber
srevivo