Created attachment 1287174 [details] ironic-conductor.log Description of problem: Ironic fails to complete the deployment even though the disk creation is successful and shutdown the node Version-Release number of selected component (if applicable): RH OSP 10 How reproducible: Always Steps to Reproduce: 1. Import baremetal nodes with pxe_ilo driver with appropriate profile. configure deploy images, introspect the nodes. 2. Run openstack overcloud deploy 3. Controllers were able to finish and the provisioning state is active but the compute node is going into deploy_failed Actual results: Compute node deployment failing Expected results: compute node deployment failing with error "iLO failed to change state to power on within 12 sec" Additional info: Increased the power_wait timeout from 2 to 20, still facing the same issue. As observed from iLO web-ui, the power state was off for much time until the error message.
Hi! I think the immediate actions to try are the following: 1. try updating the firmware on nodes and resetting the iLO, 2. if it does not help, try using pxe_ipmitool instead of pxe_ilo, 3. if it does not help, try increasing the timeout to something ridiculous (10 minutes), and see if it works. In the meantime, I'll try to figure out if there are any known limitations in these models.
We've switched to using the pxe_ipmitool instead of pxe_ilo, and this has resolved the issue. It appears there's an issue with the pxe_ilo driver in denying a power request when the enclosure is in a busy state. The theory is that we are now losing a race condition most of the time with the pxe_ilo driver and how it requests the power state change. If it was delayed a bit (or retried) I believe it would work. But, it seems to be triggering too quickly after the deploy ramdisk powers off. I see the following in the iLO logs... iLO Event Log: 276171 06/13/2017 21:00 06/13/2017 21:00 1 Power-On signal sent to host server by: OSPctl. Integraged Management Log: 33 Rack Infrastructure 06/13/2017 21:00 06/13/2017 21:00 1 Server Blade Enclosure Power Request Denied: Enclosure Busy The system is busy with something when the power request is made and ignores it. We have already shown that after a brief delay the same command then works. But, there doesn't look to be available tunables to tweak this, so proceeding with pxe_ipmitool driver for now. Dmitry, thank you very much for your help. I believe we can close this BZ.
I'm glad that this worked for you. I'll keep this bug open, if you don't mind. I'd like to follow-up with the iLO team about it. I'll close it, if I cannot get substantial attention from them. Dropping the priority, as we have a simple workaround.
For the iLO developer upstream: 09:49 <Nisha_Agarwal> [04:50:05] pmannidi, dtantsur|afk this looks an issue in firmware. 09:49 <Nisha_Agarwal> [04:50:44] pmannidi, dtantsur|afk the hardware team here(who deals with ilo and enclosure) need certain details 09:49 <Nisha_Agarwal> [04:56:18] pmannidi, dtantsur|afk the bugzilla doesnt allwo me to edit the bug as i dont have login to it. could you get "the result of the OA command – “show all”" from customer's system? apart from that complete conductor logs would be required.
I feel that this BZ should be reopened. It was opened initially to track the issues present in pxe_ilo. The customer moved forward with pxe_ipmitool as a workaround -- but that is not their permanent solution. They expect this to be fixed. While the fix will ultimately be done on HP's side, there's value in tracking it on our side, as Dimitry suggested on 6/14.
Sure, we can reopen it when we find someone to reproduce the issue and provide the logs, etc.
The customer has come across this issue again, and are available to send us the required logs. Re-opening the bug and asking for previous ironic conductor debug logs as well as the HP blade center OA "show all" output, and whatever else diagnostic information we can pull from the blade center.
Reported upstream, we'll ping them on IRC as well.
Hello, We still dont see the info about "Show All". This is what firmware team says: "I don’t see the SHOW ALL from OA. I can only see “SHOW SYSLOG SERVER ALL”. This is not enough to troubleshoot this issue." Could you please provide this information so that the issue can be troubleshooted. Regards Nisha
Created attachment 1342331 [details] OA logs, attempt 2, show all
Added new attachment with SHOW ALL from OA.
Hi, Is it possible for customer to add "deploy_forces_oob_reboot" to driver_info and set it to True and see if the issue goes away while using pxe_ilo? Regards Nisha
(In reply to Nisha from comment #22) > Is it possible for customer to add "deploy_forces_oob_reboot" to driver_info > and set it to True and see if the issue goes away while using pxe_ilo? Hello, Nisha -- I spoke with the customer this afternoon and they confirmed that the have tried those settings and are still seeing the problem. Thanks, -joe.-
Hi Joe, Thanks for the response. We have spoken to the firmware team here and they do not see any difference between RIS power on and ipmitool power on implementations. The IML pasted in https://bugs.launchpad.net/proliantutils/+bug/1725204 and the shared conductor logs are not collected at the same time. I am sorry but we would need to ask for the logs again. It would help us to investigate the issue further if you could provide following: - The IML logs and the ironic conductor logs with pxe_ipmitool driver.(both collected at the same time). - The IML logs and the ironic conductor logs with pxe_ilo driver(both collected at the same time). Please collect OA logs also at the same time. Please collect the logs on the same server for both the drivers so that they can be compared. One more thing i see in the shared conductor logs is "iLO failed to change state to power on within 12 sec". This time looks to be set by customer in the config variable "power_state_change_timeout" as 12 secs. Could they use the default value of 30 secs and see if that wait helps them to resolve the issue? In any case, please provide the above logs for further triaging. Regards Nisha
Hello, I have raised the patch https://review.openstack.org/#/c/519967/ against proliantutils. But we cannot test this workaround fix as we couldnt reproduce the issue inhouse till now. Is it possible for the customer to test the patch and confirm if the patch works for them? We cannot merge/release the fix in proliantutils unless it is tested. Please note that the fix provided in this patch is still a workaround fix and the best which could be done as of now. Regards Nisha
Hi Nisha, Do you plan on merging the patch upstream? We can backport it then, but I have some reservation on shipping something that your team has not accepted. Thanks!
Hi all! Nisha confirmed on IRC that the fix will be merged, if it proves to fix the problem. Can someone who reproduces the problem please confirm that? Then we can proceed with backports and everything. Thanks.
Hello, Our customer has tested the proposed upstream fix in their environment and has confirmed that it has solved their issue. With this new code, the issue appears to be fixed. Thank you very much for your efforts!
Thanks Andrew. With your confirmation the upstream patch should be able to be merged and we can then backport it to OSP-10.
Thanks!
Upstream patch is merged, downstream patch is https://code.engineering.redhat.com/gerrit/#/c/124744/.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:0365
Hello we get ProLiant BL460c Gen9 with latest driver installed on it. we have OPpenstack Redhat 12 version. The package we have is : python-proliantutils-2.4.0-3.el7ost.noarch And we encountered the issue describe in this tickets during introspection step : Failed to get power state for node 0b94a3b2-62bc-4c00-9b78-d087d6c55cb4. When we put pxe_ipmitool that's work well.