Description of problem: When using ilo driver and after a password update, ironic keeps using the old password to attempt to login to the ilo. Only tries with the new one after (if) ironic_conductor is restarted. This is not desirable and the behavior is different when using the ipmi_driver, which correctly retries with the updated password without requiring a restart of ironic_container. Version-Release number of selected component (if applicable): RHOSP 16.2.5 (undercloud) [stack.lab ~]$ sudo podman inspect ironic_conductor | jq .[].Config.Labels | egrep "batch|version|release" "batch": "16.2_20230526.1", "io.buildah.version": "1.27.3", "release": "5", "version": "16.2.5" (undercloud) [stack.lab ~]$ How reproducible: Always reproducible when using ilo_driver Steps to Reproduce: 1. Deploy undercloud 2. Import 1 HP server with ilo 3. Get current user, password and ip from the DB: ~~~ (undercloud) [stack.lab ~]$ date ; sudo podman exec -u root mysql /usr/bin/mysql -uroot -NEe "select driver_info from ironic.nodes;" | grep -v row | jq . |grep ilo Tue Jul 11 17:48:57 CEST 2023 "ilo_username": "root", "ilo_password": "oldpassword", "ilo_address": "10.0.0.100" (undercloud) [stack.lab ~]$ ~~~ 4. Test current user and password using ipmitool ~~~ (undercloud) [stack.lab ~]$ date ; ipmitool -I lanplus -H 10.0.0.100 -U root -P oldpassword chassis status | grep Power Tue Jul 11 17:49:18 CEST 2023 Unable to Get Channel Cipher Suites System Power : on Power Overload : false Power Interlock : inactive Main Power Fault : false Power Control Fault : false Power Restore Policy : previous Last Power Event : (undercloud) [stack.lab ~]$ ~~~ 5. Get current power status and driver using openstack cli ~~~ (undercloud) [stack.lab ~]$ date ; openstack baremetal node show hpdl380.local162.lab -c power_state -c driver --fit Tue Jul 11 17:49:49 CEST 2023 +-------------+----------+ | Field | Value | +-------------+----------+ | driver | ilo | | power_state | power on | +-------------+----------+ (undercloud) [stack.lab ~]$ ~~~ 6. Changed the password on the ilo at ~17:50 CEST and I could see the error on the log right away: ~~~ (undercloud) [stack.lab ~]$ sudo tail /var/log/containers/ironic/ironic-conductor.log |grep ERROR 2023-07-11 17:52:26.601 7 ERROR ironic.drivers.modules.ilo.power [req-ed74e79e-c989-49ea-819b-9158566fe90a - - - - -] iLO get_power_state failed for node 658b192e-abe6-48c8-ab07-5577ea3dc3b3 with error: Login failed..: proliantutils.exception.IloLoginFailError: Login failed. (undercloud) [stack.lab ~]$ ~~~ 7. I try a couple of times until eventually power status switches to None: ~~~ (undercloud) [stack.lab ~]$ while [ TRUE ] ;do date ; openstack baremetal node show hpdl380.local162.lab -c power_state | grep -q None && break > sleep 10 > done ; date ; openstack baremetal node show hpdl380.local162.lab -c power_state Tue Jul 11 17:53:59 CEST 2023 Tue Jul 11 17:54:12 CEST 2023 Tue Jul 11 17:54:24 CEST 2023 Tue Jul 11 17:54:36 CEST 2023 Tue Jul 11 17:54:38 CEST 2023 +-------------+-------+ | Field | Value | +-------------+-------+ | power_state | None | +-------------+-------+ (undercloud) [stack.lab ~]$ ~~~ 8. Update password in ironic: ~~~ (undercloud) [stack.lab ~]$ date ; openstack baremetal node set --driver-info ilo_password=newpassword hpdl380.local162.lab Tue Jul 11 17:57:43 CEST 2023 (undercloud) [stack.lab ~]$ ~~~ 9. Confirm password is updated in the db: ~~~ (undercloud) [stack.lab ~]$ date ; sudo podman exec -u root mysql /usr/bin/mysql -uroot -NEe "select driver_info from ironic.nodes;" | grep -v row | jq -r '.ilo_password' Tue Jul 11 17:57:47 CEST 2023 newpassword (undercloud) [stack.lab ~]$ ~~~ 10. Waited some minutes and the password was still set to None: (undercloud) [stack.lab ~]$ date ; openstack baremetal node show hpdl380.local162.lab -c power_state ~~~ Tue Jul 11 18:05:01 CEST 2023 +-------------+-------+ | Field | Value | +-------------+-------+ | power_state | None | +-------------+-------+ (undercloud) [stack.lab ~]$ ~~~ 11. Looking at the node details, I can see that ironic set maintenance to true on the node after many failed logins: ~~~ (undercloud) [stack.lab ~]$ date ; openstack baremetal node show hpdl380.local162.lab -c maintenance -c maintenance_reason --fit Tue Jul 11 18:06:13 CEST 2023 +--------------------+------------------------------------------------------------------------------------------------------+ | Field | Value | +--------------------+------------------------------------------------------------------------------------------------------+ | maintenance | True | | maintenance_reason | During sync_power_state, max retries exceeded for node 658b192e-abe6-48c8-ab07-5577ea3dc3b3, node | | | state None does not match expected state 'power on'. Updating DB state to 'None' Switching node to | | | maintenance mode. Error: iLO get_power_status failed, error: Login failed. | +--------------------+------------------------------------------------------------------------------------------------------+ (undercloud) [stack.lab ~]$ ~~~ 12. In the log, even after the password change, ironic kept failing to login to the ilo: ~~~ (undercloud) [stack.lab ~]$ sudo grep ERROR /var/log/containers/ironic/ironic-conductor.log | tail -3 2023-07-11 17:56:25.926 7 ERROR ironic.drivers.modules.ilo.power [req-ed74e79e-c989-49ea-819b-9158566fe90a - - - - -] iLO get_power_state failed for node 658b192e-abe6-48c8-ab07-5577ea3dc3b3 with error: Login failed..: proliantutils.exception.IloLoginFailError: Login failed. 2023-07-11 18:01:25.935 7 ERROR ironic.drivers.modules.ilo.power [req-ed74e79e-c989-49ea-819b-9158566fe90a - - - - -] iLO get_power_state failed for node 658b192e-abe6-48c8-ab07-5577ea3dc3b3 with error: Login failed..: proliantutils.exception.IloLoginFailError: Login failed. 2023-07-11 18:06:25.923 7 ERROR ironic.drivers.modules.ilo.power [req-ed74e79e-c989-49ea-819b-9158566fe90a - - - - -] iLO get_power_state failed for node 658b192e-abe6-48c8-ab07-5577ea3dc3b3 with error: Login failed..: proliantutils.exception.IloLoginFailError: Login failed. (undercloud) [stack.lab ~]$ ~~~ 13. Unsetting maintenance mode to see if that makes ironic retry login with the new password: ~~~ (undercloud) [stack.lab ~]$ date ; openstack baremetal node maintenance unset hpdl380.local162.lab Tue Jul 11 18:11:26 CEST 2023 (undercloud) [stack.lab ~]$ (undercloud) [stack.lab ~]$ date ; openstack baremetal node show hpdl380.local162.lab -c maintenance -c maintenance_reason --fit Tue Jul 11 18:11:31 CEST 2023 +--------------------+-------+ | Field | Value | +--------------------+-------+ | maintenance | False | | maintenance_reason | None | +--------------------+-------+ (undercloud) [stack.lab ~]$ ~~~ 14. But it kept trying with the old password and eventually maintenace mode is set again: ~~~ (undercloud) [stack.lab ~]$ sudo tail -5 /var/log/containers/ironic/ironic-conductor.log 2023-07-11 18:12:26.649 7 ERROR ironic.drivers.modules.ilo.power [req-ed74e79e-c989-49ea-819b-9158566fe90a - - - - -] iLO get_power_state failed for node 658b192e-abe6-48c8-ab07-5577ea3dc3b3 with error: Login failed..: proliantutils.exception.IloLoginFailError: Login failed. 2023-07-11 18:12:26.649 7 DEBUG ironic.conductor.task_manager [req-ed74e79e-c989-49ea-819b-9158566fe90a - - - - -] Upgrading shared lock on node 658b192e-abe6-48c8-ab07-5577ea3dc3b3 for power state sync to an exclusive one (shared lock was held 1.50 seconds) upgrade_lock /usr/lib/python3.6/site-packages/ironic/conductor/task_manager.py:362 2023-07-11 18:12:26.668 7 DEBUG ironic.conductor.task_manager [req-ed74e79e-c989-49ea-819b-9158566fe90a - - - - -] Node 658b192e-abe6-48c8-ab07-5577ea3dc3b3 successfully reserved for power state sync (took 0.02 seconds) reserve_node /usr/lib/python3.6/site-packages/ironic/conductor/task_manager.py:339 2023-07-11 18:12:26.692 7 ERROR ironic.conductor.manager [req-ed74e79e-c989-49ea-819b-9158566fe90a - - - - -] During sync_power_state, max retries exceeded for node 658b192e-abe6-48c8-ab07-5577ea3dc3b3, node state None does not match expected state 'None'. Updating DB state to 'None' Switching node to maintenance mode. Error: iLO get_power_status failed, error: Login failed.: ironic.common.exception.IloOperationError: iLO get_power_status failed, error: Login failed. 2023-07-11 18:12:26.702 7 DEBUG ironic.conductor.task_manager [req-ed74e79e-c989-49ea-819b-9158566fe90a - - - - -] Successfully released exclusive lock for power state sync on node 658b192e-abe6-48c8-ab07-5577ea3dc3b3 (lock was held 0.03 sec) release_resources /usr/lib/python3.6/site-packages/ironic/conductor/task_manager.py:418 (undercloud) [stack.lab ~]$ (undercloud) [stack.lab ~]$ date ; openstack baremetal node show hpdl380.local162.lab -c maintenance -c maintenance_reason --fit Tue Jul 11 18:19:13 CEST 2023 +--------------------+------------------------------------------------------------------------------------------------------+ | Field | Value | +--------------------+------------------------------------------------------------------------------------------------------+ | maintenance | True | | maintenance_reason | During sync_power_state, max retries exceeded for node 658b192e-abe6-48c8-ab07-5577ea3dc3b3, node | | | state None does not match expected state 'None'. Updating DB state to 'None' Switching node to | | | maintenance mode. Error: iLO get_power_status failed, error: Login failed. | +--------------------+------------------------------------------------------------------------------------------------------+ (undercloud) [stack.lab ~]$ 15. Restarting the ironic_conductor container ~~~ (undercloud) [stack.lab ~]$ date ; sudo podman restart ironic_conductor Tue Jul 11 18:20:33 CEST 2023 516a2cedca2502bad00fb0913c76853ec88659cbada9f8a9fbc119cb980b53db (undercloud) [stack.lab ~]$ ~~~ 16. And eventually power status goes back to normal: ~~~ (undercloud) [stack.lab ~]$ while [ TRUE ] ;do date ; openstack baremetal node show hpdl380.local162.lab -c power_state | grep -q None || break > sleep 10 > done ; date ; openstack baremetal node show hpdl380.local162.lab -c power_state Tue Jul 11 18:24:11 CEST 2023 Tue Jul 11 18:24:23 CEST 2023 Tue Jul 11 18:24:35 CEST 2023 Tue Jul 11 18:24:47 CEST 2023 Tue Jul 11 18:24:59 CEST 2023 Tue Jul 11 18:25:11 CEST 2023 Tue Jul 11 18:25:24 CEST 2023 Tue Jul 11 18:25:36 CEST 2023 Tue Jul 11 18:25:48 CEST 2023 Tue Jul 11 18:25:50 CEST 2023 +-------------+----------+ | Field | Value | +-------------+----------+ | power_state | power on | +-------------+----------+ (undercloud) [stack.lab ~]$ ~~~ Actual results: Ironic keeps retrying login with old password Expected results: Ironic should read the updated password from the db and retry with it after a password change Additional info: This does not happen when using ipmi_driver
Could you please provide a full baremetal node show: openstack baremetal node show -fyaml hpdl380.local162.lab It is not clear which power-interface is being used on this node, but if it is ipmi then I may see the issue. There is an update function to sync ipmi_password with the current value of ilo_password, which may not be triggered on baremetal node set. If this is the case, you'll need to sync the password values manually: openstack baremetal node set --driver-info ilo_password=newpassword --driver-info ipmi_password=newpassword hpdl380.local162.lab
(In reply to Steve Baker from comment #1) > Could you please provide a full baremetal node show: > > openstack baremetal node show -fyaml hpdl380.local162.lab > ~~~ (undercloud) [stack.lab ~]$ openstack baremetal node show -fyaml hpdl380.local162.lab allocation_uuid: null automated_clean: null bios_interface: ilo boot_interface: ilo-pxe chassis_uuid: null clean_step: {} conductor: director.local162.lab conductor_group: '' console_enabled: false console_interface: ilo created_at: '2023-07-11T15:38:19+00:00' deploy_interface: iscsi deploy_step: {} description: null driver: ilo driver_info: deploy_kernel: file:///var/lib/ironic/httpboot/agent.kernel deploy_ramdisk: file:///var/lib/ironic/httpboot/agent.ramdisk ilo_address: 10.0.0.100 ilo_password: '******' ilo_username: root rescue_kernel: file:///var/lib/ironic/httpboot/agent.kernel rescue_ramdisk: file:///var/lib/ironic/httpboot/agent.ramdisk driver_internal_info: {} extra: {} fault: power failure inspect_interface: inspector inspection_finished_at: null inspection_started_at: null instance_info: {} instance_uuid: null last_error: 'During sync_power_state, max retries exceeded for node 658b192e-abe6-48c8-ab07-5577ea3dc3b3, node state None does not match expected state ''power on''. Updating DB state to ''None'' Switching node to maintenance mode. Error: iLO get_power_status failed, error: Login failed.' maintenance: true maintenance_reason: 'During sync_power_state, max retries exceeded for node 658b192e-abe6-48c8-ab07-5577ea3dc3b3, node state None does not match expected state ''power on''. Updating DB state to ''None'' Switching node to maintenance mode. Error: iLO get_power_status failed, error: Login failed.' management_interface: ilo name: hpdl380.local162.lab network_interface: flat owner: null power_interface: ilo power_state: null properties: cpu_arch: x86_64 protected: false protected_reason: null provision_state: manageable provision_updated_at: '2023-07-11T15:38:25+00:00' raid_config: {} raid_interface: no-raid rescue_interface: agent reservation: null resource_class: baremetal storage_interface: noop target_power_state: null target_provision_state: null target_raid_config: {} traits: [] updated_at: '2023-07-11T16:32:47+00:00' uuid: 658b192e-abe6-48c8-ab07-5577ea3dc3b3 vendor_interface: no-vendor (undercloud) [stack.lab ~]$ ~~~ > It is not clear which power-interface is being used on this node, It's an ilo (see for example steps #5 for ilo driver and step #6 for the the ilo.power error in the log). This comes from 'pm_type: "ilo"' in the nodes.yaml. > but if it is ipmi then I may see the issue. > There is an update function to sync > ipmi_password with the current value of ilo_password, which may not be > triggered on baremetal node set. No it's the other way around for me. As much as I tried, I could not reproduce the problem when using ipmi driver. That is: when using ipmi ironic retries with the **new** password after I do "baremetal node set --driver-info ipmi_password=...". It just takes some minutes, but it resolves on its own without a container restart. When using ilo, however, it stays broken and retrying continuously with the old password until ironic_conductor is restarted, or until I put the old password back both in the ilo and in ironic. > If this is the case, you'll need to sync > the password values manually: > > openstack baremetal node set --driver-info ilo_password=newpassword > --driver-info ipmi_password=newpassword hpdl380.local162.lab It's not the same case, so I don't think this command applies. I don't have ipmi information on driver_info, only ilo_*.
We've run into a similar issue with the redfish hardware type interfaces. Actually, I need to add a patch to detect some ilos and reject the ilo driver on them and instead indicate "use redfish". Anyhow! A glance at the ironic code path doesn't appear to use caching, and we always read from the database when launching new tasks. The challenge is the proliantutils library, internally, has caching, and that is what is breaking things here. It appears, at least looking at Wallaby branch upstream, and newer, they have mirrored the logic to spawn a new cached client and checked upon username, password, and address. However in the Train release of proliantutils, it had no such guard, and only relied upon the address of the BMC. It appears they fixed this in proliantutils 2.9.2, in upstream change number 694448. Looks like we can just cherry-pick the patch into place.
Awesome, thanks for the update Julia. There's a workaround in place at the moment, so we can wait for the fix.