Bug 2222048 - ironic keeps using old ilo password after update
Summary: ironic keeps using old ilo password after update
Keywords:
Status: MODIFIED
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: python-proliantutils
Version: 16.2 (Train)
Hardware: x86_64
OS: Linux
medium
medium
Target Milestone: z6
: 16.2 (Train on RHEL 8.4)
Assignee: Julia Kreger
QA Contact: James E. LaBarre
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2023-07-11 16:53 UTC by Eric Nothen
Modified: 2023-07-31 19:58 UTC (History)
7 users (show)

Fixed In Version: python-proliantutils-2.9.4-17.1.20230620223248.5bc7569.el9ost
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker OSP-26521 0 None None None 2023-07-11 16:57:02 UTC

Description Eric Nothen 2023-07-11 16:53:54 UTC
Description of problem:

When using ilo driver and after a password update, ironic keeps using the old password to attempt to login to the ilo. Only tries with the new one after (if) ironic_conductor is restarted.

This is not desirable and the behavior is different when using the ipmi_driver, which correctly retries with the updated password without requiring a restart of ironic_container.

Version-Release number of selected component (if applicable):
RHOSP 16.2.5

(undercloud) [stack.lab ~]$ sudo podman inspect ironic_conductor | jq .[].Config.Labels | egrep "batch|version|release"
  "batch": "16.2_20230526.1",
  "io.buildah.version": "1.27.3",
  "release": "5",
  "version": "16.2.5"
(undercloud) [stack.lab ~]$ 


How reproducible:
Always reproducible when using ilo_driver

Steps to Reproduce:
1. Deploy undercloud

2. Import 1 HP server with ilo

3. Get current user, password and ip from the DB:
~~~
(undercloud) [stack.lab ~]$ date ; sudo podman exec -u root mysql /usr/bin/mysql -uroot -NEe "select driver_info from ironic.nodes;" | grep -v row | jq . |grep ilo
Tue Jul 11 17:48:57 CEST 2023
  "ilo_username": "root",
  "ilo_password": "oldpassword",
  "ilo_address": "10.0.0.100"
(undercloud) [stack.lab ~]$ 
~~~

4. Test current user and password using ipmitool
~~~
(undercloud) [stack.lab ~]$ date ; ipmitool -I lanplus -H 10.0.0.100 -U root -P oldpassword chassis status | grep Power
Tue Jul 11 17:49:18 CEST 2023
Unable to Get Channel Cipher Suites
System Power         : on
Power Overload       : false
Power Interlock      : inactive
Main Power Fault     : false
Power Control Fault  : false
Power Restore Policy : previous
Last Power Event     : 
(undercloud) [stack.lab ~]$ 
~~~

5. Get current power status and driver using openstack cli
~~~
(undercloud) [stack.lab ~]$ date ; openstack baremetal node show hpdl380.local162.lab -c power_state -c driver --fit
Tue Jul 11 17:49:49 CEST 2023
+-------------+----------+
| Field       | Value    |
+-------------+----------+
| driver      | ilo      |
| power_state | power on |
+-------------+----------+
(undercloud) [stack.lab ~]$ 
~~~

6. Changed the password on the ilo at ~17:50 CEST and I could see the error on the log right away:
~~~
(undercloud) [stack.lab ~]$ sudo tail /var/log/containers/ironic/ironic-conductor.log |grep ERROR
2023-07-11 17:52:26.601 7 ERROR ironic.drivers.modules.ilo.power [req-ed74e79e-c989-49ea-819b-9158566fe90a - - - - -] iLO get_power_state failed for node 658b192e-abe6-48c8-ab07-5577ea3dc3b3 with error: Login failed..: proliantutils.exception.IloLoginFailError: Login failed.
(undercloud) [stack.lab ~]$ 
~~~

7. I try a couple of times until eventually power status switches to None:
~~~
(undercloud) [stack.lab ~]$ while [ TRUE ] ;do date ; openstack baremetal node show hpdl380.local162.lab -c power_state | grep -q None && break
> sleep 10
> done ; date ; openstack baremetal node show hpdl380.local162.lab -c power_state
Tue Jul 11 17:53:59 CEST 2023
Tue Jul 11 17:54:12 CEST 2023
Tue Jul 11 17:54:24 CEST 2023
Tue Jul 11 17:54:36 CEST 2023
Tue Jul 11 17:54:38 CEST 2023
+-------------+-------+
| Field       | Value |
+-------------+-------+
| power_state | None  |
+-------------+-------+
(undercloud) [stack.lab ~]$ 
~~~

8. Update password in ironic:
~~~
(undercloud) [stack.lab ~]$ date ; openstack baremetal node set --driver-info ilo_password=newpassword hpdl380.local162.lab
Tue Jul 11 17:57:43 CEST 2023
(undercloud) [stack.lab ~]$ 
~~~

9. Confirm password is updated in the db:
~~~
(undercloud) [stack.lab ~]$ date ; sudo podman exec -u root mysql /usr/bin/mysql -uroot -NEe "select driver_info from ironic.nodes;" | grep -v row | jq -r '.ilo_password'
Tue Jul 11 17:57:47 CEST 2023
newpassword
(undercloud) [stack.lab ~]$ 
~~~

10. Waited some minutes and the password was still set to None:
(undercloud) [stack.lab ~]$ date ; openstack baremetal node show hpdl380.local162.lab -c power_state
~~~
Tue Jul 11 18:05:01 CEST 2023
+-------------+-------+
| Field       | Value |
+-------------+-------+
| power_state | None  |
+-------------+-------+
(undercloud) [stack.lab ~]$ 
~~~

11. Looking at the node details, I can see that ironic set maintenance to true on the node after many failed logins:
~~~
(undercloud) [stack.lab ~]$ date ; openstack baremetal node show hpdl380.local162.lab -c maintenance -c maintenance_reason --fit
Tue Jul 11 18:06:13 CEST 2023
+--------------------+------------------------------------------------------------------------------------------------------+
| Field              | Value                                                                                                |
+--------------------+------------------------------------------------------------------------------------------------------+
| maintenance        | True                                                                                                 |
| maintenance_reason | During sync_power_state, max retries exceeded for node 658b192e-abe6-48c8-ab07-5577ea3dc3b3, node    |
|                    | state None does not match expected state 'power on'. Updating DB state to 'None' Switching node to   |
|                    | maintenance mode. Error: iLO get_power_status failed, error: Login failed.                           |
+--------------------+------------------------------------------------------------------------------------------------------+
(undercloud) [stack.lab ~]$ 
~~~

12. In the log, even after the password change, ironic kept failing to login to the ilo:
~~~
(undercloud) [stack.lab ~]$ sudo grep ERROR /var/log/containers/ironic/ironic-conductor.log | tail -3
2023-07-11 17:56:25.926 7 ERROR ironic.drivers.modules.ilo.power [req-ed74e79e-c989-49ea-819b-9158566fe90a - - - - -] iLO get_power_state failed for node 658b192e-abe6-48c8-ab07-5577ea3dc3b3 with error: Login failed..: proliantutils.exception.IloLoginFailError: Login failed.
2023-07-11 18:01:25.935 7 ERROR ironic.drivers.modules.ilo.power [req-ed74e79e-c989-49ea-819b-9158566fe90a - - - - -] iLO get_power_state failed for node 658b192e-abe6-48c8-ab07-5577ea3dc3b3 with error: Login failed..: proliantutils.exception.IloLoginFailError: Login failed.
2023-07-11 18:06:25.923 7 ERROR ironic.drivers.modules.ilo.power [req-ed74e79e-c989-49ea-819b-9158566fe90a - - - - -] iLO get_power_state failed for node 658b192e-abe6-48c8-ab07-5577ea3dc3b3 with error: Login failed..: proliantutils.exception.IloLoginFailError: Login failed.
(undercloud) [stack.lab ~]$ 
~~~

13. Unsetting maintenance mode to see if that makes ironic retry login with the new password:
~~~
(undercloud) [stack.lab ~]$ date ; openstack baremetal node maintenance unset hpdl380.local162.lab 
Tue Jul 11 18:11:26 CEST 2023
(undercloud) [stack.lab ~]$ 
(undercloud) [stack.lab ~]$ date ; openstack baremetal node show hpdl380.local162.lab -c maintenance -c maintenance_reason --fit
Tue Jul 11 18:11:31 CEST 2023
+--------------------+-------+
| Field              | Value |
+--------------------+-------+
| maintenance        | False |
| maintenance_reason | None  |
+--------------------+-------+
(undercloud) [stack.lab ~]$
~~~

14. But it kept trying with the old password and eventually maintenace mode is set again:
~~~
(undercloud) [stack.lab ~]$ sudo tail -5 /var/log/containers/ironic/ironic-conductor.log
2023-07-11 18:12:26.649 7 ERROR ironic.drivers.modules.ilo.power [req-ed74e79e-c989-49ea-819b-9158566fe90a - - - - -] iLO get_power_state failed for node 658b192e-abe6-48c8-ab07-5577ea3dc3b3 with error: Login failed..: proliantutils.exception.IloLoginFailError: Login failed.
2023-07-11 18:12:26.649 7 DEBUG ironic.conductor.task_manager [req-ed74e79e-c989-49ea-819b-9158566fe90a - - - - -] Upgrading shared lock on node 658b192e-abe6-48c8-ab07-5577ea3dc3b3 for power state sync to an exclusive one (shared lock was held 1.50 seconds) upgrade_lock /usr/lib/python3.6/site-packages/ironic/conductor/task_manager.py:362
2023-07-11 18:12:26.668 7 DEBUG ironic.conductor.task_manager [req-ed74e79e-c989-49ea-819b-9158566fe90a - - - - -] Node 658b192e-abe6-48c8-ab07-5577ea3dc3b3 successfully reserved for power state sync (took 0.02 seconds) reserve_node /usr/lib/python3.6/site-packages/ironic/conductor/task_manager.py:339
2023-07-11 18:12:26.692 7 ERROR ironic.conductor.manager [req-ed74e79e-c989-49ea-819b-9158566fe90a - - - - -] During sync_power_state, max retries exceeded for node 658b192e-abe6-48c8-ab07-5577ea3dc3b3, node state None does not match expected state 'None'. Updating DB state to 'None' Switching node to maintenance mode. Error: iLO get_power_status failed, error: Login failed.: ironic.common.exception.IloOperationError: iLO get_power_status failed, error: Login failed.
2023-07-11 18:12:26.702 7 DEBUG ironic.conductor.task_manager [req-ed74e79e-c989-49ea-819b-9158566fe90a - - - - -] Successfully released exclusive lock for power state sync on node 658b192e-abe6-48c8-ab07-5577ea3dc3b3 (lock was held 0.03 sec) release_resources /usr/lib/python3.6/site-packages/ironic/conductor/task_manager.py:418
(undercloud) [stack.lab ~]$ 
(undercloud) [stack.lab ~]$ date ; openstack baremetal node show hpdl380.local162.lab -c maintenance -c maintenance_reason --fit
Tue Jul 11 18:19:13 CEST 2023
+--------------------+------------------------------------------------------------------------------------------------------+
| Field              | Value                                                                                                |
+--------------------+------------------------------------------------------------------------------------------------------+
| maintenance        | True                                                                                                 |
| maintenance_reason | During sync_power_state, max retries exceeded for node 658b192e-abe6-48c8-ab07-5577ea3dc3b3, node    |
|                    | state None does not match expected state 'None'. Updating DB state to 'None' Switching node to       |
|                    | maintenance mode. Error: iLO get_power_status failed, error: Login failed.                           |
+--------------------+------------------------------------------------------------------------------------------------------+
(undercloud) [stack.lab ~]$ 

15. Restarting the ironic_conductor container
~~~
(undercloud) [stack.lab ~]$ date ; sudo podman restart ironic_conductor
Tue Jul 11 18:20:33 CEST 2023
516a2cedca2502bad00fb0913c76853ec88659cbada9f8a9fbc119cb980b53db
(undercloud) [stack.lab ~]$ 
~~~

16. And eventually power status goes back to normal:
~~~
(undercloud) [stack.lab ~]$ while [ TRUE ] ;do date ; openstack baremetal node show hpdl380.local162.lab -c power_state | grep -q None || break
> sleep 10
> done ; date ; openstack baremetal node show hpdl380.local162.lab -c power_state
Tue Jul 11 18:24:11 CEST 2023
Tue Jul 11 18:24:23 CEST 2023
Tue Jul 11 18:24:35 CEST 2023
Tue Jul 11 18:24:47 CEST 2023
Tue Jul 11 18:24:59 CEST 2023
Tue Jul 11 18:25:11 CEST 2023
Tue Jul 11 18:25:24 CEST 2023
Tue Jul 11 18:25:36 CEST 2023
Tue Jul 11 18:25:48 CEST 2023
Tue Jul 11 18:25:50 CEST 2023
+-------------+----------+
| Field       | Value    |
+-------------+----------+
| power_state | power on |
+-------------+----------+
(undercloud) [stack.lab ~]$ 
~~~

Actual results:
Ironic keeps retrying login with old password

Expected results:
Ironic should read the updated password from the db and retry with it after a password change

Additional info:
This does not happen when using ipmi_driver

Comment 1 Steve Baker 2023-07-11 22:07:06 UTC
Could you please provide a full baremetal node show:

  openstack baremetal node show -fyaml hpdl380.local162.lab

It is not clear which power-interface is being used on this node, but if it is ipmi then I may see the issue. There is an update function to sync ipmi_password with the current value of ilo_password, which may not be triggered on baremetal node set. If this is the case, you'll need to sync the password values manually:

  openstack baremetal node set --driver-info ilo_password=newpassword --driver-info ipmi_password=newpassword hpdl380.local162.lab

Comment 2 Eric Nothen 2023-07-12 07:38:06 UTC
(In reply to Steve Baker from comment #1)
> Could you please provide a full baremetal node show:
> 
>   openstack baremetal node show -fyaml hpdl380.local162.lab
> 

~~~
(undercloud) [stack.lab ~]$ openstack baremetal node show -fyaml hpdl380.local162.lab
allocation_uuid: null
automated_clean: null
bios_interface: ilo
boot_interface: ilo-pxe
chassis_uuid: null
clean_step: {}
conductor: director.local162.lab
conductor_group: ''
console_enabled: false
console_interface: ilo
created_at: '2023-07-11T15:38:19+00:00'
deploy_interface: iscsi
deploy_step: {}
description: null
driver: ilo
driver_info:
  deploy_kernel: file:///var/lib/ironic/httpboot/agent.kernel
  deploy_ramdisk: file:///var/lib/ironic/httpboot/agent.ramdisk
  ilo_address: 10.0.0.100
  ilo_password: '******'
  ilo_username: root
  rescue_kernel: file:///var/lib/ironic/httpboot/agent.kernel
  rescue_ramdisk: file:///var/lib/ironic/httpboot/agent.ramdisk
driver_internal_info: {}
extra: {}
fault: power failure
inspect_interface: inspector
inspection_finished_at: null
inspection_started_at: null
instance_info: {}
instance_uuid: null
last_error: 'During sync_power_state, max retries exceeded for node 658b192e-abe6-48c8-ab07-5577ea3dc3b3,
  node state None does not match expected state ''power on''. Updating DB state to
  ''None'' Switching node to maintenance mode. Error: iLO get_power_status failed,
  error: Login failed.'
maintenance: true
maintenance_reason: 'During sync_power_state, max retries exceeded for node 658b192e-abe6-48c8-ab07-5577ea3dc3b3,
  node state None does not match expected state ''power on''. Updating DB state to
  ''None'' Switching node to maintenance mode. Error: iLO get_power_status failed,
  error: Login failed.'
management_interface: ilo
name: hpdl380.local162.lab
network_interface: flat
owner: null
power_interface: ilo
power_state: null
properties:
  cpu_arch: x86_64
protected: false
protected_reason: null
provision_state: manageable
provision_updated_at: '2023-07-11T15:38:25+00:00'
raid_config: {}
raid_interface: no-raid
rescue_interface: agent
reservation: null
resource_class: baremetal
storage_interface: noop
target_power_state: null
target_provision_state: null
target_raid_config: {}
traits: []
updated_at: '2023-07-11T16:32:47+00:00'
uuid: 658b192e-abe6-48c8-ab07-5577ea3dc3b3
vendor_interface: no-vendor
(undercloud) [stack.lab ~]$ 
~~~


> It is not clear which power-interface is being used on this node, 

It's an ilo (see for example steps #5 for ilo driver and step #6 for the the ilo.power error in the log). This comes from 'pm_type: "ilo"' in the nodes.yaml.


> but if it is ipmi then I may see the issue. 
> There is an update function to sync
> ipmi_password with the current value of ilo_password, which may not be
> triggered on baremetal node set.

No it's the other way around for me. As much as I tried, I could not reproduce the problem when using ipmi driver. That is: when using ipmi ironic retries with the **new** password after I do "baremetal node set --driver-info ipmi_password=...". It just takes some minutes, but it resolves on its own without a container restart.

When using ilo, however, it stays broken and retrying continuously with the old password until ironic_conductor is restarted, or until I put the old password back both in the ilo and in ironic.


> If this is the case, you'll need to sync
> the password values manually:
> 
>   openstack baremetal node set --driver-info ilo_password=newpassword
> --driver-info ipmi_password=newpassword hpdl380.local162.lab

It's not the same case, so I don't think this command applies. I don't have ipmi information on driver_info, only ilo_*.

Comment 3 Julia Kreger 2023-07-12 13:53:18 UTC
We've run into a similar issue with the redfish hardware type interfaces. Actually, I need to add a patch to detect some ilos and reject the ilo driver on them and instead indicate "use redfish".

Anyhow!

A glance at the ironic code path doesn't appear to use caching, and we always read from the database when launching new tasks. The challenge is the proliantutils library, internally, has caching, and that is what is breaking things here.

It appears, at least looking at Wallaby branch upstream, and newer, they have mirrored the logic to spawn a new cached client and checked upon username, password, and address. However in the Train release of proliantutils, it had no such guard, and only relied upon the address of the BMC.

It appears they fixed this in proliantutils 2.9.2, in upstream change number 694448.

Looks like we can just cherry-pick the patch into place.

Comment 4 Eric Nothen 2023-07-12 14:02:44 UTC
Awesome, thanks for the update Julia.

There's a workaround in place at the moment, so we can wait for the fix.


Note You need to log in before you can comment on or make changes to this bug.