Description of problem: after overcloud deletion Ironic node Provisioning State randomly goes to clean failed (undercloud) [stack@director templates]$ openstack baremetal node list +--------------------------------------+-----------+---------------+-------------+--------------------+-------------+ | UUID | Name | Instance UUID | Power State | Provisioning State | Maintenance | +--------------------------------------+-----------+---------------+-------------+--------------------+-------------+ | 2b1fab4a-fdc9-4d68-8f0f-7bf6c06c2430 | control1 | None | power off | available | False | | 58aecd47-7279-42fd-a912-a319b75831c9 | control2 | None | power off | clean failed | False | 2018-01-14 12:41:30.449 1544 ERROR ironic.drivers.modules.agent_base_vendor InstanceDeployFailure: Error rebooting node 58aecd47-7279-42fd-a912-a319b75831c9 after deploy. Error: DRAC operation failed. Reason: Unfinished config jobs found: [Job(id='JID_159042805063', name='ConfigBIOS:BIOS.Setup.1-1', start_time='TIME_NOW', until_time='TIME_NA', message='Task successfully scheduled.', status='Scheduled', percent_complete='0')]. Make sure they are completed before retrying. Version-Release number of selected component (if applicable): How reproducible: randomly Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
From the ironic/ironic-conductor.log I see thousands of these 2018-01-03 16:19:20.447 1531 WARNING dracclient.wsman [req-204fe8d0-42d4-4bae-9097-3f7e47f16446 - - - - -] A SSLError error occurred while communicating with 172.30.7.12, attempt 1 of 3: SSLError: EOF occurred in violation of protocol (_ssl.c:579) that have been occurring over the last 2 weeks. Although concerning, I don't think that's the root cause as I do see a successful cleaning occurring during that time: 2018-01-05 00:35:14.936 1516 INFO ironic.conductor.task_manager [req-4d096830-5c86-4680-b936-9eb1d3375464 - - - - -] Node 76b3e881-7ce7-4b4e-928b-aa9796036038 moved to provision state "available" from state "cleaning"; target provision state is "None" I also see lots of connection errors on 1/12 and 1/13 but I also don't think that this is source the problem as that may have been a temporary issue: 2018-01-03 14:54:12.465 1549 ERROR oslo.messaging._drivers.impl_rabbit [-] [3e23fd7e-512a-47af-9eaf-6c216b348eb5] AMQP server on 192.168.233.1:5672 is unreachable: [Errno 111] ECONNREFUSED. Trying again in 4 seconds. Client port: None: error: [Errno 111] ECONNREFUSED 2018-01-03 14:54:16.476 1549 ERROR ironic.common.rpc_service [-] Service error occurred when cleaning up the RPC manager. Error: (pymysql.err.OperationalError) (2003, "Can't connect to MySQL server on '192.168.233.1' ([Errno 111] ECONNREFUSED)"): DBConnectionError: (pymysql.err.OperationalError) (2003, "Can't connect to MySQL server on '192.168.233.1' ([Errno 111] ECONNREFUSED)") 2018-01-03 14:54:16.476 1549 ERROR ironic.common.rpc_service Traceback (most recent call last): I think the problem are these DRAC "Unfinished config jobs" errors which cause cleaning to fail: e20b33f f1ccaa6b493f4e60bcdbe06c37453d93 - default default] Failed to prepare node e57b5854-9c6d-48fb-825d-81acde677663 for cleaning: DRAC operation failed. Reason: Unfinished config jobs found: [Job(id='JID_149634821696', name='ConfigBIOS:BIOS.Setup.1-1', start_time='TIME_NOW', until_time='TIME_NA', message='Task successfully scheduled.', status='Scheduled', percent_complete='0')]. Make sure they are completed before retrying.: DracOperationError: DRAC operation failed. Reason: Unfinished config jobs found: [Job(id='JID_149634821696', name='ConfigBIOS:BIOS.Setup.1-1', start_time='TIME_NOW', until_time='TIME_NA', message='Task successfully scheduled.', status='Scheduled', percent_complete='0')]. Make sure they are completed before retrying. There are multiple of these errors that occurred on 1/13 and 1/14.
This appears to be upstream bug [0] and more generally [1]. [0] https://bugs.launchpad.net/ironic/+bug/1691808 [1] https://bugs.launchpad.net/python-dracclient/+bug/1697558 And also, as you indicated the Red Hat article https://access.redhat.com/solutions/3018101 is relevant here. Can you indicate the rpm versions of ironic* and python-dracclient that are installed?
Please find below details , so only option is to reboot / reset server $ grep -i ironic sos_commands/rpm/sh_-c_rpm_--nodigest_-qa_--qf_NAME_-_VERSION_-_RELEASE_._ARCH_INSTALLTIME_date_awk_-F_printf_-59s_s_n_1_2_sort_-f openstack-ironic-api-9.1.2-3.el7ost.noarch Wed Jan 3 10:58:42 2018 openstack-ironic-common-9.1.2-3.el7ost.noarch Wed Jan 3 10:58:34 2018 openstack-ironic-conductor-9.1.2-3.el7ost.noarch Wed Jan 3 10:58:49 2018 openstack-ironic-inspector-6.0.0-3.el7ost.noarch Wed Jan 3 10:59:00 2018 puppet-ironic-11.3.0-2.el7ost.noarch Wed Jan 3 10:19:04 2018 python-ironic-inspector-client-2.1.0-1.el7ost.noarch Wed Jan 3 10:19:11 2018 python-ironic-lib-2.10.0-1.el7ost.noarch Wed Jan 3 10:58:32 2018 python-ironicclient-1.17.0-1.el7ost.noarch Wed Jan 3 10:19:13 2018 $ grep -i drac sos_commands/rpm/sh_-c_rpm_--nodigest_-qa_--qf_NAME_-_VERSION_-_RELEASE_._ARCH_INSTALLTIME_date_awk_-F_printf_-59s_s_n_1_2_sort_-f python-dracclient-1.3.1-1.el7ost.noarch Wed Jan 3 10:58:32 2018
Feedback from a Dell engineer... ===== Well, I took a look on github in the 1.3.1 tag, and the code is present there: https://github.com/openstack/python-dracclient/blob/1.3.1/dracclient/client.py So it should be in that RPM (python-dracclient-1.3.1-1.el7ost.noarch) It looks to me like there are pending config jobs, and that’s what is causing cleaning to fail. I would have the customer check and make sure that the lifecycle controller is not in recovery mode on that node. If it is in recovery mode, take it out of recovery mode. Go in to the iDRAC GUI and look in the job queue what is the percentage completion of the pending job? If it is stuck at some percentage that’s greater than 0 and less than 100, then try deleting the config job through the GUI. If unable to delete it then the power cord will have to be pulled on the server, wait 1 minute, plug it back in, log back into the iDRAC GUI and delete the job. Hope this helps!
Just to clarify, they need to look at the job queue, find the pending job, and delete it. If it’s stuck at xx% and they can’t delete it, then they will need to do the unplug/hard reset.
>after job clear do we need to manually set ironic node state like ># ironic node-set-provision-state UUID manage Yes, set it to manage and back to provide.
since the Director had been configured for a clean-up of every node state change, it will start the cleaning process. below, changing the node to provide/available state will trigger the cleaning process which is then failed. The reason of failure when the cleaning process has been "DRAC operation failed. Reason: list index out of range" or sometime the other message that DRAC Job queue (while there is no job actually in the DRAC queue)... that is where the problem is. [stack@director ~]$ source stackrc (undercloud) [stack@director ~]$ openstack baremetal node list +--------------------------------------+----------+---------------+-------------+--------------------+-------------+ | UUID | Name | Instance UUID | Power State | Provisioning State | Maintenance | +--------------------------------------+----------+---------------+-------------+--------------------+-------------+ | c386a46e-c5a6-4607-bf1d-b1582c97d4b2 | control1 | None | power off | available | False | | f444cd09-1fa9-46b2-8625-a2c18e6a0b14 | control2 | None | power off | available | False | | 77c2e906-db9a-4bda-b6dd-c464461bb944 | control3 | None | power off | available | False | | 64318762-c04e-4f44-88f8-cafa6997aae7 | compute1 | None | power off | available | False | | 2512c630-4a7d-49c5-b5d0-3dd3292c1e35 | compute2 | None | power on | manageable | False | | 265de704-e549-4b6e-bec6-ff46a5cdeb67 | compute3 | None | power off | available | False | | 3ddab490-0ebc-4fe6-8a89-105ce1e0c441 | compute4 | None | power off | available | False | | 01ffb5c8-95fc-4d84-9ade-e2874e004d4f | network1 | None | power off | available | False | +--------------------------------------+----------+---------------+-------------+--------------------+-------------+ (undercloud) [stack@director ~]$ ironic node-set-provision-state 2512c630-4a7d-49c5-b5d0-3dd3292c1e35 provide (undercloud) [stack@director ~]$ openstack baremetal node list +--------------------------------------+----------+---------------+-------------+--------------------+-------------+ | UUID | Name | Instance UUID | Power State | Provisioning State | Maintenance | +--------------------------------------+----------+---------------+-------------+--------------------+-------------+ | c386a46e-c5a6-4607-bf1d-b1582c97d4b2 | control1 | None | power off | available | False | | f444cd09-1fa9-46b2-8625-a2c18e6a0b14 | control2 | None | power off | available | False | | 77c2e906-db9a-4bda-b6dd-c464461bb944 | control3 | None | power off | available | False | | 64318762-c04e-4f44-88f8-cafa6997aae7 | compute1 | None | power off | available | False | | 2512c630-4a7d-49c5-b5d0-3dd3292c1e35 | compute2 | None | power on | cleaning | False | | 265de704-e549-4b6e-bec6-ff46a5cdeb67 | compute3 | None | power off | available | False | | 3ddab490-0ebc-4fe6-8a89-105ce1e0c441 | compute4 | None | power off | available | False | | 01ffb5c8-95fc-4d84-9ade-e2874e004d4f | network1 | None | power off | available | False | +--------------------------------------+----------+---------------+-------------+--------------------+-------------+ (undercloud) [stack@director ~]$ openstack baremetal node list +--------------------------------------+----------+---------------+-------------+--------------------+-------------+ | UUID | Name | Instance UUID | Power State | Provisioning State | Maintenance | +--------------------------------------+----------+---------------+-------------+--------------------+-------------+ | c386a46e-c5a6-4607-bf1d-b1582c97d4b2 | control1 | None | power off | available | False | | f444cd09-1fa9-46b2-8625-a2c18e6a0b14 | control2 | None | power off | available | False | | 77c2e906-db9a-4bda-b6dd-c464461bb944 | control3 | None | power off | available | False | | 64318762-c04e-4f44-88f8-cafa6997aae7 | compute1 | None | power off | available | False | | 2512c630-4a7d-49c5-b5d0-3dd3292c1e35 | compute2 | None | power off | clean failed | True | | 265de704-e549-4b6e-bec6-ff46a5cdeb67 | compute3 | None | power off | available | False | | 3ddab490-0ebc-4fe6-8a89-105ce1e0c441 | compute4 | None | power off | available | False | | 01ffb5c8-95fc-4d84-9ade-e2874e004d4f | network1 | None | power off | available | False | +--------------------------------------+----------+---------------+-------------+--------------------+-------------+ (undercloud) [stack@director ~]$ openstack baremetal node show compute2 +------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | Field | Value | +------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | boot_interface | None | | chassis_uuid | None | | clean_step | {} | | console_enabled | False | | console_interface | None | | created_at | 2018-01-17T14:47:48+00:00 | | deploy_interface | None | | driver | pxe_drac | | driver_info | {u'drac_password': u'******', u'drac_username': u'root', u'deploy_ramdisk': u'24f54577-75e6-4939-ac28-13c5cfa330ed', u'deploy_kernel': u'44464311-dd9f-40c5-bfe2-338a6d2ee6f8', u'drac_address': u'172.30.7.16'} | | driver_internal_info | {u'agent_erase_devices_iterations': 1, u'agent_erase_devices_zeroize': True, u'drac_boot_device': {u'boot_device': u'pxe', u'persistent': False}, u'agent_continue_if_ata_erase_failed': False} | | extra | {u'hardware_swift_object': u'extra_hardware-2512c630-4a7d-49c5-b5d0-3dd3292c1e35'} | | inspect_interface | None | | inspection_finished_at | None | | inspection_started_at | None | | instance_info | {} | | instance_uuid | None | | last_error | None | | maintenance | True | | maintenance_reason | Failed to prepare node 2512c630-4a7d-49c5-b5d0-3dd3292c1e35 for cleaning: DRAC operation failed. Reason: list index out of range | | management_interface | None | | name | compute2 | | network_interface | flat | | power_interface | None | | power_state | power off | | properties | {u'memory_mb': u'262144', u'cpu_arch': u'x86_64', u'local_gb': u'1675', u'cpus': u'56', u'capabilities': u'profile:compute,cpu_hugepages:true,cpu_txt:true,cpu_vt:true,boot_option:local,cpu_aes:true,cpu_hugepages_1g:true,boot_mode:bios'} | | provision_state | clean failed | | provision_updated_at | 2018-01-19T07:23:38+00:00 | | raid_config | {} | | raid_interface | None | | reservation | None | | resource_class | baremetal | | storage_interface | noop | | target_power_state | None | | target_provision_state | available | | target_raid_config | {} | | updated_at | 2018-01-19T07:23:38+00:00 | | uuid | 2512c630-4a7d-49c5-b5d0-3dd3292c1e35 | | vendor_interface | None | +------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ (undercloud) [stack@director ~]$"
Anil, can you look in /var/log/ironic/ironic-conductor.log on the director node and see if you can find a stack trace for the "list index out of range" error above?
new sos report on collabshell /cases/02011619/sosreport-20180122-161746/director.starhub.com below from ironic-conductor.log === 2018-01-18 19:31:28.723 1544 ERROR ironic.drivers.modules.drac.power [req-badb737d-a1c7-454c-a236-6c50b9c483d0 - - - - -] DRAC driver failed to get power state for node f444cd09-1fa9-46b2-8625-a2c18e6a0b14. Reason: Invalid response received. Status code: "401", reason: "Unauthorized".: WSManInvalidResponse: Invalid response received. Status code: "401", reason: "Unauthorized" 2018-01-18 19:31:28.725 1544 WARNING ironic.conductor.manager [req-badb737d-a1c7-454c-a236-6c50b9c483d0 - - - - -] During sync_power_state, could not get power state for node f444cd09-1fa9-46b2-8625-a2c18e6a0b14, attempt 1 of 3. Error: DRAC operation failed. Reason: Invalid response received. Status code: "401", reason: "Unauthorized".: DracOperationError: DRAC operation failed. Reason: Invalid response received. Status code: "401", reason: "Unauthorized" 2018-01-18 22:41:52.428 1544 WARNING ironic.conductor.utils [req-44214a0e-9a54-4296-b022-368555ceeeae 39eed5a9cda243b18cb0778c1f20b33f f1ccaa6b493f4e60bcdbe06c37453d93 - default default] The reboot method of pxe_drac doesn't support 'timeout' parameter. 2018-01-18 22:42:29.376 1544 ERROR ironic.drivers.modules.drac.management [req-44214a0e-9a54-4296-b022-368555ceeeae 39eed5a9cda243b18cb0778c1f20b33f f1ccaa6b493f4e60bcdbe06c37453d93 - default default] DRAC driver failed to get next boot mode for node 2512c630-4a7d-49c5-b5d0-3dd3292c1e35. Reason: list index out of range.: IndexError: list index out of range 2018-01-18 23:02:00.356 1544 WARNING ironic.conductor.utils [req-615ee86d-f1e6-475d-8ce2-70908b8527eb 39eed5a9cda243b18cb0778c1f20b33f f1ccaa6b493f4e60bcdbe06c37453d93 - default default] The reboot method of pxe_drac doesn't support 'timeout' parameter. 2018-01-18 23:02:31.611 1544 ERROR ironic.drivers.modules.drac.management [req-615ee86d-f1e6-475d-8ce2-70908b8527eb 39eed5a9cda243b18cb0778c1f20b33f f1ccaa6b493f4e60bcdbe06c37453d93 - default default] DRAC driver failed to get next boot mode for node 2512c630-4a7d-49c5-b5d0-3dd3292c1e35. Reason: list index out of range.: IndexError: list index out of range 2018-01-18 23:22:07.962 1544 WARNING ironic.conductor.utils [req-92a46a31-f267-4d7c-bba3-feb6e1821723 39eed5a9cda243b18cb0778c1f20b33f f1ccaa6b493f4e60bcdbe06c37453d93 - default default] The reboot method of pxe_drac doesn't support 'timeout' parameter. 2018-01-18 23:22:09.325 1544 WARNING dracclient.wsman [req-92a46a31-f267-4d7c-bba3-feb6e1821723 39eed5a9cda243b18cb0778c1f20b33f f1ccaa6b493f4e60bcdbe06c37453d93 - default default] A SSLError error occurred while communicating with 172.30.7.16, attempt 1 of 3: SSLError: EOF occurred in violation of protocol (_ssl.c:579) 2018-01-18 23:22:42.137 1544 ERROR ironic.drivers.modules.drac.management [req-92a46a31-f267-4d7c-bba3-feb6e1821723 39eed5a9cda243b18cb0778c1f20b33f f1ccaa6b493f4e60bcdbe06c37453d93 - default default] DRAC driver failed to get next boot mode for node 2512c630-4a7d-49c5-b5d0-3dd3292c1e35. Reason: list index out of range.: IndexError: list index out of range 2018-01-19 15:22:51.881 1544 INFO ironic.conductor.task_manager [req-01213009-97f0-447f-8fc6-f43ec7a377ad 39eed5a9cda243b18cb0778c1f20b33f f1ccaa6b493f4e60bcdbe06c37453d93 - default default] Node 2512c630-4a7d-49c5-b5d0-3dd3292c1e35 moved to provision state "cleaning" from state "manageable"; target provision state is "available" 2018-01-19 15:22:52.919 1544 INFO ironic.drivers.modules.network.flat [req-01213009-97f0-447f-8fc6-f43ec7a377ad 39eed5a9cda243b18cb0778c1f20b33f f1ccaa6b493f4e60bcdbe06c37453d93 - default default] Adding cleaning network to node 2512c630-4a7d-49c5-b5d0-3dd3292c1e35 2018-01-19 15:22:54.000 1544 INFO ironic.common.neutron [req-01213009-97f0-447f-8fc6-f43ec7a377ad 39eed5a9cda243b18cb0778c1f20b33f f1ccaa6b493f4e60bcdbe06c37453d93 - default default] Successfully created ports for node 2512c630-4a7d-49c5-b5d0-3dd3292c1e35 in network 8d63ecb8-58ed-42dd-baf5-e3e00d69ec7e. 2018-01-19 15:22:55.368 1544 WARNING ironic.conductor.utils [req-01213009-97f0-447f-8fc6-f43ec7a377ad 39eed5a9cda243b18cb0778c1f20b33f f1ccaa6b493f4e60bcdbe06c37453d93 - default default] The reboot method of pxe_drac doesn't support 'timeout' parameter. 2018-01-19 15:22:56.776 1544 WARNING dracclient.wsman [req-01213009-97f0-447f-8fc6-f43ec7a377ad 39eed5a9cda243b18cb0778c1f20b33f f1ccaa6b493f4e60bcdbe06c37453d93 - default default] A SSLError error occurred while communicating with 172.30.7.16, attempt 1 of 3: SSLError: EOF occurred in violation of protocol (_ssl.c:579) 2018-01-19 15:23:30.113 1544 ERROR ironic.drivers.modules.drac.management [req-01213009-97f0-447f-8fc6-f43ec7a377ad 39eed5a9cda243b18cb0778c1f20b33f f1ccaa6b493f4e60bcdbe06c37453d93 - default default] DRAC driver failed to get next boot mode for node 2512c630-4a7d-49c5-b5d0-3dd3292c1e35. Reason: list index out of range.: IndexError: list index out of range 2018-01-19 15:23:30.123 1544 ERROR ironic.conductor.manager [req-01213009-97f0-447f-8fc6-f43ec7a377ad 39eed5a9cda243b18cb0778c1f20b33f f1ccaa6b493f4e60bcdbe06c37453d93 - default default] Failed to prepare node 2512c630-4a7d-49c5-b5d0-3dd3292c1e35 for cleaning: DRAC operation failed. Reason: list index out of range: DracOperationError: DRAC operation failed. Reason: list index out of range 2018-01-19 15:23:30.123 1544 ERROR ironic.conductor.manager Traceback (most recent call last): 2018-01-19 15:23:30.123 1544 ERROR ironic.conductor.manager File "/usr/lib/python2.7/site-packages/ironic/conductor/manager.py", line 936, in _do_node_clean 2018-01-19 15:23:30.123 1544 ERROR ironic.conductor.manager prepare_result = task.driver.deploy.prepare_cleaning(task) 2018-01-19 15:23:30.123 1544 ERROR ironic.conductor.manager File "/usr/lib/python2.7/site-packages/ironic_lib/metrics.py", line 61, in wrapped 2018-01-19 15:23:30.123 1544 ERROR ironic.conductor.manager result = f(*args, **kwargs) 2018-01-19 15:23:30.123 1544 ERROR ironic.conductor.manager File "/usr/lib/python2.7/site-packages/ironic/drivers/modules/iscsi_deploy.py", line 600, in prepare_cleaning 2018-01-19 15:23:30.123 1544 ERROR ironic.conductor.manager task, manage_boot=True) 2018-01-19 15:23:30.123 1544 ERROR ironic.conductor.manager File "/usr/lib/python2.7/site-packages/ironic/drivers/modules/deploy_utils.py", line 982, in prepare_inband_cleaning 2018-01-19 15:23:30.123 1544 ERROR ironic.conductor.manager manager_utils.node_power_action(task, states.REBOOT) 2018-01-19 15:23:30.123 1544 ERROR ironic.conductor.manager File "/usr/lib/python2.7/site-packages/ironic/conductor/task_manager.py", line 148, in wrapper 2018-01-19 15:23:30.123 1544 ERROR ironic.conductor.manager return f(*args, **kwargs) 2018-01-19 15:23:30.123 1544 ERROR ironic.conductor.manager File "/usr/lib/python2.7/site-packages/ironic/conductor/utils.py", line 267, in node_power_action 2018-01-19 15:23:30.123 1544 ERROR ironic.conductor.manager fields.NotificationStatus.ERROR, new_state) 2018-01-19 15:23:30.123 1544 ERROR ironic.conductor.manager File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 220, in __exit__ 2018-01-19 15:23:30.123 1544 ERROR ironic.conductor.manager self.force_reraise() 2018-01-19 15:23:30.123 1544 ERROR ironic.conductor.manager File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 196, in force_reraise 2018-01-19 15:23:30.123 1544 ERROR ironic.conductor.manager six.reraise(self.type_, self.value, self.tb) 2018-01-19 15:23:30.123 1544 ERROR ironic.conductor.manager File "/usr/lib/python2.7/site-packages/ironic/conductor/utils.py", line 254, in node_power_action 2018-01-19 15:23:30.123 1544 ERROR ironic.conductor.manager task.driver.power.reboot(task) 2018-01-19 15:23:30.123 1544 ERROR ironic.conductor.manager File "/usr/lib/python2.7/site-packages/ironic_lib/metrics.py", line 61, in wrapped 2018-01-19 15:23:30.123 1544 ERROR ironic.conductor.manager result = f(*args, **kwargs) 2018-01-19 15:23:30.123 1544 ERROR ironic.conductor.manager File "/usr/lib/python2.7/site-packages/ironic/conductor/task_manager.py", line 148, in wrapper 2018-01-19 15:23:30.123 1544 ERROR ironic.conductor.manager return f(*args, **kwargs) 2018-01-19 15:23:30.123 1544 ERROR ironic.conductor.manager File "/usr/lib/python2.7/site-packages/ironic/drivers/modules/drac/power.py", line 180, in reboot 2018-01-19 15:23:30.123 1544 ERROR ironic.conductor.manager _set_power_state(task.node, target_power_state) 2018-01-19 15:23:30.123 1544 ERROR ironic.conductor.manager File "/usr/lib/python2.7/site-packages/ironic/drivers/modules/drac/power.py", line 100, in _set_power_state 2018-01-19 15:23:30.123 1544 ERROR ironic.conductor.manager _commit_boot_list_change(node) 2018-01-19 15:23:30.123 1544 ERROR ironic.conductor.manager File "/usr/lib/python2.7/site-packages/ironic/drivers/modules/drac/power.py", line 76, in _commit_boot_list_change 2018-01-19 15:23:30.123 1544 ERROR ironic.conductor.manager boot_device['persistent']) 2018-01-19 15:23:30.123 1544 ERROR ironic.conductor.manager File "/usr/lib/python2.7/site-packages/ironic/drivers/modules/drac/management.py", line 98, in set_boot_device 2018-01-19 15:23:30.123 1544 ERROR ironic.conductor.manager current_boot_device = _get_boot_device(node, drac_boot_devices) 2018-01-19 15:23:30.123 1544 ERROR ironic.conductor.manager File "/usr/lib/python2.7/site-packages/ironic/drivers/modules/drac/management.py", line 74, in _get_boot_device 2018-01-19 15:23:30.123 1544 ERROR ironic.conductor.manager raise exception.DracOperationError(error=exc) 2018-01-19 15:23:30.123 1544 ERROR ironic.conductor.manager DracOperationError: DRAC operation failed. Reason: list index out of range 2018-01-19 15:23:30.123 1544 ERROR ironic.conductor.manager 2018-01-19 15:23:37.148 1544 WARNING ironic.conductor.utils [req-01213009-97f0-447f-8fc6-f43ec7a377ad 39eed5a9cda243b18cb0778c1f20b33f f1ccaa6b493f4e60bcdbe06c37453d93 - default default] Not going to change node 2512c630-4a7d-49c5-b5d0-3dd3292c1e35 power state because current state = requested state = 'power off'.: DracOperationError: DRAC operation failed. Reason: list index out of range 2018-01-19 15:23:37.150 1544 INFO ironic.drivers.modules.network.flat [req-01213009-97f0-447f-8fc6-f43ec7a377ad 39eed5a9cda243b18cb0778c1f20b33f f1ccaa6b493f4e60bcdbe06c37453d93 - default default] Removing ports from cleaning network for node 2512c630-4a7d-49c5-b5d0-3dd3292c1e35 2018-01-19 15:23:38.200 1544 INFO ironic.common.neutron [req-01213009-97f0-447f-8fc6-f43ec7a377ad 39eed5a9cda243b18cb0778c1f20b33f f1ccaa6b493f4e60bcdbe06c37453d93 - default default] Successfully removed node 2512c630-4a7d-49c5-b5d0-3dd3292c1e35 neutron ports. 2018-01-19 15:23:38.215 1544 INFO ironic.conductor.task_manager [req-01213009-97f0-447f-8fc6-f43ec7a377ad 39eed5a9cda243b18cb0778c1f20b33f f1ccaa6b493f4e60bcdbe06c37453d93 - default default] Node 2512c630-4a7d-49c5-b5d0-3dd3292c1e35 moved to provision state "clean failed" from state "cleaning"; target provision state is "available === fresh sos === to provision state "cleaning" from state "manageable"; target provision state is "available" 2018-01-22 16:14:53.977 1555 INFO ironic.drivers.modules.network.flat [req-ae0cfbd0-b023-4ea0-81b2-738f5da9cbab 39eed5a9cda243b18cb0778c1f20b33f f1ccaa6b493f4e60bcdbe06c37453d93 - default default] Adding cleaning network to node 2512c630-4a7d-49c5-b5d0-3dd3292c1e35 2018-01-22 16:14:55.148 1555 INFO ironic.common.neutron [req-ae0cfbd0-b023-4ea0-81b2-738f5da9cbab 39eed5a9cda243b18cb0778c1f20b33f f1ccaa6b493f4e60bcdbe06c37453d93 - default default] Successfully created ports for node 2512c630-4a7d-49c5-b5d0-3dd3292c1e35 in network 8d63ecb8-58ed-42dd-baf5-e3e00d69ec7e. 2018-01-22 16:14:57.502 1555 WARNING ironic.conductor.utils [req-ae0cfbd0-b023-4ea0-81b2-738f5da9cbab 39eed5a9cda243b18cb0778c1f20b33f f1ccaa6b493f4e60bcdbe06c37453d93 - default default] The reboot method of pxe_drac doesn't support 'timeout' parameter. 2018-01-22 16:15:28.190 1555 ERROR ironic.drivers.modules.drac.management [req-ae0cfbd0-b023-4ea0-81b2-738f5da9cbab 39eed5a9cda243b18cb0778c1f20b33f f1ccaa6b493f4e60bcdbe06c37453d93 - default default] DRAC driver failed to get next boot mode for node 2512c630-4a7d-49c5-b5d0-3dd3292c1e35. Reason: list index out of range.: IndexError: list index out of range 2018-01-22 16:15:28.201 1555 ERROR ironic.conductor.manager [req-ae0cfbd0-b023-4ea0-81b2-738f5da9cbab 39eed5a9cda243b18cb0778c1f20b33f f1ccaa6b493f4e60bcdbe06c37453d93 - default default] Failed to prepare node 2512c630-4a7d-49c5-b5d0-3dd3292c1e35 for cleaning: DRAC operation failed. Reason: list index out of range: DracOperationError: DRAC operation failed. Reason: list index out of range 2018-01-22 16:15:28.201 1555 ERROR ironic.conductor.manager Traceback (most recent call last): 2018-01-22 16:15:28.201 1555 ERROR ironic.conductor.manager File "/usr/lib/python2.7/site-packages/ironic/conductor/manager.py", line 936, in _do_node_clean 2018-01-22 16:15:28.201 1555 ERROR ironic.conductor.manager prepare_result = task.driver.deploy.prepare_cleaning(task) 2018-01-22 16:15:28.201 1555 ERROR ironic.conductor.manager File "/usr/lib/python2.7/site-packages/ironic_lib/metrics.py", line 61, in wrapped 2018-01-22 16:15:28.201 1555 ERROR ironic.conductor.manager result = f(*args, **kwargs) 2018-01-22 16:15:28.201 1555 ERROR ironic.conductor.manager File "/usr/lib/python2.7/site-packages/ironic/drivers/modules/iscsi_deploy.py", line 600, in prepare_cleaning 2018-01-22 16:15:28.201 1555 ERROR ironic.conductor.manager task, manage_boot=True) 2018-01-22 16:15:28.201 1555 ERROR ironic.conductor.manager File "/usr/lib/python2.7/site-packages/ironic/drivers/modules/deploy_utils.py", line 982, in prepare_inband_cleaning 2018-01-22 16:15:28.201 1555 ERROR ironic.conductor.manager manager_utils.node_power_action(task, states.REBOOT) 2018-01-22 16:15:28.201 1555 ERROR ironic.conductor.manager File "/usr/lib/python2.7/site-packages/ironic/conductor/task_manager.py", line 148, in wrapper 2018-01-22 16:15:28.201 1555 ERROR ironic.conductor.manager return f(*args, **kwargs) 2018-01-22 16:15:28.201 1555 ERROR ironic.conductor.manager File "/usr/lib/python2.7/site-packages/ironic/conductor/utils.py", line 267, in node_power_action 2018-01-22 16:15:28.201 1555 ERROR ironic.conductor.manager fields.NotificationStatus.ERROR, new_state) 2018-01-22 16:15:28.201 1555 ERROR ironic.conductor.manager File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 220, in __exit__ 2018-01-22 16:15:28.201 1555 ERROR ironic.conductor.manager self.force_reraise() 2018-01-22 16:15:28.201 1555 ERROR ironic.conductor.manager File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 196, in force_reraise 2018-01-22 16:15:28.201 1555 ERROR ironic.conductor.manager six.reraise(self.type_, self.value, self.tb) 2018-01-22 16:15:28.201 1555 ERROR ironic.conductor.manager File "/usr/lib/python2.7/site-packages/ironic/conductor/utils.py", line 254, in node_power_action 2018-01-22 16:15:28.201 1555 ERROR ironic.conductor.manager task.driver.power.reboot(task) 2018-01-22 16:15:28.201 1555 ERROR ironic.conductor.manager File "/usr/lib/python2.7/site-packages/ironic_lib/metrics.py", line 61, in wrapped 2018-01-22 16:15:28.201 1555 ERROR ironic.conductor.manager result = f(*args, **kwargs) 2018-01-22 16:15:28.201 1555 ERROR ironic.conductor.manager File "/usr/lib/python2.7/site-packages/ironic/conductor/task_manager.py", line 148, in wrapper 2018-01-22 16:15:28.201 1555 ERROR ironic.conductor.manager return f(*args, **kwargs) 2018-01-22 16:15:28.201 1555 ERROR ironic.conductor.manager File "/usr/lib/python2.7/site-packages/ironic/drivers/modules/drac/power.py", line 180, in reboot 2018-01-22 16:15:28.201 1555 ERROR ironic.conductor.manager _set_power_state(task.node, target_power_state) 2018-01-22 16:15:28.201 1555 ERROR ironic.conductor.manager File "/usr/lib/python2.7/site-packages/ironic/drivers/modules/drac/power.py", line 100, in _set_power_state 2018-01-22 16:15:28.201 1555 ERROR ironic.conductor.manager _commit_boot_list_change(node) 2018-01-22 16:15:28.201 1555 ERROR ironic.conductor.manager File "/usr/lib/python2.7/site-packages/ironic/drivers/modules/drac/power.py", line 76, in _commit_boot_list_change 2018-01-22 16:15:28.201 1555 ERROR ironic.conductor.manager boot_device['persistent']) 2018-01-22 16:15:28.201 1555 ERROR ironic.conductor.manager File "/usr/lib/python2.7/site-packages/ironic/drivers/modules/drac/management.py", line 98, in set_boot_device 2018-01-22 16:15:28.201 1555 ERROR ironic.conductor.manager current_boot_device = _get_boot_device(node, drac_boot_devices) 2018-01-22 16:15:28.201 1555 ERROR ironic.conductor.manager File "/usr/lib/python2.7/site-packages/ironic/drivers/modules/drac/management.py", line 74, in _get_boot_device 2018-01-22 16:15:28.201 1555 ERROR ironic.conductor.manager raise exception.DracOperationError(error=exc) 2018-01-22 16:15:28.201 1555 ERROR ironic.conductor.manager DracOperationError: DRAC operation failed. Reason: list index out of range 2018-01-22 16:15:28.201 1555 ERROR ironic.conductor.manager 2018-01-22 16:15:34.100 1555 WARNING ironic.conductor.utils [req-ae0cfbd0-b023-4ea0-81b2-738f5da9cbab 39eed5a9cda243b18cb0778c1f20b33f f1ccaa6b493f4e60bcdbe06c37453d93 - default default] Not going to change node 2512c630-4a7d-49c5-b5d0-3dd3292c1e35 power state because current state = requested state = 'power off'.: DracOperationError: DRAC operation failed. Reason: list index out of range
To debug further, I need the contents of DCIM_BootConfigSetting and DCIM_BootSourceSetting from the iDRAC on the node having the issue. You can get this by installing the wsmancli RPM: # yum install wsmancli -y Then running the following commands: # wsman enumerate http://schemas.dell.com/wbem/wscim/1/cim-schema/2/DCIM_BootConfigSetting -P 443 -h <IPAddress> -u root -p <iDRACPassword> -V -c ~/.ssh/id_rsa.pub > bcs.txt # wsman enumerate http://schemas.dell.com/wbem/wscim/1/cim-schema/2/DCIM_BootSourceSetting -P 443 -h <IPAddress> -u root -p <iDRACPassword> -V -c ~/.ssh/id_rsa.pub > bss.txt Please attach the resulting bcs.txt and bss.txt files. Thanks!
Created attachment 1385929 [details] bcs.txt
Created attachment 1385930 [details] bss.txt
Hey Anil, that iDRAC is in a very strange state. In DCIM_BootConfigSetting, it is listing only vFlash as a possible boot device, and it's not set as the next boot device. It should be listing several entries: IPL, BCV, UEFI, OneTime, in addition to vFlash. The best thing to do is hard-reset the iDRAC by unplugging the server from all power, waiting for 1 minute, and then plugging it back in. Next, I would log in to the iDRAC GUI and bring up the virtual console. Power the server up and wait for it to go through the firmware initialization and try to boot. Power the server back off. Rerun the above command with DCIM_BootConfigSetting and verify you get back multiple entries as indicated above. If you do, then you should be good to go. If you don't, then you have a hardware problem, and I would proceed with hardware diagnostics to determine the issue. You could try querying DCIM_PhysicalDiskView to see if the iDRAC can see any disks. You could also try querying DCIM_ControllerView to see if the iDRAC can see a disk controller. In addition, it's very strange that no entries are returned from the DCIM_BootSourceSetting query above. Are you sure you entered it correctly? A hard-reset should also solve this problem. Hope this helps!
Hey Chris, So actually reset, hard reset and power cycle by removing all power cables, and also run a successful update from Lifecycle Controller, but only updating the network NIC firmware since there was no newer iDRAC or Lifecycle Controller firmware that time. It still has the same problem. Also Cu contact Dell Support but they cannot find any problem with the hardware, and functionality wise via web and Java Console the iDRAC is working properly. We can power on, power off, graceful shutdown, reboot, vFlash working (able to install Ubuntu using virtual DVD on it), can alter boot sequence etc.
Are you leaving the vFlash card in the card slot? If so, remove it. Boot the node into the lifecycle controller. Make sure the node is set to Legacy Boot and not UEFI boot. Examine the boot source list. Judging from the above, it sounds like it will show only vflash. Check to see if the other boot sources are listed as disabled. Enable the PXE boot source and the hard disk boot source. If PXE boot is not listed as an option, then examine the NICs. One NIC port should be set to PXE boot. If none are set to PXE boot then enable PXE booting on the appropriate NIC port. Go back and look at the boot sources again. Make sure the PXE boot source is enabled.
I'm closing this as its working with pxe_ipmitool but leaving the NEEDINFO on to get the requested data for Chris.