Bug 1534551 - Ironic node using dracclient goes to clean failed with "Unfinished config jobs found"
Summary: Ironic node using dracclient goes to clean failed with "Unfinished config job...
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-ironic
Version: 12.0 (Pike)
Hardware: x86_64
OS: Unspecified
high
high
Target Milestone: ---
: ---
Assignee: RHOS Maint
QA Contact: mlammon
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-01-15 13:24 UTC by Anil Dhingra
Modified: 2021-03-11 16:54 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-02-15 15:05:04 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
bcs.txt (1.86 KB, text/plain)
2018-01-25 04:45 UTC, Anil Dhingra
no flags Details
bss.txt (1.46 KB, text/plain)
2018-01-25 04:46 UTC, Anil Dhingra
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Launchpad 1697558 0 None None None 2018-01-15 19:16:48 UTC
OpenStack gerrit 487234 0 'None' MERGED Simplify get Lifecycle Controller version 2021-02-03 14:26:54 UTC

Description Anil Dhingra 2018-01-15 13:24:18 UTC
Description of problem:
after overcloud deletion Ironic node Provisioning State randomly goes to clean failed 

(undercloud) [stack@director templates]$ openstack baremetal node list
+--------------------------------------+-----------+---------------+-------------+--------------------+-------------+
| UUID                                 | Name      | Instance UUID | Power State | Provisioning State | Maintenance |
+--------------------------------------+-----------+---------------+-------------+--------------------+-------------+
| 2b1fab4a-fdc9-4d68-8f0f-7bf6c06c2430 | control1  | None          | power off   | available          | False       |
| 58aecd47-7279-42fd-a912-a319b75831c9 | control2  | None          | power off   | clean failed       | False       |

2018-01-14 12:41:30.449 1544 ERROR ironic.drivers.modules.agent_base_vendor InstanceDeployFailure: Error rebooting node 58aecd47-7279-42fd-a912-a319b75831c9 after deploy. Error: DRAC operation failed. Reason: Unfinished config jobs found: [Job(id='JID_159042805063', name='ConfigBIOS:BIOS.Setup.1-1', start_time='TIME_NOW', until_time='TIME_NA', message='Task successfully scheduled.', status='Scheduled', percent_complete='0')]. Make sure they are completed before retrying.

Version-Release number of selected component (if applicable):


How reproducible:
randomly

Steps to Reproduce:
1.
2.
3.

Actual results:




Expected results:


Additional info:

Comment 2 Bob Fournier 2018-01-15 18:42:57 UTC
From the ironic/ironic-conductor.log I see thousands of these 
2018-01-03 16:19:20.447 1531 WARNING dracclient.wsman [req-204fe8d0-42d4-4bae-9097-3f7e47f16446 - - - - -] A SSLError error occurred while  communicating with 172.30.7.12, attempt 1 of 3: SSLError: EOF occurred in violation of protocol (_ssl.c:579)

that have been occurring over the last 2 weeks. Although concerning, I don't think that's the root cause as I do see a successful cleaning occurring during that time:
2018-01-05 00:35:14.936 1516 INFO ironic.conductor.task_manager [req-4d096830-5c86-4680-b936-9eb1d3375464 - - - - -] Node 76b3e881-7ce7-4b4e-928b-aa9796036038 moved to provision state "available" from state "cleaning"; target provision state is "None"

I also see lots of connection errors on 1/12 and 1/13 but I also don't think that this is source the problem as that may have been a temporary issue:
2018-01-03 14:54:12.465 1549 ERROR oslo.messaging._drivers.impl_rabbit [-] [3e23fd7e-512a-47af-9eaf-6c216b348eb5] AMQP server on 192.168.233.1:5672 is unreachable: [Errno 111] ECONNREFUSED. Trying again in 4 seconds. Client port: None: error: [Errno 111] ECONNREFUSED
2018-01-03 14:54:16.476 1549 ERROR ironic.common.rpc_service [-] Service error occurred when cleaning up the RPC manager. Error: (pymysql.err.OperationalError) (2003, "Can't connect to MySQL server on '192.168.233.1' ([Errno 111] ECONNREFUSED)"): DBConnectionError: (pymysql.err.OperationalError) (2003, "Can't connect to MySQL server on '192.168.233.1' ([Errno 111] ECONNREFUSED)")
2018-01-03 14:54:16.476 1549 ERROR ironic.common.rpc_service Traceback (most recent call last):

I think the problem are these DRAC "Unfinished config jobs" errors which cause cleaning to fail:
e20b33f f1ccaa6b493f4e60bcdbe06c37453d93 - default default] Failed to prepare node e57b5854-9c6d-48fb-825d-81acde677663 for cleaning: DRAC operation failed. Reason: Unfinished config jobs found: [Job(id='JID_149634821696', name='ConfigBIOS:BIOS.Setup.1-1', start_time='TIME_NOW', until_time='TIME_NA', message='Task successfully scheduled.', status='Scheduled', percent_complete='0')]. Make sure they are completed before retrying.: DracOperationError: DRAC operation failed. Reason: Unfinished config jobs found: [Job(id='JID_149634821696', name='ConfigBIOS:BIOS.Setup.1-1', start_time='TIME_NOW', until_time='TIME_NA', message='Task successfully scheduled.', status='Scheduled', percent_complete='0')]. Make sure they are completed before retrying.

There are multiple of these errors that occurred on 1/13 and 1/14.

Comment 3 Bob Fournier 2018-01-15 19:08:44 UTC
This appears to be upstream bug [0] and more generally [1].

[0] https://bugs.launchpad.net/ironic/+bug/1691808
[1] https://bugs.launchpad.net/python-dracclient/+bug/1697558

And also, as you indicated the Red Hat article https://access.redhat.com/solutions/3018101 is relevant here.

Can you indicate the rpm versions of ironic* and python-dracclient that are installed?

Comment 4 Anil Dhingra 2018-01-16 03:47:15 UTC
Please find below details , so only option is to reboot / reset server

$ grep -i ironic sos_commands/rpm/sh_-c_rpm_--nodigest_-qa_--qf_NAME_-_VERSION_-_RELEASE_._ARCH_INSTALLTIME_date_awk_-F_printf_-59s_s_n_1_2_sort_-f 
openstack-ironic-api-9.1.2-3.el7ost.noarch                  Wed Jan  3 10:58:42 2018
openstack-ironic-common-9.1.2-3.el7ost.noarch               Wed Jan  3 10:58:34 2018
openstack-ironic-conductor-9.1.2-3.el7ost.noarch            Wed Jan  3 10:58:49 2018
openstack-ironic-inspector-6.0.0-3.el7ost.noarch            Wed Jan  3 10:59:00 2018
puppet-ironic-11.3.0-2.el7ost.noarch                        Wed Jan  3 10:19:04 2018
python-ironic-inspector-client-2.1.0-1.el7ost.noarch        Wed Jan  3 10:19:11 2018
python-ironic-lib-2.10.0-1.el7ost.noarch                    Wed Jan  3 10:58:32 2018
python-ironicclient-1.17.0-1.el7ost.noarch                  Wed Jan  3 10:19:13 2018

$ grep -i drac sos_commands/rpm/sh_-c_rpm_--nodigest_-qa_--qf_NAME_-_VERSION_-_RELEASE_._ARCH_INSTALLTIME_date_awk_-F_printf_-59s_s_n_1_2_sort_-f 

python-dracclient-1.3.1-1.el7ost.noarch                     Wed Jan  3 10:58:32 2018

Comment 5 Bob Fournier 2018-01-16 15:17:23 UTC
Feedback from a Dell engineer...

=====

Well, I took a look on github in the 1.3.1 tag, and the code is present there:
https://github.com/openstack/python-dracclient/blob/1.3.1/dracclient/client.py
 
So it should be in that RPM (python-dracclient-1.3.1-1.el7ost.noarch)
 
It looks to me like there are pending config jobs, and that’s what is causing cleaning to fail.  I would have the customer check and make sure that the lifecycle controller is not in recovery mode on that node.  If it is in recovery mode, take it out of recovery mode.
 
Go in to the iDRAC GUI and look in the job queue what is the percentage completion of the pending job?  If it is stuck at some percentage that’s greater than 0 and less than 100, then try deleting the config job through the GUI.  If unable to delete it then the power cord will have to be pulled on the server, wait 1 minute, plug it back in, log back into the iDRAC GUI and delete the job.

Hope this helps!

Comment 6 Bob Fournier 2018-01-17 16:13:44 UTC

Just to clarify, they need to look at the job queue, find the pending job, and delete it.  If it’s stuck at xx% and they can’t delete it, then they will need to do the unplug/hard reset.

Comment 8 Bob Fournier 2018-01-19 02:07:14 UTC
>after job clear do we need to manually set ironic node state like 
># ironic node-set-provision-state UUID manage

Yes, set it to manage and back to provide.

Comment 9 Anil Dhingra 2018-01-19 09:28:02 UTC
since the Director had been configured for a clean-up of every node state change, it will start the cleaning process. below, changing the node to provide/available state will trigger the cleaning process which is then failed.
The reason of failure when the cleaning process has been "DRAC operation failed. Reason: list index out of range" or sometime the other message that DRAC Job queue (while there is no job actually in the DRAC queue)... that is where the problem is.

[stack@director ~]$ source stackrc
(undercloud) [stack@director ~]$ openstack baremetal node list
+--------------------------------------+----------+---------------+-------------+--------------------+-------------+
| UUID                                 | Name     | Instance UUID | Power State | Provisioning State | Maintenance |
+--------------------------------------+----------+---------------+-------------+--------------------+-------------+
| c386a46e-c5a6-4607-bf1d-b1582c97d4b2 | control1 | None          | power off   | available          | False       |
| f444cd09-1fa9-46b2-8625-a2c18e6a0b14 | control2 | None          | power off   | available          | False       |
| 77c2e906-db9a-4bda-b6dd-c464461bb944 | control3 | None          | power off   | available          | False       |
| 64318762-c04e-4f44-88f8-cafa6997aae7 | compute1 | None          | power off   | available          | False       |
| 2512c630-4a7d-49c5-b5d0-3dd3292c1e35 | compute2 | None          | power on    | manageable         | False       |
| 265de704-e549-4b6e-bec6-ff46a5cdeb67 | compute3 | None          | power off   | available          | False       |
| 3ddab490-0ebc-4fe6-8a89-105ce1e0c441 | compute4 | None          | power off   | available          | False       |
| 01ffb5c8-95fc-4d84-9ade-e2874e004d4f | network1 | None          | power off   | available          | False       |
+--------------------------------------+----------+---------------+-------------+--------------------+-------------+
(undercloud) [stack@director ~]$ ironic node-set-provision-state 2512c630-4a7d-49c5-b5d0-3dd3292c1e35 provide
(undercloud) [stack@director ~]$ openstack baremetal node list
+--------------------------------------+----------+---------------+-------------+--------------------+-------------+
| UUID                                 | Name     | Instance UUID | Power State | Provisioning State | Maintenance |
+--------------------------------------+----------+---------------+-------------+--------------------+-------------+
| c386a46e-c5a6-4607-bf1d-b1582c97d4b2 | control1 | None          | power off   | available          | False       |
| f444cd09-1fa9-46b2-8625-a2c18e6a0b14 | control2 | None          | power off   | available          | False       |
| 77c2e906-db9a-4bda-b6dd-c464461bb944 | control3 | None          | power off   | available          | False       |
| 64318762-c04e-4f44-88f8-cafa6997aae7 | compute1 | None          | power off   | available          | False       |
| 2512c630-4a7d-49c5-b5d0-3dd3292c1e35 | compute2 | None          | power on    | cleaning           | False       |
| 265de704-e549-4b6e-bec6-ff46a5cdeb67 | compute3 | None          | power off   | available          | False       |
| 3ddab490-0ebc-4fe6-8a89-105ce1e0c441 | compute4 | None          | power off   | available          | False       |
| 01ffb5c8-95fc-4d84-9ade-e2874e004d4f | network1 | None          | power off   | available          | False       |
+--------------------------------------+----------+---------------+-------------+--------------------+-------------+
(undercloud) [stack@director ~]$ openstack baremetal node list
+--------------------------------------+----------+---------------+-------------+--------------------+-------------+
| UUID                                 | Name     | Instance UUID | Power State | Provisioning State | Maintenance |
+--------------------------------------+----------+---------------+-------------+--------------------+-------------+
| c386a46e-c5a6-4607-bf1d-b1582c97d4b2 | control1 | None          | power off   | available          | False       |
| f444cd09-1fa9-46b2-8625-a2c18e6a0b14 | control2 | None          | power off   | available          | False       |
| 77c2e906-db9a-4bda-b6dd-c464461bb944 | control3 | None          | power off   | available          | False       |
| 64318762-c04e-4f44-88f8-cafa6997aae7 | compute1 | None          | power off   | available          | False       |
| 2512c630-4a7d-49c5-b5d0-3dd3292c1e35 | compute2 | None          | power off   | clean failed       | True        |
| 265de704-e549-4b6e-bec6-ff46a5cdeb67 | compute3 | None          | power off   | available          | False       |
| 3ddab490-0ebc-4fe6-8a89-105ce1e0c441 | compute4 | None          | power off   | available          | False       |
| 01ffb5c8-95fc-4d84-9ade-e2874e004d4f | network1 | None          | power off   | available          | False       |
+--------------------------------------+----------+---------------+-------------+--------------------+-------------+
(undercloud) [stack@director ~]$ openstack baremetal node show compute2
+------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Field                  | Value                                                                                                                                                                                                                                        |
+------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| boot_interface         | None                                                                                                                                                                                                                                         |
| chassis_uuid           | None                                                                                                                                                                                                                                         |
| clean_step             | {}                                                                                                                                                                                                                                           |
| console_enabled        | False                                                                                                                                                                                                                                        |
| console_interface      | None                                                                                                                                                                                                                                         |
| created_at             | 2018-01-17T14:47:48+00:00                                                                                                                                                                                                                    |
| deploy_interface       | None                                                                                                                                                                                                                                         |
| driver                 | pxe_drac                                                                                                                                                                                                                                     |
| driver_info            | {u'drac_password': u'******', u'drac_username': u'root', u'deploy_ramdisk': u'24f54577-75e6-4939-ac28-13c5cfa330ed', u'deploy_kernel': u'44464311-dd9f-40c5-bfe2-338a6d2ee6f8', u'drac_address': u'172.30.7.16'}                             |
| driver_internal_info   | {u'agent_erase_devices_iterations': 1, u'agent_erase_devices_zeroize': True, u'drac_boot_device': {u'boot_device': u'pxe', u'persistent': False}, u'agent_continue_if_ata_erase_failed': False}                                              |
| extra                  | {u'hardware_swift_object': u'extra_hardware-2512c630-4a7d-49c5-b5d0-3dd3292c1e35'}                                                                                                                                                           |
| inspect_interface      | None                                                                                                                                                                                                                                         |
| inspection_finished_at | None                                                                                                                                                                                                                                         |
| inspection_started_at  | None                                                                                                                                                                                                                                         |
| instance_info          | {}                                                                                                                                                                                                                                           |
| instance_uuid          | None                                                                                                                                                                                                                                         |
| last_error             | None                                                                                                                                                                                                                                         |
| maintenance            | True                                                                                                                                                                                                                                         |
| maintenance_reason     | Failed to prepare node 2512c630-4a7d-49c5-b5d0-3dd3292c1e35 for cleaning: DRAC operation failed. Reason: list index out of range                                                                                                             |
| management_interface   | None                                                                                                                                                                                                                                         |
| name                   | compute2                                                                                                                                                                                                                                     |
| network_interface      | flat                                                                                                                                                                                                                                         |
| power_interface        | None                                                                                                                                                                                                                                         |
| power_state            | power off                                                                                                                                                                                                                                    |
| properties             | {u'memory_mb': u'262144', u'cpu_arch': u'x86_64', u'local_gb': u'1675', u'cpus': u'56', u'capabilities': u'profile:compute,cpu_hugepages:true,cpu_txt:true,cpu_vt:true,boot_option:local,cpu_aes:true,cpu_hugepages_1g:true,boot_mode:bios'} |
| provision_state        | clean failed                                                                                                                                                                                                                                 |
| provision_updated_at   | 2018-01-19T07:23:38+00:00                                                                                                                                                                                                                    |
| raid_config            | {}                                                                                                                                                                                                                                           |
| raid_interface         | None                                                                                                                                                                                                                                         |
| reservation            | None                                                                                                                                                                                                                                         |
| resource_class         | baremetal                                                                                                                                                                                                                                    |
| storage_interface      | noop                                                                                                                                                                                                                                         |
| target_power_state     | None                                                                                                                                                                                                                                         |
| target_provision_state | available                                                                                                                                                                                                                                    |
| target_raid_config     | {}                                                                                                                                                                                                                                           |
| updated_at             | 2018-01-19T07:23:38+00:00                                                                                                                                                                                                                    |
| uuid                   | 2512c630-4a7d-49c5-b5d0-3dd3292c1e35                                                                                                                                                                                                         |
| vendor_interface       | None                                                                                                                                                                                                                                         |
+------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
(undercloud) [stack@director ~]$"

Comment 10 Chris Dearborn 2018-01-19 16:20:12 UTC
Anil, can you look in /var/log/ironic/ironic-conductor.log on the director node and see if you can find a stack trace for the "list index out of range" error above?

Comment 11 Anil Dhingra 2018-01-23 01:38:32 UTC
new sos report on collabshell

/cases/02011619/sosreport-20180122-161746/director.starhub.com

below from ironic-conductor.log
===
2018-01-18 19:31:28.723 1544 ERROR ironic.drivers.modules.drac.power [req-badb737d-a1c7-454c-a236-6c50b9c483d0 - - - - -] DRAC driver failed to get power state for node f444cd09-1fa9-46b2-8625-a2c18e6a0b14. Reason: Invalid response received. Status code: "401", reason: "Unauthorized".: WSManInvalidResponse: Invalid response received. Status code: "401", reason: "Unauthorized"
2018-01-18 19:31:28.725 1544 WARNING ironic.conductor.manager [req-badb737d-a1c7-454c-a236-6c50b9c483d0 - - - - -] During sync_power_state, could not get power state for node f444cd09-1fa9-46b2-8625-a2c18e6a0b14, attempt 1 of 3. Error: DRAC operation failed. Reason: Invalid response received. Status code: "401", reason: "Unauthorized".: DracOperationError: DRAC operation failed. Reason: Invalid response received. Status code: "401", reason: "Unauthorized"
2018-01-18 22:41:52.428 1544 WARNING ironic.conductor.utils [req-44214a0e-9a54-4296-b022-368555ceeeae 39eed5a9cda243b18cb0778c1f20b33f f1ccaa6b493f4e60bcdbe06c37453d93 - default default] The reboot method of pxe_drac doesn't support 'timeout' parameter.
2018-01-18 22:42:29.376 1544 ERROR ironic.drivers.modules.drac.management [req-44214a0e-9a54-4296-b022-368555ceeeae 39eed5a9cda243b18cb0778c1f20b33f f1ccaa6b493f4e60bcdbe06c37453d93 - default default] DRAC driver failed to get next boot mode for node 2512c630-4a7d-49c5-b5d0-3dd3292c1e35. Reason: list index out of range.: IndexError: list index out of range
2018-01-18 23:02:00.356 1544 WARNING ironic.conductor.utils [req-615ee86d-f1e6-475d-8ce2-70908b8527eb 39eed5a9cda243b18cb0778c1f20b33f f1ccaa6b493f4e60bcdbe06c37453d93 - default default] The reboot method of pxe_drac doesn't support 'timeout' parameter.
2018-01-18 23:02:31.611 1544 ERROR ironic.drivers.modules.drac.management [req-615ee86d-f1e6-475d-8ce2-70908b8527eb 39eed5a9cda243b18cb0778c1f20b33f f1ccaa6b493f4e60bcdbe06c37453d93 - default default] DRAC driver failed to get next boot mode for node 2512c630-4a7d-49c5-b5d0-3dd3292c1e35. Reason: list index out of range.: IndexError: list index out of range
2018-01-18 23:22:07.962 1544 WARNING ironic.conductor.utils [req-92a46a31-f267-4d7c-bba3-feb6e1821723 39eed5a9cda243b18cb0778c1f20b33f f1ccaa6b493f4e60bcdbe06c37453d93 - default default] The reboot method of pxe_drac doesn't support 'timeout' parameter.

2018-01-18 23:22:09.325 1544 WARNING dracclient.wsman [req-92a46a31-f267-4d7c-bba3-feb6e1821723 39eed5a9cda243b18cb0778c1f20b33f f1ccaa6b493f4e60bcdbe06c37453d93 - default default] A SSLError error occurred while  communicating with 172.30.7.16, attempt 1 of 3: SSLError: EOF occurred in violation of protocol (_ssl.c:579)
2018-01-18 23:22:42.137 1544 ERROR ironic.drivers.modules.drac.management [req-92a46a31-f267-4d7c-bba3-feb6e1821723 39eed5a9cda243b18cb0778c1f20b33f f1ccaa6b493f4e60bcdbe06c37453d93 - default default] DRAC driver failed to get next boot mode for node 2512c630-4a7d-49c5-b5d0-3dd3292c1e35. Reason: list index out of range.: IndexError: list index out of range
2018-01-19 15:22:51.881 1544 INFO ironic.conductor.task_manager [req-01213009-97f0-447f-8fc6-f43ec7a377ad 39eed5a9cda243b18cb0778c1f20b33f f1ccaa6b493f4e60bcdbe06c37453d93 - default default] Node 2512c630-4a7d-49c5-b5d0-3dd3292c1e35 moved to provision state "cleaning" from state "manageable"; target provision state is "available"
2018-01-19 15:22:52.919 1544 INFO ironic.drivers.modules.network.flat [req-01213009-97f0-447f-8fc6-f43ec7a377ad 39eed5a9cda243b18cb0778c1f20b33f f1ccaa6b493f4e60bcdbe06c37453d93 - default default] Adding cleaning network to node 2512c630-4a7d-49c5-b5d0-3dd3292c1e35
2018-01-19 15:22:54.000 1544 INFO ironic.common.neutron [req-01213009-97f0-447f-8fc6-f43ec7a377ad 39eed5a9cda243b18cb0778c1f20b33f f1ccaa6b493f4e60bcdbe06c37453d93 - default default] Successfully created ports for node 2512c630-4a7d-49c5-b5d0-3dd3292c1e35 in network 8d63ecb8-58ed-42dd-baf5-e3e00d69ec7e.
2018-01-19 15:22:55.368 1544 WARNING ironic.conductor.utils [req-01213009-97f0-447f-8fc6-f43ec7a377ad 39eed5a9cda243b18cb0778c1f20b33f f1ccaa6b493f4e60bcdbe06c37453d93 - default default] The reboot method of pxe_drac doesn't support 'timeout' parameter.
2018-01-19 15:22:56.776 1544 WARNING dracclient.wsman [req-01213009-97f0-447f-8fc6-f43ec7a377ad 39eed5a9cda243b18cb0778c1f20b33f f1ccaa6b493f4e60bcdbe06c37453d93 - default default] A SSLError error occurred while  communicating with 172.30.7.16, attempt 1 of 3: SSLError: EOF occurred in violation of protocol (_ssl.c:579)
2018-01-19 15:23:30.113 1544 ERROR ironic.drivers.modules.drac.management [req-01213009-97f0-447f-8fc6-f43ec7a377ad 39eed5a9cda243b18cb0778c1f20b33f f1ccaa6b493f4e60bcdbe06c37453d93 - default default] DRAC driver failed to get next boot mode for node 2512c630-4a7d-49c5-b5d0-3dd3292c1e35. Reason: list index out of range.: IndexError: list index out of range
2018-01-19 15:23:30.123 1544 ERROR ironic.conductor.manager [req-01213009-97f0-447f-8fc6-f43ec7a377ad 39eed5a9cda243b18cb0778c1f20b33f f1ccaa6b493f4e60bcdbe06c37453d93 - default default] Failed to prepare node 2512c630-4a7d-49c5-b5d0-3dd3292c1e35 for cleaning: DRAC operation failed. Reason: list index out of range: DracOperationError: DRAC operation failed. Reason: list index out of range
2018-01-19 15:23:30.123 1544 ERROR ironic.conductor.manager Traceback (most recent call last):
2018-01-19 15:23:30.123 1544 ERROR ironic.conductor.manager   File "/usr/lib/python2.7/site-packages/ironic/conductor/manager.py", line 936, in _do_node_clean
2018-01-19 15:23:30.123 1544 ERROR ironic.conductor.manager     prepare_result = task.driver.deploy.prepare_cleaning(task)
2018-01-19 15:23:30.123 1544 ERROR ironic.conductor.manager   File "/usr/lib/python2.7/site-packages/ironic_lib/metrics.py", line 61, in wrapped
2018-01-19 15:23:30.123 1544 ERROR ironic.conductor.manager     result = f(*args, **kwargs)
2018-01-19 15:23:30.123 1544 ERROR ironic.conductor.manager   File "/usr/lib/python2.7/site-packages/ironic/drivers/modules/iscsi_deploy.py", line 600, in prepare_cleaning
2018-01-19 15:23:30.123 1544 ERROR ironic.conductor.manager     task, manage_boot=True)
2018-01-19 15:23:30.123 1544 ERROR ironic.conductor.manager   File "/usr/lib/python2.7/site-packages/ironic/drivers/modules/deploy_utils.py", line 982, in prepare_inband_cleaning
2018-01-19 15:23:30.123 1544 ERROR ironic.conductor.manager     manager_utils.node_power_action(task, states.REBOOT)
2018-01-19 15:23:30.123 1544 ERROR ironic.conductor.manager   File "/usr/lib/python2.7/site-packages/ironic/conductor/task_manager.py", line 148, in wrapper
2018-01-19 15:23:30.123 1544 ERROR ironic.conductor.manager     return f(*args, **kwargs)
2018-01-19 15:23:30.123 1544 ERROR ironic.conductor.manager   File "/usr/lib/python2.7/site-packages/ironic/conductor/utils.py", line 267, in node_power_action
2018-01-19 15:23:30.123 1544 ERROR ironic.conductor.manager     fields.NotificationStatus.ERROR, new_state)
2018-01-19 15:23:30.123 1544 ERROR ironic.conductor.manager   File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 220, in __exit__
2018-01-19 15:23:30.123 1544 ERROR ironic.conductor.manager     self.force_reraise()
2018-01-19 15:23:30.123 1544 ERROR ironic.conductor.manager   File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 196, in force_reraise
2018-01-19 15:23:30.123 1544 ERROR ironic.conductor.manager     six.reraise(self.type_, self.value, self.tb)
2018-01-19 15:23:30.123 1544 ERROR ironic.conductor.manager   File "/usr/lib/python2.7/site-packages/ironic/conductor/utils.py", line 254, in node_power_action
2018-01-19 15:23:30.123 1544 ERROR ironic.conductor.manager     task.driver.power.reboot(task)
2018-01-19 15:23:30.123 1544 ERROR ironic.conductor.manager   File "/usr/lib/python2.7/site-packages/ironic_lib/metrics.py", line 61, in wrapped
2018-01-19 15:23:30.123 1544 ERROR ironic.conductor.manager     result = f(*args, **kwargs)
2018-01-19 15:23:30.123 1544 ERROR ironic.conductor.manager   File "/usr/lib/python2.7/site-packages/ironic/conductor/task_manager.py", line 148, in wrapper
2018-01-19 15:23:30.123 1544 ERROR ironic.conductor.manager     return f(*args, **kwargs)
2018-01-19 15:23:30.123 1544 ERROR ironic.conductor.manager   File "/usr/lib/python2.7/site-packages/ironic/drivers/modules/drac/power.py", line 180, in reboot
2018-01-19 15:23:30.123 1544 ERROR ironic.conductor.manager     _set_power_state(task.node, target_power_state)
2018-01-19 15:23:30.123 1544 ERROR ironic.conductor.manager   File "/usr/lib/python2.7/site-packages/ironic/drivers/modules/drac/power.py", line 100, in _set_power_state
2018-01-19 15:23:30.123 1544 ERROR ironic.conductor.manager     _commit_boot_list_change(node)
2018-01-19 15:23:30.123 1544 ERROR ironic.conductor.manager   File "/usr/lib/python2.7/site-packages/ironic/drivers/modules/drac/power.py", line 76, in _commit_boot_list_change
2018-01-19 15:23:30.123 1544 ERROR ironic.conductor.manager     boot_device['persistent'])
2018-01-19 15:23:30.123 1544 ERROR ironic.conductor.manager   File "/usr/lib/python2.7/site-packages/ironic/drivers/modules/drac/management.py", line 98, in set_boot_device
2018-01-19 15:23:30.123 1544 ERROR ironic.conductor.manager     current_boot_device = _get_boot_device(node, drac_boot_devices)
2018-01-19 15:23:30.123 1544 ERROR ironic.conductor.manager   File "/usr/lib/python2.7/site-packages/ironic/drivers/modules/drac/management.py", line 74, in _get_boot_device
2018-01-19 15:23:30.123 1544 ERROR ironic.conductor.manager     raise exception.DracOperationError(error=exc)
2018-01-19 15:23:30.123 1544 ERROR ironic.conductor.manager DracOperationError: DRAC operation failed. Reason: list index out of range
2018-01-19 15:23:30.123 1544 ERROR ironic.conductor.manager
2018-01-19 15:23:37.148 1544 WARNING ironic.conductor.utils [req-01213009-97f0-447f-8fc6-f43ec7a377ad 39eed5a9cda243b18cb0778c1f20b33f f1ccaa6b493f4e60bcdbe06c37453d93 - default default] Not going to change node 2512c630-4a7d-49c5-b5d0-3dd3292c1e35 power state because current state = requested state = 'power off'.: DracOperationError: DRAC operation failed. Reason: list index out of range
2018-01-19 15:23:37.150 1544 INFO ironic.drivers.modules.network.flat [req-01213009-97f0-447f-8fc6-f43ec7a377ad 39eed5a9cda243b18cb0778c1f20b33f f1ccaa6b493f4e60bcdbe06c37453d93 - default default] Removing ports from cleaning network for node 2512c630-4a7d-49c5-b5d0-3dd3292c1e35
2018-01-19 15:23:38.200 1544 INFO ironic.common.neutron [req-01213009-97f0-447f-8fc6-f43ec7a377ad 39eed5a9cda243b18cb0778c1f20b33f f1ccaa6b493f4e60bcdbe06c37453d93 - default default] Successfully removed node 2512c630-4a7d-49c5-b5d0-3dd3292c1e35 neutron ports.
2018-01-19 15:23:38.215 1544 INFO ironic.conductor.task_manager [req-01213009-97f0-447f-8fc6-f43ec7a377ad 39eed5a9cda243b18cb0778c1f20b33f f1ccaa6b493f4e60bcdbe06c37453d93 - default default] Node 2512c630-4a7d-49c5-b5d0-3dd3292c1e35 moved to provision state "clean failed" from state "cleaning"; target provision state is "available

=== fresh sos ===

 to provision state "cleaning" from state "manageable"; target provision state is "available"
2018-01-22 16:14:53.977 1555 INFO ironic.drivers.modules.network.flat [req-ae0cfbd0-b023-4ea0-81b2-738f5da9cbab 39eed5a9cda243b18cb0778c1f20b33f f1ccaa6b493f4e60bcdbe06c37453d93 - default default] Adding cleaning network to node 2512c630-4a7d-49c5-b5d0-3dd3292c1e35
2018-01-22 16:14:55.148 1555 INFO ironic.common.neutron [req-ae0cfbd0-b023-4ea0-81b2-738f5da9cbab 39eed5a9cda243b18cb0778c1f20b33f f1ccaa6b493f4e60bcdbe06c37453d93 - default default] Successfully created ports for node 2512c630-4a7d-49c5-b5d0-3dd3292c1e35 in network 8d63ecb8-58ed-42dd-baf5-e3e00d69ec7e.
2018-01-22 16:14:57.502 1555 WARNING ironic.conductor.utils [req-ae0cfbd0-b023-4ea0-81b2-738f5da9cbab 39eed5a9cda243b18cb0778c1f20b33f f1ccaa6b493f4e60bcdbe06c37453d93 - default default] The reboot method of pxe_drac doesn't support 'timeout' parameter.
2018-01-22 16:15:28.190 1555 ERROR ironic.drivers.modules.drac.management [req-ae0cfbd0-b023-4ea0-81b2-738f5da9cbab 39eed5a9cda243b18cb0778c1f20b33f f1ccaa6b493f4e60bcdbe06c37453d93 - default default] DRAC driver failed to get next boot mode for node 2512c630-4a7d-49c5-b5d0-3dd3292c1e35. Reason: list index out of range.: IndexError: list index out of range
2018-01-22 16:15:28.201 1555 ERROR ironic.conductor.manager [req-ae0cfbd0-b023-4ea0-81b2-738f5da9cbab 39eed5a9cda243b18cb0778c1f20b33f f1ccaa6b493f4e60bcdbe06c37453d93 - default default] Failed to prepare node 2512c630-4a7d-49c5-b5d0-3dd3292c1e35 for cleaning: DRAC operation failed. Reason: list index out of range: DracOperationError: DRAC operation failed. Reason: list index out of range
2018-01-22 16:15:28.201 1555 ERROR ironic.conductor.manager Traceback (most recent call last):
2018-01-22 16:15:28.201 1555 ERROR ironic.conductor.manager   File "/usr/lib/python2.7/site-packages/ironic/conductor/manager.py", line 936, in _do_node_clean
2018-01-22 16:15:28.201 1555 ERROR ironic.conductor.manager     prepare_result = task.driver.deploy.prepare_cleaning(task)
2018-01-22 16:15:28.201 1555 ERROR ironic.conductor.manager   File "/usr/lib/python2.7/site-packages/ironic_lib/metrics.py", line 61, in wrapped
2018-01-22 16:15:28.201 1555 ERROR ironic.conductor.manager     result = f(*args, **kwargs)
2018-01-22 16:15:28.201 1555 ERROR ironic.conductor.manager   File "/usr/lib/python2.7/site-packages/ironic/drivers/modules/iscsi_deploy.py", line 600, in prepare_cleaning
2018-01-22 16:15:28.201 1555 ERROR ironic.conductor.manager     task, manage_boot=True)
2018-01-22 16:15:28.201 1555 ERROR ironic.conductor.manager   File "/usr/lib/python2.7/site-packages/ironic/drivers/modules/deploy_utils.py", line 982, in prepare_inband_cleaning
2018-01-22 16:15:28.201 1555 ERROR ironic.conductor.manager     manager_utils.node_power_action(task, states.REBOOT)
2018-01-22 16:15:28.201 1555 ERROR ironic.conductor.manager   File "/usr/lib/python2.7/site-packages/ironic/conductor/task_manager.py", line 148, in wrapper
2018-01-22 16:15:28.201 1555 ERROR ironic.conductor.manager     return f(*args, **kwargs)
2018-01-22 16:15:28.201 1555 ERROR ironic.conductor.manager   File "/usr/lib/python2.7/site-packages/ironic/conductor/utils.py", line 267, in node_power_action
2018-01-22 16:15:28.201 1555 ERROR ironic.conductor.manager     fields.NotificationStatus.ERROR, new_state)
2018-01-22 16:15:28.201 1555 ERROR ironic.conductor.manager   File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 220, in __exit__
2018-01-22 16:15:28.201 1555 ERROR ironic.conductor.manager     self.force_reraise()
2018-01-22 16:15:28.201 1555 ERROR ironic.conductor.manager   File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 196, in force_reraise
2018-01-22 16:15:28.201 1555 ERROR ironic.conductor.manager     six.reraise(self.type_, self.value, self.tb)
2018-01-22 16:15:28.201 1555 ERROR ironic.conductor.manager   File "/usr/lib/python2.7/site-packages/ironic/conductor/utils.py", line 254, in node_power_action
2018-01-22 16:15:28.201 1555 ERROR ironic.conductor.manager     task.driver.power.reboot(task)
2018-01-22 16:15:28.201 1555 ERROR ironic.conductor.manager   File "/usr/lib/python2.7/site-packages/ironic_lib/metrics.py", line 61, in wrapped
2018-01-22 16:15:28.201 1555 ERROR ironic.conductor.manager     result = f(*args, **kwargs)
2018-01-22 16:15:28.201 1555 ERROR ironic.conductor.manager   File "/usr/lib/python2.7/site-packages/ironic/conductor/task_manager.py", line 148, in wrapper
2018-01-22 16:15:28.201 1555 ERROR ironic.conductor.manager     return f(*args, **kwargs)
2018-01-22 16:15:28.201 1555 ERROR ironic.conductor.manager   File "/usr/lib/python2.7/site-packages/ironic/drivers/modules/drac/power.py", line 180, in reboot
2018-01-22 16:15:28.201 1555 ERROR ironic.conductor.manager     _set_power_state(task.node, target_power_state)
2018-01-22 16:15:28.201 1555 ERROR ironic.conductor.manager   File "/usr/lib/python2.7/site-packages/ironic/drivers/modules/drac/power.py", line 100, in _set_power_state
2018-01-22 16:15:28.201 1555 ERROR ironic.conductor.manager     _commit_boot_list_change(node)
2018-01-22 16:15:28.201 1555 ERROR ironic.conductor.manager   File "/usr/lib/python2.7/site-packages/ironic/drivers/modules/drac/power.py", line 76, in _commit_boot_list_change
2018-01-22 16:15:28.201 1555 ERROR ironic.conductor.manager     boot_device['persistent'])
2018-01-22 16:15:28.201 1555 ERROR ironic.conductor.manager   File "/usr/lib/python2.7/site-packages/ironic/drivers/modules/drac/management.py", line 98, in set_boot_device
2018-01-22 16:15:28.201 1555 ERROR ironic.conductor.manager     current_boot_device = _get_boot_device(node, drac_boot_devices)
2018-01-22 16:15:28.201 1555 ERROR ironic.conductor.manager   File "/usr/lib/python2.7/site-packages/ironic/drivers/modules/drac/management.py", line 74, in _get_boot_device
2018-01-22 16:15:28.201 1555 ERROR ironic.conductor.manager     raise exception.DracOperationError(error=exc)
2018-01-22 16:15:28.201 1555 ERROR ironic.conductor.manager DracOperationError: DRAC operation failed. Reason: list index out of range
2018-01-22 16:15:28.201 1555 ERROR ironic.conductor.manager
2018-01-22 16:15:34.100 1555 WARNING ironic.conductor.utils [req-ae0cfbd0-b023-4ea0-81b2-738f5da9cbab 39eed5a9cda243b18cb0778c1f20b33f f1ccaa6b493f4e60bcdbe06c37453d93 - default default] Not going to change node 2512c630-4a7d-49c5-b5d0-3dd3292c1e35 power state because current state = requested state = 'power off'.: DracOperationError: DRAC operation failed. Reason: list index out of range

Comment 12 Chris Dearborn 2018-01-24 16:58:30 UTC
To debug further, I need the contents of DCIM_BootConfigSetting and DCIM_BootSourceSetting from the iDRAC on the node having the issue.  You can get this by installing the wsmancli RPM:
# yum install wsmancli -y

Then running the following commands:

# wsman enumerate http://schemas.dell.com/wbem/wscim/1/cim-schema/2/DCIM_BootConfigSetting -P 443 -h <IPAddress> -u root -p <iDRACPassword> -V -c ~/.ssh/id_rsa.pub > bcs.txt
# wsman enumerate http://schemas.dell.com/wbem/wscim/1/cim-schema/2/DCIM_BootSourceSetting -P 443 -h <IPAddress> -u root -p <iDRACPassword> -V -c ~/.ssh/id_rsa.pub > bss.txt

Please attach the resulting bcs.txt and bss.txt files.

Thanks!

Comment 13 Anil Dhingra 2018-01-25 04:45:28 UTC
Created attachment 1385929 [details]
bcs.txt

Comment 14 Anil Dhingra 2018-01-25 04:46:04 UTC
Created attachment 1385930 [details]
bss.txt

Comment 15 Chris Dearborn 2018-01-25 20:21:58 UTC
Hey Anil, that iDRAC is in a very strange state.  In DCIM_BootConfigSetting, it is listing only vFlash as a possible boot device, and it's not set as the next boot device.  It should be listing several entries: IPL, BCV, UEFI, OneTime, in addition to vFlash.

The best thing to do is hard-reset the iDRAC by unplugging the server from all power, waiting for 1 minute, and then plugging it back in.  Next, I would log in to the iDRAC GUI and bring up the virtual console.  Power the server up and wait for it to go through the firmware initialization and try to boot.  Power the server back off.  Rerun the above command with DCIM_BootConfigSetting and verify you get back multiple entries as indicated above.  If you do, then you should be good to go.  If you don't, then you have a hardware problem, and I would proceed with hardware diagnostics to determine the issue.  You could try querying DCIM_PhysicalDiskView to see if the iDRAC can see any disks.  You could also try querying DCIM_ControllerView to see if the iDRAC can see a disk controller.

In addition, it's very strange that no entries are returned from the DCIM_BootSourceSetting query above.  Are you sure you entered it correctly?  A hard-reset should also solve this problem.

Hope this helps!

Comment 16 rohit londhe 2018-02-01 03:40:22 UTC
Hey Chris,

So actually reset, hard reset and power cycle by removing all power cables, and also run a successful update from Lifecycle Controller, but only updating the network NIC firmware since there was no newer iDRAC or Lifecycle Controller firmware that time. It still has the same problem.

Also Cu contact Dell Support but they cannot find any problem with the hardware, and functionality wise via web and Java Console the iDRAC is working properly. We can power on, power off, graceful shutdown, reboot, vFlash working (able to install Ubuntu using virtual DVD on it),  can alter boot sequence etc.

Comment 18 Chris Dearborn 2018-02-01 14:46:03 UTC
Are you leaving the vFlash card in the card slot?  If so, remove it.

Boot the node into the lifecycle controller.  Make sure the node is set to Legacy Boot and not UEFI boot.  Examine the boot source list.  Judging from the above, it sounds like it will show only vflash.  Check to see if the other boot sources are listed as disabled.  Enable the PXE boot source and the hard disk boot source.  If PXE boot is not listed as an option, then examine the NICs.  One NIC port should be set to PXE boot.  If none are set to PXE boot then enable PXE booting on the appropriate NIC port.  Go back and look at the boot sources again.  Make sure the PXE boot source is enabled.

Comment 19 Bob Fournier 2018-02-15 15:05:04 UTC
I'm closing this as its working with pxe_ipmitool but leaving the NEEDINFO on to get the requested data for Chris.


Note You need to log in before you can comment on or make changes to this bug.