Description of problem: I'm unable to introspect OVB provided nodes with the pxe_ipmitool configuration that was working for 16.0 and older. Version-Release number of selected component (if applicable): openstack-tripleo-puppet-elements-11.2.2-0.20200311084936.a6fef08.el8ost.noarch python3-tripleoclient-12.3.2-0.20200311145124.a13fe7d.el8ost.noarch puppet-tripleo-11.4.1-0.20200311163439.cfd3f34.el8ost.noarch python3-tripleo-common-11.3.3-0.20200311121708.7be309b.el8ost.noarch openstack-tripleo-common-11.3.3-0.20200311121708.7be309b.el8ost.noarch openstack-tripleo-image-elements-10.6.2-0.20200311051343.4461ca1.el8ost.noarch ansible-tripleo-ipsec-9.2.1-0.20200311073016.0c8693c.el8ost.noarch ansible-role-tripleo-modify-image-1.1.1-0.20200311081746.bb6f78d.el8ost.noarch puppet-ironic-15.4.1-0.20200311084019.2f91ad1.el8ost.noarch openstack-tripleo-validations-11.3.2-0.20200311183433.9ca7403.el8ost.noarch openstack-tripleo-heat-templates-11.3.2-0.20200311165442.ee04497.el8ost.noarch python3-ironic-inspector-client-3.7.0-0.20200310220030.d95a4cd.el8ost.noarch tripleo-ansible-0.4.2-0.20200311125545.00cb5fc.el8ost.noarch python3-tripleoclient-heat-installer-12.3.2-0.20200311145124.a13fe7d.el8ost.noarch python3-ironicclient-3.1.1-0.20200310190222.8188c01.el8ost.noarch openstack-ironic-python-agent-builder-1.1.1-0.20200311071127.b32f4ea.el8ost.noarch openstack-tripleo-common-containers-11.3.3-0.20200311121708.7be309b.el8ost.noarch How reproducible: Steps to Reproduce: 1. Deploy undercloud 2. openstack overcloud node import env.json example env.json file: { "nodes": [ { "pm_type": "pxe_ipmitool", "mac": [ "fa:16:3e:b2:a6:d0" ], "cpu": 4, "memory": 8192, "disk": 60, "arch": "x86_64", "pm_user": "admin", "pm_password": "password", "pm_addr": "192.168.1.15", "capabilities": "boot_option:local", "name": "aschultz-baremetal-0" }, { "pm_type": "pxe_ipmitool", "mac": [ "fa:16:3e:4d:ef:17" ], "cpu": 4, "memory": 8192, "disk": 60, "arch": "x86_64", "pm_user": "admin", "pm_password": "password", "pm_addr": "192.168.1.9", "capabilities": "boot_option:local", "name": "aschultz-baremetal-1" }, { "pm_type": "pxe_ipmitool", "mac": [ "fa:16:3e:45:66:9b" ], "cpu": 4, "memory": 8192, "disk": 60, "arch": "x86_64", "pm_user": "admin", "pm_password": "password", "pm_addr": "192.168.1.27", "capabilities": "boot_option:local", "name": "aschultz-baremetal-2" }, { "pm_type": "pxe_ipmitool", "mac": [ "fa:16:3e:a5:df:e9" ], "cpu": 4, "memory": 8192, "disk": 60, "arch": "x86_64", "pm_user": "admin", "pm_password": "password", "pm_addr": "192.168.1.38", "capabilities": "boot_option:local", "name": "aschultz-baremetal-3" } ] } 3. Attempt intospection: openstack overcloud node introspect --all-manageable --provide Actual results: undercloud) [cloud-user@undercloud ~]$ openstack overcloud node introspect --all-manageable --provide Waiting for introspection to finish... Waiting for messages on queue 'tripleo' with no timeout. Introspection of node completed:f0e6e84f-4cab-4bf7-9042-c68a583ba01f. Status:FAILED. Errors:Failed to set boot device to PXE: Gateway Timeout (HTTP 504) Introspection of node completed:8318615d-8764-45cd-8024-dfd615c67391. Status:FAILED. Errors:Failed to set boot device to PXE: Timed out waiting for a reply to message ID 3eb52f5a9a5d4f2aa4a49ecb589875ea (HTTP 500) Introspection of node completed:e4746a57-0e4f-44c3-a7b4-06c90a2c82f7. Status:FAILED. Errors:Failed to set boot device to PXE: Gateway Timeout (HTTP 504) Introspection of node completed:120ce0a3-2e03-4364-a102-909f0494a5b3. Status:FAILED. Errors:Failed to set boot device to PXE: Gateway Timeout (HTTP 504) Retrying 4 nodes that failed introspection. Attempt 1 of 3 Introspection of node completed:8318615d-8764-45cd-8024-dfd615c67391. Status:FAILED. Errors:Failed to set boot device to PXE: Node 8318615d-8764-45cd-8024-dfd615c67391 is locked by host undercloud.localdomain, please retry after the current operation is completed. (HTTP 409) Introspection of node completed:f0e6e84f-4cab-4bf7-9042-c68a583ba01f. Status:FAILED. Errors:Failed to set boot device to PXE: Node f0e6e84f-4cab-4bf7-9042-c68a583ba01f is locked by host undercloud.localdomain, please retry after the current operation is completed. (HTTP 409) Introspection of node completed:e4746a57-0e4f-44c3-a7b4-06c90a2c82f7. Status:FAILED. Errors:Failed to set boot device to PXE: Node e4746a57-0e4f-44c3-a7b4-06c90a2c82f7 is locked by host undercloud.localdomain, please retry after the current operation is completed. (HTTP 409) Introspection of node completed:120ce0a3-2e03-4364-a102-909f0494a5b3. Status:FAILED. Errors:Failed to set boot device to PXE: Node 120ce0a3-2e03-4364-a102-909f0494a5b3 is locked by host undercloud.localdomain, please retry after the current operation is completed. (HTTP 409) Retrying 4 nodes that failed introspection. Attempt 2 of 3 Introspection of node completed:8318615d-8764-45cd-8024-dfd615c67391. Status:FAILED. Errors:Failed to set boot device to PXE: Node 8318615d-8764-45cd-8024-dfd615c67391 is locked by host undercloud.localdomain, please retry after the current operation is completed. (HTTP 409) Introspection of node completed:e4746a57-0e4f-44c3-a7b4-06c90a2c82f7. Status:FAILED. Errors:Failed to set boot device to PXE: Node e4746a57-0e4f-44c3-a7b4-06c90a2c82f7 is locked by host undercloud.localdomain, please retry after the current operation is completed. (HTTP 409) Introspection of node completed:120ce0a3-2e03-4364-a102-909f0494a5b3. Status:FAILED. Errors:Failed to set boot device to PXE: Node 120ce0a3-2e03-4364-a102-909f0494a5b3 is locked by host undercloud.localdomain, please retry after the current operation is completed. (HTTP 409) Introspection of node completed:f0e6e84f-4cab-4bf7-9042-c68a583ba01f. Status:FAILED. Errors:Failed to set boot device to PXE: Gateway Timeout (HTTP 504) Retrying 4 nodes that failed introspection. Attempt 3 of 3 Introspection of node completed:120ce0a3-2e03-4364-a102-909f0494a5b3. Status:FAILED. Errors:Failed to set boot device to PXE: Timed out waiting for a reply to message ID f2bb6b46422c4c538a81d6bf87e369c4 (HTTP 500) Introspection of node completed:8318615d-8764-45cd-8024-dfd615c67391. Status:FAILED. Errors:Failed to set boot device to PXE: Timed out waiting for a reply to message ID fce7fad47f7f4fcd8134089536f90070 (HTTP 500) Introspection of node completed:e4746a57-0e4f-44c3-a7b4-06c90a2c82f7. Status:FAILED. Errors:Failed to set boot device to PXE: Timed out waiting for a reply to message ID c8c65b0812784d8b9aafa84ef1a121c0 (HTTP 500) Introspection of node completed:f0e6e84f-4cab-4bf7-9042-c68a583ba01f. Status:FAILED. Errors:Failed to set boot device to PXE: Node f0e6e84f-4cab-4bf7-9042-c68a583ba01f is locked by host undercloud.localdomain, please retry after the current operation is completed. (HTTP 409) Retry limit reached with 4 nodes still failing introspection {'result': 'Failure caused by error in tasks: send_message\n\n send_message [task_ex_id=501aec29-29b9-4fe5-aa11-0e81ef780855] -> Workflow failed due to message status. Status:FAILED Message:Retry limit reached with 4 nodes still failing introspection\n [wf_ex_id=33742d50-9fa1-48f6-9e89-7c61529e0e93, idx=0]: Workflow failed due to message status. Status:FAILED Message:Retry limit reached with 4 nodes still failing introspection\n', 'introspection_attempt': 2, 'introspected_nodes': {'8318615d-8764-45cd-8024-dfd615c67391': {'error': 'Failed to set boot device to PXE: Node 8318615d-8764-45cd-8024-dfd615c67391 is locked by host undercloud.localdomain, please retry after the current operation is completed. (HTTP 409)', 'finished': True, 'finished_at': '2020-03-13T22:00:14', 'links': [{'href': 'http://192.168.24.2:13050/v1/introspection/8318615d-8764-45cd-8024-dfd615c67391', 'rel': 'self'}], 'started_at': '2020-03-13T21:58:59', 'state': 'error', 'uuid': '8318615d-8764-45cd-8024-dfd615c67391'}, '120ce0a3-2e03-4364-a102-909f0494a5b3': {'error': 'Failed to set boot device to PXE: Node 120ce0a3-2e03-4364-a102-909f0494a5b3 is locked by host undercloud.localdomain, please retry after the current operation is completed. (HTTP 409)', 'finished': True, 'finished_at': '2020-03-13T22:00:24', 'links': [{'href': 'http://192.168.24.2:13050/v1/introspection/120ce0a3-2e03-4364-a102-909f0494a5b3', 'rel': 'self'}], 'started_at': '2020-03-13T21:58:59', 'state': 'error', 'uuid': '120ce0a3-2e03-4364-a102-909f0494a5b3'}, 'e4746a57-0e4f-44c3-a7b4-06c90a2c82f7': {'error': 'Failed to set boot device to PXE: Node e4746a57-0e4f-44c3-a7b4-06c90a2c82f7 is locked by host undercloud.localdomain, please retry after the current operation is completed. (HTTP 409)', 'finished': True, 'finished_at': '2020-03-13T22:00:29', 'links': [{'href': 'http://192.168.24.2:13050/v1/introspection/e4746a57-0e4f-44c3-a7b4-06c90a2c82f7', 'rel': 'self'}], 'started_at': '2020-03-13T21:58:59', 'state': 'error', 'uuid': 'e4746a57-0e4f-44c3-a7b4-06c90a2c82f7'}, 'f0e6e84f-4cab-4bf7-9042-c68a583ba01f': {'error': 'Failed to set boot device to PXE: Node f0e6e84f-4cab-4bf7-9042-c68a583ba01f is locked by host undercloud.localdomain, please retry after the current operation is completed. (HTTP 409)', 'finished': True, 'finished_at': '2020-03-13T22:00:19', 'links': [{'href': 'http://192.168.24.2:13050/v1/introspection/f0e6e84f-4cab-4bf7-9042-c68a583ba01f', 'rel': 'self'}], 'started_at': '2020-03-13T21:58:59', 'state': 'error', 'uuid': 'f0e6e84f-4cab-4bf7-9042-c68a583ba01f'}}, 'failed_introspection': ['8318615d-8764-45cd-8024-dfd615c67391', '120ce0a3-2e03-4364-a102-909f0494a5b3', 'e4746a57-0e4f-44c3-a7b4-06c90a2c82f7', 'f0e6e84f-4cab-4bf7-9042-c68a583ba01f'], 'status': 'RUNNING', 'message': 'Retrying 4 nodes that failed introspection. Attempt 2 of 3 ', 'node_uuids': ['8318615d-8764-45cd-8024-dfd615c67391', '120ce0a3-2e03-4364-a102-909f0494a5b3', 'e4746a57-0e4f-44c3-a7b4-06c90a2c82f7', 'f0e6e84f-4cab-4bf7-9042-c68a583ba01f']} Exception introspecting nodes: {'result': 'Failure caused by error in tasks: send_message\n\n send_message [task_ex_id=501aec29-29b9-4fe5-aa11-0e81ef780855] -> Workflow failed due to message status. Status:FAILED Message:Retry limit reached with 4 nodes still failing introspection\n [wf_ex_id=33742d50-9fa1-48f6-9e89-7c61529e0e93, idx=0]: Workflow failed due to message status. Status:FAILED Message:Retry limit reached with 4 nodes still failing introspection\n', 'introspection_attempt': 2, 'introspected_nodes': {'8318615d-8764-45cd-8024-dfd615c67391': {'error': 'Failed to set boot device to PXE: Node 8318615d-8764-45cd-8024-dfd615c67391 is locked by host undercloud.localdomain, please retry after the current operation is completed. (HTTP 409)', 'finished': True, 'finished_at': '2020-03-13T22:00:14', 'links': [{'href': 'http://192.168.24.2:13050/v1/introspection/8318615d-8764-45cd-8024-dfd615c67391', 'rel': 'self'}], 'started_at': '2020-03-13T21:58:59', 'state': 'error', 'uuid': '8318615d-8764-45cd-8024-dfd615c67391'}, '120ce0a3-2e03-4364-a102-909f0494a5b3': {'error': 'Failed to set boot device to PXE: Node 120ce0a3-2e03-4364-a102-909f0494a5b3 is locked by host undercloud.localdomain, please retry after the current operation is completed. (HTTP 409)', 'finished': True, 'finished_at': '2020-03-13T22:00:24', 'links': [{'href': 'http://192.168.24.2:13050/v1/introspection/120ce0a3-2e03-4364-a102-909f0494a5b3', 'rel': 'self'}], 'started_at': '2020-03-13T21:58:59', 'state': 'error', 'uuid': '120ce0a3-2e03-4364-a102-909f0494a5b3'}, 'e4746a57-0e4f-44c3-a7b4-06c90a2c82f7': {'error': 'Failed to set boot device to PXE: Node e4746a57-0e4f-44c3-a7b4-06c90a2c82f7 is locked by host undercloud.localdomain, please retry after the current operation is completed. (HTTP 409)', 'finished': True, 'finished_at': '2020-03-13T22:00:29', 'links': [{'href': 'http://192.168.24.2:13050/v1/introspection/e4746a57-0e4f-44c3-a7b4-06c90a2c82f7', 'rel': 'self'}], 'started_at': '2020-03-13T21:58:59', 'state': 'error', 'uuid': 'e4746a57-0e4f-44c3-a7b4-06c90a2c82f7'}, 'f0e6e84f-4cab-4bf7-9042-c68a583ba01f': {'error': 'Failed to set boot device to PXE: Node f0e6e84f-4cab-4bf7-9042-c68a583ba01f is locked by host undercloud.localdomain, please retry after the current operation is completed. (HTTP 409)', 'finished': True, 'finished_at': '2020-03-13T22:00:19', 'links': [{'href': 'http://192.168.24.2:13050/v1/introspection/f0e6e84f-4cab-4bf7-9042-c68a583ba01f', 'rel': 'self'}], 'started_at': '2020-03-13T21:58:59', 'state': 'error', 'uuid': 'f0e6e84f-4cab-4bf7-9042-c68a583ba01f'}}, 'failed_introspection': ['8318615d-8764-45cd-8024-dfd615c67391', '120ce0a3-2e03-4364-a102-909f0494a5b3', 'e4746a57-0e4f-44c3-a7b4-06c90a2c82f7', 'f0e6e84f-4cab-4bf7-9042-c68a583ba01f'], 'status': 'RUNNING', 'message': 'Retrying 4 nodes that failed introspection. Attempt 2 of 3 ', 'node_uuids': ['8318615d-8764-45cd-8024-dfd615c67391', '120ce0a3-2e03-4364-a102-909f0494a5b3', 'e4746a57-0e4f-44c3-a7b4-06c90a2c82f7', 'f0e6e84f-4cab-4bf7-9042-c68a583ba01f']} Expected results: Successful introspection
Created attachment 1669999 [details] ironic-logs.tgz
Can we capture status of messaging? RabbitMQ log's and configuration etc as well? /var/log/containers/ironic/app.log ---------------------------------- 2020-03-13 17:58:33.976 23 ERROR wsme.api [req-6d087f77-535a-4e7c-b6b2-e0fa31ea38fa e622bfddedf047df9de28b34976edd89 1f9c2ccd412d40b095393639ef2fdde3 - default default] Server-side error: "Timed out waiting for a reply to message ID 9adff0f843414aca8db8f0e97c7eb936". Detail: Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 397, in get return self._queues[msg_id].get(block=True, timeout=timeout) File "/usr/lib64/python3.6/queue.py", line 172, in get raise Empty queue.Empty During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/wsmeext/pecan.py", line 85, in callfunction result = f(self, *args, **kwargs) File "/usr/lib/python3.6/site-packages/ironic/api/controllers/v1/node.py", line 233, in put topic=topic) File "/usr/lib/python3.6/site-packages/ironic/conductor/rpcapi.py", line 645, in set_boot_device device=device, persistent=persistent) File "/usr/lib/python3.6/site-packages/oslo_messaging/rpc/client.py", line 181, in call transport_options=self.transport_options) File "/usr/lib/python3.6/site-packages/oslo_messaging/transport.py", line 129, in _send transport_options=transport_options) File "/usr/lib/python3.6/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 646, in send transport_options=transport_options) File "/usr/lib/python3.6/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 634, in _send call_monitor_timeout) File "/usr/lib/python3.6/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 523, in wait message = self.waiters.get(msg_id, timeout=timeout) File "/usr/lib/python3.6/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 401, in get 'to message ID %s' % msg_id)
Also FYI the pxe_ipmitool driver hasn't existed for a few releases. It only works for you because there is tripleo-specific code that replaces it with the "ipmi" driver.
2020-03-13 17:56:57.238 7 DEBUG ironic.conductor.task_manager [req-1b2041b8-d8bb-4f09-95c7-32bd5058722b 91417b2cccb8407aa51511a4ac60bdaa 4fde87ff8d29484898fed971c187a3ff - default default] Successfully released exclusive lock for provision action manage on node 120ce0a3-2e03-4364-a102-909f0494a5b3 (lock was held 127.02 sec) release_resources /usr/lib/python3.6/site-packages/ironic/conductor/task_manager.py:356 2020-03-13 18:02:06.897 7 DEBUG ironic.conductor.task_manager [req-c34397f6-06ee-4767-a6d6-9b01a40f7de2 e622bfddedf047df9de28b34976edd89 1f9c2ccd412d40b095393639ef2fdde3 - default default] Successfully released exclusive lock for setting boot device on node 120ce0a3-2e03-4364-a102-909f0494a5b3 (lock was held 257.93 sec) release_resources /usr/lib/python3.6/site-packages/ironic/conductor/task_manager.py:356 2020-03-13 18:07:32.827 7 DEBUG ironic.conductor.task_manager [req-b581812a-7383-4656-94bf-f0fea6c12806 e622bfddedf047df9de28b34976edd89 1f9c2ccd412d40b095393639ef2fdde3 - default default] Successfully released exclusive lock for setting boot device on node 120ce0a3-2e03-4364-a102-909f0494a5b3 (lock was held 260.23 sec) release_resources /usr/lib/python3.6/site-packages/ironic/conductor/task_manager.py:356 This is ... super slow. Since the boot device API is synchronous (a historical deficiency of the ironic API), oslo.messaging times out waiting for so long.
Even the simplest ipmitool operations take 2 minutes: 2020-03-13 18:04:52.063 7 DEBUG oslo_concurrency.processutils [req-2198bcc4-ca21-49e3-a9f8-4f0c8821fa24 - - - - -] CMD "ipmitool -I lanplus -H 192.168.1.27 -L ADMINISTRATOR -U admin -v -R 12 -N 5 -f /tmp/tmpa6i246j2 power status" returned: 0 in 126.460s execute /usr/lib/python3.6/site-packages/oslo_concurrency/processutils.py:409 2020-03-13 18:04:52.092 7 DEBUG oslo_concurrency.processutils [req-2198bcc4-ca21-49e3-a9f8-4f0c8821fa24 - - - - -] CMD "ipmitool -I lanplus -H 192.168.1.9 -L ADMINISTRATOR -U admin -v -R 12 -N 5 -f /tmp/tmp11c9_ax6 power status" returned: 0 in 126.500s execute /usr/lib/python3.6/site-packages/oslo_concurrency/processutils.py:409 2020-03-13 18:04:52.103 7 DEBUG oslo_concurrency.processutils [req-2198bcc4-ca21-49e3-a9f8-4f0c8821fa24 - - - - -] CMD "ipmitool -I lanplus -H 192.168.1.15 -L ADMINISTRATOR -U admin -v -R 12 -N 5 -f /tmp/tmpeyvy7o34 power status" returned: 0 in 126.525s execute /usr/lib/python3.6/site-packages/oslo_concurrency/processutils.py:409 This may cause the observed behavior.
After more investigation, it does seem a duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1813889.
A possible workaround: $ cat env.yaml parameter_defaults: ExtraConfig: ironic::drivers::ipmi::command_retry_timeout: 10 ironic::drivers::ipmi::min_command_interval: 2 $ grep custom_env_files undercloud.conf custom_env_files = /home/cloud-user/env.yaml
The workaround doesn't completely resolve it as even though the ipmi bits function faster, dhcp is not being provided to the nodes for introspection.
We're tracking inspector dnsmasq DHCP issue in https://bugzilla.redhat.com/show_bug.cgi?id=1814616.
See Dmitry's comment - https://bugzilla.redhat.com/show_bug.cgi?id=1813889#c28. A fix has been made to pyghmi that will work with ipmitool-1.8.18-14 - https://opendev.org/x/pyghmi/commit/ec4b503edb5422046f9b0bac6dcd84d47c178fe5. As vbmc is installed via pip this should take care of the issue, same with OVB. The inspector dnsmasq issue is still open and tracked at https://bugzilla.redhat.com/show_bug.cgi?id=1814616.
We'll have to rebuild the bmc images used everywhere as it seem they already have pyghmi baked in with an older version for OVB. At least is the case when I spun up a brand new environment.
Can we close this out now as ipmitool is working? The inspector dnsmasq issue is being tracked separately
Closing this as a dup, I think we've resolved the two issues. *** This bug has been marked as a duplicate of bug 1813889 ***