Description of problem: Introspection is failing after a scale up is performed. This is seen in logs when an introspection is performed: STDOUT: Waiting for introspection to finish... Introspection of node completed:82c00b53-c17b-467e-864d-47d57c4b38eb. Status:FAILED. Errors:Failed to set boot device to PXE: Timed out waiting for a reply to message ID e38a3a305549458e9e588f4a0d7a5d17 (HTTP 500) Retrying 1 nodes that failed introspection. Attempt 1 of 3 Introspection of node completed:82c00b53-c17b-467e-864d-47d57c4b38eb. Status:FAILED. Errors:Failed to set boot device to PXE: Node 82c00b53-c17b-467e-864d-47d57c4b38eb is locked by host undercloud-0.redhat.local, please retry after the current operation is completed. (HTTP 409) Retrying 1 nodes that failed introspection. Attempt 2 of 3 Introspection of node completed:82c00b53-c17b-467e-864d-47d57c4b38eb. Status:FAILED. Errors:Failed to set boot device to PXE: Node 82c00b53-c17b-467e-864d-47d57c4b38eb is locked by host undercloud-0.redhat.local, please retry after the current operation is completed. (HTTP 409) Retrying 1 nodes that failed introspection. Attempt 3 of 3 Introspection of node completed:82c00b53-c17b-467e-864d-47d57c4b38eb. Status:FAILED. Errors:Failed to set boot device to PXE: Timed out waiting for a reply to message ID 020dbfdd32484ef9a91b3456998a856c (HTTP 500) Retry limit reached with 1 nodes still failing introspection STDERR: Waiting for messages on queue 'tripleo' with no timeout. Introspection completed with errors: Retry limit reached with 1 nodes still failing introspection Version-Release number of selected component (if applicable): RHOS-16.1-RHEL-8-20200424.n.0 How reproducible: Each time one of the df rfe scale up jobs is performed. e.g.: DFG-df-rfe-16.1-virsh-3cont_3comp_3ceph-blacklist-1compute-1control-scaleup Steps to Reproduce: 1. Execute jenkins job: DFG-df-rfe-16.1-virsh-3cont_3comp_3ceph-blacklist-1compute-1control-scaleup 2. 3. Actual results: Introspection fails with error seen above. Expected results: Introspection is successful. Additional info:
Logs to a failing test: https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/view/DFG/view/df/view/rfe/job/DFG-df-rfe-16.1-virsh-3cont_3comp_3ceph-blacklist-1compute-1control-scaleup/3/
I'm seeing the same issue. I imported a new node using ipmitool after I've already deployed a stack. When I attempted to introspect I immediately get: $ openstack overcloud node introspect hp-dl360-g9-02 --provide Waiting for introspection to finish... Waiting for messages on queue 'tripleo' with no timeout. Introspection of node completed:a574065d-febc-47fd-8990-8a5aeaeddbe2. Status:FAILED. Errors:Failed to set boot device to PXE: Gateway Timeout (HTTP 504) Retrying 1 nodes that failed introspection. Attempt 1 of 3 Introspection of node completed:a574065d-febc-47fd-8990-8a5aeaeddbe2. Status:FAILED. Errors:Failed to set boot device to PXE: Node a574065d-febc-47fd-8990-8a5aeaeddbe2 is locked by host hardprov-dl360-g9-01.snedlab.lab.eng.rdu2.redhat.com, please retry after the current operation is completed. (HTTP 409) Retrying 1 nodes that failed introspection. Attempt 2 of 3 Introspection of node completed:a574065d-febc-47fd-8990-8a5aeaeddbe2. Status:FAILED. Errors:Failed to set boot device to PXE: Node a574065d-febc-47fd-8990-8a5aeaeddbe2 is locked by host hardprov-dl360-g9-01.snedlab.lab.eng.rdu2.redhat.com, please retry after the current operation is completed. (HTTP 409) Retrying 1 nodes that failed introspection. Attempt 3 of 3 Introspection of node completed:a574065d-febc-47fd-8990-8a5aeaeddbe2. Status:FAILED. Errors:Failed to set boot device to PXE: Gateway Timeout (HTTP 504) Retry limit reached with 1 nodes still failing introspection Introspection completed with errors: Retry limit reached with 1 nodes still failing introspection ironic_inspector.log containers/ironic-inspector/ironic-inspector.log:2020-05-01 10:30:16.732 7 INFO ironic_inspector.introspect [-] [node: a574065d-febc-47fd-8990-8a5aeaeddbe2 state starting] The following attributes will be used for look up: {'bmc_address': ['10.9.103.29'], 'mac': ['48:df:37:12:0e:14']} containers/ironic-inspector/ironic-inspector.log:2020-05-01 10:30:26.724 7 INFO eventlet.wsgi.server [req-ea534936-545a-4e52-ac3c-191466d69ba4 8b55a2650e584ea5811c51ec24e33a0d bec9c7aab90547aeb255ab9ef7564ecd - default default] fe32:dead:beef::2,fe32:dead:beef::4 "GET /v1/introspection/a574065d-febc-47fd-8990-8a5aeaeddbe2 HTTP/1.1" status: 200 len: 492 time: 0.0066347 containers/ironic-inspector/ironic-inspector.log:2020-05-01 10:31:40.945 7 ERROR ironic_inspector.utils [-] [node: a574065d-febc-47fd-8990-8a5aeaeddbe2 state starting] Failed to set boot device to PXE: Timed out waiting for a reply to message ID f665caeb4bc34df2bd5be342f7a22703 (HTTP 500): ironicclient.common.apiclient.exceptions.InternalServerError: Timed out waiting for a reply to message ID f665caeb4bc34df2bd5be342f7a22703 (HTTP 500) containers/ironic-inspector/ironic-inspector.log:2020-05-01 10:31:40.946 7 ERROR ironic_inspector.node_cache [-] [node: a574065d-febc-47fd-8990-8a5aeaeddbe2 state starting] Processing the error event because of an exception <class 'ironic_inspector.utils.Error'>: Failed to set boot device to PXE: Timed out waiting for a reply to message ID f665caeb4bc34df2bd5be342f7a22703 (HTTP 500) raised by ironic_inspector.introspect._background_introspect: ironic_inspector.utils.Error: Failed to set boot device to PXE: Timed out waiting for a reply to message ID f665caeb4bc34df2bd5be342f7a22703 (HTTP 500) I do see calls logged in ironic, but not at same timestamp at above: containers/ironic/ironic-conductor.log:2020-05-01 10:30:40.943 7 DEBUG ironic.conductor.manager [req-dd4537be-0acf-4557-bea4-2bd4e4ccd79b 7eac58e2c3ac44169b007f2d339589e0 9de13d993d53445aaff2a73a8b49e832 - default default] RPC set_boot_device called for node a574065d-febc-47fd-8990-8a5aeaeddbe2 with device pxe set_boot_device /usr/lib/python3.6/site-packages/ironic/conductor/manager.py:3049 Not sure if this is related to the openstacksdk change (https://review.opendev.org/#/c/672179/) but that's the last change in inspector to use set_node_boot_device - https://github.com/openstack/ironic-inspector/blob/master/ironic_inspector/introspect.py#L118.
I've also see the same issue with a 500 error 2020-05-01 13:16:42.169 7 ERROR ironic_inspector.utils [-] [node: c87251c5-7804-473e-bea9-0f0772d1d870 state starting] Failed to set boot device to PXE: Timed out waiting for a reply to message ID 7366b9d106254a26bb929aeb4943fa21 (HTTP 500): So we're either getting 500 or 504 errors when attempting the set_boot_device command.
Seeing this in ironic/app.log for the particular message that is timing out 2020-05-01 13:16:42.167 27 ERROR wsme.api [req-2905f9ce-be9e-491e-89fc-d5e2aee09a41 7eac58e2c3ac44169b007f2d339589e0 9de13d993d53445aaff2a73a8b49e832 - default default] Server-side error: "Timed out waiting for a reply to message ID 7366b9d106254a26bb929aeb4943fa21". Detail: Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 397, in get return self._queues[msg_id].get(block=True, timeout=timeout) File "/usr/lib64/python3.6/queue.py", line 172, in get raise Empty queue.Empty During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/wsmeext/pecan.py", line 85, in callfunction result = f(self, *args, **kwargs) File "/usr/lib/python3.6/site-packages/ironic/api/controllers/v1/node.py", line 233, in put topic=topic) File "/usr/lib/python3.6/site-packages/ironic/conductor/rpcapi.py", line 645, in set_boot_device device=device, persistent=persistent) File "/usr/lib/python3.6/site-packages/oslo_messaging/rpc/client.py", line 181, in call transport_options=self.transport_options) File "/usr/lib/python3.6/site-packages/oslo_messaging/transport.py", line 129, in _send transport_options=transport_options) File "/usr/lib/python3.6/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 646, in send transport_options=transport_options) File "/usr/lib/python3.6/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 634, in _send call_monitor_timeout) File "/usr/lib/python3.6/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 523, in wait message = self.waiters.get(msg_id, timeout=timeout) File "/usr/lib/python3.6/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 401, in get 'to message ID %s' % msg_id) oslo_messaging.exceptions.MessagingTimeout: Timed out waiting for a reply to message ID 7366b9d106254a26bb929aeb4943fa21 : oslo_messaging.exceptions.MessagingTimeout: Timed out waiting for a reply to message ID 7366b9d106254a26bb929aeb4943fa21 @
So it looks like this is a similar issue as https://bugzilla.redhat.com/show_bug.cgi?id=1813889, ipmitool commands are taking up to 2 minutes which causes the inspector timeout. Seeing lots of: 2020-05-04 11:16:53.205 7 DEBUG ironic.common.utils [req-eb49faaa-94bd-4f0e-badd-064272ba1ebc - - - - -] Command stderr is: "Unable to Get Channel Cipher Suites " execute /usr/lib/python3.6/site-packages/ironic/common/utils.py:79 Its not clear why this only occurs on scale up. The first time a set of nodes is introspected and the stack deployed it works fine. Its only when bringing in another node afterwards that we get these ipmitool timeouts. I am seeing this in a baremetal setup, I believe the initial report of this bug is a virtual setup.
Versions used by ironic_conductor (undercloud) [stack@hardprov-dl360-g9-01 log]$ sudo podman exec -it ironic_conductor /bin/bash ()[ironic@hardprov-dl360-g9-01 /]$ rpm -qa | grep ipmi ipmitool-1.8.18-14.el8.x86_64 ()[ironic@hardprov-dl360-g9-01 /]$ rpm -qa | grep pyg python3-pyghmi-1.0.22-3.el8ost.noarch
Was seeing the introspection error in multiple scale up jobs. However, with latest build(RHOS-16.1-RHEL-8-20200428.n.0) those same jobs pass without seeing the introspection error. With that build still see the introspection error in the replace controller stage of this job: https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/view/DFG/view/df/view/controller_replacement/job/DFG-df-controller_replacement-16.1-virthost-3cont_3comp_3ceph-yes_UC_SSL-yes_OC_SSL-ceph-ipv4-geneve-replace_controller-RHELOSP-31864/4/ Searched all my logs and don't see pyhgmi being installed.
Thanks David - I verified in the logs that you are seeing the same issue as https://bugzilla.redhat.com/show_bug.cgi?id=1813889, which is due to the ipmitool commands taking 2 minutes to complete due to "Unable to Get Channel Cipher Suites" 2020-04-30 15:56:21.645 7 DEBUG ironic.common.utils [req-d964b773-c1df-474e-b9d0-fffdcb912b08 0272b8c63c7c41b5ac5b3b613a268498 2b656bb559154893ba9c3fe97122a1ab - default default] Command stderr is: "Unable to Get Channel Cipher Suites " execute /usr/lib/python3.6/site-packages/ironic/common/utils.py:79 This was resolved with the latest pyghmi in https://bugzilla.redhat.com/show_bug.cgi?id=181388, from the Trello card associated with this bug [1] - "update: the issue with slow ipmitool is gone from OSP CI as we use the always the newest pyghmi with vbmc (from pip), as of this moment it's pyghmi-1.5.13" Is it possible that in the replace controller stage it is using an older version of pyghmi which needs to be updated via pip? [1] https://trello.com/c/WVFOGaUG/1393-cixbz1813889ospphase1osp161python-virtualbmcosp161-rhel82-ipmitool-commands-via-vbmc-virtualbmc-take-too-long-and-cause-overclou
The Infrared task: [vbmc : Install package from pypi] in the replace controller stage does create a virtual environment using python 2.7.5: created virtual environment CPython2.7.5.final.0-64 in 358ms creator CPython2Posix(dest=/root/.virtualenvs/vbmc, clear=False, global=True) seeder FromAppData(download=False, pip=latest, setuptools=latest, wheel=latest, via=copy, app_data_dir=/root/.local/share/virtualenv/seed-app-data/v1.0.1) activators PythonActivator,CShellActivator,FishActivator,PowerShellActivator,BashActivator This is the pyghmi version installed in the virtual environment: Collecting pyghmi===1.4.1 Using cached pyghmi-1.4.1-py2.py3-none-any.whl (196 kB) which is older than referenced above.
David - we have a fix in Ironic that will get past the issue introduced in ipmitool. Please retest with latest compose. Marking as MODIFIED.
Yes, scale up jobs now pass with no introspection error. Passing logs using build: RHOS-16.1-RHEL-8-20200604.n.1 https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/view/DFG/view/df/view/rfe/job/DFG-df-rfe-16.1-virsh-3cont_3comp_3ceph-blacklist-1compute-1control-scaleup/15/
If this bug requires doc text for errata release, please set the 'Doc Type' and provide draft text according to the template in the 'Doc Text' field. The documentation team will review, edit, and approve the text. If this bug does not require doc text, please set the 'requires_doc_text' flag to '-'.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:3148