Bug 1813468
Summary: | introspection fails on 16.1 with Failed to set boot device to PXE: Gateway Timeout (HTTP 504) with OVB | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat OpenStack | Reporter: | Alex Schultz <aschultz> | ||||
Component: | openstack-ironic | Assignee: | RHOS Maint <rhos-maint> | ||||
Status: | CLOSED DUPLICATE | QA Contact: | Alistair Tonner <atonner> | ||||
Severity: | unspecified | Docs Contact: | |||||
Priority: | unspecified | ||||||
Version: | 16.0 (Train) | CC: | bfournie, dtantsur, hjensas, jpretori, mburns, sbaker | ||||
Target Milestone: | --- | ||||||
Target Release: | --- | ||||||
Hardware: | Unspecified | ||||||
OS: | Unspecified | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | If docs needed, set a value | |||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2020-03-29 20:28:05 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | 1813889 | ||||||
Bug Blocks: | |||||||
Attachments: |
|
Description
Alex Schultz
2020-03-13 22:10:06 UTC
Created attachment 1669999 [details]
ironic-logs.tgz
Can we capture status of messaging? RabbitMQ log's and configuration etc as well? /var/log/containers/ironic/app.log ---------------------------------- 2020-03-13 17:58:33.976 23 ERROR wsme.api [req-6d087f77-535a-4e7c-b6b2-e0fa31ea38fa e622bfddedf047df9de28b34976edd89 1f9c2ccd412d40b095393639ef2fdde3 - default default] Server-side error: "Timed out waiting for a reply to message ID 9adff0f843414aca8db8f0e97c7eb936". Detail: Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 397, in get return self._queues[msg_id].get(block=True, timeout=timeout) File "/usr/lib64/python3.6/queue.py", line 172, in get raise Empty queue.Empty During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/wsmeext/pecan.py", line 85, in callfunction result = f(self, *args, **kwargs) File "/usr/lib/python3.6/site-packages/ironic/api/controllers/v1/node.py", line 233, in put topic=topic) File "/usr/lib/python3.6/site-packages/ironic/conductor/rpcapi.py", line 645, in set_boot_device device=device, persistent=persistent) File "/usr/lib/python3.6/site-packages/oslo_messaging/rpc/client.py", line 181, in call transport_options=self.transport_options) File "/usr/lib/python3.6/site-packages/oslo_messaging/transport.py", line 129, in _send transport_options=transport_options) File "/usr/lib/python3.6/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 646, in send transport_options=transport_options) File "/usr/lib/python3.6/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 634, in _send call_monitor_timeout) File "/usr/lib/python3.6/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 523, in wait message = self.waiters.get(msg_id, timeout=timeout) File "/usr/lib/python3.6/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 401, in get 'to message ID %s' % msg_id) Also FYI the pxe_ipmitool driver hasn't existed for a few releases. It only works for you because there is tripleo-specific code that replaces it with the "ipmi" driver. 2020-03-13 17:56:57.238 7 DEBUG ironic.conductor.task_manager [req-1b2041b8-d8bb-4f09-95c7-32bd5058722b 91417b2cccb8407aa51511a4ac60bdaa 4fde87ff8d29484898fed971c187a3ff - default default] Successfully released exclusive lock for provision action manage on node 120ce0a3-2e03-4364-a102-909f0494a5b3 (lock was held 127.02 sec) release_resources /usr/lib/python3.6/site-packages/ironic/conductor/task_manager.py:356 2020-03-13 18:02:06.897 7 DEBUG ironic.conductor.task_manager [req-c34397f6-06ee-4767-a6d6-9b01a40f7de2 e622bfddedf047df9de28b34976edd89 1f9c2ccd412d40b095393639ef2fdde3 - default default] Successfully released exclusive lock for setting boot device on node 120ce0a3-2e03-4364-a102-909f0494a5b3 (lock was held 257.93 sec) release_resources /usr/lib/python3.6/site-packages/ironic/conductor/task_manager.py:356 2020-03-13 18:07:32.827 7 DEBUG ironic.conductor.task_manager [req-b581812a-7383-4656-94bf-f0fea6c12806 e622bfddedf047df9de28b34976edd89 1f9c2ccd412d40b095393639ef2fdde3 - default default] Successfully released exclusive lock for setting boot device on node 120ce0a3-2e03-4364-a102-909f0494a5b3 (lock was held 260.23 sec) release_resources /usr/lib/python3.6/site-packages/ironic/conductor/task_manager.py:356 This is ... super slow. Since the boot device API is synchronous (a historical deficiency of the ironic API), oslo.messaging times out waiting for so long. Even the simplest ipmitool operations take 2 minutes: 2020-03-13 18:04:52.063 7 DEBUG oslo_concurrency.processutils [req-2198bcc4-ca21-49e3-a9f8-4f0c8821fa24 - - - - -] CMD "ipmitool -I lanplus -H 192.168.1.27 -L ADMINISTRATOR -U admin -v -R 12 -N 5 -f /tmp/tmpa6i246j2 power status" returned: 0 in 126.460s execute /usr/lib/python3.6/site-packages/oslo_concurrency/processutils.py:409 2020-03-13 18:04:52.092 7 DEBUG oslo_concurrency.processutils [req-2198bcc4-ca21-49e3-a9f8-4f0c8821fa24 - - - - -] CMD "ipmitool -I lanplus -H 192.168.1.9 -L ADMINISTRATOR -U admin -v -R 12 -N 5 -f /tmp/tmp11c9_ax6 power status" returned: 0 in 126.500s execute /usr/lib/python3.6/site-packages/oslo_concurrency/processutils.py:409 2020-03-13 18:04:52.103 7 DEBUG oslo_concurrency.processutils [req-2198bcc4-ca21-49e3-a9f8-4f0c8821fa24 - - - - -] CMD "ipmitool -I lanplus -H 192.168.1.15 -L ADMINISTRATOR -U admin -v -R 12 -N 5 -f /tmp/tmpeyvy7o34 power status" returned: 0 in 126.525s execute /usr/lib/python3.6/site-packages/oslo_concurrency/processutils.py:409 This may cause the observed behavior. After more investigation, it does seem a duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1813889. A possible workaround: $ cat env.yaml parameter_defaults: ExtraConfig: ironic::drivers::ipmi::command_retry_timeout: 10 ironic::drivers::ipmi::min_command_interval: 2 $ grep custom_env_files undercloud.conf custom_env_files = /home/cloud-user/env.yaml The workaround doesn't completely resolve it as even though the ipmi bits function faster, dhcp is not being provided to the nodes for introspection. We're tracking inspector dnsmasq DHCP issue in https://bugzilla.redhat.com/show_bug.cgi?id=1814616. See Dmitry's comment - https://bugzilla.redhat.com/show_bug.cgi?id=1813889#c28. A fix has been made to pyghmi that will work with ipmitool-1.8.18-14 - https://opendev.org/x/pyghmi/commit/ec4b503edb5422046f9b0bac6dcd84d47c178fe5. As vbmc is installed via pip this should take care of the issue, same with OVB. The inspector dnsmasq issue is still open and tracked at https://bugzilla.redhat.com/show_bug.cgi?id=1814616. We'll have to rebuild the bmc images used everywhere as it seem they already have pyghmi baked in with an older version for OVB. At least is the case when I spun up a brand new environment. Can we close this out now as ipmitool is working? The inspector dnsmasq issue is being tracked separately Closing this as a dup, I think we've resolved the two issues. *** This bug has been marked as a duplicate of bug 1813889 *** |