Created attachment 1780485 [details] 4.7 conductor log This problem is being seen with OCP 4.7, when using 4.6 on the same setup the problem does not occur. When the conductor attempts to send the deploy_steps to the ramdisk the command fails with a connection error: 2021-05-06 14:28:36.648 1 ERROR ironic.drivers.modules.agent_client [req-42a8d11d-55a1-47cc-a4e3-9bf347aba0e5 bootstrap-user - - - -] Failed to connect to the agent running on node 5b913ebb-6c0e-44a0-842f-6b3e42e0fd85 for invoking command deploy.execute_deploy_step. Error: HTTPSConnectionPool(host='172.22.0.15', port=9999): Read timed out. (read timeout=60): requests.exceptions.ReadTimeout: HTTPSConnectionPool(host='172.22.0.15', port=9999): Read timed out. (read timeout=60)^[[00m The conductor logs this as an HTTP 409 error: 2021-05-06 14:28:46.253 1 ERROR ironic.drivers.modules.agent_client [req-42a8d11d-55a1-47cc-a4e3-9bf347aba0e5 bootstrap-user - - - -] Agent command deploy.execute_deploy_step for node 5b913ebb-6c0e-44a0-842f-6b3e42e0fd85 failed. Expected 2xx HTTP status code, got 409.^[[00m 2021-05-06 14:28:46.254 1 ERROR ironic.conductor.utils [req-42a8d11d-55a1-47cc-a4e3-9bf347aba0e5 bootstrap-user - - - -] Node 5b913ebb-6c0e-44a0-842f-6b3e42e0fd85 failed deploy step {'step': 'write_image', 'priority': 80, 'argsinfo': None, 'interface': 'deploy'}. Error: Agent API for node 5b913ebb-6c0e-44a0-842f-6b3e42e0fd85 returned HTTP status code 409 with error: Agent is busy: executing command execute_deploy_step: ironic.common.exception.AgentAPIError: Agent API for node 5b913ebb-6c0e-44a0-842f-6b3e42e0fd85 returned HTTP status code 409 with error: Agent is busy: executing command execute_deploy_step^[[00m On the ramdisk we see the command received OK but an ssl error is logged: May 06 13:46:31 master-0 ironic-python-agent[3208]: 2021-05-06 13:45:29.397 3208 DEBUG root [-] Executing deploy step {'interface': 'deploy', 'step': 'write_image', 'args': {'image_info': {'id': 'rhcos-47.83.202103251640-0-compressed.x86_64.qcow2', 'urls': ['http://172.22.0.2:80/images/rhcos-47.83.202103251640-0-openstack.x86_64.qcow2/rhcos-47.83.202103251640-0-compressed.x86_64.qcow2'], 'disk_format': None, 'container_format': None, 'stream_raw_images': True, 'checksum': 'http://172.22.0.2:80/images/rhcos-47.83.202103251640-0-openstack.x86_64.qcow2/rhcos-47.83.202103251640-0-compressed.x86_64.qcow2.md5sum', 'node_uuid': 'a7046d77-3d11-400f-99cb-c1a20cb86678'}, 'configdrive': <large config drive> May 06 13:46:32 master-0 rsyslogd[3212]: message too long (312054) with configured size 8096, begin of message is: 2021-05-06 13:45:29.397 3208 DEBUG root [-] Executing deploy step {'interface': [v8.1911.0-6.el8 try https://www.rsyslog.com/e/2445 ] May 06 13:46:32 master-0 ironic-python-agent[3208]: 2021-05-06 13:46:31.668 3208 INFO eventlet.wsgi.server [-] Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/eventlet/wsgi.py", line 591, in handle_one_response write(b''.join(towrite)) File "/usr/lib/python3.6/site-packages/eventlet/wsgi.py", line 537, in write wfile.writelines(towrite) File "/usr/lib64/python3.6/socket.py", line 604, in write return self._sock.send(b) File "/usr/lib/python3.6/site-packages/eventlet/green/ssl.py", line 183, in send super(GreenSSLSocket, self).send, data, flags) File "/usr/lib/python3.6/site-packages/eventlet/green/ssl.py", line 150, in _call_trampolining return func(*a, **kw) File "/usr/lib64/python3.6/ssl.py", line 903, in send return self._sslobj.write(data) File "/usr/lib64/python3.6/ssl.py", line 601, in write return self._sslobj.write(data) BrokenPipeError: [Errno 32] Broken pipe The ramdisk successfully writes the image: May 06 13:46:46 master-0 ironic-python-agent[3208]: 2021-05-06 13:46:46.076 3208 INFO ironic_python_agent.extensions.standby [-] Writing image with command: /bin/bash /usr/lib/python3.6/site-packages/ironic_python_agent/extensions/../shell/write_image.sh /tmp/rhcos-47.83.202103251640-0-compressed.x86_64.qcow2 /dev/sda May 06 13:48:21 master-0 ironic-python-agent[3208]: 2021-05-06 13:47:18.864 3208 INFO root [-] Deploy step completed: {'interface': 'deploy', 'step': 'write_image', 'args': {'image_info': {'id': 'rhcos-47.83.202103251640-0-compressed.x86_64.qcow2', 'urls': ['http://172.22.0.2:80/images/rhcos-47.83.202103251640-0-openstack.x86_64.qcow2/rhcos-47.83.202103251640-0-compressed.x86_64.qcow2'], 'disk_format': None, 'container_format': None, 'stream_raw_images': True, 'checksum': 'http://172.22.0.2:80/images/rhcos-47.83.202103251640-0-openstack.x86_64.qcow2/rhcos-47.83.202103251640-0-compressed.x86_64.qcow2.md5sum', 'node_uuid': 'a7046d77-3d11-400f-99cb-c1a20cb86678'}, 'configdrive': ... However the conductor has already failed the deployment. Is not clear why the deploy_steps command always fails in 4.7 and works in 4.6 but one difference between 4.6 and 4.7 is that the configdrive in the deploy_steps command is much larger. See the difference in the 2 attached ramdisk files. One other datapoint that may point to this being an iLO version problem. According to Michael Zamot: "We have an almost identical cluster that doesn't have this issue on either version, the only difference I can think of is ILO 2.30 vs 2.14"
Created attachment 1780486 [details] ramdisk 4.7 log
Created attachment 1780487 [details] ramdisk 4.6 log
*** Bug 1957980 has been marked as a duplicate of this bug. ***
Michael downgraded iLO on the passing cluster but it still passes. The only difference between the 2 clusters is one has 5 nodes (passes) and one has 19 (fails on 4.7 only).
To clarify on the connection failure - Ironic times out when sending the deploy_steps, the ramdisk is taking longer than 60 seconds to process the deploy_steps command. Ironic then attempts a retry which results in the HTTP 409 response back from the ramdisk: The ramdisk gets the deploy_steps command here: May 06 13:45:29 master-0 ironic-python-agent[3208]: 2021-05-06 13:45:27.558 3208 DEBUG root [-] Executing command: deploy.execute_deploy_step with args: {'step': {'interface': 'deploy', 'step': 'write_image', 'args': {'image_info': {'id': 'rhcos-47.83.202103251640-0-compressed.x86_64.qcow2', 'urls': ['http://172.22.0.2:80/images/rhcos-47.83.202103251640-0-openstack.x86_64.qcow2/rhcos-47.83.202103251640-0-compressed.x86_64.qcow2 After 60 seconds it receives the retry from ironic: May 06 13:46:34 master-0 rsyslogd[3212]: message too long (9588) with configured size 8096, begin of message is: 2021-05-06 13:46:32.241 3208 DEBUG root [-] Executing command: deploy.execute_de [v8.1911.0-6.el8 try https://www.rsyslog.com/e/2445 ] The ramdisk sends the 409 back for the 2nd command: May 06 13:46:35 master-0 ironic-python-agent[3208]: 2021-05-06 13:46:33.973 3208 ERROR root [-] Tried to execute deploy.execute_deploy_step, agent is still executing Command name: execute_deploy_step, params: {'step': {'interface': 'deploy', 'step': 'write_image', 'args': The timeout for ramdisk commands is configurable in Ironic (https://github.com/openstack/ironic/blob/master/ironic/conf/agent.py#L114) but not easily with Metal3. Need to figure out why the ramdisk is taking too long to process the deploy_steps command and doesn't complete in 60 seconds.
I'm working on a mitigation that will check the list of executing commands before retrying.
An observation so far: something gets stuck inside the agent for the whole 60 seconds until Ironic hits timeout and aborts the request. The retry fails because the work is actually in progress. The patch linked above is a mitigation for the latter: we will detect that the command is running and avoid a failure. We still need to understand why the 1 minute pause appeared in 4.7. The most likely suspect is the introduction of TLS.
Actually I've just remembered that ramdisk TLS was already in place in 4.6, so TLS alone cannot be the cause of the issue.
Looking at the logs that Michael provided it looks like the fix is working also. On the ironic-python-agent we still see the ssl error in eventlet 60 seconds after the execute_deploy_step command: cb860729-639f-4a82-b236-845dcbd72c4c_d6dc5b34-7d40-43bb-9780-a430c186d4ba_2021-05-10-21-12-11.tar.gz: May 10 17:08:48 master-1 ironic-python-agent[3342]: 2021-05-10 17:08:46.669 3342 DEBUG root [-] Executing command: deploy.execute_deploy_step with args: ... cb860729-639f-4a82-b236-845dcbd72c4c_d6dc5b34-7d40-43bb-9780-a430c186d4ba_2021-05-10-21-12-11.tar.gz: May 10 17:09:51 master-1 rsyslogd[3343]: message too long (312338) with configured size 8096, begin of message is: 2021-05-10 17:08:48.675 3342 DEBUG root [-] Executing deploy step {'interface': [v8.1911.0-6.el8 try https://www.rsyslog.com/e/2445 ] cb860729-639f-4a82-b236-845dcbd72c4c_d6dc5b34-7d40-43bb-9780-a430c186d4ba_2021-05-10-21-12-11.tar.gz: May 10 17:09:51 master-1 ironic-python-agent[3342]: 2021-05-10 17:09:51.582 3342 INFO eventlet.wsgi.server [-] Traceback (most recent call last): ... cb860729-639f-4a82-b236-845dcbd72c4c_d6dc5b34-7d40-43bb-9780-a430c186d4ba_2021-05-10-21-12-11.tar.gz: return self._sslobj.write(data) cb860729-639f-4a82-b236-845dcbd72c4c_d6dc5b34-7d40-43bb-9780-a430c186d4ba_2021-05-10-21-12-11.tar.gz: BrokenPipeError: [Errno 32] Broken pipe But on Ironic now that is being handled without an error: log-bundle-20210510212033/bootstrap/pods/1db251fe4475.log:2021-05-10 21:09:51.411 1 DEBUG ironic.drivers.modules.agent_client [req-9b7a1e9e-ceaf-4d04-9921-33a0964b2 288 bootstrap-user - - - -] Agent command deploy.execute_deploy_step for node d70c228b-41e9-47cd-9bb7-a52d2ceed591 returned result None, error None, HTTP status cod e unknown _command /usr/lib/python3.6/site-packages/ironic/drivers/modules/agent_client.py:212
Follow-up: https://storyboard.openstack.org/#!/story/2008904
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2438