Bug 1957976

Summary: Ironic execute_deploy_steps command to ramdisk times out, resulting in a failed deployment in 4.7
Product: OpenShift Container Platform Reporter: Bob Fournier <bfournie>
Component: Bare Metal Hardware ProvisioningAssignee: Dmitry Tantsur <dtantsur>
Bare Metal Hardware Provisioning sub component: ironic QA Contact: Amit Ugol <augol>
Status: CLOSED ERRATA Docs Contact:
Severity: urgent    
Priority: high CC: dcritch, derekh, dtantsur, mmethot, mzamot, tsedovic
Version: 4.7Keywords: Triaged
Target Milestone: ---   
Target Release: 4.8.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Cause: When sending large packets from Ironic to the ramdisk, particularly if the the mtu is not confiugred correctly, connections failures can result. Consequence: The deployment will fail because Ironic did not get a response from the ramdisk. Fix: Ironic will query the ramdisk to determine the last command that it ran to work around the connection error. Result: The deployment will succeed.
Story Points: ---
Clone Of:
: 1958965 (view as bug list) Environment:
Last Closed: 2021-07-27 23:07:08 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1958965    
Attachments:
Description Flags
4.7 conductor log
none
ramdisk 4.7 log
none
ramdisk 4.6 log none

Description Bob Fournier 2021-05-06 21:33:34 UTC
Created attachment 1780485 [details]
4.7 conductor log

This problem is being seen with OCP 4.7, when using 4.6 on the same setup the problem does not occur.

When the conductor attempts to send the deploy_steps to the ramdisk the command fails with a connection error:
2021-05-06 14:28:36.648 1 ERROR ironic.drivers.modules.agent_client [req-42a8d11d-55a1-47cc-a4e3-9bf347aba0e5 bootstrap-user - - - -] Failed to connect to the agent running on node 5b913ebb-6c0e-44a0-842f-6b3e42e0fd85 for invoking command deploy.execute_deploy_step. Error: HTTPSConnectionPool(host='172.22.0.15', port=9999): Read timed out. (read timeout=60): requests.exceptions.ReadTimeout: HTTPSConnectionPool(host='172.22.0.15', port=9999): Read timed out. (read timeout=60)^[[00m

The conductor logs this as an HTTP 409 error:
2021-05-06 14:28:46.253 1 ERROR ironic.drivers.modules.agent_client [req-42a8d11d-55a1-47cc-a4e3-9bf347aba0e5 bootstrap-user - - - -] Agent command deploy.execute_deploy_step for node 5b913ebb-6c0e-44a0-842f-6b3e42e0fd85 failed. Expected 2xx HTTP status code, got 409.^[[00m
2021-05-06 14:28:46.254 1 ERROR ironic.conductor.utils [req-42a8d11d-55a1-47cc-a4e3-9bf347aba0e5 bootstrap-user - - - -] Node 5b913ebb-6c0e-44a0-842f-6b3e42e0fd85 failed deploy step {'step': 'write_image', 'priority': 80, 'argsinfo': None, 'interface': 'deploy'}. Error: Agent API for node 5b913ebb-6c0e-44a0-842f-6b3e42e0fd85 returned HTTP status code 409 with error: Agent is busy: executing command execute_deploy_step: ironic.common.exception.AgentAPIError: Agent API for node 5b913ebb-6c0e-44a0-842f-6b3e42e0fd85 returned HTTP status code 409 with error: Agent is busy: executing command execute_deploy_step^[[00m

On the ramdisk we see the command received OK but an ssl error is logged:

May 06 13:46:31 master-0 ironic-python-agent[3208]: 2021-05-06 13:45:29.397 3208 DEBUG root [-] Executing deploy step {'interface': 'deploy', 'step': 'write_image', 'args': {'image_info': {'id': 'rhcos-47.83.202103251640-0-compressed.x86_64.qcow2', 'urls': ['http://172.22.0.2:80/images/rhcos-47.83.202103251640-0-openstack.x86_64.qcow2/rhcos-47.83.202103251640-0-compressed.x86_64.qcow2'], 'disk_format': None, 'container_format': None, 'stream_raw_images': True, 'checksum': 'http://172.22.0.2:80/images/rhcos-47.83.202103251640-0-openstack.x86_64.qcow2/rhcos-47.83.202103251640-0-compressed.x86_64.qcow2.md5sum', 'node_uuid': 'a7046d77-3d11-400f-99cb-c1a20cb86678'}, 'configdrive': <large config drive>
May 06 13:46:32 master-0 rsyslogd[3212]: message too long (312054) with configured size 8096, begin of message is: 2021-05-06 13:45:29.397 3208 DEBUG root [-] Executing deploy step {'interface':  [v8.1911.0-6.el8 try https://www.rsyslog.com/e/2445 ]
May 06 13:46:32 master-0 ironic-python-agent[3208]: 2021-05-06 13:46:31.668 3208 INFO eventlet.wsgi.server [-] Traceback (most recent call last):
                                                      File "/usr/lib/python3.6/site-packages/eventlet/wsgi.py", line 591, in handle_one_response
                                                        write(b''.join(towrite))
                                                      File "/usr/lib/python3.6/site-packages/eventlet/wsgi.py", line 537, in write
                                                        wfile.writelines(towrite)
                                                      File "/usr/lib64/python3.6/socket.py", line 604, in write
                                                        return self._sock.send(b)
                                                      File "/usr/lib/python3.6/site-packages/eventlet/green/ssl.py", line 183, in send
                                                        super(GreenSSLSocket, self).send, data, flags)
                                                      File "/usr/lib/python3.6/site-packages/eventlet/green/ssl.py", line 150, in _call_trampolining
                                                        return func(*a, **kw)
                                                      File "/usr/lib64/python3.6/ssl.py", line 903, in send
                                                        return self._sslobj.write(data)
                                                      File "/usr/lib64/python3.6/ssl.py", line 601, in write
                                                        return self._sslobj.write(data)
                                                    BrokenPipeError: [Errno 32] Broken pipe


The ramdisk successfully writes the image:
May 06 13:46:46 master-0 ironic-python-agent[3208]: 2021-05-06 13:46:46.076 3208 INFO ironic_python_agent.extensions.standby [-] Writing image with command: /bin/bash /usr/lib/python3.6/site-packages/ironic_python_agent/extensions/../shell/write_image.sh /tmp/rhcos-47.83.202103251640-0-compressed.x86_64.qcow2 /dev/sda
May 06 13:48:21 master-0 ironic-python-agent[3208]: 2021-05-06 13:47:18.864 3208 INFO root [-] Deploy step completed: {'interface': 'deploy', 'step': 'write_image', 'args': {'image_info': {'id': 'rhcos-47.83.202103251640-0-compressed.x86_64.qcow2', 'urls': ['http://172.22.0.2:80/images/rhcos-47.83.202103251640-0-openstack.x86_64.qcow2/rhcos-47.83.202103251640-0-compressed.x86_64.qcow2'], 'disk_format': None, 'container_format': None, 'stream_raw_images': True, 'checksum': 'http://172.22.0.2:80/images/rhcos-47.83.202103251640-0-openstack.x86_64.qcow2/rhcos-47.83.202103251640-0-compressed.x86_64.qcow2.md5sum', 'node_uuid': 'a7046d77-3d11-400f-99cb-c1a20cb86678'}, 'configdrive': ...

However the conductor has already failed the deployment.

Is not clear why the deploy_steps command always fails in 4.7 and works in 4.6 but one difference between 4.6 and 4.7 is that the configdrive in the deploy_steps command is much larger.  See the difference in the 2 attached ramdisk files.


One other datapoint that may point to this being an iLO version problem. According to Michael Zamot:
"We have an almost identical cluster that doesn't have this issue on either version, the only difference I can think of is ILO
2.30 vs 2.14"

Comment 1 Bob Fournier 2021-05-06 21:35:04 UTC
Created attachment 1780486 [details]
ramdisk 4.7 log

Comment 2 Bob Fournier 2021-05-06 21:36:01 UTC
Created attachment 1780487 [details]
ramdisk 4.6 log

Comment 3 Bob Fournier 2021-05-06 22:03:13 UTC
*** Bug 1957980 has been marked as a duplicate of this bug. ***

Comment 4 Bob Fournier 2021-05-07 00:45:10 UTC
Michael downgraded iLO on the passing cluster but it still passes. The only difference between the 2 clusters is one has 5 nodes (passes) and one has 19 (fails on 4.7 only).

Comment 5 Bob Fournier 2021-05-07 01:52:50 UTC
To clarify on the connection failure - Ironic times out when sending the deploy_steps, the ramdisk is taking longer than 60 seconds to process the deploy_steps command. Ironic then attempts a retry which results in the HTTP 409 response back from the ramdisk:

The ramdisk gets the deploy_steps command here:
May 06 13:45:29 master-0 ironic-python-agent[3208]: 2021-05-06 13:45:27.558 3208 DEBUG root [-] Executing command: deploy.execute_deploy_step with args: {'step': {'interface': 'deploy', 'step': 'write_image', 'args': {'image_info': {'id': 'rhcos-47.83.202103251640-0-compressed.x86_64.qcow2', 'urls': ['http://172.22.0.2:80/images/rhcos-47.83.202103251640-0-openstack.x86_64.qcow2/rhcos-47.83.202103251640-0-compressed.x86_64.qcow2

After 60 seconds it receives the retry from ironic:
May 06 13:46:34 master-0 rsyslogd[3212]: message too long (9588) with configured size 8096, begin of message is: 2021-05-06 13:46:32.241 3208 DEBUG root [-] Executing command: deploy.execute_de [v8.1911.0-6.el8 try https://www.rsyslog.com/e/2445 ]

The ramdisk sends the 409 back for the 2nd command:
May 06 13:46:35 master-0 ironic-python-agent[3208]: 2021-05-06 13:46:33.973 3208 ERROR root [-] Tried to execute deploy.execute_deploy_step, agent is still executing Command name: execute_deploy_step, params: {'step': {'interface': 'deploy', 'step': 'write_image', 'args': 


The timeout for ramdisk commands is configurable in Ironic (https://github.com/openstack/ironic/blob/master/ironic/conf/agent.py#L114) but not easily with Metal3. Need to figure out why the ramdisk is taking too long to process the deploy_steps command and doesn't complete in 60 seconds.

Comment 6 Dmitry Tantsur 2021-05-07 08:35:56 UTC
I'm working on a mitigation that will check the list of executing commands before retrying.

Comment 7 Dmitry Tantsur 2021-05-07 13:09:07 UTC
An observation so far: something gets stuck inside the agent for the whole 60 seconds until Ironic hits timeout and aborts the request. The retry fails because the work is actually in progress. The patch linked above is a mitigation for the latter: we will detect that the command is running and avoid a failure.

We still need to understand why the 1 minute pause appeared in 4.7. The most likely suspect is the introduction of TLS.

Comment 8 Dmitry Tantsur 2021-05-07 13:28:38 UTC
Actually I've just remembered that ramdisk TLS was already in place in 4.6, so TLS alone cannot be the cause of the issue.

Comment 11 Bob Fournier 2021-05-10 22:35:26 UTC
Looking at the logs that Michael provided it looks like the fix is working also. On the ironic-python-agent we still see the ssl error in eventlet 60 seconds after the execute_deploy_step command:

cb860729-639f-4a82-b236-845dcbd72c4c_d6dc5b34-7d40-43bb-9780-a430c186d4ba_2021-05-10-21-12-11.tar.gz: May 10 17:08:48 master-1 ironic-python-agent[3342]: 2021-05-10 17:08:46.669 3342 DEBUG root [-] Executing command: deploy.execute_deploy_step with args: 
...
cb860729-639f-4a82-b236-845dcbd72c4c_d6dc5b34-7d40-43bb-9780-a430c186d4ba_2021-05-10-21-12-11.tar.gz: May 10 17:09:51 master-1 rsyslogd[3343]: message too long (312338) with configured size 8096, begin of message is: 2021-05-10 17:08:48.675 3342 DEBUG root [-] Executing deploy step {'interface':  [v8.1911.0-6.el8 try https://www.rsyslog.com/e/2445 ]
cb860729-639f-4a82-b236-845dcbd72c4c_d6dc5b34-7d40-43bb-9780-a430c186d4ba_2021-05-10-21-12-11.tar.gz: May 10 17:09:51 master-1 ironic-python-agent[3342]: 2021-05-10 17:09:51.582 3342 INFO eventlet.wsgi.server [-] Traceback (most recent call last):
...
cb860729-639f-4a82-b236-845dcbd72c4c_d6dc5b34-7d40-43bb-9780-a430c186d4ba_2021-05-10-21-12-11.tar.gz:                                                         return self._sslobj.write(data)
cb860729-639f-4a82-b236-845dcbd72c4c_d6dc5b34-7d40-43bb-9780-a430c186d4ba_2021-05-10-21-12-11.tar.gz:                                                     BrokenPipeError: [Errno 32] Broken pipe

But on Ironic now that is being handled without an error:
log-bundle-20210510212033/bootstrap/pods/1db251fe4475.log:2021-05-10 21:09:51.411 1 DEBUG ironic.drivers.modules.agent_client [req-9b7a1e9e-ceaf-4d04-9921-33a0964b2
288 bootstrap-user - - - -] Agent command deploy.execute_deploy_step for node d70c228b-41e9-47cd-9bb7-a52d2ceed591 returned result None, error None, HTTP status cod
e unknown _command /usr/lib/python3.6/site-packages/ironic/drivers/modules/agent_client.py:212

Comment 13 Dmitry Tantsur 2021-05-18 11:00:50 UTC
Follow-up: https://storyboard.openstack.org/#!/story/2008904

Comment 16 errata-xmlrpc 2021-07-27 23:07:08 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438