Bug 1957976
Summary: | Ironic execute_deploy_steps command to ramdisk times out, resulting in a failed deployment in 4.7 | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Bob Fournier <bfournie> | ||||||||
Component: | Bare Metal Hardware Provisioning | Assignee: | Dmitry Tantsur <dtantsur> | ||||||||
Bare Metal Hardware Provisioning sub component: | ironic | QA Contact: | Amit Ugol <augol> | ||||||||
Status: | CLOSED ERRATA | Docs Contact: | |||||||||
Severity: | urgent | ||||||||||
Priority: | high | CC: | dcritch, derekh, dtantsur, mmethot, mzamot, tsedovic | ||||||||
Version: | 4.7 | Keywords: | Triaged | ||||||||
Target Milestone: | --- | ||||||||||
Target Release: | 4.8.0 | ||||||||||
Hardware: | Unspecified | ||||||||||
OS: | Unspecified | ||||||||||
Whiteboard: | |||||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||||
Doc Text: |
Cause: When sending large packets from Ironic to the ramdisk, particularly if the the mtu is not confiugred correctly, connections failures can result.
Consequence: The deployment will fail because Ironic did not get a response from the ramdisk.
Fix: Ironic will query the ramdisk to determine the last command that it ran to work around the connection error.
Result: The deployment will succeed.
|
Story Points: | --- | ||||||||
Clone Of: | |||||||||||
: | 1958965 (view as bug list) | Environment: | |||||||||
Last Closed: | 2021-07-27 23:07:08 UTC | Type: | Bug | ||||||||
Regression: | --- | Mount Type: | --- | ||||||||
Documentation: | --- | CRM: | |||||||||
Verified Versions: | Category: | --- | |||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||
Embargoed: | |||||||||||
Bug Depends On: | |||||||||||
Bug Blocks: | 1958965 | ||||||||||
Attachments: |
|
Description
Bob Fournier
2021-05-06 21:33:34 UTC
Created attachment 1780486 [details]
ramdisk 4.7 log
Created attachment 1780487 [details]
ramdisk 4.6 log
*** Bug 1957980 has been marked as a duplicate of this bug. *** Michael downgraded iLO on the passing cluster but it still passes. The only difference between the 2 clusters is one has 5 nodes (passes) and one has 19 (fails on 4.7 only). To clarify on the connection failure - Ironic times out when sending the deploy_steps, the ramdisk is taking longer than 60 seconds to process the deploy_steps command. Ironic then attempts a retry which results in the HTTP 409 response back from the ramdisk: The ramdisk gets the deploy_steps command here: May 06 13:45:29 master-0 ironic-python-agent[3208]: 2021-05-06 13:45:27.558 3208 DEBUG root [-] Executing command: deploy.execute_deploy_step with args: {'step': {'interface': 'deploy', 'step': 'write_image', 'args': {'image_info': {'id': 'rhcos-47.83.202103251640-0-compressed.x86_64.qcow2', 'urls': ['http://172.22.0.2:80/images/rhcos-47.83.202103251640-0-openstack.x86_64.qcow2/rhcos-47.83.202103251640-0-compressed.x86_64.qcow2 After 60 seconds it receives the retry from ironic: May 06 13:46:34 master-0 rsyslogd[3212]: message too long (9588) with configured size 8096, begin of message is: 2021-05-06 13:46:32.241 3208 DEBUG root [-] Executing command: deploy.execute_de [v8.1911.0-6.el8 try https://www.rsyslog.com/e/2445 ] The ramdisk sends the 409 back for the 2nd command: May 06 13:46:35 master-0 ironic-python-agent[3208]: 2021-05-06 13:46:33.973 3208 ERROR root [-] Tried to execute deploy.execute_deploy_step, agent is still executing Command name: execute_deploy_step, params: {'step': {'interface': 'deploy', 'step': 'write_image', 'args': The timeout for ramdisk commands is configurable in Ironic (https://github.com/openstack/ironic/blob/master/ironic/conf/agent.py#L114) but not easily with Metal3. Need to figure out why the ramdisk is taking too long to process the deploy_steps command and doesn't complete in 60 seconds. I'm working on a mitigation that will check the list of executing commands before retrying. An observation so far: something gets stuck inside the agent for the whole 60 seconds until Ironic hits timeout and aborts the request. The retry fails because the work is actually in progress. The patch linked above is a mitigation for the latter: we will detect that the command is running and avoid a failure. We still need to understand why the 1 minute pause appeared in 4.7. The most likely suspect is the introduction of TLS. Actually I've just remembered that ramdisk TLS was already in place in 4.6, so TLS alone cannot be the cause of the issue. Looking at the logs that Michael provided it looks like the fix is working also. On the ironic-python-agent we still see the ssl error in eventlet 60 seconds after the execute_deploy_step command: cb860729-639f-4a82-b236-845dcbd72c4c_d6dc5b34-7d40-43bb-9780-a430c186d4ba_2021-05-10-21-12-11.tar.gz: May 10 17:08:48 master-1 ironic-python-agent[3342]: 2021-05-10 17:08:46.669 3342 DEBUG root [-] Executing command: deploy.execute_deploy_step with args: ... cb860729-639f-4a82-b236-845dcbd72c4c_d6dc5b34-7d40-43bb-9780-a430c186d4ba_2021-05-10-21-12-11.tar.gz: May 10 17:09:51 master-1 rsyslogd[3343]: message too long (312338) with configured size 8096, begin of message is: 2021-05-10 17:08:48.675 3342 DEBUG root [-] Executing deploy step {'interface': [v8.1911.0-6.el8 try https://www.rsyslog.com/e/2445 ] cb860729-639f-4a82-b236-845dcbd72c4c_d6dc5b34-7d40-43bb-9780-a430c186d4ba_2021-05-10-21-12-11.tar.gz: May 10 17:09:51 master-1 ironic-python-agent[3342]: 2021-05-10 17:09:51.582 3342 INFO eventlet.wsgi.server [-] Traceback (most recent call last): ... cb860729-639f-4a82-b236-845dcbd72c4c_d6dc5b34-7d40-43bb-9780-a430c186d4ba_2021-05-10-21-12-11.tar.gz: return self._sslobj.write(data) cb860729-639f-4a82-b236-845dcbd72c4c_d6dc5b34-7d40-43bb-9780-a430c186d4ba_2021-05-10-21-12-11.tar.gz: BrokenPipeError: [Errno 32] Broken pipe But on Ironic now that is being handled without an error: log-bundle-20210510212033/bootstrap/pods/1db251fe4475.log:2021-05-10 21:09:51.411 1 DEBUG ironic.drivers.modules.agent_client [req-9b7a1e9e-ceaf-4d04-9921-33a0964b2 288 bootstrap-user - - - -] Agent command deploy.execute_deploy_step for node d70c228b-41e9-47cd-9bb7-a52d2ceed591 returned result None, error None, HTTP status cod e unknown _command /usr/lib/python3.6/site-packages/ironic/drivers/modules/agent_client.py:212 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2438 |