Bug 1850213 - Extended connectivity outages with agent while downloading an image can collide with the next heartbeat operation
Summary: Extended connectivity outages with agent while downloading an image can colli...
Keywords:
Status: CLOSED DEFERRED
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Bare Metal Hardware Provisioning
Version: 4.4
Hardware: Unspecified
OS: Unspecified
medium
low
Target Milestone: ---
: 4.6.0
Assignee: Julia Kreger
QA Contact: Raviv Bar-Tal
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-06-23 18:20 UTC by Julia Kreger
Modified: 2020-08-11 14:06 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-08-11 14:06:54 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
OpenStack gerrit 737595 0 None NEW Handle "agent is busy" errors during deployment 2020-08-11 13:38:15 UTC

Description Julia Kreger 2020-06-23 18:20:19 UTC
Description of problem:

Deployments with connectivity issues can have downloads still be in-flight causing deployment to fail with "agent is busy" errors.

How reproducible:

Can only be reproduced where retries are triggering causing file downloads to fail, in a race condition between the file download and the network fabric blocking packets.

In other words, not easily reproducible except in environments where it is being encountered.

Additional info:

What is occurring is the agent is heart beating to the conductor service to signal it is online. If connectivity breaks between the initial image download/write request and the next heartbeat in a sufficiently long enough time where the first command is still in-progress attempting to download the files.

As such, there is no completed command status for ironic to determine the next step, and it treats the agent as if this is the first heartbeat and sends it down the path of performing the file download, which was already in progress.

From the conductor logging:

2020-06-22 18:48:04.823 1 DEBUG ironic.conductor.task_manager [-] Upgrading shared lock on node feb99f1f-853d-4184-8c86-1a54e6731633 for heartbeat to an exclusive one (shared lock was held 0.01 seconds) upgrade_lock /usr/lib/python3.6/site-packages/ironic/conductor/task_manager.py:300^[[00m
2020-06-22 18:48:06.141 1 ERROR ironic.drivers.modules.agent_client [req-2b1b38ca-6357-4074-9e29-8e837e198c8e - - - - -] Failed to connect to the agent running on node 0f7b66c6-0dfa-4c69-ac65-a8c74e91b141 for invoking command standby.prepare_image. Error: HTTPConnectionPool(host='REDACTED', port=9999): Max retries exceeded with url: /v1/commands?wait=false (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fe04b00d358>: Failed to establish a new connection: [Errno 113] EHOSTUNREACH',)): requests.exceptions.ConnectionError: HTTPConnectionPool(host='REDACTED', port=9999): Max retries exceeded with url: /v1/commands?wait=false (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fe04b00d358>: Failed to establish a new connection: [Errno 113] EHOSTUNREACH',))^[[00m
2020-06-22 18:48:06.142 1 DEBUG ironic.drivers.modules.agent_client [req-2b1b38ca-6357-4074-9e29-8e837e198c8e - - - - -] Executing agent command standby.prepare_image for node 0f7b66c6-0dfa-4c69-ac65-a8c74e91b141 _command /usr/lib/python3.6/site-packages/ironic/drivers/modules/agent_client.py:93^[[00m
2020-06-22 18:48:06.150 1 DEBUG ironic.drivers.modules.agent_client [req-aeb44145-3c59-470f-ab2a-e2a16979feff - - - - -] Agent command standby.prepare_image for node feb99f1f-853d-4184-8c86-1a54e6731633 returned result None, error None, HTTP status code 500 _command /usr/lib/python3.6/site-packages/ironic/drivers/modules/agent_client.py:131^[[00m
2020-06-22 18:48:06.150 1 ERROR ironic.drivers.modules.agent_client [req-aeb44145-3c59-470f-ab2a-e2a16979feff - - - - -] Agent command standby.prepare_image for node feb99f1f-853d-4184-8c86-1a54e6731633 failed. Expected 2xx HTTP status code, got 500.^[[00m
2020-06-22 18:48:06.150 1 ERROR ironic.drivers.modules.agent_base_vendor [req-aeb44145-3c59-470f-ab2a-e2a16979feff - - - - -] Asynchronous exception: Node failed to deploy. Exception: Agent API for node feb99f1f-853d-4184-8c86-1a54e6731633 returned HTTP status code 500 with error: Command execution failed: agent is busy for node feb99f1f-853d-4184-8c86-1a54e6731633: ironic.common.exception.AgentAPIError: Agent API for node feb99f1f-853d-4184-8c86-1a54e6731633 returned HTTP status code 500 with error: Command execution failed: agent is busy
2020-06-22 18:48:06.150 1 ERROR ironic.drivers.modules.agent_base_vendor Traceback (most recent call last):
2020-06-22 18:48:06.150 1 ERROR ironic.drivers.modules.agent_base_vendor   File "/usr/lib/python3.6/site-packages/ironic/drivers/modules/agent_base_vendor.py", line 389, in heartbeat
2020-06-22 18:48:06.150 1 ERROR ironic.drivers.modules.agent_base_vendor     self.continue_deploy(task)
2020-06-22 18:48:06.150 1 ERROR ironic.drivers.modules.agent_base_vendor   File "/usr/lib/python3.6/site-packages/ironic_lib/metrics.py", line 60, in wrapped
2020-06-22 18:48:06.150 1 ERROR ironic.drivers.modules.agent_base_vendor     result = f(*args, **kwargs)
2020-06-22 18:48:06.150 1 ERROR ironic.drivers.modules.agent_base_vendor   File "/usr/lib/python3.6/site-packages/ironic/conductor/task_manager.py", line 148, in wrapper
2020-06-22 18:48:06.150 1 ERROR ironic.drivers.modules.agent_base_vendor     return f(*args, **kwargs)
2020-06-22 18:48:06.150 1 ERROR ironic.drivers.modules.agent_base_vendor   File "/usr/lib/python3.6/site-packages/ironic/drivers/modules/agent.py", line 255, in continue_deploy
2020-06-22 18:48:06.150 1 ERROR ironic.drivers.modules.agent_base_vendor     self._client.prepare_image(node, image_info)

From the agent logs, trimmed to 200 characters per line.

Jun 22 14:56:14 master-0.betadev4.REDACTED ironic-python-agent[3217]: 2020-06-22 14:56:14.330 3217 DEBUG root [-] Executing command: standby.prepare_image with args: {'image_info': {'id': 'rhcos-44.81
Jun 22 14:56:15 master-0.betadev4.REDACTED ironic-python-agent[3217]: 2020-06-22 14:56:14.935 3217 INFO root [-] Command standby.prepare_image completed: Command name: prepare_image, params: {'image_i
Jun 22 14:58:08 master-0.betadev4.REDACTED ironic-python-agent[3217]: 2020-06-22 14:58:07.669 3217 DEBUG root [-] Executing command: standby.prepare_image with args: {'image_info': {'id': 'rhcos-44.81
Jun 22 14:58:08 master-0.betadev4.REDACTED ironic-python-agent[3217]: 2020-06-22 14:58:08.306 3217 ERROR root [-] Tried to execute standby.prepare_image, agent is still executing Command name: prepare

The indicator for the agent is still executing command is an internal threading lock. As the thread is still running, and we see no signs that the download from the initial request has reached a completed state, we can see where things go sideways. Ironic fails the deployment erroneously at this point.

Comment 1 Dmitry Tantsur 2020-06-26 07:45:19 UTC
Could you paste the whole ramdisk and conductor logs? This statement does not seem correct from reading the IPA code:

> As such, there is no completed command status for ironic to determine the next step

The command should be recorded in RUNNING state and no duplicate should be run.


Note You need to log in before you can comment on or make changes to this bug.