Description of problem: Deployments with connectivity issues can have downloads still be in-flight causing deployment to fail with "agent is busy" errors. How reproducible: Can only be reproduced where retries are triggering causing file downloads to fail, in a race condition between the file download and the network fabric blocking packets. In other words, not easily reproducible except in environments where it is being encountered. Additional info: What is occurring is the agent is heart beating to the conductor service to signal it is online. If connectivity breaks between the initial image download/write request and the next heartbeat in a sufficiently long enough time where the first command is still in-progress attempting to download the files. As such, there is no completed command status for ironic to determine the next step, and it treats the agent as if this is the first heartbeat and sends it down the path of performing the file download, which was already in progress. From the conductor logging: 2020-06-22 18:48:04.823 1 DEBUG ironic.conductor.task_manager [-] Upgrading shared lock on node feb99f1f-853d-4184-8c86-1a54e6731633 for heartbeat to an exclusive one (shared lock was held 0.01 seconds) upgrade_lock /usr/lib/python3.6/site-packages/ironic/conductor/task_manager.py:300^[[00m 2020-06-22 18:48:06.141 1 ERROR ironic.drivers.modules.agent_client [req-2b1b38ca-6357-4074-9e29-8e837e198c8e - - - - -] Failed to connect to the agent running on node 0f7b66c6-0dfa-4c69-ac65-a8c74e91b141 for invoking command standby.prepare_image. Error: HTTPConnectionPool(host='REDACTED', port=9999): Max retries exceeded with url: /v1/commands?wait=false (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fe04b00d358>: Failed to establish a new connection: [Errno 113] EHOSTUNREACH',)): requests.exceptions.ConnectionError: HTTPConnectionPool(host='REDACTED', port=9999): Max retries exceeded with url: /v1/commands?wait=false (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fe04b00d358>: Failed to establish a new connection: [Errno 113] EHOSTUNREACH',))^[[00m 2020-06-22 18:48:06.142 1 DEBUG ironic.drivers.modules.agent_client [req-2b1b38ca-6357-4074-9e29-8e837e198c8e - - - - -] Executing agent command standby.prepare_image for node 0f7b66c6-0dfa-4c69-ac65-a8c74e91b141 _command /usr/lib/python3.6/site-packages/ironic/drivers/modules/agent_client.py:93^[[00m 2020-06-22 18:48:06.150 1 DEBUG ironic.drivers.modules.agent_client [req-aeb44145-3c59-470f-ab2a-e2a16979feff - - - - -] Agent command standby.prepare_image for node feb99f1f-853d-4184-8c86-1a54e6731633 returned result None, error None, HTTP status code 500 _command /usr/lib/python3.6/site-packages/ironic/drivers/modules/agent_client.py:131^[[00m 2020-06-22 18:48:06.150 1 ERROR ironic.drivers.modules.agent_client [req-aeb44145-3c59-470f-ab2a-e2a16979feff - - - - -] Agent command standby.prepare_image for node feb99f1f-853d-4184-8c86-1a54e6731633 failed. Expected 2xx HTTP status code, got 500.^[[00m 2020-06-22 18:48:06.150 1 ERROR ironic.drivers.modules.agent_base_vendor [req-aeb44145-3c59-470f-ab2a-e2a16979feff - - - - -] Asynchronous exception: Node failed to deploy. Exception: Agent API for node feb99f1f-853d-4184-8c86-1a54e6731633 returned HTTP status code 500 with error: Command execution failed: agent is busy for node feb99f1f-853d-4184-8c86-1a54e6731633: ironic.common.exception.AgentAPIError: Agent API for node feb99f1f-853d-4184-8c86-1a54e6731633 returned HTTP status code 500 with error: Command execution failed: agent is busy 2020-06-22 18:48:06.150 1 ERROR ironic.drivers.modules.agent_base_vendor Traceback (most recent call last): 2020-06-22 18:48:06.150 1 ERROR ironic.drivers.modules.agent_base_vendor File "/usr/lib/python3.6/site-packages/ironic/drivers/modules/agent_base_vendor.py", line 389, in heartbeat 2020-06-22 18:48:06.150 1 ERROR ironic.drivers.modules.agent_base_vendor self.continue_deploy(task) 2020-06-22 18:48:06.150 1 ERROR ironic.drivers.modules.agent_base_vendor File "/usr/lib/python3.6/site-packages/ironic_lib/metrics.py", line 60, in wrapped 2020-06-22 18:48:06.150 1 ERROR ironic.drivers.modules.agent_base_vendor result = f(*args, **kwargs) 2020-06-22 18:48:06.150 1 ERROR ironic.drivers.modules.agent_base_vendor File "/usr/lib/python3.6/site-packages/ironic/conductor/task_manager.py", line 148, in wrapper 2020-06-22 18:48:06.150 1 ERROR ironic.drivers.modules.agent_base_vendor return f(*args, **kwargs) 2020-06-22 18:48:06.150 1 ERROR ironic.drivers.modules.agent_base_vendor File "/usr/lib/python3.6/site-packages/ironic/drivers/modules/agent.py", line 255, in continue_deploy 2020-06-22 18:48:06.150 1 ERROR ironic.drivers.modules.agent_base_vendor self._client.prepare_image(node, image_info) From the agent logs, trimmed to 200 characters per line. Jun 22 14:56:14 master-0.betadev4.REDACTED ironic-python-agent[3217]: 2020-06-22 14:56:14.330 3217 DEBUG root [-] Executing command: standby.prepare_image with args: {'image_info': {'id': 'rhcos-44.81 Jun 22 14:56:15 master-0.betadev4.REDACTED ironic-python-agent[3217]: 2020-06-22 14:56:14.935 3217 INFO root [-] Command standby.prepare_image completed: Command name: prepare_image, params: {'image_i Jun 22 14:58:08 master-0.betadev4.REDACTED ironic-python-agent[3217]: 2020-06-22 14:58:07.669 3217 DEBUG root [-] Executing command: standby.prepare_image with args: {'image_info': {'id': 'rhcos-44.81 Jun 22 14:58:08 master-0.betadev4.REDACTED ironic-python-agent[3217]: 2020-06-22 14:58:08.306 3217 ERROR root [-] Tried to execute standby.prepare_image, agent is still executing Command name: prepare The indicator for the agent is still executing command is an internal threading lock. As the thread is still running, and we see no signs that the download from the initial request has reached a completed state, we can see where things go sideways. Ironic fails the deployment erroneously at this point.
Could you paste the whole ramdisk and conductor logs? This statement does not seem correct from reading the IPA code: > As such, there is no completed command status for ironic to determine the next step The command should be recorded in RUNNING state and no duplicate should be run.