1660979 – podman: 'failed to pull image' should be a hard error for container agents (RHEL8)

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1660979 - podman: 'failed to pull image' should be a hard error for container agents (RHEL8)

Summary: podman: 'failed to pull image' should be a hard error for container agents (R...

Keywords:
Status:	CLOSED CANTFIX
Alias:	None
Product:	Red Hat Enterprise Linux 8
Classification:	Red Hat
Component:	resource-agents
Sub Component:
Version:	8.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	low
Target Milestone:	rc
Target Release:	8.4
Assignee:	Oyvind Albrigtsen
QA Contact:	cluster-qe@redhat.com
Docs Contact:
URL:
Whiteboard:
Depends On:	1682136 1743687
Blocks:
TreeView+	depends on / blocked

Reported:	2018-12-19 18:09 UTC by Ken Gaillot
Modified:	2023-09-14 04:44 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-11-23 09:21:37 UTC
Type:	Bug
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Ken Gaillot 2018-12-19 18:09:55 UTC

Description of problem: If "docker pull" fails in the docker agent's start (and presumably the equivalent for the podman and rkt agents), the agent will return OCF_ERR_GENERIC, which is a soft error and will cause pacemaker to retry the start on the same node, but most likely the error is not recoverable on the same node.

Steps to Reproduce:
1. Configure a container resource with an invalid image name in pacemaker.

Actual results: Agent returns OCF_ERR_GENERIC, and pacemaker repeatedly retries start on same node (up to 1,000,000 times by default), getting same error each time.

Expected results: Agent returns OCF_ERR_ARGS or OCF_ERR_INSTALLED, indicating a problem with the local node, so pacemaker retries on another node.

Additional info: The proper exit status is debatable. If the root cause is transient (e.g. network connectivity blip to a remote image repository), then the current OCF_ERR_GENERIC status is appropriate. If the root cause is a typo in the image name (or otherwise complete unavailability), then it really should be OCF_ERR_CONFIGURED (fatal on all nodes). If the root cause is that a local image hasn't been built, then OCF_ERR_ARGS or OCF_ERR_INSTALLED as recommended here makes sense.

I'm leaning to ARGS/INSTALLED because using GENERIC when inappropriate leads to very long recovery time, and using CONFIGURED when inappropriate makes the container unrecoverable without manual intervention, but using ARGS/INSTALLED when inappropriate will either jump too easily to another node or require manual intervention after all nodes fail, and that seems the least harmful scenario.

I didn't go through the container agents to see if other exit statuses could be re-evaluated, but that might be worthwhile.

Comment 6 Patrik Hagara 2019-08-20 13:35:26 UTC

qa_ack+

Reproducer in bug description.

Comment 13 Oyvind Albrigtsen 2020-11-23 09:21:37 UTC

Closing, as bz to add rc code was closed due to it not being possible to add without acting flaky.

Comment 14 Red Hat Bugzilla 2023-09-14 04:44:03 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days

Note You need to log in before you can comment on or make changes to this bug.