Bug 1660979 - podman: 'failed to pull image' should be a hard error for container agents (RHEL8) [NEEDINFO]
Summary: podman: 'failed to pull image' should be a hard error for container agents (R...
Keywords:
Status: CLOSED CANTFIX
Alias: None
Product: Red Hat Enterprise Linux 8
Classification: Red Hat
Component: resource-agents
Version: 8.0
Hardware: Unspecified
OS: Unspecified
unspecified
low
Target Milestone: rc
: 8.4
Assignee: Oyvind Albrigtsen
QA Contact: cluster-qe@redhat.com
URL:
Whiteboard:
Depends On: 1682136 1743687
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-12-19 18:09 UTC by Ken Gaillot
Modified: 2020-11-23 09:21 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-11-23 09:21:37 UTC
Type: Bug
Target Upstream Version:
michele: needinfo? (dciabrin)


Attachments (Terms of Use)

Description Ken Gaillot 2018-12-19 18:09:55 UTC
Description of problem: If "docker pull" fails in the docker agent's start (and presumably the equivalent for the podman and rkt agents), the agent will return OCF_ERR_GENERIC, which is a soft error and will cause pacemaker to retry the start on the same node, but most likely the error is not recoverable on the same node.

Steps to Reproduce:
1. Configure a container resource with an invalid image name in pacemaker.

Actual results: Agent returns OCF_ERR_GENERIC, and pacemaker repeatedly retries start on same node (up to 1,000,000 times by default), getting same error each time.

Expected results: Agent returns OCF_ERR_ARGS or OCF_ERR_INSTALLED, indicating a problem with the local node, so pacemaker retries on another node.


Additional info: The proper exit status is debatable. If the root cause is transient (e.g. network connectivity blip to a remote image repository), then the current OCF_ERR_GENERIC status is appropriate. If the root cause is a typo in the image name (or otherwise complete unavailability), then it really should be OCF_ERR_CONFIGURED (fatal on all nodes). If the root cause is that a local image hasn't been built, then OCF_ERR_ARGS or OCF_ERR_INSTALLED as recommended here makes sense.

I'm leaning to ARGS/INSTALLED because using GENERIC when inappropriate leads to very long recovery time, and using CONFIGURED when inappropriate makes the container unrecoverable without manual intervention, but using ARGS/INSTALLED when inappropriate will either jump too easily to another node or require manual intervention after all nodes fail, and that seems the least harmful scenario.

I didn't go through the container agents to see if other exit statuses could be re-evaluated, but that might be worthwhile.

Comment 6 Patrik Hagara 2019-08-20 13:35:26 UTC
qa_ack+

Reproducer in bug description.

Comment 13 Oyvind Albrigtsen 2020-11-23 09:21:37 UTC
Closing, as bz to add rc code was closed due to it not being possible to add without acting flaky.


Note You need to log in before you can comment on or make changes to this bug.