Bug 1660979 - podman: 'failed to pull image' should be a hard error for container agents (RHEL8) [NEEDINFO]
Summary: podman: 'failed to pull image' should be a hard error for container agents (R...
Keywords:
Status: ASSIGNED
Alias: None
Product: Red Hat Enterprise Linux 8
Classification: Red Hat
Component: resource-agents
Version: 8.0
Hardware: Unspecified
OS: Unspecified
unspecified
low
Target Milestone: rc
: 8.3
Assignee: Oyvind Albrigtsen
QA Contact: cluster-qe@redhat.com
URL:
Whiteboard:
Depends On: 1743687 1682136
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-12-19 18:09 UTC by Ken Gaillot
Modified: 2020-01-06 10:55 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:
Type: Bug
Target Upstream Version:
michele: needinfo? (dciabrin)


Attachments (Terms of Use)

Description Ken Gaillot 2018-12-19 18:09:55 UTC
Description of problem: If "docker pull" fails in the docker agent's start (and presumably the equivalent for the podman and rkt agents), the agent will return OCF_ERR_GENERIC, which is a soft error and will cause pacemaker to retry the start on the same node, but most likely the error is not recoverable on the same node.

Steps to Reproduce:
1. Configure a container resource with an invalid image name in pacemaker.

Actual results: Agent returns OCF_ERR_GENERIC, and pacemaker repeatedly retries start on same node (up to 1,000,000 times by default), getting same error each time.

Expected results: Agent returns OCF_ERR_ARGS or OCF_ERR_INSTALLED, indicating a problem with the local node, so pacemaker retries on another node.


Additional info: The proper exit status is debatable. If the root cause is transient (e.g. network connectivity blip to a remote image repository), then the current OCF_ERR_GENERIC status is appropriate. If the root cause is a typo in the image name (or otherwise complete unavailability), then it really should be OCF_ERR_CONFIGURED (fatal on all nodes). If the root cause is that a local image hasn't been built, then OCF_ERR_ARGS or OCF_ERR_INSTALLED as recommended here makes sense.

I'm leaning to ARGS/INSTALLED because using GENERIC when inappropriate leads to very long recovery time, and using CONFIGURED when inappropriate makes the container unrecoverable without manual intervention, but using ARGS/INSTALLED when inappropriate will either jump too easily to another node or require manual intervention after all nodes fail, and that seems the least harmful scenario.

I didn't go through the container agents to see if other exit statuses could be re-evaluated, but that might be worthwhile.

Comment 6 Patrik Hagara 2019-08-20 13:35:26 UTC
qa_ack+

Reproducer in bug description.


Note You need to log in before you can comment on or make changes to this bug.