Bug 1972743

Summary: resource agent bails out when podman fails to start container under heavy load
Product: Red Hat Enterprise Linux 8 Reporter: Damien Ciabrini <dciabrin>
Component: resource-agentsAssignee: Oyvind Albrigtsen <oalbrigt>
Status: CLOSED ERRATA QA Contact: cluster-qe <cluster-qe>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 8.4CC: agk, cfeist, cluster-maint, dabarzil, fdinitto, jmarcian, phagara
Target Milestone: rcKeywords: Triaged, ZStream
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: resource-agents-4.1.1-97.el8 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1973035 (view as bug list) Environment:
Last Closed: 2021-11-09 17:27:32 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1973035    

Description Damien Ciabrini 2021-06-16 14:28:11 UTC
Description of problem:
Due to [1], it may happen that podman is not able to create and run a container under heavy load due to a race. When that happens, the resource agent receives an error code from podman and it stops immediately [2].

Jun 01 23:44:12 controller-0 pacemaker-controld[3284]:  notice: Requesting local execution of start operation for haproxy-bundle-podman-0 on controller-0                                                                             
Jun 01 23:44:16 controller-0 podman(haproxy-bundle-podman-0)[8338]: INFO: running container haproxy-bundle-podman-0 for the first time                                                                                                
Jun 01 23:44:19 controller-0 podman(haproxy-bundle-podman-0)[8867]: ERROR: Error: OCI runtime error: container_linux.go:370: starting container process caused: process_linux.go:459: container init caused: process_linux.go:422: setting cgroup config for procHooks process caused: Unit libpod-35c4a915bdc87f8a0c9c33eb43cba5f4470b5683882ed34d7a02bad6beb2bb53.scope not found.                                                                                         
Jun 01 23:44:19 controller-0 podman(haproxy-bundle-podman-0)[8896]: ERROR: podman failed to launch container
Jun 01 23:44:19 controller-0 pacemaker-execd[3278]:  notice: haproxy-bundle-podman-0_start_0[7089] error output [ ocf-exit-reason:podman failed to launch container ]                                                                 
Jun 01 23:44:19 controller-0 pacemaker-controld[3284]:  notice: Result of start operation for haproxy-bundle-podman-0 on controller-0: error                                                                                          
Jun 01 23:44:19 controller-0 pacemaker-controld[3284]:  notice: controller-0-haproxy-bundle-podman-0_start_0:103 [ ocf-exit-reason:podman failed to launch container\n ]                                                              
Jun 01 23:44:19 controller-0 pacemaker-attrd[3279]:  notice: Setting fail-count-haproxy-bundle-podman-0#start_0[controller-0]: (unset) -> INFINITY                                                                                    
Jun 01 23:44:19 controller-0 pacemaker-attrd[3279]:  notice: Setting last-failure-haproxy-bundle-podman-0#start_0[controller-0]: (unset) -> 1622591059                                                                                


Since by default, a start failure blocks a resource on a host, we end up missing one or several containers in our cluster, as pacemaker will not try to restart them until "pcs resource cleanup" is invoked manually. 

[1] https://bugzilla.redhat.com/show_bug.cgi?id=1972209
[2] https://bugzilla.redhat.com/show_bug.cgi?id=1967128

Version-Release number of selected component (if applicable):
https://bugzilla.redhat.com/show_bug.cgi?id=1972209

How reproducible:
Random

Steps to Reproduce:
1. Deploy an Openstack HA control plane
2. Reboot one node in the control plane

Actual results:
During the node restart, pacemaker fail to restart all its resources because podman could not create a container

Expected results:
The resource-agent should retry creation until the start operation times out, to overcome the race in "podman run"

Additional info:

Comment 8 Julia Marciano 2021-08-30 22:31:27 UTC
Verifies. based on https://bugzilla.redhat.com/show_bug.cgi?id=1973035#c4.

Comment 10 errata-xmlrpc 2021-11-09 17:27:32 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: resource-agents security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:4139