Bug 1972743

Summary:	resource agent bails out when podman fails to start container under heavy load
Product:	Red Hat Enterprise Linux 8	Reporter:	Damien Ciabrini <dciabrin>
Component:	resource-agents	Assignee:	Oyvind Albrigtsen <oalbrigt>
Status:	CLOSED ERRATA	QA Contact:	cluster-qe <cluster-qe>
Severity:	unspecified	Docs Contact:
Priority:	unspecified
Version:	8.4	CC:	agk, cfeist, cluster-maint, dabarzil, fdinitto, jmarcian, phagara
Target Milestone:	rc	Keywords:	Triaged, ZStream
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	resource-agents-4.1.1-97.el8	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:
Clones:	1973035 (view as bug list)		Environment:
Last Closed:	2021-11-09 17:27:32 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1973035

Description Damien Ciabrini 2021-06-16 14:28:11 UTC

Description of problem:
Due to [1], it may happen that podman is not able to create and run a container under heavy load due to a race. When that happens, the resource agent receives an error code from podman and it stops immediately [2].

Jun 01 23:44:12 controller-0 pacemaker-controld[3284]:  notice: Requesting local execution of start operation for haproxy-bundle-podman-0 on controller-0                                                                             
Jun 01 23:44:16 controller-0 podman(haproxy-bundle-podman-0)[8338]: INFO: running container haproxy-bundle-podman-0 for the first time                                                                                                
Jun 01 23:44:19 controller-0 podman(haproxy-bundle-podman-0)[8867]: ERROR: Error: OCI runtime error: container_linux.go:370: starting container process caused: process_linux.go:459: container init caused: process_linux.go:422: setting cgroup config for procHooks process caused: Unit libpod-35c4a915bdc87f8a0c9c33eb43cba5f4470b5683882ed34d7a02bad6beb2bb53.scope not found.                                                                                         
Jun 01 23:44:19 controller-0 podman(haproxy-bundle-podman-0)[8896]: ERROR: podman failed to launch container
Jun 01 23:44:19 controller-0 pacemaker-execd[3278]:  notice: haproxy-bundle-podman-0_start_0[7089] error output [ ocf-exit-reason:podman failed to launch container ]                                                                 
Jun 01 23:44:19 controller-0 pacemaker-controld[3284]:  notice: Result of start operation for haproxy-bundle-podman-0 on controller-0: error                                                                                          
Jun 01 23:44:19 controller-0 pacemaker-controld[3284]:  notice: controller-0-haproxy-bundle-podman-0_start_0:103 [ ocf-exit-reason:podman failed to launch container\n ]                                                              
Jun 01 23:44:19 controller-0 pacemaker-attrd[3279]:  notice: Setting fail-count-haproxy-bundle-podman-0#start_0[controller-0]: (unset) -> INFINITY                                                                                    
Jun 01 23:44:19 controller-0 pacemaker-attrd[3279]:  notice: Setting last-failure-haproxy-bundle-podman-0#start_0[controller-0]: (unset) -> 1622591059                                                                                


Since by default, a start failure blocks a resource on a host, we end up missing one or several containers in our cluster, as pacemaker will not try to restart them until "pcs resource cleanup" is invoked manually. 

[1] https://bugzilla.redhat.com/show_bug.cgi?id=1972209
[2] https://bugzilla.redhat.com/show_bug.cgi?id=1967128

Version-Release number of selected component (if applicable):
https://bugzilla.redhat.com/show_bug.cgi?id=1972209

How reproducible:
Random

Steps to Reproduce:
1. Deploy an Openstack HA control plane
2. Reboot one node in the control plane

Actual results:
During the node restart, pacemaker fail to restart all its resources because podman could not create a container

Expected results:
The resource-agent should retry creation until the start operation times out, to overcome the race in "podman run"

Additional info:

Comment 8 Julia Marciano 2021-08-30 22:31:27 UTC

Verifies. based on https://bugzilla.redhat.com/show_bug.cgi?id=1973035#c4.

Comment 10 errata-xmlrpc 2021-11-09 17:27:32 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: resource-agents security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:4139