Bug 1859134

Summary: Switch to periodic process reaper for collecting zombie processes
Product: OpenShift Container Platform Reporter: Andrew McDermott <amcdermo>
Component: RoutingAssignee: Stephen Greene <sgreene>
Status: CLOSED ERRATA QA Contact: Arvind iyengar <aiyengar>
Severity: high Docs Contact:
Priority: medium    
Version: 4.6CC: aiyengar, akretzsc, alchan, aos-bugs, bbennett, sgreene
Target Milestone: ---   
Target Release: 4.6.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1885688 (view as bug list) Environment:
Last Closed: 2020-10-27 16:16:18 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Bug Depends On:    
Bug Blocks: 1885688    

Description Andrew McDermott 2020-07-21 10:25:29 UTC
We need this change https://github.com/openshift/router/pull/111 which will fix https://github.com/openshift/router/pull/78.

"We were getting races when reloading haproxy via the reload-haproxy script because we had our own process reaper (StartReaper). Occasionally the reload would report no child processes and this happened when StartReaper had already reaped the reload script that we were independently waiting on elsewhere."

Comment 1 Andrew McDermott 2020-07-30 10:10:47 UTC
I’m adding UpcomingSprint, because I was occupied by fixing bugs with
higher priority/severity, developing new features with higher
priority, or developing new features to improve stability at a macro
level. I will revisit this bug next sprint.

Comment 2 mfisher 2020-08-18 19:55:58 UTC
Target reset to 4.7 while investigation is either ongoing or not yet started.  Will be considered for earlier release versions when diagnosed and resolved.

Comment 3 Andrew McDermott 2020-09-10 11:50:58 UTC
I’m adding UpcomingSprint, because I was occupied by fixing bugs with
higher priority/severity, developing new features with higher
priority, or developing new features to improve stability at a macro
level. I will revisit this bug next sprint.

Comment 7 Arvind iyengar 2020-10-09 08:28:28 UTC
Verified in "4.6.0-0.nightly-2020-10-08-210814" release. With this payload, the periodic reaper errors does not occur. For reference,the unpatched version will have such messages in the router logs:
------
$ oc -n openshift-ingress logs router-default-689c5d5bb7-w5bbh --tail 10
E1009 05:15:33.741661       1 haproxy.go:416] can't scrape HAProxy: dial unix /var/lib/haproxy/run/haproxy.sock: connect: no such file or directory
E1009 05:30:42.298061       1 limiter.go:165] error reloading router: waitid: no child processes
------

Comment 10 errata-xmlrpc 2020-10-27 16:16:18 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196