Bug 1700314

Summary: jenkins-slave produce process defunct [ Jenkins "SLAVE" ]
Product: OpenShift Container Platform Reporter: Madhusudan Upadhyay <maupadhy>
Component: ImageStreamsAssignee: Gabe Montero <gmontero>
Status: CLOSED ERRATA QA Contact: XiuJuan Wang <xiuwang>
Severity: high Docs Contact:
Priority: unspecified    
Version: 3.9.0CC: aos-bugs, fgrosjea, gmontero, jokerman, mmccomas, vbobade, wzheng
Target Milestone: ---   
Target Release: 3.9.z   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Cause: Long running jenkins agent/slave pods can experience the defect process phenomenon that we previously observed with the jenkins master Consequence: A lot of defect processes show up in process listings until the pod is terminated. Fix: Employ `dumb-init` as with the openshift/jenkins master image to clean up these defect processes which occur during jenkins job processing. Result: Process listings within agent/slave pods and on the hosts those pods reside no longer include the defunct processes.
Story Points: ---
Clone Of:
: 1705123 1707447 1707448 1718379 (view as bug list) Environment:
Last Closed: 2019-07-05 06:58:57 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1705123, 1707447, 1707448, 1718379    

Description Madhusudan Upadhyay 2019-04-16 09:55:00 UTC
Description of problem:

Since openshift was upgraded from v3.7 to v3.9 , jenkins-slave pods produce defunct process.
jenkins : openshift3/jenkins-slave-base-rhel7:v3.9.68
jenkins-slave : openshift3/jenkins-slave-base-rhel7:v3.9.68

What happens ?

Example snippet of the error:
~~~~
 What happens ?

I have this process : 
default    9077   9072  0 14:32 ?        00:00:00 sleep 3
PPID is :
default    9072   9070  0 14:32 ?        00:00:00 sh -c { while [ -d '/var/lib/jenkins/workspace/.../.../' -a \! -f '/var/lib/jenkins/workspace/.../.../jenkins-result.txt' ]; do touch '/var/lib/jenkins/workspace/.../..../jenkins-log.txt'; sleep 3; done } & jsc=durable-..... ; JENKINS_SERVER_COOKIE=$jsc '/var/lib/jenkins/workspace/.../...' > '/var/lib/jenkins/workspace/.../.../jenkins-log.txt' 2>&1; echo $? > '/var/lib/jenkins/workspace/.../..../jenkins-result.txt'; wait

PID 9077 becomes : 
default    9077      1  0 14:32 ?        00:00:00 [sleep]  
~~~~

Version-Release number of selected component (if applicable):

Openshift v3.9

How reproducible:

It happens every days. More applications are deployed, more process defunct are present.
For example, today 16 pm I had 113 process defunct



Actual results:

jenkins-slave produce process defunct



Additional info:

There is also an associated issue: [ However, this is for master and we need to track for slave ]
~~~
- Related Github Issue:- 
  [+] https://github.com/openshift/release/issues/432 
~~~

- I also understand that this is a PID1 issue that is a common occurrence due to zombie reaping problem of processes without properly running init process. Please let me know in case of any additional information. 

Few additional clarifications:

[ Questions ]
- the jenkins master pod logs and any slave/agent pod logs in question
- if you get a process listing like the one in the email, make sure it is clear which pod it came from
- if it is an agent/slave pod, are they using the k8s plugin to launch it
- and confirm whether they are using the client plugin, and get the pipeline code if possible

[ Answers ]

- Added as private to preserve customer sensitive data.

Comment 20 Gabe Montero 2019-05-13 14:51:06 UTC
Turns out we will need https://github.com/openshift/ocp-build-data/pull/122 to merge before we can start getting slave builds at osbs/brew with dumb-init

Comment 21 Gabe Montero 2019-06-10 14:18:19 UTC
looks like we have new images on brew for this

Comment 22 XiuJuan Wang 2019-06-12 05:56:41 UTC
Verified with 
openshift3/jenkins-slave-maven-rhel7:v3.9 (v3.9.82)
openshift3/jenkins-slave-nodejs-rhel7:v3.9 (v3.9.82)
openshift3/jenkins-2-rhel7:v3.9 (v3.9.82)

Steps:
1. Create jenkins server and maven| nodejs pipeline buildconfigs.
2.Login to jenkins console to set the maven/nodejs pod idle 30 mins
3.Trigger maven and nodejs pipeline builds.
4.Rsh into slave pod when time is almost out.
dumb-init process has cleaned defunct processes, no defunct processes exist.


maven-c7tnh                           1/1       Running     0          27m
nodejs-bgj5c                          1/1       Running     0          28m

# oc -n xiu rsh maven-c7tnh 
sh-4.2$ ps -ef 
UID         PID   PPID  C STIME TTY          TIME CMD
default       1      0  0 05:28 ?        00:00:00 /usr/bin/dumb-init -- /usr/local/bin/run-jnlp-client 47e9f7e9ac1b98d05da6227ee6e0f599080f1ad35ba70881fad70b84c0245096 maven-c7tnh
default       8      1  1 05:28 ?        00:00:16 java -XX:+UseParallelGC -XX:MinHeapFreeRatio=5 -XX:MaxHeapFreeRatio=10 -XX:GCTimeRatio=4 -XX:AdaptiveSizePolicyWeight=90 -cp /home/jenkins/remoting.jar hudson.re
default     253      0  0 05:51 ?        00:00:00 /bin/sh
default     262    253  0 05:52 ?        00:00:00 ps -ef
sh-4.2$ exit
# oc rsh  -n xiu nodejs-bgj5c 
sh-4.2$ ps -ef 
UID         PID   PPID  C STIME TTY          TIME CMD
default       1      0  0 05:27 ?        00:00:00 /usr/bin/dumb-init -- /usr/local/bin/run-jnlp-client 67e18a79f35d805216d7af10c13a54a29616639caa2830e362f7f0e8e3272051 nodejs-bgj5c
default       8      1  1 05:27 ?        00:00:18 java -XX:+UseParallelGC -XX:MinHeapFreeRatio=5 -XX:MaxHeapFreeRatio=10 -XX:GCTimeRatio=4 -XX:AdaptiveSizePolicyWeight=90 -cp /home/jenkins/remoting.jar hudson.re
default     416      0  0 05:52 ?        00:00:00 /bin/sh
default     425    416  0 05:52 ?        00:00:00 ps -ef

Comment 24 errata-xmlrpc 2019-07-05 06:58:57 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:1642