Hide Forgot
Description of problem: Since openshift was upgraded from v3.7 to v3.9 , jenkins-slave pods produce defunct process. jenkins : openshift3/jenkins-slave-base-rhel7:v3.9.68 jenkins-slave : openshift3/jenkins-slave-base-rhel7:v3.9.68 What happens ? Example snippet of the error: ~~~~ What happens ? I have this process : default 9077 9072 0 14:32 ? 00:00:00 sleep 3 PPID is : default 9072 9070 0 14:32 ? 00:00:00 sh -c { while [ -d '/var/lib/jenkins/workspace/.../.../' -a \! -f '/var/lib/jenkins/workspace/.../.../jenkins-result.txt' ]; do touch '/var/lib/jenkins/workspace/.../..../jenkins-log.txt'; sleep 3; done } & jsc=durable-..... ; JENKINS_SERVER_COOKIE=$jsc '/var/lib/jenkins/workspace/.../...' > '/var/lib/jenkins/workspace/.../.../jenkins-log.txt' 2>&1; echo $? > '/var/lib/jenkins/workspace/.../..../jenkins-result.txt'; wait PID 9077 becomes : default 9077 1 0 14:32 ? 00:00:00 [sleep] ~~~~ Version-Release number of selected component (if applicable): Openshift v3.9 How reproducible: It happens every days. More applications are deployed, more process defunct are present. For example, today 16 pm I had 113 process defunct Actual results: jenkins-slave produce process defunct Additional info: There is also an associated issue: [ However, this is for master and we need to track for slave ] ~~~ - Related Github Issue:- [+] https://github.com/openshift/release/issues/432 ~~~ - I also understand that this is a PID1 issue that is a common occurrence due to zombie reaping problem of processes without properly running init process. Please let me know in case of any additional information. Few additional clarifications: [ Questions ] - the jenkins master pod logs and any slave/agent pod logs in question - if you get a process listing like the one in the email, make sure it is clear which pod it came from - if it is an agent/slave pod, are they using the k8s plugin to launch it - and confirm whether they are using the client plugin, and get the pipeline code if possible [ Answers ] - Added as private to preserve customer sensitive data.
Turns out we will need https://github.com/openshift/ocp-build-data/pull/122 to merge before we can start getting slave builds at osbs/brew with dumb-init
looks like we have new images on brew for this
Verified with openshift3/jenkins-slave-maven-rhel7:v3.9 (v3.9.82) openshift3/jenkins-slave-nodejs-rhel7:v3.9 (v3.9.82) openshift3/jenkins-2-rhel7:v3.9 (v3.9.82) Steps: 1. Create jenkins server and maven| nodejs pipeline buildconfigs. 2.Login to jenkins console to set the maven/nodejs pod idle 30 mins 3.Trigger maven and nodejs pipeline builds. 4.Rsh into slave pod when time is almost out. dumb-init process has cleaned defunct processes, no defunct processes exist. maven-c7tnh 1/1 Running 0 27m nodejs-bgj5c 1/1 Running 0 28m # oc -n xiu rsh maven-c7tnh sh-4.2$ ps -ef UID PID PPID C STIME TTY TIME CMD default 1 0 0 05:28 ? 00:00:00 /usr/bin/dumb-init -- /usr/local/bin/run-jnlp-client 47e9f7e9ac1b98d05da6227ee6e0f599080f1ad35ba70881fad70b84c0245096 maven-c7tnh default 8 1 1 05:28 ? 00:00:16 java -XX:+UseParallelGC -XX:MinHeapFreeRatio=5 -XX:MaxHeapFreeRatio=10 -XX:GCTimeRatio=4 -XX:AdaptiveSizePolicyWeight=90 -cp /home/jenkins/remoting.jar hudson.re default 253 0 0 05:51 ? 00:00:00 /bin/sh default 262 253 0 05:52 ? 00:00:00 ps -ef sh-4.2$ exit # oc rsh -n xiu nodejs-bgj5c sh-4.2$ ps -ef UID PID PPID C STIME TTY TIME CMD default 1 0 0 05:27 ? 00:00:00 /usr/bin/dumb-init -- /usr/local/bin/run-jnlp-client 67e18a79f35d805216d7af10c13a54a29616639caa2830e362f7f0e8e3272051 nodejs-bgj5c default 8 1 1 05:27 ? 00:00:18 java -XX:+UseParallelGC -XX:MinHeapFreeRatio=5 -XX:MaxHeapFreeRatio=10 -XX:GCTimeRatio=4 -XX:AdaptiveSizePolicyWeight=90 -cp /home/jenkins/remoting.jar hudson.re default 416 0 0 05:52 ? 00:00:00 /bin/sh default 425 416 0 05:52 ? 00:00:00 ps -ef
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:1642