Description of problem: Prevent jenkins to leak zombie processes.The Jenkins image is currently producing many zombie processes in different nodes, 1241190+ 3872 113792 0 16:38 ? 00:00:00 [sh] <defunct> 1241190+ 14955 113792 0 16:42 ? 00:00:00 [sh] <defunct> 1241190+ 14958 113792 0 16:42 ? 00:00:00 [sleep] <defunct> 1241190+ 33500 113792 0 13:05 ? 00:00:00 [sh] <defunct> 1241190+ 46567 113792 0 13:08 ? 00:00:00 [sleep] <defunct> 1200910+ 52376 124068 0 16:54 ? 00:00:00 [sh] <defunct> 1200910+ 60636 124068 0 16:58 ? 00:00:00 [node] <defunct> 1200910+ 60676 124068 0 16:58 ? 00:00:00 [node] <defunct> 1241190+ 69142 113792 0 14:35 ? 00:00:00 [git-remote-http] <defunct> 1241190+ 84828 113792 0 16:16 ? 00:00:00 [sh] <defunct> 1200910+ 87575 124068 0 17:08 ? 00:00:00 [node] <defunct> 1200910+ 89941 124068 0 17:09 ? 00:00:00 [sh] <defunct> Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Zombie process seen on the host/node Expected results: No zombie process should be seen Additional info:Upstream issue https://github.com/openshift/jenkins/issues/421
PR https://github.com/openshift/jenkins/pull/422 is driving the fix here. Also reported with https://github.com/openshift/jenkins/issues/421 and https://github.com/openshift/release/issues/432
The jenkins repo PR has merged. The centos image should be available on dockerhub within the hour. A rhel image should be available on the internal ci.openshift registry within the hour as well, with the brew-pulp registries presumably updated along their normal cadence (I think at least once a day but don't recall precisely). Going to move to modified as such. Dango Yan - given the nature of this change (changing the entry point of the jenkins image to leverage dumb-init, which in turn launches jenkins and monitors it and its children pids), Ben and I want to make sure sufficient regression is done. Would it be possible for you to document the specific test cases you plan to go after, and we can then review and iterate over them? Also, I've cc:ed Steve Kuznetsov from the CI/CD team. They saw this first hand as well. Steve - if you have a moment and think there are any quick pointers for reproducing the zombies that you can provide to QE, please do so. Thanks
No specific pointers but I would assume any pipeline that makes heavy use of the Jenkins Client Plugin would spawn the requisite `oc` zombies.
gotcha - thanks Steve
Also Dangbo Yan, any pipeline that leverages `sh '<any linux command>'` essentially does the same thing as the client plugin (forks process from the jenkins java pid).
Hi, Gabe, Ben I pick up some test cases in https://url.corp.redhat.com/66560cf (internal only), please review, thanks I find the latest centos image has included the changing, cannot reproduce this bug with centos image. Then the latest v3.9 rhel image does not include the changing
Hey Dongbo Yan - looks like a nice broad regression to me. Compliments the regression test cases in our nightly test runs. And OCP-15384 does capture the best *potential* offender from our existing test cases via the use of the client plugin. That said, I'd like to offer one new addition, specifically targeted to this scenario. Consider this series of steps: 1) take a look at https://stackoverflow.com/questions/25172425/create-zombie-process and the examples of how to create a simple program the creates zombies 2) build / compile the sample into a command that you place on your jenkins system 3) then create a simple pipeline job that runs that program 4) then verify the zombie is cleaned up via our use of dumb-init Let me know if that is something you could take on. If not, we can probably still get by, but it would be ideal if we had something like this new test case I mentioned. And sounds good re: your centos testing. As the where to find the rhel image for testing, yeah, I don't see it on the brew-pulp registry nor registry.reg-aws.openshift.com. They both have the old entrypoint. Can you link up with the CD team to see where a jenkins rhel image with https://github.com/openshift/jenkins/pull/422 would be located?
Hi, Gabe I have used that simple program to create zombie process, but the zombie process will be killed once program finished no matter if dumb-init exists. So I don't think that program can be used to test. I prefer to add step 4 (verify the zombie is cleaned up via our use of dumb-init) into OCP-15384, I always can reproduce the issue by this scenario. What do you think?
Sounds good - thanks for diving into it
actual result: sh-4.2$ ps ax PID TTY STAT TIME COMMAND 1 ? Ss 0:00 /usr/bin/dumb-init -- /usr/libexec/s2i/run 5 ? Ssl 12:38 java -XX:+UseParallelGC -XX:MinHeapFreeRatio=5 -XX:MaxHeapFreeRatio=10 -XX:GCTimeRatio=4 -XX:AdaptiveSizePolicyWeight=90 -Xmx256m 583 ? Ss 0:00 /bin/sh 1123 ? S 0:00 sh -c { while [ -d '/var/lib/jenkins/jobs/dyan7-sample-pipeline-openshift-client-plugin/workspace@tmp/durable-9a73072a' -a \! -f 1126 ? S 0:00 /bin/sh -xe /var/lib/jenkins/jobs/dyan7-sample-pipeline-openshift-client-plugin/workspace@tmp/durable-9a73072a/script.sh 1127 ? Sl 0:00 oc --server=https://172.30.0.1:443 --certificate-authority=/var/run/secrets/kubernetes.io/serviceaccount/ca.crt --namespace=dyan7 1190 ? R+ 0:00 ps ax
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:0489
Yes, this fix was included in the 3.9 image that was just released. Given the nature of the fix and the changes they entail, there are not plans to backport it to 3.7 at this time.