Bug 1528548 - Prevent jenkins to leak zombie processes
Summary: Prevent jenkins to leak zombie processes
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: ImageStreams
Version: 3.5.0
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 3.9.0
Assignee: Gabe Montero
QA Contact: Dongbo Yan
URL:
Whiteboard:
Depends On:
Blocks: 1562348
TreeView+ depends on / blocked
 
Reported: 2017-12-22 05:47 UTC by Jaspreet Kaur
Modified: 2021-12-10 15:30 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: Docker has a known "zombie process" phenomena that impacted the OpenShift Jenkins image Consequence: operating system level resources could be exhausted as these zombie processes accumulated Fix: the OpenShift Jenkins image now leverages one of the Docker image init implementations to launch Jenkins and monitor / handle any zombie child processes Result: zombie processes should no longer accumulate
Clone Of:
: 1562348 (view as bug list)
Environment:
Last Closed: 2018-03-28 14:16:17 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2018:0489 0 None None None 2018-03-28 14:16:58 UTC

Description Jaspreet Kaur 2017-12-22 05:47:23 UTC
Description of problem: Prevent jenkins to leak zombie processes.The Jenkins image is currently producing many zombie processes in different nodes, 

1241190+   3872 113792  0 16:38 ?        00:00:00 [sh] <defunct>
1241190+  14955 113792  0 16:42 ?        00:00:00 [sh] <defunct>
1241190+  14958 113792  0 16:42 ?        00:00:00 [sleep] <defunct>
1241190+  33500 113792  0 13:05 ?        00:00:00 [sh] <defunct>
1241190+  46567 113792  0 13:08 ?        00:00:00 [sleep] <defunct>
1200910+  52376 124068  0 16:54 ?        00:00:00 [sh] <defunct>
1200910+  60636 124068  0 16:58 ?        00:00:00 [node] <defunct>
1200910+  60676 124068  0 16:58 ?        00:00:00 [node] <defunct>
1241190+  69142 113792  0 14:35 ?        00:00:00 [git-remote-http] <defunct>
1241190+  84828 113792  0 16:16 ?        00:00:00 [sh] <defunct>
1200910+  87575 124068  0 17:08 ?        00:00:00 [node] <defunct>
1200910+  89941 124068  0 17:09 ?        00:00:00 [sh] <defunct>


Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results: Zombie process seen on the host/node


Expected results: No zombie process should be seen


Additional info:Upstream issue 

 https://github.com/openshift/jenkins/issues/421

Comment 3 Gabe Montero 2018-01-04 16:33:36 UTC
The jenkins repo PR has merged.  The centos image should be available on dockerhub within the hour.  A rhel image should be available on the internal ci.openshift registry within the hour as well, with the brew-pulp registries presumably updated along their normal cadence (I think at least once a day but don't recall precisely).

Going to move to modified as such.

Dango Yan - given the nature of this change (changing the entry point of the jenkins image to leverage dumb-init, which in turn launches jenkins and monitors it and its children pids), Ben and I want to make sure sufficient regression is done.  Would it be possible for you to document the specific test cases you plan to go after, and we can then review and iterate over them?

Also, I've cc:ed Steve Kuznetsov from the CI/CD team.  They saw this first hand as well.  Steve - if you have a moment and think there are any quick pointers for reproducing the zombies that you can provide to QE, please do so.

Thanks

Comment 7 Steve Kuznetsov 2018-01-04 21:04:30 UTC
No specific pointers but I would assume any pipeline that makes heavy use of the Jenkins Client Plugin would spawn the requisite `oc` zombies.

Comment 8 Gabe Montero 2018-01-04 21:22:42 UTC
gotcha - thanks Steve

Comment 9 Gabe Montero 2018-01-04 21:24:05 UTC
Also Dangbo Yan, any pipeline that leverages `sh '<any linux command>'` essentially does the same thing as the client plugin (forks process from the jenkins java pid).

Comment 11 Dongbo Yan 2018-01-10 09:28:01 UTC
Hi, Gabe, Ben
 I pick up some test cases in https://url.corp.redhat.com/66560cf (internal only), please review, thanks

I find the latest centos image has included the changing, cannot reproduce this bug with centos image.
Then the latest v3.9 rhel image does not include the changing

Comment 12 Gabe Montero 2018-01-10 22:31:25 UTC
Hey Dongbo Yan - looks like a nice broad regression to me.  Compliments the regression test cases in our nightly test runs.  And OCP-15384 does capture the best *potential* offender from our existing test cases via the use of the client plugin. 

That said, I'd like to offer one new addition, specifically targeted to this scenario.  Consider this series of steps:

1) take a look at https://stackoverflow.com/questions/25172425/create-zombie-process and the examples of how to create a simple program the creates zombies

2) build / compile the sample into a command that you place on your jenkins system

3) then create a simple pipeline job that runs that program

4) then verify the zombie is cleaned up via our use of dumb-init

Let me know if that is something you could take on.  If not, we can probably still get by, but it would be ideal if we had something like this new test case I mentioned. 

And sounds good re: your centos testing.

As the where to find the rhel image for testing, yeah, I don't see it on the brew-pulp registry nor registry.reg-aws.openshift.com.  They both have the old entrypoint.

Can you link up with the CD team to see where a jenkins rhel image with https://github.com/openshift/jenkins/pull/422 would be located?

Comment 13 Dongbo Yan 2018-01-11 09:30:55 UTC
Hi, Gabe
I have used that simple program to create zombie process, but the zombie process will be killed once program finished no matter if dumb-init exists. So I don't think that program can be used to test.

I prefer to add step 4 (verify the zombie is cleaned up via our use of dumb-init) into OCP-15384, I always can reproduce the issue by this scenario.
What do you think?

Comment 14 Gabe Montero 2018-01-11 14:21:33 UTC
Sounds good - thanks for diving into it

Comment 16 Dongbo Yan 2018-01-17 11:59:02 UTC
actual result:
sh-4.2$ ps ax
   PID TTY      STAT   TIME COMMAND
     1 ?        Ss     0:00 /usr/bin/dumb-init -- /usr/libexec/s2i/run
     5 ?        Ssl   12:38 java -XX:+UseParallelGC -XX:MinHeapFreeRatio=5 -XX:MaxHeapFreeRatio=10 -XX:GCTimeRatio=4 -XX:AdaptiveSizePolicyWeight=90 -Xmx256m
   583 ?        Ss     0:00 /bin/sh
  1123 ?        S      0:00 sh -c { while [ -d '/var/lib/jenkins/jobs/dyan7-sample-pipeline-openshift-client-plugin/workspace@tmp/durable-9a73072a' -a \! -f 
  1126 ?        S      0:00 /bin/sh -xe /var/lib/jenkins/jobs/dyan7-sample-pipeline-openshift-client-plugin/workspace@tmp/durable-9a73072a/script.sh
  1127 ?        Sl     0:00 oc --server=https://172.30.0.1:443 --certificate-authority=/var/run/secrets/kubernetes.io/serviceaccount/ca.crt --namespace=dyan7
  1190 ?        R+     0:00 ps ax

Comment 20 errata-xmlrpc 2018-03-28 14:16:17 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:0489

Comment 21 Gabe Montero 2018-04-03 14:26:00 UTC
Yes, this fix was included in the 3.9 image that was just released.

Given the nature of the fix and the changes they entail, there are not plans to backport it to 3.7 at this time.


Note You need to log in before you can comment on or make changes to this bug.