Bug 1021601

Summary: [performance]Fail to create scalable app when more and more app are deployed onto node and the failure will leave a zombie process
Product: OpenShift Container Platform Reporter: Johnny Liu <jialiu>
Component: ContainersAssignee: Brenton Leanhardt <bleanhar>
Status: CLOSED EOL QA Contact: libra bugs <libra-bugs>
Severity: medium Docs Contact:
Priority: low    
Version: 2.2.0CC: jgoulding, libra-onpremise-devel, rthrashe
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-01-13 22:37:24 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Johnny Liu 2013-10-21 15:39:26 UTC
Description of problem:
Create scalable app one by one, when creating the 1962th app, failure is seen, and a zombie process will be generated every time the failure happened.

[root@node ~]# ps -ef|grep lsof
root      43987  83136 13 03:44 ?        00:01:44 [lsof] <defunct>
root      49196  83136 29 03:51 ?        00:01:48 [lsof] <defunct>
root      53657  80400  0 03:57 pts/2    00:00:00 grep lsof
root      84013  83136  0 Sep24 ?        00:01:48 [lsof] <defunct>
root      88537  83136  0 Sep24 ?        00:01:43 [lsof] <defunct>
root      93748  83136  0 Sep24 ?        00:01:51 [lsof] <defunct>
root      98268  83136  0 Sep24 ?        00:01:56 [lsof] <defunct>


Version-Release number of selected component (if applicable):
1.2/2013-08-23.3

How reproducible:
Always

Steps to Reproduce:
1.Create scalable app one by one
2.
3.

Actual results:
Failed to create the 1962th app.


The following error log is seen in mcollective log:
E, [2013-09-25T03:55:19.766496 #83136] ERROR -- : openshift.rb:171:in `rescue in with_container_from_args' /opt/rh/ruby193/root/usr/share/gems/gems/openshift-origin-node-1.9.14.3/lib/openshift-origin-node/utils/shell_exec.rb:161:in `select'
/opt/rh/ruby193/root/usr/share/gems/gems/openshift-origin-node-1.9.14.3/lib/openshift-origin-node/utils/shell_exec.rb:161:in `block in read_results'
/opt/rh/ruby193/root/usr/share/ruby/timeout.rb:69:in `timeout'
/opt/rh/ruby193/root/usr/share/gems/gems/openshift-origin-node-1.9.14.3/lib/openshift-origin-node/utils/shell_exec.rb:159:in `read_results'
/opt/rh/ruby193/root/usr/share/gems/gems/openshift-origin-node-1.9.14.3/lib/openshift-origin-node/utils/shell_exec.rb:124:in `block (2 levels) in oo_spawn'
/opt/rh/ruby193/root/usr/share/gems/gems/openshift-origin-node-1.9.14.3/lib/openshift-origin-node/utils/shell_exec.rb:93:in `pipe'
/opt/rh/ruby193/root/usr/share/gems/gems/openshift-origin-node-1.9.14.3/lib/openshift-origin-node/utils/shell_exec.rb:93:in `block in oo_spawn'
/opt/rh/ruby193/root/usr/share/gems/gems/openshift-origin-node-1.9.14.3/lib/openshift-origin-node/utils/shell_exec.rb:92:in `pipe'
/opt/rh/ruby193/root/usr/share/gems/gems/openshift-origin-node-1.9.14.3/lib/openshift-origin-node/utils/shell_exec.rb:92:in `oo_spawn'
/opt/rh/ruby193/root/usr/share/gems/gems/openshift-origin-node-1.9.14.3/lib/openshift-origin-node/model/v2_cart_model.rb:763:in `addresses_bound?'
/opt/rh/ruby193/root/usr/share/gems/gems/openshift-origin-node-1.9.14.3/lib/openshift-origin-node/model/v2_cart_model.rb:703:in `create_private_endpoints'
/opt/rh/ruby193/root/usr/share/gems/gems/openshift-origin-node-1.9.14.3/lib/openshift-origin-node/model/v2_cart_model.rb:251:in `block in configure'
/opt/rh/ruby193/root/usr/share/gems/gems/openshift-origin-node-1.9.14.3/lib/openshift-origin-node/utils/cgroups.rb:70:in `with_no_cpu_limits'
/opt/rh/ruby193/root/usr/share/gems/gems/openshift-origin-node-1.9.14.3/lib/openshift-origin-node/model/v2_cart_model.rb:237:in `configure'
/opt/rh/ruby193/root/usr/share/gems/gems/openshift-origin-node-1.9.14.3/lib/openshift-origin-node/model/application_container.rb:97:in `configure'
/opt/rh/ruby193/root/usr/libexec/mcollective/mcollective/agent/openshift.rb:586:in `block in oo_configure'
/opt/rh/ruby193/root/usr/libexec/mcollective/mcollective/agent/openshift.rb:166:in `with_container_from_args'
/opt/rh/ruby193/root/usr/libexec/mcollective/mcollective/agent/openshift.rb:585:in `oo_configure'
/opt/rh/ruby193/root/usr/libexec/mcollective/mcollective/agent/openshift.rb:93:in `execute_action'
/opt/rh/ruby193/root/usr/libexec/mcollective/mcollective/agent/openshift.rb:65:in `cartridge_do_action'
/opt/rh/ruby193/root/usr/share/ruby/mcollective/rpc/agent.rb:86:in `handlemsg'
/opt/rh/ruby193/root/usr/share/ruby/mcollective/agents.rb:126:in `block (2 levels) in dispatch'
/opt/rh/ruby193/root/usr/share/ruby/timeout.rb:69:in `timeout'
/opt/rh/ruby193/root/usr/share/ruby/mcollective/agents.rb:125:in `block in dispatch'
I, [2013-09-25T03:55:19.766735 #83136]  INFO -- : openshift.rb:100:in `execute_action' Finished executing action [configure] (-1)
I, [2013-09-25T03:55:19.766917 #83136]  INFO -- : openshift.rb:73:in `cartridge_do_action' cartridge_do_action failed (-1)
------
execution expired
------)

After a little debug from code, found that:
when the number of processes in node is getting more and more, lsof will take a long time to verify the listening port is available, that will lead a timeout, as a result, this will generate a zombie process.

Expected results:
No error is seen, and no zombie process is generated.

Additional info:

Comment 2 Andy Grimm 2013-12-18 16:31:37 UTC
I suspect this bug is the same as https://bugzilla.redhat.com/show_bug.cgi?id=1018009

Comment 3 Andy Grimm 2013-12-18 16:32:42 UTC
sorry, I should have said the reason for the _zombie_ processes is the same as the bug mentioned in the previous comment.  The overall bug here is a separate issue.

Comment 4 Johnny Liu 2013-12-19 02:09:37 UTC
(In reply to Andy Grimm from comment #3)
> sorry, I should have said the reason for the _zombie_ processes is the same
> as the bug mentioned in the previous comment.  The overall bug here is a
> separate issue.

Yeah, agree. This bug is mainly used to track ose-1.2 issue.

QE already open BZ#1044432 for tracking ose-2.0 issue. I guess you already noticed that bug.

Comment 5 Rory Thrasher 2017-01-13 22:37:24 UTC
OpenShift Enterprise v2 has officially reached EoL.  This product is no longer supported and bugs will be closed.

Please look into the replacement enterprise-grade container option, OpenShift Container Platform v3.  https://www.openshift.com/container-platform/

More information can be found here: https://access.redhat.com/support/policy/updates/openshift/