Bug 1044831

Summary: [performance]Failed to start all the apps when restart a node with a large number of apps
Product: OpenShift Container Platform Reporter: Gaoyun Pei <gpei>
Component: ContainersAssignee: Brenton Leanhardt <bleanhar>
Status: CLOSED EOL QA Contact: libra bugs <libra-bugs>
Severity: low Docs Contact:
Priority: low    
Version: 2.2.0CC: anli, libra-onpremise-devel, rthrashe
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-01-13 22:45:36 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Description Gaoyun Pei 2013-12-19 05:02:47 UTC
Description of problem:
After creating 1755 scalable php apps on a node, reboot the node and found not all of the apps get started correctly. Some of them are not available.

Version-Release number of selected component (if applicable):
2.0/2013-11-26.1
(Before reboot the node, update package rubygem-openshift-origin-node to the latest version: rubygem-openshift-origin-node-1.17.5-3.el6op.noarch.rpm)

How reproducible:
always

Steps to Reproduce:
1. Create scalable php apps as many as possible on a node, reboot the node.

Finally, we got 1600 apps available after the node's restart, which means there're 155 apps fail to start successfully.

After check the log, found there're 68 errors when starting the app, the error log in /var/log/openshift/node/platform.log:
...
December 18 18:41:19 INFO Shell command '/sbin/runuser -s /bin/sh 5295dacf78e213a7d600247e -c "exec /usr/bin/runcon 'unconfined_u:system_r:openshift_t:s0:c1,c187' /bin/sh -c \"set -e; /var/lib/openshift/5295dacf78e213a7d600247e/haproxy/bin/control start \""' ran. rc=0 out=/opt/rh/ruby193/root/usr/share/gems/gems/daemons-1.0.10/lib/daemons/application.rb:330:in `kill': Operation not permitted (Errno::EPERM)
        from /opt/rh/ruby193/root/usr/share/gems/gems/daemons-1.0.10/lib/daemons/application.rb:330:in `stop'
        from /opt/rh/ruby193/root/usr/share/gems/gems/daemons-1.0.10/lib/daemons/application_group.rb:135:in `block in stop_all'
        from /opt/rh/ruby193/root/usr/share/gems/gems/daemons-1.0.10/lib/daemons/application_group.rb:131:in `each'
        from /opt/rh/ruby193/root/usr/share/gems/gems/daemons-1.0.10/lib/daemons/application_group.rb:131:in `stop_all'
        from /opt/rh/ruby193/root/usr/share/gems/gems/daemons-1.0.10/lib/daemons/controller.rb:74:in `run'
        from /opt/rh/ruby193/root/usr/share/gems/gems/daemons-1.0.10/lib/daemons.rb:139:in `block in run'
        from /opt/rh/ruby193/root/usr/share/gems/gems/daemons-1.0.10/lib/daemons/cmdline.rb:105:in `call'
        from /opt/rh/ruby193/root/usr/share/gems/gems/daemons-1.0.10/lib/daemons/cmdline.rb:105:in `catch_exceptions'
        from /opt/rh/ruby193/root/usr/share/gems/gems/daemons-1.0.10/lib/daemons.rb:138:in `run'
        from /var/lib/openshift/5295dacf78e213a7d600247e/haproxy/usr/bin/haproxy_ctld_daemon.rb:21:in `<main>'
ERROR: there is already one or more instance(s) of the program running
HAProxy instance is started
...


Log into some apps which are unavailable, check its processes.
Some of them did not have the httpd process:
[app1722-name1722.scalability.com 52af1c0d78e213ff39000b04]\> ps -ef
UID        PID  PPID  C STIME TTY          TIME CMD
2721     21972     1  0 10:03 ?        00:00:06 haproxy_ctld.rb                                                                              
2721     24602     1  0 Dec18 ?        00:00:07 /usr/sbin/haproxy -f /var/lib/openshift/52af1c0d78e213ff39000b04/haproxy//conf/haproxy.cfg
2721     29035 29015  1 12:54 ?        00:00:00 sshd: 52af1c0d78e213ff39000b04@pts/5
2721     29038 29035  5 12:54 pts/5    00:00:00 /bin/bash --init-file /usr/bin/rhcsh -i
2721     29352 29038  0 12:54 pts/5    00:00:01 ps -ef


while some of them seems to get stuck in the "start" process:
[app1709-name1709.scalability.com 52aeead378e213ff39000837]\> ps -ef
UID        PID  PPID  C STIME TTY          TIME CMD
2708      8247     1  0 Dec18 ?        00:00:00 /bin/bash -e /var/lib/openshift/52aeead378e213ff39000837/haproxy/bin/control start
2708     27363 27326  0 12:53 ?        00:00:00 sshd: 52aeead378e213ff39000837@pts/5
2708     27366 27363  6 12:53 pts/5    00:00:00 /bin/bash --init-file /usr/bin/rhcsh -i
2708     27686 27366  0 12:53 pts/5    00:00:01 ps -ef
2708     27687  8247  0 12:53 ?        00:00:00 /bin/bash -e /var/lib/openshift/52aeead378e213ff39000837/haproxy/bin/control start
2708     27688 27687  0 12:53 ?        00:00:00 /bin/bash -e /var/lib/openshift/52aeead378e213ff39000837/haproxy/bin/control start
2708     27689 27687  0 12:53 ?        00:00:00 cut -f 3 -d ,
2708     27690 27688  0 12:53 ?        00:00:00 scl enable ruby193 ruby /usr/bin/oo-gear-registry web 
2708     27692 27690  0 12:53 ?        00:00:00 /bin/bash /var/tmp/sclstWYWd
2708     27703 27692  0 12:53 ?        00:00:00 ruby /usr/bin/oo-gear-registry web


After restart the unavailable apps, they come back to normal and become available again.


Actual results:

Expected results:
All the apps should get started after the node rebooting

Additional info:

Comment 4 Rory Thrasher 2017-01-13 22:45:36 UTC
OpenShift Enterprise v2 has officially reached EoL.  This product is no longer supported and bugs will be closed.

Please look into the replacement enterprise-grade container option, OpenShift Container Platform v3.  https://www.openshift.com/container-platform/

More information can be found here: https://access.redhat.com/support/policy/updates/openshift/