Bug 1077353

Summary: multiple nodejs processes running in a gear
Product: OpenShift Online Reporter: Andy Grimm <agrimm>
Component: ImageAssignee: Ben Parees <bparees>
Status: CLOSED CURRENTRELEASE QA Contact: libra bugs <libra-bugs>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 2.xCC: agrimm, anli, bparees, dmcphers, jgoulding, mfojtik, wzheng
Target Milestone: ---Keywords: UpcomingRelease
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 1116817 (view as bug list) Environment:
Last Closed: 2014-10-10 00:46:58 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1116817    

Description Andy Grimm 2014-03-17 19:27:24 UTC
Description of problem:

I saw three cases today where a gear had multiple nodejs supervisor processes running.  The result was that the second instance's child process kept dying, since they could not bind to port 8080.  They kept retrying, consuming the gear's entire CPU quota.

Version-Release number of selected component (if applicable):

openshift-origin-cartridge-nodejs-1.22.4-1.el6oso.noarch

Comment 1 Michal Fojtik 2014-03-17 20:25:24 UTC
Andy: Do you have more details? Does the apps use hot_deploy?

Comment 2 Michal Fojtik 2014-04-11 10:40:26 UTC
Andy, ping? ;-)

Comment 4 Andy Grimm 2014-04-16 19:40:59 UTC
It looks like two of the apps where I'm currently seeing this got unidled twice concurrently.  It's not clear what happened with the third; it was started at 19:40:44 and restarted at 19:41:49.  Maybe the first set of processes didn't die?  

The upcoming fix for BZ 1061926 may fix at least two of these three occurrences.

Comment 5 Ben Parees 2014-06-27 20:29:46 UTC
It looks like this could happen if someone removed the pid file and then issued a restart (the cart logic will just start another instance if the pidfile is not found).

A number of our carts share this logic, but nodejs may be the only one that auto-restarts due to the bind failure.

I will look into making the "is started" checking more robust.

Comment 6 Ben Parees 2014-07-01 14:48:35 UTC
Adding logic to recreate the pid file if it does not exist, prior to checking if the process is started.

https://github.com/openshift/origin-server/pull/5562

Comment 7 Wenjing Zheng 2014-07-02 06:53:04 UTC
Verified on devenv_4932, there is no multiple nodejs process as below:

1. Create a nodejs-0.10 app
2. SSH into gear, delete the cartridge.pid file under $OPENSHIFT_NODEJS_PID_DIR and check the process:
[n10-d.dev.rhcloud.com 53b3df4040b38ce446000001]\> ps -ef
UID        PID  PPID  C STIME TTY          TIME CMD
1000      7138     1  0 06:30 ?        00:00:00 node /opt/rh/nodejs010/root/usr
1000      7139     1  0 06:30 ?        00:00:00 /usr/bin/logshifter -tag nodejs
1000      7158  7138  0 06:30 ?        00:00:00 node server.js
1000      9028  9015  0 06:34 ?        00:00:00 sshd: 53b3df4040b38ce446000001@
1000      9029  9028  1 06:34 pts/2    00:00:00 /bin/bash --init-file /usr/bin/
1000      9252  9029  0 06:34 pts/2    00:00:00 ps -ef
3. restart gear and re-check the process
[n10-d.dev.rhcloud.com 53b3df4040b38ce446000001]\> ps -ef
UID        PID  PPID  C STIME TTY          TIME CMD
1000     11887     1  0 06:39 ?        00:00:00 node /opt/rh/nodejs010/root/usr/bin/supervisor
1000     11888     1  0 06:39 ?        00:00:00 /usr/bin/logshifter -tag nodejs
1000     11914 11887  0 06:39 ?        00:00:00 node server.js
1000     12017 12004  0 06:39 ?        00:00:00 sshd: 53b3df4040b38ce446000001@pts/2
1000     12018 12017  3 06:39 pts/2    00:00:00 /bin/bash --init-file /usr/bin/rhcsh -i
1000     12230 12018  0 06:39 pts/2    00:00:00 ps -ef