Description of problem: Yesterday, Tim Kramer discovered that the CPU usage of node-web-proxy on an OpenShift Online node was much higher than normal. Using strace, we found that the process was making thousands of accept() calls per second, all of which were failing because the process was out of file descriptors. I was able to work around the problem by adjusting the process's limit: echo -n "Max open files=4096:4096" > /proc/88047/limits at which point it quickly handled existing connections at settled down to only having about 15 open files. Version-Release number of selected component (if applicable): openshift-origin-node-proxy-1.16.1-1.el6oso.noarch How reproducible: We have not yet attempted to reproduce this. I suspect the if you set the ulimit to an artificially low number like 32, you could probably reproduce with a fairly small number of concurrent connections through the proxy. Actual result: It seems that the system was in a state where it was not servicing existing connections at all, yet still trying to accept new ones. Expected result: I would expect that when the process runs out of file descriptors, it should still be able service existing connections (or error out and close them) and simply reject incoming connections until enough file descriptors are closed to handle new connections.
Possibly related to this, I found that node-web-proxy is not closing down some connections where the client has disconnected. I have several nodes with more than 100 sockets in CLOSE_WAIT state, and they don't ever appear to go away. One node has 605 such connections.
This is causing outages about every other week or so in Online (at least for cloud9, possibly others)
So, I found this: https://github.com/einaros/ws/issues/180 which seems related to the file descriptor leak. and this: https://github.com/joyent/node/issues/5504 which seems related to the high CPU utilization (which we now see at times independent of hitting the fd limit)
Moving this to software collections. We're actually seeing suspiciously similar behavior in both the OpenShift code which uses node.js and in our users' node.js based apps. All are currently using nodejs010-nodejs-0.10.5-6.el6.x86_64 .
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHEA-2014-0620.html