Description of problem: When determining if a gear's expected processes are running, Watchman doesn't filter logshifter or haproxy processes. Version-Release number of selected component (if applicable): How reproducible: every time logshifter issue Steps to Reproduce: 1. create javaews application 2. kill -9 the java process 3. ps ax --format 'uid,pid=,ppid=,ucmd=' |grep <gear uid> 4. logshifter will now have ppid == 1 haproxy issue Steps to Reproduce: 1. create scaled javaews application on one gear 2. kill -9 the java process 3. ps ax --format 'uid,pid=,ppid=,ucmd=' |grep <gear uid> 4. haproxy will now have ppid == 1 Actual results: Watchman never restarts gear Expected results: Watchman restarts gear Additional info:
WIP: https://github.com/openshift/origin-server/pull/5814
Commit pushed to master at https://github.com/openshift/origin-server https://github.com/openshift/origin-server/commit/7a12cdc7f0b943ff39617c430810cf47b1a73c83 Watchman filters out haproxy/logshifter Make Watchman's GearStatePlugin filter out haproxy and logshifter related processes so it can better determine when to start a gear whose processes have died. Bug 1133629
Checked on devenv_5175, Watchman will restart the gear which has only haproxy and logshifter process running. # ps ax --format 'uid,pid=,ppid=,args=' | grep 1002 1002 5157 1 /usr/bin/logshifter -tag haproxy 1002 5158 1 /usr/sbin/haproxy -f /var/lib/openshift/541ae22e53eb02da6a000233/haproxy//conf/haproxy.cfg 1002 5159 1 bash /var/lib/openshift/541ae22e53eb02da6a000233/haproxy/usr/bin/haproxy_ctld 1002 5160 1 /usr/bin/logshifter -tag haproxy_ctld 1002 5167 5159 ruby /var/lib/openshift/541ae22e53eb02da6a000233/haproxy/usr/bin/haproxy_ctld.rb 1002 9919 9905 sshd: 541ae22e53eb02da6a000233@pts/7 1002 9921 9919 /bin/bash --init-file /usr/bin/rhcsh -i 0 10256 23809 grep 1002 [root@ip-10-51-165-140 ~]# tailf /var/log/messages Sep 18 10:04:42 ip-10-51-165-140 kernel: docker0: port 5(veth51e6) entering forwarding state Sep 18 10:04:48 ip-10-51-165-140 watchman[4062]: watchman restarted user 541ae22e53eb02da6a000233: application jbews2s (retries: 0) Sep 18 10:05:06 ip-10-51-165-140 root[12359]: user-cron-jobs :START: minutely run of all scheduled jobs
I see that this code is already pushed into the product, but this fixes only a very specific case of a much more general problem. The gear state plugin has no knowledge of whether a gear is running the "right" processes given the cartridges it contains. A real solution to this would have to be a more rigorous check, where a cartridge manifest would describe what processes should be present (e.g., at least two httpd worker processes, one Rack process, whatever), and the gear state plugin would parse and verify that specification for each cartridge in a gear. That's expensive, but it's the only way to verify that a running gear is really running. That's clearly an RFE, not a bug, but it seemed worth mentioning here.