When the environment is under high load, there is a race condition where a httpd graceful restart will be called while the vhost configs are being modified when using the openshift-origin-frontend-apache-vhost which causes the httpd.worker thread to end in a bad state. We followed up with the httpd team and they traced it down to effectively this: [Information provided by the httpd dev team] 1. OpenShift edits httpd configuration and calls "httpd.worker -k graceful" to gracefully restart httpd to load new configuration. But for some reason, the confiuration supplied to httpd by OpenShift at the time "httpd -k graceful" is executed is broken. You can see this from the httpd error_log: > [Tue Sep 23 23:48:02 2014] [notice] SIGUSR1 received. Doing graceful restart > httpd.worker: Syntax error on line 222 of /etc/httpd/conf/httpd.conf: Syntax > error on line 75 of /etc/httpd/conf.d/000000_default.conf: Could not open > configuration file /etc/httpd/conf.d/openshift/54223d5203ef640de1000981_nagiosmonitor_0_chkexsrv1.conf: No such file or directory Note that this is also the first date/time when /var/log/messages msg appears: > Sep 23 23:48:04 ex-std-node3 root: httpd -k graceful already running, perhaps force restart httpd 2. After the Syntax Error during graceful restart, httpd stops itself, so no "httpd.worker" process exists on the system. This is expected behaviour when Syntax Error happens because of invalid configuration files. 3. Next execution of "httpd.worker -k graceful" finds outs that httpd process does not exist and starts the new one. This is expected behaviour. This httpd process handles requests normally and you can see this process in ps output as "httpd.worker -k graceful".
Potential fix: https://github.com/openshift/origin-server/pull/5842 Fork ami for the above is being built at : https://ci.dev.openshift.redhat.com/jenkins/job/fork_ami/1256/ For QE, what to test: The application creation rate should not drop drastically because of the use of a common lock file.
I have tried on the fork_ami with 5 jobs to create and delete apps in parallel on m3.medium instance. All the creation and deletion get succeeded. And all the apps can be accessed. And also do acceptance testing on the fork_ami, no regression issue found. Move the bug to verified.
The pull request has been updated. https://github.com/openshift/origin-server/pull/5842 Both versions of the files have been updated now. Please re-test performance and regresssion.
Tested on devenv_5200, no regression issue found, and the app creation without many failures. Verify the bug.
*** Bug 1148418 has been marked as a duplicate of this bug. ***
*** Bug 1145982 has been marked as a duplicate of this bug. ***