Description of problem:
(Opening this on request of Alex Mayberry.)
Following instructions to install CloudForms on OpenShift and applying cfme-template.yaml sometimes leads to a failure in httpd pod startup.
In such cases the httpd service fails to start:
sh-4.2# systemctl status httpd
● httpd.service - The Apache HTTP Server
Loaded: loaded (/usr/lib/systemd/system/httpd.service; enabled; vendor preset: disabled)
Active: inactive (dead)
Condition: start condition failed at Thu 2018-10-25 16:51:17 UTC; 44s ago
ConditionPathExists=/etc/container-environment was not met
Yet, when we look, the file is there:
sh-4.2# ls /etc/container-environment -l
-rw-r--r--. 1 root root 854 Oct 25 16:51 /etc/container-environment
(It's difficult to catch because the deployer deletes the pod after a few seconds.)
The faulty code from cfme-template.yaml is:
(This isn't the master copy, however it is what seems to be deployed in standard OCP installs.)
The semantics of postStart (https://kubernetes.io/docs/tasks/configure-pod-container/attach-handler-lifecycle-event/#discussion) mean there is no guarantee save-container-environment will complete before systemd checks the pre-requisites to launch the httpd service. Hence the race.
Version-Release number of selected component (if applicable):
I don't know what is the master copy for this, however the issue is visible here:
It's a race. On some environments I think it passes unnoticed because it only happens occasionally, and the deployer deletes the errored pod a number of times before giving up. On at least one IaaS it has proved systematic. However, when it works it is just luck.
Steps to Reproduce:
1. Deploy cfme-template.yaml on OpenShift.
2. Watch the httpd pod, see if it errors on start.
3. You can maybe try deleting it when it does start successfully to see if you can get it to error on start.
httpd pod sometimes ends up in error on startup.
httpd pod runnning. Always.
The fix is to move what save-container-environment does to an init-container. Init-containers are run in sequence and must complete successfully (exit code 0) before the next is launched, and before the main containers are created and started.
Here is an example fix (which we used to workaround the issue internally):
Now, my fix is a bit ugly, and if I were doing a definitive fix I would make a few changes.
At the very least, I would change save-container-environment so it writes to a subdirectory of /etc (say /etc/httpd-environment), not directly to a file in /etc. That way, I can call s-c-e from the initContainer spec, rather than having to write the low-level `env | egrep ...` command. The pb as things stand is that s-c-e writes directly to a file called /etc/config, and that would force my initContainer to mount an EmptyDir on /etc, which might work, but it is rather ugly. Changing s-c-e means changing the image however - I don't know where the source for that is, and doing so would break the template (you need both changesto be coordinated...).
Pushing things a little further, I was wondering if it wasn't possible to get rid of s-c-e altogether, e.g. by using a ConfigMap volume, however that looks a little tricky as it grabs $PATH too.
Hope this helps,