Bug 1649924 - Use of postStart hook to generate env file races with httpd startup
Summary: Use of postStart hook to generate env file races with httpd startup
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: Red Hat CloudForms Management Engine
Classification: Red Hat
Component: cfme-openshift-httpd
Version: 5.10.0
Hardware: x86_64
OS: Linux
medium
medium
Target Milestone: GA
: 5.10.z
Assignee: Gregg Tanzillo
QA Contact: Sudhir Mallamprabhakara
Red Hat CloudForms Documentation
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-11-14 19:33 UTC by eric.mountain
Modified: 2021-06-10 20:31 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-07-10 15:21:03 UTC
Category: ---
Cloudforms Team: CFME Core
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description eric.mountain 2018-11-14 19:33:58 UTC
Description of problem:

(Opening this on request of Alex Mayberry.)

Following instructions to install CloudForms on OpenShift and applying cfme-template.yaml sometimes leads to a failure in httpd pod startup.


In such cases the httpd service fails to start:

sh-4.2# systemctl status httpd
● httpd.service - The Apache HTTP Server
   Loaded: loaded (/usr/lib/systemd/system/httpd.service; enabled; vendor preset: disabled)
  Drop-In: /etc/systemd/system/httpd.service.d
           └─environment.conf
   Active: inactive (dead)
Condition: start condition failed at Thu 2018-10-25 16:51:17 UTC; 44s ago
           ConditionPathExists=/etc/container-environment was not met
     Docs: man:httpd(8)
           man:apachectl(8)


Yet, when we look, the file is there:

sh-4.2# ls /etc/container-environment -l
-rw-r--r--. 1 root root 854 Oct 25 16:51 /etc/container-environment

(It's difficult to catch because the deployer deletes the pod after a few seconds.)

The faulty code from cfme-template.yaml is:

          lifecycle:
            postStart:
              exec:
                command:
                - "/usr/bin/save-container-environment"


https://github.com/openshift/openshift-ansible/blob/1d44ee17f48d6619030fbceff932d726e1639084/roles/openshift_management/files/templates/cloudforms/cfme-template.yaml#L875-L879

(This isn't the master copy, however it is what seems to be deployed in standard OCP installs.)

The semantics of postStart (https://kubernetes.io/docs/tasks/configure-pod-container/attach-handler-lifecycle-event/#discussion) mean there is no guarantee save-container-environment will complete before systemd checks the pre-requisites to launch the httpd service.  Hence the race.

Version-Release number of selected component (if applicable):

I don't know what is the master copy for this, however the issue is visible here:

https://github.com/openshift/openshift-ansible/blob/1d44ee17f48d6619030fbceff932d726e1639084/roles/openshift_management/files/templates/cloudforms/cfme-template.yaml#L875-L879

How reproducible:

It's a race.  On some environments I think it passes unnoticed because it only happens occasionally, and the deployer deletes the errored pod a number of times before giving up.  On at least one IaaS it has proved systematic.  However, when it works it is just luck.

Steps to Reproduce:

1. Deploy cfme-template.yaml on OpenShift.
2. Watch the httpd pod, see if it errors on start.
3. You can maybe try deleting it when it does start successfully to see if you can get it to error on start.

Actual results:

httpd pod sometimes ends up in error on startup.

Expected results:

httpd pod runnning.  Always.

Additional info:

The fix is to move what save-container-environment does to an init-container.  Init-containers are run in sequence and must complete successfully (exit code 0) before the next is launched, and before the main containers are created and started.

Here is an example fix (which we used to workaround the issue internally):

https://github.com/EricMountain-1A/openshift-ansible/commit/2a86409d0b8c098812ca4a1d2fb61c05ed9d3b27

Now, my fix is a bit ugly, and if I were doing a definitive fix I would make a few changes.


At the very least, I would change save-container-environment so it writes to a subdirectory of /etc (say /etc/httpd-environment), not directly to a file in /etc.  That way, I can call s-c-e from the initContainer spec, rather than having to write the low-level `env | egrep ...` command.  The pb as things stand is that s-c-e writes directly to a file called /etc/config, and that would force my initContainer to mount an EmptyDir on /etc, which might work, but it is rather ugly.  Changing s-c-e means changing the image however - I don't know where the source for that is, and doing so would break the template (you need both changesto be coordinated...).


Pushing things a little further, I was wondering if it wasn't possible to get rid of s-c-e altogether, e.g. by using a ConfigMap volume, however that looks a little tricky as it grabs $PATH too.

Hope this helps,
Eric Mountain


Note You need to log in before you can comment on or make changes to this bug.