Bug 1587825

Summary: CFME httpd pod fail to get started after deployed on ocp-3.10
Product: OpenShift Container Platform Reporter: Gaoyun Pei <gpei>
Component: InstallerAssignee: Scott Dodson <sdodson>
Status: CLOSED ERRATA QA Contact: Gaoyun Pei <gpei>
Severity: high Docs Contact:
Priority: high    
Version: 3.10.0CC: aos-bugs, bleanhar, gpei, jokerman, mmccomas, ncarboni, pasik, xtian
Target Milestone: ---   
Target Release: 3.10.0   
Hardware: Unspecified   
OS: Unspecified   
Fixed In Version: Doc Type: Bug Fix
Doc Text:
A recent change in SELinux policy requires that an additional SEBoolean is set when running any pods with systemd which includes CFME.
Story Points: ---
Clone Of:
: 1589929 (view as bug list) Environment:
Last Closed: 2018-07-30 19:17:19 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Bug Depends On:    
Bug Blocks: 1589929    

Description Gaoyun Pei 2018-06-06 07:38:35 UTC
Description of problem:
Deploy CFME-46 on OCP-3.10 using openshift-management/config.yml playbook, playbook finished but httpd pod was not running in the end.

[root@ip-172-18-7-193 ~]# oc get pod
NAME                 READY     STATUS             RESTARTS   AGE
cloudforms-0         1/1       Running            0          16m
httpd-1-6p6w2        0/1       CrashLoopBackOff   6          16m
httpd-1-deploy       1/1       Running            0          16m
memcached-1-rftxq    1/1       Running            0          16m
postgresql-1-wvvrb   1/1       Running            0          16m
[root@ip-172-18-7-193 ~]# oc describe pod httpd-1-6p6w2
Name:           httpd-1-6p6w2
Namespace:      openshift-management
Node:           ip-172-18-15-140.ec2.internal/
Start Time:     Wed, 06 Jun 2018 01:47:44 -0400
Labels:         deployment=httpd-1
Annotations:    openshift.io/deployment-config.latest-version=1
Status:         Running
Controlled By:  ReplicationController/httpd-1
    Container ID:   docker://500147788e422274d08cc6b41f473815447fb950455a9843e4bb5e8fe887b45d
    Image:          registry.access.redhat.com/cloudforms46/cfme-openshift-httpd:latest
    Image ID:       docker-pullable://registry.access.redhat.com/cloudforms46/cfme-openshift-httpd@sha256:235002e7a8bfb31841a2cb4c2afb6c8b599086b5189c34434411d03848d013dc
    Ports:          80/TCP, 8080/TCP
    Host Ports:     0/TCP, 0/TCP
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Error
      Exit Code:    137
      Started:      Wed, 06 Jun 2018 02:00:29 -0400
      Finished:     Wed, 06 Jun 2018 02:01:34 -0400
    Ready:          False
    Restart Count:  6
      memory:  8Gi
      cpu:      500m
      memory:   512Mi
    Liveness:   exec [pidof httpd] delay=15s timeout=3s period=10s #success=1 #failure=3
    Readiness:  tcp-socket :80 delay=10s timeout=3s period=10s #success=1 #failure=3
      HTTPD_AUTH_TYPE:             <set to the key 'auth-type' of config map 'httpd-auth-configs'>             Optional: false
      HTTPD_AUTH_KERBEROS_REALMS:  <set to the key 'auth-kerberos-realms' of config map 'httpd-auth-configs'>  Optional: false
      /etc/httpd/auth-conf.d from httpd-auth-config (rw)
      /etc/httpd/conf.d from httpd-config (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from cfme-httpd-token-8h546 (ro)
  Type           Status
  Initialized    True 
  Ready          False 
  PodScheduled   True 
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      httpd-configs
    Optional:  false
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      httpd-auth-configs
    Optional:  false
    Type:        Secret (a volume populated by a Secret)
    SecretName:  cfme-httpd-token-8h546
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  node-role.kubernetes.io/compute=true
Tolerations:     node.kubernetes.io/memory-pressure:NoSchedule
  Type     Reason          Age                 From                                    Message
  ----     ------          ----                ----                                    -------
  Normal   Scheduled       16m                 default-scheduler                       Successfully assigned httpd-1-6p6w2 to ip-172-18-15-140.ec2.internal
  Warning  Failed          16m                 kubelet, ip-172-18-15-140.ec2.internal  Failed to pull image "registry.access.redhat.com/cloudforms46/cfme-openshift-httpd:latest": rpc error: code = Unknown desc = Get https://registry.access.redhat.com/v2/cloudforms46/cfme-openshift-httpd/manifests/sha256:c9828511ab7c93a4a31961e2f6b0ed601ae94e8a270ce4e1788f4b971feb3199: net/http: TLS handshake timeout
  Warning  Failed          16m                 kubelet, ip-172-18-15-140.ec2.internal  Error: ErrImagePull
  Normal   BackOff         15m (x3 over 16m)   kubelet, ip-172-18-15-140.ec2.internal  Back-off pulling image "registry.access.redhat.com/cloudforms46/cfme-openshift-httpd:latest"
  Normal   SandboxChanged  15m (x4 over 16m)   kubelet, ip-172-18-15-140.ec2.internal  Pod sandbox changed, it will be killed and re-created.
  Warning  Failed          15m (x3 over 16m)   kubelet, ip-172-18-15-140.ec2.internal  Error: ImagePullBackOff
  Normal   Pulling         15m (x2 over 16m)   kubelet, ip-172-18-15-140.ec2.internal  pulling image "registry.access.redhat.com/cloudforms46/cfme-openshift-httpd:latest"
  Normal   Pulled          14m                 kubelet, ip-172-18-15-140.ec2.internal  Successfully pulled image "registry.access.redhat.com/cloudforms46/cfme-openshift-httpd:latest"
  Normal   Created         14m                 kubelet, ip-172-18-15-140.ec2.internal  Created container
  Normal   Started         14m                 kubelet, ip-172-18-15-140.ec2.internal  Started container
  Warning  Unhealthy       13m (x3 over 13m)   kubelet, ip-172-18-15-140.ec2.internal  Liveness probe failed:
  Warning  Unhealthy       11m (x16 over 13m)  kubelet, ip-172-18-15-140.ec2.internal  Readiness probe failed: dial tcp getsockopt: connection refused
  Warning  BackOff         1m (x28 over 8m)    kubelet, ip-172-18-15-140.ec2.internal  Back-off restarting failed container


Version-Release number of the following components:
[root@ip-172-18-7-193 ~]# openshift version
openshift v3.10.0-0.60.0

[root@ip-172-18-15-140 ~]# docker images |grep cfme
registry.access.redhat.com/cloudforms46/cfme-openshift-httpd           2.4.6-27            133ae4fcad6d        5 weeks ago         349 MB
registry.access.redhat.com/cloudforms46/cfme-openshift-httpd           latest              133ae4fcad6d        5 weeks ago         349 MB
registry.access.redhat.com/cloudforms46/cfme-openshift-postgresql      latest              697bcc27331e        5 weeks ago         339 MB
registry.access.redhat.com/cloudforms46/cfme-openshift-memcached       latest              12991e670cc1        5 weeks ago         251 MB
registry.access.redhat.com/cloudforms46/cfme-openshift-app-ui              latest              8b2e78ea76a8        4 weeks ago         926 MB

How reproducible:

Steps to Reproduce:

Actual results:
Please include the entire output from the last TASK line through the end of output if an error is generated

Expected results:

Additional info:
Please attach logs from ansible-playbook with the -vvv flag

Comment 4 Nick Carboni 2018-06-07 14:02:11 UTC
This may be fixed by https://github.com/openshift/openshift-ansible/pull/8423

Can you try enabling the container_manage_cgroup sebool on the nodes and see if that fixes the issue?

Comment 5 Gaoyun Pei 2018-06-08 02:43:55 UTC
Thanks for your reply Nick! It did fix the issue.

container_manage_cgroup sebool on the nodes are in "off" status in the beginning, after setting it as "on", restart cfme httpd deployment, the pod could run well.

Here's my verify steps:
1. On all nodes
[root@gpei-bz1587825-node-registry-router-1 ~]# setsebool container_manage_cgroup on
[root@gpei-bz1587825-node-registry-router-1 ~]# getsebool -a |grep container_manage_cgroup
container_manage_cgroup --> on

2. Rollout the httpd deployment
[root@gpei-bz1587825-master-etcd-1 ~]# oc rollout latest httpd
deploymentconfig "httpd" rolled out
[root@gpei-bz1587825-master-etcd-1 ~]# oc get pod
NAME                 READY     STATUS    RESTARTS   AGE
cloudforms-0         1/1       Running   0          17h
httpd-1-deploy       0/1       Error     0          17h
httpd-2-dwc9t        1/1       Running   0          38s
memcached-1-8tk4k    1/1       Running   0          17h
postgresql-1-fm8jp   1/1       Running   0          17h
[root@gpei-bz1587825-master-etcd-1 ~]# oc get route
NAME      HOST/PORT                                                 PATH      SERVICES   PORT      TERMINATION     WILDCARD
httpd     httpd-openshift-management.apps.0607-y8f.qe.rhcloud.com             httpd      http      edge/Redirect   None
[root@gpei-bz1587825-master-etcd-1 ~]# curl -Ik https://httpd-openshift-management.apps.0607-y8f.qe.rhcloud.com/
HTTP/1.1 200 OK
Date: Fri, 08 Jun 2018 02:36:01 GMT

Comment 6 Gaoyun Pei 2018-06-14 06:25:42 UTC
Verify this bug with openshift-ansible-3.10.0-0.67.0.git.107.1bd1f01.el7.noarch.
Deploy CFME-46 on ocp-3.10 cluster using openshift-management/config.yml playbook. container_manage_cgroup sebool got changed during deployment.

PLAY [Enable sebool container_manage_cgroup] ********************************************************************************************************************************

TASK [Setting sebool container_manage_cgroup] *******************************************************************************************************************************
changed: [ec2-54-174-15-179.compute-1.amazonaws.com] => {"changed": true, "failed": false, "name": "container_manage_cgroup"}
changed: [ec2-34-201-220-201.compute-1.amazonaws.com] => {"changed": true, "failed": false, "name": "container_manage_cgroup"}
changed: [ec2-54-87-220-234.compute-1.amazonaws.com] => {"changed": true, "failed": false, "name": "container_manage_cgroup"}
changed: [ec2-54-173-226-200.compute-1.amazonaws.com] => {"changed": true, "failed": false, "name": "container_manage_cgroup"}
changed: [ec2-18-232-99-88.compute-1.amazonaws.com] => {"changed": true, "failed": false, "name": "container_manage_cgroup"}

[root@ip-172-18-14-218 ~]# getsebool -a |grep container_manage_cgroup
container_manage_cgroup --> on

[root@ip-172-18-14-218 ~]# oc get pod -n openshift-management
NAME                 READY     STATUS    RESTARTS   AGE
cloudforms-0         1/1       Running   0          2h
httpd-1-xf8wg        1/1       Running   0          2h
memcached-1-fc4dv    1/1       Running   0          2h
postgresql-1-dz5qb   1/1       Running   0          2h

All pods running well and CFME webconsole is available. Move this bug to verified.

Comment 8 Gaoyun Pei 2018-06-19 10:04:10 UTC
Move this bug to Modified for now we have a new PR to address this:

Comment 10 Gaoyun Pei 2018-06-26 09:42:47 UTC
Verify this bug with openshift-ansible-3.10.8-1.git.230.830efc0.el7.noarch.

container_manage_cgroup sebool is set to "on" during node installation.

TASK [openshift_node : Setting sebool container_manage_cgroup] *****************
Tuesday 26 June 2018  02:51:35 -0400 (0:00:00.508)       0:02:05.109 ********** 
changed: [ec2-34-230-27-81.compute-1.amazonaws.com] => {"changed": true, "failed": false, "name": "container_manage_cgroup"}
changed: [ec2-54-80-211-244.compute-1.amazonaws.com] => {"changed": true, "failed": false, "name": "container_manage_cgroup"}
changed: [ec2-35-173-204-154.compute-1.amazonaws.com] => {"changed": true, "failed": false, "name": "container_manage_cgroup"}

When deploying CFME on the ocp-3.10 cluster, httpd pod could run well.

Comment 12 errata-xmlrpc 2018-07-30 19:17:19 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.