1587825 – CFME httpd pod fail to get started after deployed on ocp-3.10

Bug 1587825 - CFME httpd pod fail to get started after deployed on ocp-3.10

Summary: CFME httpd pod fail to get started after deployed on ocp-3.10

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Installer
Sub Component:
Version:	3.10.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	3.10.0
Assignee:	Scott Dodson
QA Contact:	Gaoyun Pei
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1589929
TreeView+	depends on / blocked

Reported:	2018-06-06 07:38 UTC by Gaoyun Pei
Modified:	2018-08-22 15:31 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	A recent change in SELinux policy requires that an additional SEBoolean is set when running any pods with systemd which includes CFME.
Clone Of:
Clones:	1589929 (view as bug list)
Environment:
Last Closed:	2018-07-30 19:17:19 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2018:1816	0	None	None	None	2018-07-30 19:17:58 UTC

Description Gaoyun Pei 2018-06-06 07:38:35 UTC

Description of problem:
Deploy CFME-46 on OCP-3.10 using openshift-management/config.yml playbook, playbook finished but httpd pod was not running in the end.

[root@ip-172-18-7-193 ~]# oc get pod
NAME                 READY     STATUS             RESTARTS   AGE
cloudforms-0         1/1       Running            0          16m
httpd-1-6p6w2        0/1       CrashLoopBackOff   6          16m
httpd-1-deploy       1/1       Running            0          16m
memcached-1-rftxq    1/1       Running            0          16m
postgresql-1-wvvrb   1/1       Running            0          16m
[root@ip-172-18-7-193 ~]# oc describe pod httpd-1-6p6w2
Name:           httpd-1-6p6w2
Namespace:      openshift-management
Node:           ip-172-18-15-140.ec2.internal/172.18.15.140
Start Time:     Wed, 06 Jun 2018 01:47:44 -0400
Labels:         deployment=httpd-1
                deploymentconfig=httpd
                name=httpd
Annotations:    openshift.io/deployment-config.latest-version=1
                openshift.io/deployment-config.name=httpd
                openshift.io/deployment.name=httpd-1
                openshift.io/scc=anyuid
Status:         Running
IP:             10.130.0.17
Controlled By:  ReplicationController/httpd-1
Containers:
  httpd:
    Container ID:   docker://500147788e422274d08cc6b41f473815447fb950455a9843e4bb5e8fe887b45d
    Image:          registry.access.redhat.com/cloudforms46/cfme-openshift-httpd:latest
    Image ID:       docker-pullable://registry.access.redhat.com/cloudforms46/cfme-openshift-httpd@sha256:235002e7a8bfb31841a2cb4c2afb6c8b599086b5189c34434411d03848d013dc
    Ports:          80/TCP, 8080/TCP
    Host Ports:     0/TCP, 0/TCP
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Error
      Exit Code:    137
      Started:      Wed, 06 Jun 2018 02:00:29 -0400
      Finished:     Wed, 06 Jun 2018 02:01:34 -0400
    Ready:          False
    Restart Count:  6
    Limits:
      memory:  8Gi
    Requests:
      cpu:      500m
      memory:   512Mi
    Liveness:   exec [pidof httpd] delay=15s timeout=3s period=10s #success=1 #failure=3
    Readiness:  tcp-socket :80 delay=10s timeout=3s period=10s #success=1 #failure=3
    Environment:
      HTTPD_AUTH_TYPE:             <set to the key 'auth-type' of config map 'httpd-auth-configs'>             Optional: false
      HTTPD_AUTH_KERBEROS_REALMS:  <set to the key 'auth-kerberos-realms' of config map 'httpd-auth-configs'>  Optional: false
    Mounts:
      /etc/httpd/auth-conf.d from httpd-auth-config (rw)
      /etc/httpd/conf.d from httpd-config (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from cfme-httpd-token-8h546 (ro)
Conditions:
  Type           Status
  Initialized    True 
  Ready          False 
  PodScheduled   True 
Volumes:
  httpd-config:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      httpd-configs
    Optional:  false
  httpd-auth-config:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      httpd-auth-configs
    Optional:  false
  cfme-httpd-token-8h546:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  cfme-httpd-token-8h546
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  node-role.kubernetes.io/compute=true
Tolerations:     node.kubernetes.io/memory-pressure:NoSchedule
Events:
  Type     Reason          Age                 From                                    Message
  ----     ------          ----                ----                                    -------
  Normal   Scheduled       16m                 default-scheduler                       Successfully assigned httpd-1-6p6w2 to ip-172-18-15-140.ec2.internal
  Warning  Failed          16m                 kubelet, ip-172-18-15-140.ec2.internal  Failed to pull image "registry.access.redhat.com/cloudforms46/cfme-openshift-httpd:latest": rpc error: code = Unknown desc = Get https://registry.access.redhat.com/v2/cloudforms46/cfme-openshift-httpd/manifests/sha256:c9828511ab7c93a4a31961e2f6b0ed601ae94e8a270ce4e1788f4b971feb3199: net/http: TLS handshake timeout
  Warning  Failed          16m                 kubelet, ip-172-18-15-140.ec2.internal  Error: ErrImagePull
  Normal   BackOff         15m (x3 over 16m)   kubelet, ip-172-18-15-140.ec2.internal  Back-off pulling image "registry.access.redhat.com/cloudforms46/cfme-openshift-httpd:latest"
  Normal   SandboxChanged  15m (x4 over 16m)   kubelet, ip-172-18-15-140.ec2.internal  Pod sandbox changed, it will be killed and re-created.
  Warning  Failed          15m (x3 over 16m)   kubelet, ip-172-18-15-140.ec2.internal  Error: ImagePullBackOff
  Normal   Pulling         15m (x2 over 16m)   kubelet, ip-172-18-15-140.ec2.internal  pulling image "registry.access.redhat.com/cloudforms46/cfme-openshift-httpd:latest"
  Normal   Pulled          14m                 kubelet, ip-172-18-15-140.ec2.internal  Successfully pulled image "registry.access.redhat.com/cloudforms46/cfme-openshift-httpd:latest"
  Normal   Created         14m                 kubelet, ip-172-18-15-140.ec2.internal  Created container
  Normal   Started         14m                 kubelet, ip-172-18-15-140.ec2.internal  Started container
  Warning  Unhealthy       13m (x3 over 13m)   kubelet, ip-172-18-15-140.ec2.internal  Liveness probe failed:
  Warning  Unhealthy       11m (x16 over 13m)  kubelet, ip-172-18-15-140.ec2.internal  Readiness probe failed: dial tcp 10.130.0.17:80: getsockopt: connection refused
  Warning  BackOff         1m (x28 over 8m)    kubelet, ip-172-18-15-140.ec2.internal  Back-off restarting failed container

   

Version-Release number of the following components:
[root@ip-172-18-7-193 ~]# openshift version
openshift v3.10.0-0.60.0

[root@ip-172-18-15-140 ~]# docker images |grep cfme
registry.access.redhat.com/cloudforms46/cfme-openshift-httpd           2.4.6-27            133ae4fcad6d        5 weeks ago         349 MB
registry.access.redhat.com/cloudforms46/cfme-openshift-httpd           latest              133ae4fcad6d        5 weeks ago         349 MB
registry.access.redhat.com/cloudforms46/cfme-openshift-postgresql      latest              697bcc27331e        5 weeks ago         339 MB
registry.access.redhat.com/cloudforms46/cfme-openshift-memcached       latest              12991e670cc1        5 weeks ago         251 MB
registry.access.redhat.com/cloudforms46/cfme-openshift-app-ui              latest              8b2e78ea76a8        4 weeks ago         926 MB



How reproducible:
Always

Steps to Reproduce:
1.
2.
3.

Actual results:
Please include the entire output from the last TASK line through the end of output if an error is generated

Expected results:

Additional info:
Please attach logs from ansible-playbook with the -vvv flag

Comment 4 Nick Carboni 2018-06-07 14:02:11 UTC

This may be fixed by https://github.com/openshift/openshift-ansible/pull/8423

Can you try enabling the container_manage_cgroup sebool on the nodes and see if that fixes the issue?

Comment 5 Gaoyun Pei 2018-06-08 02:43:55 UTC

Thanks for your reply Nick! It did fix the issue.

container_manage_cgroup sebool on the nodes are in "off" status in the beginning, after setting it as "on", restart cfme httpd deployment, the pod could run well.

Here's my verify steps:
1. On all nodes
[root@gpei-bz1587825-node-registry-router-1 ~]# setsebool container_manage_cgroup on
[root@gpei-bz1587825-node-registry-router-1 ~]# getsebool -a |grep container_manage_cgroup
container_manage_cgroup --> on

2. Rollout the httpd deployment
[root@gpei-bz1587825-master-etcd-1 ~]# oc rollout latest httpd
deploymentconfig "httpd" rolled out
[root@gpei-bz1587825-master-etcd-1 ~]# oc get pod
NAME                 READY     STATUS    RESTARTS   AGE
cloudforms-0         1/1       Running   0          17h
httpd-1-deploy       0/1       Error     0          17h
httpd-2-dwc9t        1/1       Running   0          38s
memcached-1-8tk4k    1/1       Running   0          17h
postgresql-1-fm8jp   1/1       Running   0          17h
[root@gpei-bz1587825-master-etcd-1 ~]# oc get route
NAME      HOST/PORT                                                 PATH      SERVICES   PORT      TERMINATION     WILDCARD
httpd     httpd-openshift-management.apps.0607-y8f.qe.rhcloud.com             httpd      http      edge/Redirect   None
[root@gpei-bz1587825-master-etcd-1 ~]# curl -Ik https://httpd-openshift-management.apps.0607-y8f.qe.rhcloud.com/
HTTP/1.1 200 OK
Date: Fri, 08 Jun 2018 02:36:01 GMT
...

Comment 6 Gaoyun Pei 2018-06-14 06:25:42 UTC

Verify this bug with openshift-ansible-3.10.0-0.67.0.git.107.1bd1f01.el7.noarch.
Deploy CFME-46 on ocp-3.10 cluster using openshift-management/config.yml playbook. container_manage_cgroup sebool got changed during deployment.

PLAY [Enable sebool container_manage_cgroup] ********************************************************************************************************************************

TASK [Setting sebool container_manage_cgroup] *******************************************************************************************************************************
changed: [ec2-54-174-15-179.compute-1.amazonaws.com] => {"changed": true, "failed": false, "name": "container_manage_cgroup"}
changed: [ec2-34-201-220-201.compute-1.amazonaws.com] => {"changed": true, "failed": false, "name": "container_manage_cgroup"}
changed: [ec2-54-87-220-234.compute-1.amazonaws.com] => {"changed": true, "failed": false, "name": "container_manage_cgroup"}
changed: [ec2-54-173-226-200.compute-1.amazonaws.com] => {"changed": true, "failed": false, "name": "container_manage_cgroup"}
changed: [ec2-18-232-99-88.compute-1.amazonaws.com] => {"changed": true, "failed": false, "name": "container_manage_cgroup"}


[root@ip-172-18-14-218 ~]# getsebool -a |grep container_manage_cgroup
container_manage_cgroup --> on

[root@ip-172-18-14-218 ~]# oc get pod -n openshift-management
NAME                 READY     STATUS    RESTARTS   AGE
cloudforms-0         1/1       Running   0          2h
httpd-1-xf8wg        1/1       Running   0          2h
memcached-1-fc4dv    1/1       Running   0          2h
postgresql-1-dz5qb   1/1       Running   0          2h

All pods running well and CFME webconsole is available. Move this bug to verified.

Comment 8 Gaoyun Pei 2018-06-19 10:04:10 UTC

Move this bug to Modified for now we have a new PR to address this:
https://github.com/openshift/openshift-ansible/pull/8838/

Comment 10 Gaoyun Pei 2018-06-26 09:42:47 UTC

Verify this bug with openshift-ansible-3.10.8-1.git.230.830efc0.el7.noarch.

container_manage_cgroup sebool is set to "on" during node installation.

TASK [openshift_node : Setting sebool container_manage_cgroup] *****************
Tuesday 26 June 2018  02:51:35 -0400 (0:00:00.508)       0:02:05.109 ********** 
changed: [ec2-34-230-27-81.compute-1.amazonaws.com] => {"changed": true, "failed": false, "name": "container_manage_cgroup"}
changed: [ec2-54-80-211-244.compute-1.amazonaws.com] => {"changed": true, "failed": false, "name": "container_manage_cgroup"}
changed: [ec2-35-173-204-154.compute-1.amazonaws.com] => {"changed": true, "failed": false, "name": "container_manage_cgroup"}


When deploying CFME on the ocp-3.10 cluster, httpd pod could run well.

Comment 12 errata-xmlrpc 2018-07-30 19:17:19 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:1816

Note You need to log in before you can comment on or make changes to this bug.