Description of problem: Deploy CFME-46 on OCP-3.10 using openshift-management/config.yml playbook, playbook finished but httpd pod was not running in the end. [root@ip-172-18-7-193 ~]# oc get pod NAME READY STATUS RESTARTS AGE cloudforms-0 1/1 Running 0 16m httpd-1-6p6w2 0/1 CrashLoopBackOff 6 16m httpd-1-deploy 1/1 Running 0 16m memcached-1-rftxq 1/1 Running 0 16m postgresql-1-wvvrb 1/1 Running 0 16m [root@ip-172-18-7-193 ~]# oc describe pod httpd-1-6p6w2 Name: httpd-1-6p6w2 Namespace: openshift-management Node: ip-172-18-15-140.ec2.internal/172.18.15.140 Start Time: Wed, 06 Jun 2018 01:47:44 -0400 Labels: deployment=httpd-1 deploymentconfig=httpd name=httpd Annotations: openshift.io/deployment-config.latest-version=1 openshift.io/deployment-config.name=httpd openshift.io/deployment.name=httpd-1 openshift.io/scc=anyuid Status: Running IP: 10.130.0.17 Controlled By: ReplicationController/httpd-1 Containers: httpd: Container ID: docker://500147788e422274d08cc6b41f473815447fb950455a9843e4bb5e8fe887b45d Image: registry.access.redhat.com/cloudforms46/cfme-openshift-httpd:latest Image ID: docker-pullable://registry.access.redhat.com/cloudforms46/cfme-openshift-httpd@sha256:235002e7a8bfb31841a2cb4c2afb6c8b599086b5189c34434411d03848d013dc Ports: 80/TCP, 8080/TCP Host Ports: 0/TCP, 0/TCP State: Waiting Reason: CrashLoopBackOff Last State: Terminated Reason: Error Exit Code: 137 Started: Wed, 06 Jun 2018 02:00:29 -0400 Finished: Wed, 06 Jun 2018 02:01:34 -0400 Ready: False Restart Count: 6 Limits: memory: 8Gi Requests: cpu: 500m memory: 512Mi Liveness: exec [pidof httpd] delay=15s timeout=3s period=10s #success=1 #failure=3 Readiness: tcp-socket :80 delay=10s timeout=3s period=10s #success=1 #failure=3 Environment: HTTPD_AUTH_TYPE: <set to the key 'auth-type' of config map 'httpd-auth-configs'> Optional: false HTTPD_AUTH_KERBEROS_REALMS: <set to the key 'auth-kerberos-realms' of config map 'httpd-auth-configs'> Optional: false Mounts: /etc/httpd/auth-conf.d from httpd-auth-config (rw) /etc/httpd/conf.d from httpd-config (rw) /var/run/secrets/kubernetes.io/serviceaccount from cfme-httpd-token-8h546 (ro) Conditions: Type Status Initialized True Ready False PodScheduled True Volumes: httpd-config: Type: ConfigMap (a volume populated by a ConfigMap) Name: httpd-configs Optional: false httpd-auth-config: Type: ConfigMap (a volume populated by a ConfigMap) Name: httpd-auth-configs Optional: false cfme-httpd-token-8h546: Type: Secret (a volume populated by a Secret) SecretName: cfme-httpd-token-8h546 Optional: false QoS Class: Burstable Node-Selectors: node-role.kubernetes.io/compute=true Tolerations: node.kubernetes.io/memory-pressure:NoSchedule Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Scheduled 16m default-scheduler Successfully assigned httpd-1-6p6w2 to ip-172-18-15-140.ec2.internal Warning Failed 16m kubelet, ip-172-18-15-140.ec2.internal Failed to pull image "registry.access.redhat.com/cloudforms46/cfme-openshift-httpd:latest": rpc error: code = Unknown desc = Get https://registry.access.redhat.com/v2/cloudforms46/cfme-openshift-httpd/manifests/sha256:c9828511ab7c93a4a31961e2f6b0ed601ae94e8a270ce4e1788f4b971feb3199: net/http: TLS handshake timeout Warning Failed 16m kubelet, ip-172-18-15-140.ec2.internal Error: ErrImagePull Normal BackOff 15m (x3 over 16m) kubelet, ip-172-18-15-140.ec2.internal Back-off pulling image "registry.access.redhat.com/cloudforms46/cfme-openshift-httpd:latest" Normal SandboxChanged 15m (x4 over 16m) kubelet, ip-172-18-15-140.ec2.internal Pod sandbox changed, it will be killed and re-created. Warning Failed 15m (x3 over 16m) kubelet, ip-172-18-15-140.ec2.internal Error: ImagePullBackOff Normal Pulling 15m (x2 over 16m) kubelet, ip-172-18-15-140.ec2.internal pulling image "registry.access.redhat.com/cloudforms46/cfme-openshift-httpd:latest" Normal Pulled 14m kubelet, ip-172-18-15-140.ec2.internal Successfully pulled image "registry.access.redhat.com/cloudforms46/cfme-openshift-httpd:latest" Normal Created 14m kubelet, ip-172-18-15-140.ec2.internal Created container Normal Started 14m kubelet, ip-172-18-15-140.ec2.internal Started container Warning Unhealthy 13m (x3 over 13m) kubelet, ip-172-18-15-140.ec2.internal Liveness probe failed: Warning Unhealthy 11m (x16 over 13m) kubelet, ip-172-18-15-140.ec2.internal Readiness probe failed: dial tcp 10.130.0.17:80: getsockopt: connection refused Warning BackOff 1m (x28 over 8m) kubelet, ip-172-18-15-140.ec2.internal Back-off restarting failed container Version-Release number of the following components: [root@ip-172-18-7-193 ~]# openshift version openshift v3.10.0-0.60.0 [root@ip-172-18-15-140 ~]# docker images |grep cfme registry.access.redhat.com/cloudforms46/cfme-openshift-httpd 2.4.6-27 133ae4fcad6d 5 weeks ago 349 MB registry.access.redhat.com/cloudforms46/cfme-openshift-httpd latest 133ae4fcad6d 5 weeks ago 349 MB registry.access.redhat.com/cloudforms46/cfme-openshift-postgresql latest 697bcc27331e 5 weeks ago 339 MB registry.access.redhat.com/cloudforms46/cfme-openshift-memcached latest 12991e670cc1 5 weeks ago 251 MB registry.access.redhat.com/cloudforms46/cfme-openshift-app-ui latest 8b2e78ea76a8 4 weeks ago 926 MB How reproducible: Always Steps to Reproduce: 1. 2. 3. Actual results: Please include the entire output from the last TASK line through the end of output if an error is generated Expected results: Additional info: Please attach logs from ansible-playbook with the -vvv flag
This may be fixed by https://github.com/openshift/openshift-ansible/pull/8423 Can you try enabling the container_manage_cgroup sebool on the nodes and see if that fixes the issue?
Thanks for your reply Nick! It did fix the issue. container_manage_cgroup sebool on the nodes are in "off" status in the beginning, after setting it as "on", restart cfme httpd deployment, the pod could run well. Here's my verify steps: 1. On all nodes [root@gpei-bz1587825-node-registry-router-1 ~]# setsebool container_manage_cgroup on [root@gpei-bz1587825-node-registry-router-1 ~]# getsebool -a |grep container_manage_cgroup container_manage_cgroup --> on 2. Rollout the httpd deployment [root@gpei-bz1587825-master-etcd-1 ~]# oc rollout latest httpd deploymentconfig "httpd" rolled out [root@gpei-bz1587825-master-etcd-1 ~]# oc get pod NAME READY STATUS RESTARTS AGE cloudforms-0 1/1 Running 0 17h httpd-1-deploy 0/1 Error 0 17h httpd-2-dwc9t 1/1 Running 0 38s memcached-1-8tk4k 1/1 Running 0 17h postgresql-1-fm8jp 1/1 Running 0 17h [root@gpei-bz1587825-master-etcd-1 ~]# oc get route NAME HOST/PORT PATH SERVICES PORT TERMINATION WILDCARD httpd httpd-openshift-management.apps.0607-y8f.qe.rhcloud.com httpd http edge/Redirect None [root@gpei-bz1587825-master-etcd-1 ~]# curl -Ik https://httpd-openshift-management.apps.0607-y8f.qe.rhcloud.com/ HTTP/1.1 200 OK Date: Fri, 08 Jun 2018 02:36:01 GMT ...
Verify this bug with openshift-ansible-3.10.0-0.67.0.git.107.1bd1f01.el7.noarch. Deploy CFME-46 on ocp-3.10 cluster using openshift-management/config.yml playbook. container_manage_cgroup sebool got changed during deployment. PLAY [Enable sebool container_manage_cgroup] ******************************************************************************************************************************** TASK [Setting sebool container_manage_cgroup] ******************************************************************************************************************************* changed: [ec2-54-174-15-179.compute-1.amazonaws.com] => {"changed": true, "failed": false, "name": "container_manage_cgroup"} changed: [ec2-34-201-220-201.compute-1.amazonaws.com] => {"changed": true, "failed": false, "name": "container_manage_cgroup"} changed: [ec2-54-87-220-234.compute-1.amazonaws.com] => {"changed": true, "failed": false, "name": "container_manage_cgroup"} changed: [ec2-54-173-226-200.compute-1.amazonaws.com] => {"changed": true, "failed": false, "name": "container_manage_cgroup"} changed: [ec2-18-232-99-88.compute-1.amazonaws.com] => {"changed": true, "failed": false, "name": "container_manage_cgroup"} [root@ip-172-18-14-218 ~]# getsebool -a |grep container_manage_cgroup container_manage_cgroup --> on [root@ip-172-18-14-218 ~]# oc get pod -n openshift-management NAME READY STATUS RESTARTS AGE cloudforms-0 1/1 Running 0 2h httpd-1-xf8wg 1/1 Running 0 2h memcached-1-fc4dv 1/1 Running 0 2h postgresql-1-dz5qb 1/1 Running 0 2h All pods running well and CFME webconsole is available. Move this bug to verified.
Move this bug to Modified for now we have a new PR to address this: https://github.com/openshift/openshift-ansible/pull/8838/
Verify this bug with openshift-ansible-3.10.8-1.git.230.830efc0.el7.noarch. container_manage_cgroup sebool is set to "on" during node installation. TASK [openshift_node : Setting sebool container_manage_cgroup] ***************** Tuesday 26 June 2018 02:51:35 -0400 (0:00:00.508) 0:02:05.109 ********** changed: [ec2-34-230-27-81.compute-1.amazonaws.com] => {"changed": true, "failed": false, "name": "container_manage_cgroup"} changed: [ec2-54-80-211-244.compute-1.amazonaws.com] => {"changed": true, "failed": false, "name": "container_manage_cgroup"} changed: [ec2-35-173-204-154.compute-1.amazonaws.com] => {"changed": true, "failed": false, "name": "container_manage_cgroup"} When deploying CFME on the ocp-3.10 cluster, httpd pod could run well.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:1816