Description of problem: Trying to use openshift-ansible (latest checkout from github) in order to deploy CNS on OCP 3.6 Startup of RHGS containers later than image tag 3.3.0-15 fails with the an error message from within the container about TCMU_LOGDIR and GB_GLFS_LRU_COUNT environment variables not being set. How reproducible: always Steps to Reproduce: 1. Download latest rhgs3/rhgs-(server|volmanager)-rhel7 images 2. Deploy using openshift-ansible and [glusterfs] inventory groups 3. Deployment playbooks times out waiting for GlusterFS pods to start Actual results: Deployment fails, OCP deployment fails if the required pods were designated to provide storage to the registry. Expected results: Deployment succeeds, containers are coming up with reasonable default values for above mentioned environment variables in case they are not set.
Thanks Daniel, we are looking into this issue.
Test results: Deployment works: ############################### TASK [openshift_excluder : Enable openshift excluder] *************************************************************************************************************************************** changed: [10.70.47.111] changed: [10.70.47.22] changed: [10.70.46.202] changed: [10.70.47.145] PLAY RECAP ********************************************************************************************************************************************************************************** 10.70.46.202 : ok=252 changed=41 unreachable=0 failed=0 10.70.47.111 : ok=727 changed=183 unreachable=0 failed=0 10.70.47.145 : ok=251 changed=41 unreachable=0 failed=0 10.70.47.22 : ok=252 changed=41 unreachable=0 failed=0 localhost : ok=14 changed=0 unreachable=0 failed=0 # oc project Using project "glusterfs" on server "https://dhcp47-111.lab.eng.blr.redhat.com:8443". [root@dhcp47-111 ~]# oc get all NAME REVISION DESIRED CURRENT TRIGGERED BY dc/heketi-storage 1 1 1 config NAME DESIRED CURRENT READY AGE rc/heketi-storage-1 1 1 1 50m NAME HOST/PORT PATH SERVICES PORT TERMINATION WILDCARD routes/heketi-storage heketi-storage-glusterfs.cloudapps.mystorage.com heketi-storage <all> None NAME CLUSTER-IP EXTERNAL-IP PORT(S) AGE svc/heketi-db-storage-endpoints 172.30.203.70 <none> 1/TCP 50m svc/heketi-storage 172.30.255.23 <none> 8080/TCP 50m NAME READY STATUS RESTARTS AGE po/glusterfs-storage-0t6c5 1/1 Running 0 1h po/glusterfs-storage-sn930 1/1 Running 0 1h po/glusterfs-storage-vq22s 1/1 Running 0 1h po/heketi-storage-1-kn81q 1/1 Running 0 50m ############################### However, if the setup doesn't have access to the externel public registry, there is a high chance for anyone to hit this issue: https://github.com/openshift/openshift-ansible/pull/5562 Ex: If BLOCK_REGISTRY='--block-registry docker.io' is configured in docker, the ansible deployment will fail as the heketi-storage-copy-job currently tries to pull the "heketi/heketi:dev" image. See below: ************************************************************************** # oc get pods NAME READY STATUS RESTARTS AGE deploy-heketi-storage-1-b92l9 1/1 Running 0 1m glusterfs-storage-0t6c5 1/1 Running 0 5m glusterfs-storage-sn930 1/1 Running 0 5m glusterfs-storage-vq22s 1/1 Running 0 5m heketi-storage-copy-job-sczx6 0/1 ContainerCreating 0 9s # oc get pods NAME READY STATUS RESTARTS AGE deploy-heketi-storage-1-b92l9 1/1 Running 0 3m glusterfs-storage-0t6c5 1/1 Running 0 7m glusterfs-storage-sn930 1/1 Running 0 7m glusterfs-storage-vq22s 1/1 Running 0 7m heketi-storage-copy-job-sczx6 0/1 ErrImagePull 0 1m # oc describe pod heketi-storage-copy-job-sczx6 Name: heketi-storage-copy-job-sczx6 Namespace: glusterfs Security Policy: privileged Node: dhcp46-202.lab.eng.blr.redhat.com/10.70.46.202 Start Time: Wed, 04 Oct 2017 19:15:39 +0530 Labels: controller-uid=4edcee40-a90a-11e7-bd89-005056a53cea job-name=heketi-storage-copy-job Annotations: kubernetes.io/created-by={"kind":"SerializedReference","apiVersion":"v1","reference":{"kind":"Job","namespace":"glusterfs","name":"heketi-storage-copy-job","uid":"4edcee40-a90a-11e7-bd89-005056a53cea"... openshift.io/scc=privileged Status: Pending IP: 10.131.0.6 Controllers: Job/heketi-storage-copy-job Containers: heketi: Container ID: Image: heketi/heketi:dev Image ID: Port: Command: cp /db/heketi.db /heketi State: Waiting Reason: ImagePullBackOff Ready: False Restart Count: 0 Environment: <none> Mounts: /db from heketi-storage-secret (rw) /heketi from heketi-storage (rw) /var/run/secrets/kubernetes.io/serviceaccount from default-token-pxl40 (ro) Conditions: Type Status Initialized True Ready False PodScheduled True Volumes: heketi-storage: Type: Glusterfs (a Glusterfs mount on the host that shares a pod's lifetime) EndpointsName: heketi-storage-endpoints Path: heketidbstorage ReadOnly: false heketi-storage-secret: Type: Secret (a volume populated by a Secret) SecretName: heketi-storage-secret Optional: false default-token-pxl40: Type: Secret (a volume populated by a Secret) SecretName: default-token-pxl40 Optional: false QoS Class: BestEffort Node-Selectors: <none> Tolerations: <none> Events: FirstSeen LastSeen Count From SubObjectPath Type Reason Message --------- -------- ----- ---- ------------- -------- ------ ------- 46s 46s 1 default-scheduler Normal Scheduled Successfully assigned heketi-storage-copy-job-sczx6 to dhcp46-202.lab.eng.blr.redhat.com 31s 31s 1 kubelet, dhcp46-202.lab.eng.blr.redhat.com spec.containers{heketi} Normal BackOff Back-off pulling image "heketi/heketi:dev" 41s 17s 2 kubelet, dhcp46-202.lab.eng.blr.redhat.com spec.containers{heketi} Normal Pulling pulling image "heketi/heketi:dev" 32s 8s 2 kubelet, dhcp46-202.lab.eng.blr.redhat.com spec.containers{heketi} Warning Failed Failed to pull image "heketi/heketi:dev": rpc error: code = 2 desc = unknown: Not Found 32s 8s 3 kubelet, dhcp46-202.lab.eng.blr.redhat.com Warning FailedSync Error syncing pod ************************************************************************** Jose, could you please confirm if this is the current behaviour/issue in OCP 3.6 ansible installer? Once you confirm, i'll go ahead and open a separate BZ with OCP 3.6 to track the same.
Confirmed. Let's see if OCP will allow this fix into 3.6.z. :)
(In reply to Jose A. Rivera from comment #6) > Confirmed. Let's see if OCP will allow this fix into 3.6.z. :) Thanks for confirming the same, Jose. I'll soon go ahead and file a BZ in OCP 3.6 and let's try to get the fix into 3.6.z. :)
Based on Comment 5, moving this BZ to Verified.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2017:2877