Bug 1498595 - Cannot deploy CNS on OCP 3.6 as the heketi-storage-copy-job is trying to pull a wrong image "heketi/heketi:dev" instead of "rhgs3/rhgs-volmanager-rhel7:latest"
Summary: Cannot deploy CNS on OCP 3.6 as the heketi-storage-copy-job is trying to pull...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Installer
Version: 3.6.0
Hardware: x86_64
OS: Linux
medium
high
Target Milestone: ---
: 3.6.z
Assignee: Jose A. Rivera
QA Contact: Wenkai Shi
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-10-04 17:18 UTC by Prasanth
Modified: 2018-03-21 12:30 UTC (History)
16 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
undefined
Clone Of: 1494270
Environment:
Last Closed: 2017-12-14 21:01:55 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2017:3438 normal SHIPPED_LIVE OpenShift Container Platform 3.6 and 3.5 bug fix and enhancement update 2017-12-15 01:58:11 UTC

Description Prasanth 2017-10-04 17:18:06 UTC
+++ This bug was initially created as a clone of Bug #1494270 +++

Description of problem:

Trying to use openshift-ansible (latest checkout from github) in order to deploy CNS on OCP 3.6 Startup of RHGS containers later than image tag 3.3.0-15 fails with the an error message from within the container about TCMU_LOGDIR and GB_GLFS_LRU_COUNT environment variables not being set.

How reproducible:

always

Steps to Reproduce:
1. Download latest rhgs3/rhgs-(server|volmanager)-rhel7 images
2. Deploy using openshift-ansible and [glusterfs] inventory groups
3. Deployment playbooks times out waiting for GlusterFS pods to start

Actual results:

Deployment fails, OCP deployment fails if the required pods were designated to provide storage to the registry.

Expected results:

Deployment succeeds, containers are coming up with reasonable default values for above mentioned environment variables in case they are not set.


If the setup doesn't have access to the external public registry, there is a high chance for anyone to hit this issue: https://github.com/openshift/openshift-ansible/pull/5562

Ex: If BLOCK_REGISTRY='--block-registry docker.io' is configured in docker, the ansible deployment will fail as the heketi-storage-copy-job currently tries to pull the "heketi/heketi:dev" image. See below:


**************************************************************************
# oc get pods
NAME                            READY     STATUS              RESTARTS   AGE
deploy-heketi-storage-1-b92l9   1/1       Running             0          1m
glusterfs-storage-0t6c5         1/1       Running             0          5m
glusterfs-storage-sn930         1/1       Running             0          5m
glusterfs-storage-vq22s         1/1       Running             0          5m
heketi-storage-copy-job-sczx6   0/1       ContainerCreating   0          9s


# oc get pods
NAME                            READY     STATUS         RESTARTS   AGE
deploy-heketi-storage-1-b92l9   1/1       Running        0          3m
glusterfs-storage-0t6c5         1/1       Running        0          7m
glusterfs-storage-sn930         1/1       Running        0          7m
glusterfs-storage-vq22s         1/1       Running        0          7m
heketi-storage-copy-job-sczx6   0/1       ErrImagePull   0          1m


# oc describe pod heketi-storage-copy-job-sczx6
Name:                   heketi-storage-copy-job-sczx6
Namespace:              glusterfs
Security Policy:        privileged
Node:                   dhcp46-202.lab.eng.blr.redhat.com/10.70.46.202
Start Time:             Wed, 04 Oct 2017 19:15:39 +0530
Labels:                 controller-uid=4edcee40-a90a-11e7-bd89-005056a53cea
                        job-name=heketi-storage-copy-job
Annotations:            kubernetes.io/created-by={"kind":"SerializedReference","apiVersion":"v1","reference":{"kind":"Job","namespace":"glusterfs","name":"heketi-storage-copy-job","uid":"4edcee40-a90a-11e7-bd89-005056a53cea"...
                        openshift.io/scc=privileged
Status:                 Pending
IP:                     10.131.0.6
Controllers:            Job/heketi-storage-copy-job
Containers:
  heketi:
    Container ID:
    Image:              heketi/heketi:dev
    Image ID:
    Port:
    Command:
      cp
      /db/heketi.db
      /heketi
    State:              Waiting
      Reason:           ImagePullBackOff
    Ready:              False
    Restart Count:      0
    Environment:        <none>
    Mounts:
      /db from heketi-storage-secret (rw)
      /heketi from heketi-storage (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-pxl40 (ro)
Conditions:
  Type          Status
  Initialized   True 
  Ready         False 
  PodScheduled  True 
Volumes:
  heketi-storage:
    Type:               Glusterfs (a Glusterfs mount on the host that shares a pod's lifetime)
    EndpointsName:      heketi-storage-endpoints
    Path:               heketidbstorage
    ReadOnly:           false
  heketi-storage-secret:
    Type:       Secret (a volume populated by a Secret)
    SecretName: heketi-storage-secret
    Optional:   false
  default-token-pxl40:
    Type:       Secret (a volume populated by a Secret)
    SecretName: default-token-pxl40
    Optional:   false
QoS Class:      BestEffort
Node-Selectors: <none>
Tolerations:    <none>
Events:
  FirstSeen     LastSeen        Count   From                                            SubObjectPath           Type            Reason          Message
  ---------     --------        -----   ----                                            -------------           --------        ------          -------
  46s           46s             1       default-scheduler                                                       Normal          Scheduled       Successfully assigned heketi-storage-copy-job-sczx6 to dhcp46-202.lab.eng.blr.redhat.com
  31s           31s             1       kubelet, dhcp46-202.lab.eng.blr.redhat.com      spec.containers{heketi} Normal          BackOff         Back-off pulling image "heketi/heketi:dev"
  41s           17s             2       kubelet, dhcp46-202.lab.eng.blr.redhat.com      spec.containers{heketi} Normal          Pulling         pulling image "heketi/heketi:dev"
  32s           8s              2       kubelet, dhcp46-202.lab.eng.blr.redhat.com      spec.containers{heketi} Warning         Failed          Failed to pull image "heketi/heketi:dev": rpc error: code = 2 desc = unknown: Not Found
  32s           8s              3       kubelet, dhcp46-202.lab.eng.blr.redhat.com                              Warning         FailedSync      Error syncing pod
**************************************************************************

Jose, could you please confirm if this is the current behaviour/issue in OCP 3.6 ansible installer? Once you confirm, i'll go ahead and open a separate BZ with OCP 3.6 to track the same.

--- Additional comment from Jose A. Rivera on 2017-10-04 11:11:42 EDT ---

Confirmed. Let's see if OCP will allow this fix into 3.6.z. :)

--- Additional comment from Prasanth on 2017-10-04 11:47:10 EDT ---

(In reply to Jose A. Rivera from comment #6)
> Confirmed. Let's see if OCP will allow this fix into 3.6.z. :)

Thanks for confirming the same, Jose. I'll soon go ahead and file a BZ in OCP 3.6 and let's try to get the fix into 3.6.z. :)

--- Additional comment from Prasanth on 2017-10-04 11:47:47 EDT ---

Based on Comment 5, moving this BZ to Verified.

Comment 1 Scott Dodson 2017-10-04 20:20:38 UTC
https://github.com/openshift/openshift-ansible/pull/5663 proposed fix

Comment 2 Jose A. Rivera 2017-10-09 17:19:20 UTC
PR is merged.

Comment 3 Wenkai Shi 2017-10-18 11:24:04 UTC
Verified with version openshift-ansible-3.6.173.0.56-1.git.0.eecaf3e.el7, installation could succeed with special heketi image.

# cat hosts
...
openshift_storage_glusterfs_heketi_image=heketi/heketi
openshift_storage_glusterfs_heketi_version=dev
...

# docker ps -a
... docker.io/heketi/heketi@...

Comment 6 errata-xmlrpc 2017-12-14 21:01:55 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:3438


Note You need to log in before you can comment on or make changes to this bug.