Bug 1494270

Summary:	Cannot deploy CNS with rhgs-server-rhel7 image > 3.3.0-15 due to missing environment variables GB_GLFS_LRU_COUNT, TCMU_LOGDIR
Product:	[Red Hat Storage] Red Hat Gluster Storage	Reporter:	Daniel Messer <dmesser>
Component:	rhgs-server-container	Assignee:	Humble Chirammal <hchiramm>
Status:	CLOSED ERRATA	QA Contact:	Prasanth <pprakash>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	cns-3.6	CC:	asrivast, hchiramm, jarrpa, madam, pprakash, rcyriac, rhs-bugs, rreddy, rtalur, sankarshan
Target Milestone:	---
Target Release:	CNS 3.6
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:	cns-deploy-5.0.0-50	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:
Clones:	1498595 (view as bug list)		Environment:
Last Closed:	2017-10-11 06:58:29 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1445448

Description Daniel Messer 2017-09-21 21:45:26 UTC

Description of problem:

Trying to use openshift-ansible (latest checkout from github) in order to deploy CNS on OCP 3.6 Startup of RHGS containers later than image tag 3.3.0-15 fails with the an error message from within the container about TCMU_LOGDIR and GB_GLFS_LRU_COUNT environment variables not being set.

How reproducible:

always

Steps to Reproduce:
1. Download latest rhgs3/rhgs-(server|volmanager)-rhel7 images
2. Deploy using openshift-ansible and [glusterfs] inventory groups
3. Deployment playbooks times out waiting for GlusterFS pods to start

Actual results:

Deployment fails, OCP deployment fails if the required pods were designated to provide storage to the registry.

Expected results:

Deployment succeeds, containers are coming up with reasonable default values for above mentioned environment variables in case they are not set.

Comment 2 Humble Chirammal 2017-09-22 03:53:02 UTC

Thanks Daniel, we are looking into this issue.

Comment 5 Prasanth 2017-10-04 15:06:32 UTC

Test results:

Deployment works:

###############################
TASK [openshift_excluder : Enable openshift excluder] ***************************************************************************************************************************************
changed: [10.70.47.111]
changed: [10.70.47.22]
changed: [10.70.46.202]
changed: [10.70.47.145]

PLAY RECAP **********************************************************************************************************************************************************************************
10.70.46.202               : ok=252  changed=41   unreachable=0    failed=0   
10.70.47.111               : ok=727  changed=183  unreachable=0    failed=0   
10.70.47.145               : ok=251  changed=41   unreachable=0    failed=0   
10.70.47.22                : ok=252  changed=41   unreachable=0    failed=0   
localhost                  : ok=14   changed=0    unreachable=0    failed=0  


# oc project
Using project "glusterfs" on server "https://dhcp47-111.lab.eng.blr.redhat.com:8443".
[root@dhcp47-111 ~]# oc get all
NAME                REVISION   DESIRED   CURRENT   TRIGGERED BY
dc/heketi-storage   1          1         1         config

NAME                  DESIRED   CURRENT   READY     AGE
rc/heketi-storage-1   1         1         1         50m

NAME                    HOST/PORT                                          PATH      SERVICES         PORT      TERMINATION   WILDCARD
routes/heketi-storage   heketi-storage-glusterfs.cloudapps.mystorage.com             heketi-storage   <all>                   None

NAME                              CLUSTER-IP      EXTERNAL-IP   PORT(S)    AGE
svc/heketi-db-storage-endpoints   172.30.203.70   <none>        1/TCP      50m
svc/heketi-storage                172.30.255.23   <none>        8080/TCP   50m

NAME                         READY     STATUS    RESTARTS   AGE
po/glusterfs-storage-0t6c5   1/1       Running   0          1h
po/glusterfs-storage-sn930   1/1       Running   0          1h
po/glusterfs-storage-vq22s   1/1       Running   0          1h
po/heketi-storage-1-kn81q    1/1       Running   0          50m
###############################


However, if the setup doesn't have access to the externel public registry, there is a high chance for anyone to hit this issue: https://github.com/openshift/openshift-ansible/pull/5562

Ex: If BLOCK_REGISTRY='--block-registry docker.io' is configured in docker, the ansible deployment will fail as the heketi-storage-copy-job currently tries to pull the "heketi/heketi:dev" image. See below:


**************************************************************************
# oc get pods
NAME                            READY     STATUS              RESTARTS   AGE
deploy-heketi-storage-1-b92l9   1/1       Running             0          1m
glusterfs-storage-0t6c5         1/1       Running             0          5m
glusterfs-storage-sn930         1/1       Running             0          5m
glusterfs-storage-vq22s         1/1       Running             0          5m
heketi-storage-copy-job-sczx6   0/1       ContainerCreating   0          9s


# oc get pods
NAME                            READY     STATUS         RESTARTS   AGE
deploy-heketi-storage-1-b92l9   1/1       Running        0          3m
glusterfs-storage-0t6c5         1/1       Running        0          7m
glusterfs-storage-sn930         1/1       Running        0          7m
glusterfs-storage-vq22s         1/1       Running        0          7m
heketi-storage-copy-job-sczx6   0/1       ErrImagePull   0          1m


# oc describe pod heketi-storage-copy-job-sczx6
Name:                   heketi-storage-copy-job-sczx6
Namespace:              glusterfs
Security Policy:        privileged
Node:                   dhcp46-202.lab.eng.blr.redhat.com/10.70.46.202
Start Time:             Wed, 04 Oct 2017 19:15:39 +0530
Labels:                 controller-uid=4edcee40-a90a-11e7-bd89-005056a53cea
                        job-name=heketi-storage-copy-job
Annotations:            kubernetes.io/created-by={"kind":"SerializedReference","apiVersion":"v1","reference":{"kind":"Job","namespace":"glusterfs","name":"heketi-storage-copy-job","uid":"4edcee40-a90a-11e7-bd89-005056a53cea"...
                        openshift.io/scc=privileged
Status:                 Pending
IP:                     10.131.0.6
Controllers:            Job/heketi-storage-copy-job
Containers:
  heketi:
    Container ID:
    Image:              heketi/heketi:dev
    Image ID:
    Port:
    Command:
      cp
      /db/heketi.db
      /heketi
    State:              Waiting
      Reason:           ImagePullBackOff
    Ready:              False
    Restart Count:      0
    Environment:        <none>
    Mounts:
      /db from heketi-storage-secret (rw)
      /heketi from heketi-storage (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-pxl40 (ro)
Conditions:
  Type          Status
  Initialized   True 
  Ready         False 
  PodScheduled  True 
Volumes:
  heketi-storage:
    Type:               Glusterfs (a Glusterfs mount on the host that shares a pod's lifetime)
    EndpointsName:      heketi-storage-endpoints
    Path:               heketidbstorage
    ReadOnly:           false
  heketi-storage-secret:
    Type:       Secret (a volume populated by a Secret)
    SecretName: heketi-storage-secret
    Optional:   false
  default-token-pxl40:
    Type:       Secret (a volume populated by a Secret)
    SecretName: default-token-pxl40
    Optional:   false
QoS Class:      BestEffort
Node-Selectors: <none>
Tolerations:    <none>
Events:
  FirstSeen     LastSeen        Count   From                                            SubObjectPath           Type            Reason          Message
  ---------     --------        -----   ----                                            -------------           --------        ------          -------
  46s           46s             1       default-scheduler                                                       Normal          Scheduled       Successfully assigned heketi-storage-copy-job-sczx6 to dhcp46-202.lab.eng.blr.redhat.com
  31s           31s             1       kubelet, dhcp46-202.lab.eng.blr.redhat.com      spec.containers{heketi} Normal          BackOff         Back-off pulling image "heketi/heketi:dev"
  41s           17s             2       kubelet, dhcp46-202.lab.eng.blr.redhat.com      spec.containers{heketi} Normal          Pulling         pulling image "heketi/heketi:dev"
  32s           8s              2       kubelet, dhcp46-202.lab.eng.blr.redhat.com      spec.containers{heketi} Warning         Failed          Failed to pull image "heketi/heketi:dev": rpc error: code = 2 desc = unknown: Not Found
  32s           8s              3       kubelet, dhcp46-202.lab.eng.blr.redhat.com                              Warning         FailedSync      Error syncing pod
**************************************************************************

Jose, could you please confirm if this is the current behaviour/issue in OCP 3.6 ansible installer? Once you confirm, i'll go ahead and open a separate BZ with OCP 3.6 to track the same.

Comment 6 Jose A. Rivera 2017-10-04 15:11:42 UTC

Confirmed. Let's see if OCP will allow this fix into 3.6.z. :)

Comment 7 Prasanth 2017-10-04 15:47:10 UTC

(In reply to Jose A. Rivera from comment #6)
> Confirmed. Let's see if OCP will allow this fix into 3.6.z. :)

Thanks for confirming the same, Jose. I'll soon go ahead and file a BZ in OCP 3.6 and let's try to get the fix into 3.6.z. :)

Comment 8 Prasanth 2017-10-04 15:47:47 UTC

Based on Comment 5, moving this BZ to Verified.

Comment 9 errata-xmlrpc 2017-10-11 06:58:29 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2017:2877