Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1623777

Summary:	Fail to deploy CNS with both glusterfs and glusterfs_registry group
Product:	OpenShift Container Platform	Reporter:	Wenkai Shi <weshi>
Component:	Installer	Assignee:	Jose A. Rivera <jarrpa>
Status:	CLOSED CURRENTRELEASE	QA Contact:	Johnny Liu <jialiu>
Severity:	high	Docs Contact:
Priority:	high
Version:	3.11.0	CC:	aos-bugs, bleanhar, crmarquesjc, jokerman, madam, mmccomas, pprakash, sarumuga, wmeng, wsun, xxia
Target Milestone:	---	Keywords:	TestBlocker
Target Release:	3.11.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	No Doc Update
Doc Text:	undefined	Story Points:	---
Clone Of:
Clones:	1627454 (view as bug list)		Environment:
Last Closed:	2018-12-21 15:23:39 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1626751, 1627454

Description Wenkai Shi 2018-08-30 07:59:49 UTC

Description of problem:
Fail to deploy CNS with both glusterfs and glusterfs_registry group, it's always failed in Verify heketi service task.

Version-Release number of the following components:
openshift-ansible-3.11.0-0.25.0.git.0.7497e69.el7
ansible-2.6.2-1.el7ae.noarch

How reproducible:
100%

Steps to Reproduce:
1. Deploy CNS with both glusterfs and glusterfs_registry group
2.
3.

Actual results:
Installer failed in Verify heketi service task:
...
TASK [openshift_storage_glusterfs : Verify heketi service] *********************
Thursday 30 August 2018  03:09:37 -0400 (0:00:00.115)       0:26:41.155 ******* 

fatal: [qe-weshi-cnsb-master-etcd-1.0830-lhj.qe.rhcloud.com]: FAILED! => {"changed": false, "cmd": ["oc", "--config=/tmp/openshift-glusterfs-ansible-OtyRlI/admin.kubeconfig", "rsh", "--namespace=default", "heketi-storage-1-pjm59", "heketi-cli", "-s", "http://localhost:8080", "--user", "admin", "--secret", "y3rSHd1iifC/3iOAp/ETuQ4WXTNepegv091mESqaA00=", "cluster", "list"], "delta": "0:00:00.243109", "end": "2018-08-30 03:11:33.915053", "msg": "non-zero return code", "rc": 1, "start": "2018-08-30 03:11:33.671944", "stderr": "Error from server (NotFound): pods \"heketi-storage-1-pjm59\" not found", "stderr_lines": ["Error from server (NotFound): pods \"heketi-storage-1-pjm59\" not found"], "stdout": "", "stdout_lines": []}
...

Expected results:
Installer should pass here

Additional info:
Login to the master, deploy-heketi pod is keep pending:
# oc get po
NAME                             READY     STATUS    RESTARTS   AGE
deploy-heketi-registry-1-wlxs9   1/1       Running   0          16m
glusterfs-registry-pgdt7         1/1       Running   0          17m
glusterfs-registry-skd7g         1/1       Running   0          17m
glusterfs-registry-tbm2p         1/1       Running   0          17m

# oc delete po deploy-heketi-registry-1-wlxs9
pod "deploy-heketi-registry-1-wlxs9" deleted

# oc get po
NAME                             READY     STATUS    RESTARTS   AGE
deploy-heketi-registry-1-667r8   1/1       Running   0          37s
glusterfs-registry-pgdt7         1/1       Running   0          18m
glusterfs-registry-skd7g         1/1       Running   0          18m
glusterfs-registry-tbm2p         1/1       Running   0          18m

# oc describe po deploy-heketi-registry-1-667r8
Name:               deploy-heketi-registry-1-667r8
Namespace:          default
Priority:           0
PriorityClassName:  <none>
Node:               qe-weshi-cnsb-node-registry-router-1/10.240.0.29
Start Time:         Thu, 30 Aug 2018 03:27:25 -0400
Labels:             deploy-heketi=support
                    deployment=deploy-heketi-registry-1
                    deploymentconfig=deploy-heketi-registry
                    glusterfs=deploy-heketi-registry-pod
Annotations:        openshift.io/deployment-config.latest-version=1
                    openshift.io/deployment-config.name=deploy-heketi-registry
                    openshift.io/deployment.name=deploy-heketi-registry-1
                    openshift.io/scc=restricted
Status:             Running
IP:                 10.128.4.3
Controlled By:      ReplicationController/deploy-heketi-registry-1
Containers:
  heketi:
    Container ID:   docker://d737fd472c78612c4c34f68b8e00c813286d93056d78f929ba7cec61fa44471e
    Image:          registry.access.redhat.com/rhgs3/rhgs-volmanager-rhel7
    Image ID:       docker-pullable://registry.access.redhat.com/rhgs3/rhgs-volmanager-rhel7@sha256:5d93c20bce1d76e508254d589ffd8d0b324a404bbab5a20deff6916dd27a1f39
    Port:           8080/TCP
    Host Port:      0/TCP
    State:          Running
      Started:      Thu, 30 Aug 2018 03:27:42 -0400
    Ready:          True
    Restart Count:  0
    Liveness:       http-get http://:8080/hello delay=30s timeout=3s period=10s #success=1 #failure=3
    Readiness:      http-get http://:8080/hello delay=3s timeout=3s period=10s #success=1 #failure=3
    Environment:
      HEKETI_USER_KEY:                 oTfeiMoV1X1U/XwAdqnMj9eSFZxXw3rVnWrF2IQq3TQ=
      HEKETI_ADMIN_KEY:                y3rSHd1iifC/3iOAp/ETuQ4WXTNepegv091mESqaA00=
      HEKETI_EXECUTOR:                 kubernetes
      HEKETI_FSTAB:                    /var/lib/heketi/fstab
      HEKETI_SNAPSHOT_LIMIT:           14
      HEKETI_KUBE_GLUSTER_DAEMONSET:   1
      HEKETI_IGNORE_STALE_OPERATIONS:  true
    Mounts:
      /etc/heketi from config (rw)
      /var/lib/heketi from db (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from heketi-registry-service-account-token-87f9l (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             True 
  ContainersReady   True 
  PodScheduled      True 
Volumes:
  db:
    Type:    EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:  
  config:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  heketi-registry-config-secret
    Optional:    false
  heketi-registry-service-account-token-87f9l:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  heketi-registry-service-account-token-87f9l
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  <none>
Tolerations:     <none>
Events:
  Type    Reason     Age   From                                           Message
  ----    ------     ----  ----                                           -------
  Normal  Scheduled  47s   default-scheduler                              Successfully assigned default/deploy-heketi-registry-1-667r8 to qe-weshi-cnsb-node-registry-router-1
  Normal  Pulling    45s   kubelet, qe-weshi-cnsb-node-registry-router-1  pulling image "registry.access.redhat.com/rhgs3/rhgs-volmanager-rhel7"
  Normal  Pulled     30s   kubelet, qe-weshi-cnsb-node-registry-router-1  Successfully pulled image "registry.access.redhat.com/rhgs3/rhgs-volmanager-rhel7"
  Normal  Created    30s   kubelet, qe-weshi-cnsb-node-registry-router-1  Created container
  Normal  Started    30s   kubelet, qe-weshi-cnsb-node-registry-router-1  Started container

# oc logs -f deploy-heketi-registry-1-667r8
stat: cannot stat '/var/lib/heketi/heketi.db': No such file or directory
Heketi 6.0.0
[heketi] ERROR 2018/08/30 07:27:42 /src/github.com/heketi/heketi/apps/glusterfs/app.go:100: invalid log level: 
[heketi] INFO 2018/08/30 07:27:42 Loaded kubernetes executor
[heketi] INFO 2018/08/30 07:27:42 Block: Auto Create Block Hosting Volume set to true
[heketi] INFO 2018/08/30 07:27:42 Block: New Block Hosting Volume size 100 GB
[heketi] INFO 2018/08/30 07:27:42 GlusterFS Application Loaded
[heketi] INFO 2018/08/30 07:27:42 Started Node Health Cache Monitor
Authorization loaded
Listening on port 8080
[heketi] INFO 2018/08/30 07:27:52 Starting Node Health Status refresh
[heketi] INFO 2018/08/30 07:27:52 Cleaned 0 nodes from health cache
[heketi] INFO 2018/08/30 07:29:42 Starting Node Health Status refresh
[heketi] INFO 2018/08/30 07:29:42 Cleaned 0 nodes from health cache
[heketi] INFO 2018/08/30 07:31:42 Starting Node Health Status refresh
[heketi] INFO 2018/08/30 07:31:42 Cleaned 0 nodes from health cache
[heketi] INFO 2018/08/30 07:33:42 Starting Node Health Status refresh
[heketi] INFO 2018/08/30 07:33:42 Cleaned 0 nodes from health cache
[heketi] INFO 2018/08/30 07:35:42 Starting Node Health Status refresh
[heketi] INFO 2018/08/30 07:35:42 Cleaned 0 nodes from health cache
^C

Comment 4 Jose A. Rivera 2018-08-31 14:55:36 UTC

Where is the name "heketi-storage-1-pjm59" coming from if its not in the cluster?

Comment 5 Wenkai Shi 2018-09-03 07:02:04 UTC

(In reply to Jose A. Rivera from comment #4)
> Where is the name "heketi-storage-1-pjm59" coming from if its not in the
> cluster?

I've no idea, I've reproduce this in another deployment, it's still not in the cluster, but name existing.

Comment 6 Jose A. Rivera 2018-09-04 21:08:33 UTC

There should be two heketi pods, one for glusterfs and one for glusterfs_registry. What is the output for "oc get po" on their respective namespaces?

Comment 7 crmarques 2018-09-04 22:13:32 UTC

I'm facing the same problem. 

In my case, ansible creates two namespaces: app-storage (for CNS storage cluster) and infra-storage (for CNS storage for Openshift Infrastructure).

The first CNS seems to execute Ok. But for the second one, I receive the same error above.

One thing I noticed is that the pod the heketi-cli command is trying to execute inside is from app-storage namespace, although the command explicitly says infra-storage. I think the previous steps misread the right pod.

In another words, using the example from these report: heketi-storage-1-pjm59 would be in the "app-storage" namespace. But the steps for "infra-storage" CNS would retrieve this pod instead of the equivalent one from the right namespace.

Another thing I noticed is that my "deploy-heketi-registry-xxxx" pod shows the same error from above and do not create the "heketi-storage-1-xxxx" pod (that would be selected - if the selector was not wrong - for the previous heketi-cli command but is not being created).

oc logs -f deploy-heketi-registry-xxxx
stat: cannot stat '/var/lib/heketi/heketi.db': No such file or directory
Heketi 6.0.0
[heketi] ERROR 2018/09/04 xx:yy:zz /src/github.com/heketi/heketi/apps/glusterfs/app.go:100: invalid log level:

Comment 8 Wenkai Shi 2018-09-05 02:50:00 UTC

(In reply to Jose A. Rivera from comment #6)
> There should be two heketi pods, one for glusterfs and one for
> glusterfs_registry. What is the output for "oc get po" on their respective
> namespaces?

It's default namespace, for glusterfs_registry group.

Comment 9 Jose A. Rivera 2018-09-05 03:33:21 UTC

That does not answer my question. :)

Comment 10 Wenkai Shi 2018-09-05 06:42:30 UTC

(In reply to Jose A. Rivera from comment #9)
> That does not answer my question. :)

Sorry...

# oc get po -n default
NAME                             READY     STATUS    RESTARTS   AGE
deploy-heketi-registry-1-ltxnl   1/1       Running   0          12m
glusterfs-registry-4zjl5         1/1       Running   0          13m
glusterfs-registry-7tztw         1/1       Running   0          13m
glusterfs-registry-c8t54         1/1       Running   0          13m
# oc get po -n glusterfs 
NAME                                          READY     STATUS    RESTARTS   AGE
glusterblock-storage-provisioner-dc-1-9xv7n   1/1       Running   0          14m
glusterfs-storage-8ddtb                       1/1       Running   0          18m
glusterfs-storage-p5r6f                       1/1       Running   0          18m
glusterfs-storage-zz2bw                       1/1       Running   0          18m
heketi-storage-1-9s2qp                        1/1       Running   0          15m

Comment 11 crmarques 2018-09-08 02:46:23 UTC

https://github.com/openshift/openshift-ansible/issues/9943

Comment 13 Jose A. Rivera 2018-09-10 00:33:39 UTC

PR submitted for master: https://github.com/openshift/openshift-ansible/pull/9971

Comment 14 Jose A. Rivera 2018-09-10 15:39:59 UTC

PR merged.

Comment 15 Wei Sun 2018-09-13 05:10:35 UTC

The 3.11 PR 9980 has been merged to openshift-ansible-3.11.2-1,please check the bug.

Comment 16 Wenkai Shi 2018-09-13 06:56:18 UTC

Verified with version openshift-ansible-3.11.3-1.git.0.42aeb49.el7_5.noarch, installation succeed.

Comment 17 Wenkai Shi 2018-09-14 03:36:32 UTC

Move to VERIFIED per comment #16.

Comment 19 Luke Meyer 2018-12-21 15:23:39 UTC

Closing bugs that were verified and targeted for GA but for some reason were not picked up by errata. This bug fix should be present in current 3.11 release content.