1434055 – cns-deploy tool failed to setup: failed to communicate with heketi service

Bug 1434055 - cns-deploy tool failed to setup: failed to communicate with heketi service

Summary: cns-deploy tool failed to setup: failed to communicate with heketi service

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	cns-deploy-tool
Sub Component:
Version:	cns-3.5
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	urgent
Target Milestone:	---
Target Release:	CNS 3.5
Assignee:	Mohamed Ashiq
QA Contact:	Apeksha
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1415600
TreeView+	depends on / blocked

Reported:	2017-03-20 15:59 UTC by Apeksha
Modified:	2018-12-06 19:28 UTC (History)
CC List:	8 users (show)
Fixed In Version:	cns-deploy-4.0.0-6.el7rhgs
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2017-04-20 18:28:07 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHEA-2017:1112	0	normal	SHIPPED_LIVE	cns-deploy-tool bug fix and enhancement update	2017-04-20 22:25:47 UTC

Description Apeksha 2017-03-20 15:59:14 UTC

Description of problem:
cns-deploy tool failed to setup: failed to communicate with heketi service

Version-Release number of selected component (if applicable):
heketi-client-4.0.0-2.el7rhgs.x86_64
cns-deploy-4.0.0-4.el7rhgs.x86_64
glusterfs-3.7.9-12.el7.x86_64
openshift-ansible-3.6.3-1.git.0.622449e.el7.noarch
atomic-openshift-3.5.0.55-1.git.0.a552679.el7.x86_64
docker-1.12.6-14.el7.x86_64

Steps to Reproduce:
1. Setup openshift
2. Setup router 
3. Run the cns_deploy command

Ashiq,
As per the discusson and suggestion in https://bugzilla.redhat.com/show_bug.cgi?id=1427846#c19 , i created a new setup and put a exit 1 where it fails in the cns_deploy script and ran it. 

Output of oc_deploy command and other oc commands:
http://pastebin.test.redhat.com/466318

Comment 2 Apeksha 2017-03-20 16:09:20 UTC

Additional info :

[root@rhsauto045 ~]# oc get pods
NAME                  READY     STATUS    RESTARTS   AGE
aplo-router-1-q9d38   1/1       Running   0          1h
glusterfs-5k3vj       1/1       Running   0          37m
glusterfs-g42qk       1/1       Running   0          37m
glusterfs-h5smd       1/1       Running   0          37m
heketi-1-deploy       0/1       Error     0          32m
[root@rhsauto045 ~]# oc get dc
NAME          REVISION   DESIRED   CURRENT   TRIGGERED BY
aplo-router   1          1         1         config
heketi        1          1         0         config
[root@rhsauto045 ~]# oc describe dc heketi
Name:		heketi
Namespace:	aplo
Created:	33 minutes ago
Labels:		glusterfs=heketi-dc
		template=heketi
Description:	Defines how to deploy Heketi
Annotations:	<none>
Latest Version:	1
Selector:	glusterfs=heketi-pod
Replicas:	1
Triggers:	Config
Strategy:	Recreate
Template:
  Labels:		glusterfs=heketi-pod
  Service Account:	heketi-service-account
  Containers:
   heketi:
    Image:	rhgs3/rhgs-volmanager-rhel7:3.2.0-2
    Port:	8080/TCP
    Liveness:	http-get http://:8080/hello delay=30s timeout=3s period=10s #success=1 #failure=3
    Readiness:	http-get http://:8080/hello delay=3s timeout=3s period=10s #success=1 #failure=3
    Volume Mounts:
      /var/lib/heketi from db (rw)
    Environment Variables:
      HEKETI_USER_KEY:			
      HEKETI_ADMIN_KEY:			
      HEKETI_EXECUTOR:			kubernetes
      HEKETI_FSTAB:			/var/lib/heketi/fstab
      HEKETI_SNAPSHOT_LIMIT:		14
      HEKETI_KUBE_GLUSTER_DAEMONSET:	y
  Volumes:
   db:
    Type:		Glusterfs (a Glusterfs mount on the host that shares a pod's lifetime)
    EndpointsName:	heketi-storage-endpoints
    Path:		heketidbstorage
    ReadOnly:		false

Deployment #1 (latest):
	Name:		heketi-1
	Created:	33 minutes ago
	Status:		Failed
	Replicas:	0 current / 0 desired
	Selector:	deployment=heketi-1,deploymentconfig=heketi,glusterfs=heketi-pod
	Labels:		glusterfs=heketi-dc,openshift.io/deployment-config.name=heketi,template=heketi
	Pods Status:	0 Running / 0 Waiting / 0 Succeeded / 0 Failed

Events:
  FirstSeen	LastSeen	Count	From				SubObjectPath	Type		Reason				Message
  ---------	--------	-----	----				-------------	--------	------				-------
  33m		33m		1	{deploymentconfig-controller }			Normal		DeploymentCreated		Created new replication controller "heketi-1" for version 1
  22m		22m		1	{deploymentconfig-controller }			Normal		ReplicationControllerScaled	Scaled replication controller "heketi-1" from 1 to 0
  22m		22m		1	{heketi-1-deploy }				Warning		FailedCreate			Error creating: pods "heketi-1-" is forbidden: unable to validate against any security context constraint: [spec.containers[0].securityContext.volumes[0]: Invalid value: "glusterfs": glusterfs volumes are not allowed to be used]

Comment 3 Mohamed Ashiq 2017-03-21 05:49:45 UTC

I am facing the same issue in my setup.

  <invalid>	<invalid>	1	{heketi-1-deploy }				Warning		FailedCreate			Error creating: pods "heketi-1-" is forbidden: unable to validate against any security context constraint: [spec.containers[0].securityContext.volumes[0]: Invalid value: "glusterfs": glusterfs volumes are not allowed to be used]

I am trying to isolate the issue.

I have mounted the heketidbstorage volume on all the node and did a bind mount of the db volume mountpoint into heketi container's /var/lib/heketi

  <invalid>	<invalid>	1	{heketi-4-deploy }				Warning		FailedCreate			Error creating: pods "heketi-4-" is forbidden: unable to validate against any security context constraint: [spec.containers[0].securityContext.volumes[0]: Invalid value: "hostPath": hostPath volumes are not allowed to be used]


facing issues.


What I did:

# mount -t glusterfs <ip>:/heketidbstorage /mnt

on all the nodes then

in heketi template edit 

-          glusterfs:
-            endpoints: heketi-storage-endpoints
-            path: heketidbstorage

+          hostPath:
+            path: "/mnt"


This caused the above error. This proves nothing wrong in gluster volume plugin in kube.

Still debugging.

Comment 4 Mohamed Ashiq 2017-03-21 06:27:15 UTC

# oadm policy add-scc-to-user privileged -z heketi-service-account

Above worked for me. Please do the above and try again. 

# oc delete dc,routes,svc,ep,rc heketi

# oc process heketi | oc create -f -

#######
For the sake of what I tried,

# oadm policy add-scc-to-user hostmount-anyuid system:serviceaccount:aplo:heketi-service-account

This worked for the hostPath issue. It is a permission issue for the heketi-service-account user giving it the permission will allow it to run it.

Got the fix from [1]

[1] https://lists.openshift.redhat.com/openshift-archives/users/2016-May/msg00069.html

Comment 6 Mohamed Ashiq 2017-03-21 06:31:22 UTC

@Jose 

We had the privileged requirements for upstream, so Is it downstream too now?

Can you confirm this?

Comment 7 Apeksha 2017-03-21 06:44:36 UTC

(In reply to Mohamed Ashiq from comment #5)
> Can you try the above?

I tried the workaround, i am able to setup heketi now.

Steps:
1. oc delete dc,routes,svc,ep,rc heketi
2. oadm policy add-scc-to-user privileged -z heketi-service-account
3. oc process heketi | oc create -f -

[root@rhsauto045 ~]# oc get pods
NAME                  READY     STATUS    RESTARTS   AGE
aplo-router-1-q9d38   1/1       Running   0          15h
glusterfs-5k3vj       1/1       Running   0          15h
glusterfs-g42qk       1/1       Running   0          15h
glusterfs-h5smd       1/1       Running   0          15h
heketi-1-cwr4g        1/1       Running   0          1m

Comment 8 Mohamed Ashiq 2017-03-21 06:47:33 UTC

(In reply to Apeksha from comment #7)
> (In reply to Mohamed Ashiq from comment #5)
> > Can you try the above?
> 
> I tried the workaround, i am able to setup heketi now.

I think this will be the proper way to deploy in future(not a workaround).

> 
> Steps:
> 1. oc delete dc,routes,svc,ep,rc heketi
> 2. oadm policy add-scc-to-user privileged -z heketi-service-account
> 3. oc process heketi | oc create -f -
> 
> [root@rhsauto045 ~]# oc get pods
> NAME                  READY     STATUS    RESTARTS   AGE
> aplo-router-1-q9d38   1/1       Running   0          15h
> glusterfs-5k3vj       1/1       Running   0          15h
> glusterfs-g42qk       1/1       Running   0          15h
> glusterfs-h5smd       1/1       Running   0          15h
> heketi-1-cwr4g        1/1       Running   0          1m

Putting back the need info
Jose to confirm?

Comment 12 Mohamed Ashiq 2017-03-21 09:10:09 UTC

I have sent a patch upstream for the same.

https://github.com/gluster/gluster-kubernetes/pull/204

Comment 14 Apeksha 2017-03-22 02:38:26 UTC

Got the same error message while running cns_deploy on this build - cns-deploy-4.0.0-6.el7rhgs.x86_64, heketi-client-4.0.0-3.el7rhgs.x86_64

Output of cns_Deploy and other oc commands: http://pastebin.test.redhat.com/466976

Comment 15 Mohamed Ashiq 2017-03-22 04:33:09 UTC

(In reply to Apeksha from comment #14)
> Got the same error message while running cns_deploy on this build -
> cns-deploy-4.0.0-6.el7rhgs.x86_64, heketi-client-4.0.0-3.el7rhgs.x86_64
> 
> Output of cns_Deploy and other oc commands:
> http://pastebin.test.redhat.com/466976

Can please tell me what is the error clearly in pastebin everything looks fine in setup rather than

"Failed to communicate with heketi service.\nPlease verify that a router has been properly configured."


Heketi pod seems to be running perfectly good.

Can you curl for /hello and give the output?

Comment 16 Mohamed Ashiq 2017-03-22 05:13:32 UTC

Thanks for giving access to the machine.

# oc describe route heketi
Name:			heketi
Namespace:		aplo
Created:		9 hours ago
Labels:			glusterfs=heketi-route
			template=heketi
Annotations:		openshift.io/host.generated=true
Requested Host:		heketi-aplo.cloudapps.myaplo.com
			  exposed on router aplo-router 9 hours ago
Path:			<none>
TLS Termination:	<none>
Insecure Policy:	<none>
Endpoint Port:		<all endpoint ports>

Service:	heketi
Weight:		100 (100%)
Endpoints:	10.129.0.4:8080


# oc get svc 
NAME                       CLUSTER-IP       EXTERNAL-IP   PORT(S)                   AGE
aplo-router                172.30.43.194    <none>        80/TCP,443/TCP,1936/TCP   9h
heketi                     172.30.66.222    <none>        8080/TCP                  9h
heketi-storage-endpoints   172.30.106.187   <none>        1/TCP                     9h


route points to wrong endpoint.


This is the issue rather as old one is fixed.

Comment 17 Mohamed Ashiq 2017-03-22 05:16:52 UTC

(In reply to Mohamed Ashiq from comment #16)
> Thanks for giving access to the machine.
> 
> # oc describe route heketi
> Name:			heketi
> Namespace:		aplo
> Created:		9 hours ago
> Labels:			glusterfs=heketi-route
> 			template=heketi
> Annotations:		openshift.io/host.generated=true
> Requested Host:		heketi-aplo.cloudapps.myaplo.com
> 			  exposed on router aplo-router 9 hours ago
> Path:			<none>
> TLS Termination:	<none>
> Insecure Policy:	<none>
> Endpoint Port:		<all endpoint ports>
> 
> Service:	heketi
> Weight:		100 (100%)
> Endpoints:	10.129.0.4:8080
> 
> 
> # oc get svc 
> NAME                       CLUSTER-IP       EXTERNAL-IP   PORT(S)           
> AGE
> aplo-router                172.30.43.194    <none>       
> 80/TCP,443/TCP,1936/TCP   9h
> heketi                     172.30.66.222    <none>        8080/TCP          
> 9h
> heketi-storage-endpoints   172.30.106.187   <none>        1/TCP             
> 9h
> 
> 
> route points to wrong endpoint.

It should point to the svc ip:port rather it is pointing to the pod directly. 

I did:

# oc delete svc,route,dc heketi
# oc process heketi| oc create -f -

Now it seems to be working.

Can you try this again in new setup? so that we can know this issue is hit for sure. 

> 
> 
> This is the issue rather as old one is fixed.

Comment 18 Mohamed Ashiq 2017-03-22 05:27:43 UTC

(In reply to Mohamed Ashiq from comment #17)
> (In reply to Mohamed Ashiq from comment #16)
> > Thanks for giving access to the machine.
> > 
> > # oc describe route heketi
> > Name:			heketi
> > Namespace:		aplo
> > Created:		9 hours ago
> > Labels:			glusterfs=heketi-route
> > 			template=heketi
> > Annotations:		openshift.io/host.generated=true
> > Requested Host:		heketi-aplo.cloudapps.myaplo.com
> > 			  exposed on router aplo-router 9 hours ago
> > Path:			<none>
> > TLS Termination:	<none>
> > Insecure Policy:	<none>
> > Endpoint Port:		<all endpoint ports>
> > 
> > Service:	heketi
> > Weight:		100 (100%)
> > Endpoints:	10.129.0.4:8080
> > 
> > 
> > # oc get svc 
> > NAME                       CLUSTER-IP       EXTERNAL-IP   PORT(S)           
> > AGE
> > aplo-router                172.30.43.194    <none>       
> > 80/TCP,443/TCP,1936/TCP   9h
> > heketi                     172.30.66.222    <none>        8080/TCP          
> > 9h
> > heketi-storage-endpoints   172.30.106.187   <none>        1/TCP             
> > 9h
> > 
> > 
> > route points to wrong endpoint.
> 
> It should point to the svc ip:port rather it is pointing to the pod
> directly. 
> 
> I did:
> 
> # oc delete svc,route,dc heketi
> # oc process heketi| oc create -f -
> 
> Now it seems to be working.
> 
> Can you try this again in new setup? so that we can know this issue is hit
> for sure. 
> 
> > 
> > 
> > This is the issue rather as old one is fixed.

Please Retry this. I have not faced it and not able to reproduce in my setup.

I verified the ip updation on route and everytime the pod is restarted/ newly spawned the ip is updated. This works for me nothing is there to fix in this issue. I think it is spurious and also if consistent then a bug in openshift route module as it is not updated on the right time.

Comment 19 Mohamed Ashiq 2017-03-22 05:29:11 UTC

Moving it back to on_QA as the issue filed is not been seen and heketi is running perfectly ok.

Comment 20 Apeksha 2017-03-28 07:06:08 UTC

Heketi is up and running after running cns_deploy on build:
cns-deploy-4.0.0-9.el7rhgs.x86_64 
heketi-client-4.0.0-4.el7rhgs.x86_64,
hence marking it as verified.

Comment 22 errata-xmlrpc 2017-04-20 18:28:07 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2017:1112

Comment 23 vinutha 2018-12-06 19:28:33 UTC

Marking qe-test-coverage as - since the preferred mode of deployment is using ansible

Note You need to log in before you can comment on or make changes to this bug.