Bug 1427846

Summary: cns-deploy tool failed to setup: failed to communicate with heketi service
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: Apeksha <akhakhar>
Component: cns-deploy-toolAssignee: Mohamed Ashiq <mliyazud>
Status: CLOSED ERRATA QA Contact: Apeksha <akhakhar>
Severity: urgent Docs Contact:
Priority: unspecified    
Version: cns-3.5CC: akhakhar, hchiramm, jarrpa, madam, mliyazud, pprakash, vinug
Target Milestone: ---   
Target Release: CNS 3.5   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: rhgs-volmanager-rhel7:3.2.0-2 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-04-20 18:26:53 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1415600    

Description Apeksha 2017-03-01 11:46:28 UTC
Description of problem:
cns-deploy tool failed to setup: failed to communicate with heketi service

Version-Release number of selected component (if applicable):
atomic-openshift-3.5.0.35-1.git.0.b806d03.el7.x86_64
openshift-ansible-3.5.17-1.git.0.561702e.el7.noarch
cns-deploy-4.0.0-2.el7rhgs.x86_64
heketi-client-4.0.0-1.el7rhgs.x86_64
docker-1.12.6-11.el7.x86_64
container-selinux-2.9-4.el7.noarch
selinux-policy-3.13.1-102.el7_3.13.noarch

Steps:
1. Openshift install 3.5
2. setup router
[root@dhcp47-79 ~]# oc get pods
NAME                  READY     STATUS    RESTARTS   AGE
aplo-router-1-xvvkl   1/1       Running   1          1h

3. cns-deploy -n aplo -g topology.json -c oc -y

Using OpenShift CLI.\nNAME      STATUS    AGE\naplo      Active    3h\nUsing namespace "aplo".\nserviceaccount "heketi-service-account" created\ntemplate "heketi" created\ntemplate "glusterfs" created\nrole "edit" added: "system:serviceaccount:aplo:heketi-service-account"\nnode "dhcp47-122.lab.eng.blr.redhat.com" labeled\nnode "dhcp47-94.lab.eng.blr.redhat.com" labeled\nnode "dhcp47-87.lab.eng.blr.redhat.com" labeled\ndaemonset "glusterfs" created\nWaiting for GlusterFS pods to start ... OK\nsecret "heketi-db-backup" created\nservice "heketi" created\nroute "heketi" created\ndeploymentconfig "heketi" created\nWaiting for heketi pod to start ... OK\nFailed to communicate with heketi service.\nPlease verify that a router has been properly configured.\ndeploymentconfig "heketi" deleted\nservice "heketi" deleted\nroute "heketi" deleted\nserviceaccount "heketi-service-account" deleted\nsecret "heketi-db-backup" deleted\ntemplate "heketi" deleted\nnode "dhcp47-122.lab.eng.blr.redhat.com" labeled\nnode "dhcp47-94.lab.eng.blr.redhat.com" labeled\nnode "dhcp47-87.lab.eng.blr.redhat.com" labeled\ndaemonset "glusterfs" deleted\ntemplate "glusterfs" deleted\n


while running this cns_deploy command i have the output of oc get pods n oc decsribe of heketi pod that says CrashLoopBackOff - http://pastebin.test.redhat.com/460240

Comment 2 Humble Chirammal 2017-03-01 11:52:28 UTC
Can you run below command and get the result ?
 #oadm policy add-scc-to-user anyuid -n aplo -z heketi-service-account

Comment 3 Apeksha 2017-03-01 12:06:26 UTC
The command dint give any output:

[root@dhcp47-79 ~]# oadm policy add-scc-to-user anyuid -n aplo -z heketi-service-account
[root@dhcp47-79 ~]#

Comment 4 Humble Chirammal 2017-03-01 15:11:25 UTC
(In reply to Apeksha from comment #3)
> The command dint give any output:
> 
> [root@dhcp47-79 ~]# oadm policy add-scc-to-user anyuid -n aplo -z
> heketi-service-account
> [root@dhcp47-79 ~]#

Didnt expect an output from above command :). Can you please rerun your tests and find the result now ?

Comment 5 Jose A. Rivera 2017-03-01 15:53:30 UTC
I found the following in the pastebin contents:

[root@dhcp46-216 ~]# oc logs heketi-1-xzf4w -c heketi
/bin/sh: /usr/sbin/heketi-start.sh: Permission denied

Ashiq, do we have a permissions issue in the container?

Comment 6 Humble Chirammal 2017-03-01 16:28:36 UTC
(In reply to Jose A. Rivera from comment #5)
> I found the following in the pastebin contents:
> 
> [root@dhcp46-216 ~]# oc logs heketi-1-xzf4w -c heketi
> /bin/sh: /usr/sbin/heketi-start.sh: Permission denied
> 
> Ashiq, do we have a permissions issue in the container?

There is a permission setting of 500 for the startup script. However he is able to run plain docker container without issues using this image. iic, he is checking it in OCP env.

Comment 7 Mohamed Ashiq 2017-03-01 17:23:11 UTC
(In reply to Humble Chirammal from comment #6)
> (In reply to Jose A. Rivera from comment #5)
> > I found the following in the pastebin contents:
> > 
> > [root@dhcp46-216 ~]# oc logs heketi-1-xzf4w -c heketi
> > /bin/sh: /usr/sbin/heketi-start.sh: Permission denied
> > 
> > Ashiq, do we have a permissions issue in the container?
> 
> There is a permission setting of 500 for the startup script. However he is
> able to run plain docker container without issues using this image. iic, he
> is checking it in OCP env.

@Jose yeah, As Humble mentioned I set the start-script permission to 500. It worked for me in the docker run, But I am also facing the same permission denied in ocp setup of mine. If you can figure what is the difference in running in docker and ocp pod environment changes, we can figure what is going wrong. JFYI we also set 500 permission to scripts in gluster images, which we never faced problem.

Comment 8 Mohamed Ashiq 2017-03-01 17:23:49 UTC
From #comment4

Comment 9 Apeksha 2017-03-02 03:52:57 UTC
I reran cns_deploy tool after running this command - oadm policy add-scc-to-user anyuid -n aplo -z heketi-service-account, it worked fine.

cns-deploy -n aplo -g topology.json -c oc -y

Cluster Id: 7918d44cd584a71f7ff52a65a1dbd7dd\n\n    Volumes:\n\n    Nodes:\n\n\tNode Id: c14ddbc8043959bd1160ac2ad6850e02\n\tState: online\n\tCluster Id: 7918d44cd584a71f7ff52a65a1dbd7dd\n\tZone: 1\n\tManagement Hostname: dhcp47-122.lab.eng.blr.redhat.com\n\tStorage Hostname: 10.70.47.122\n\tDevices:\n\t\tId:8034e12f438b14f8dd23f3b671190a28   Name:/dev/sdd            State:online    Size (GiB):199     Used (GiB):0       Free (GiB):199     \n\t\t\tBricks:\n\n\tNode Id: cc8e53fd95aa9e0994aaa1428254fc8a\n\tState: online\n\tCluster Id: 7918d44cd584a71f7ff52a65a1dbd7dd\n\tZone: 1\n\tManagement Hostname: dhcp47-94.lab.eng.blr.redhat.com\n\tStorage Hostname: 10.70.47.94\n\tDevices:\n\t\tId:122a68fd55a8feff87193254a63d8784   Name:/dev/sdd            State:online    Size (GiB):199     Used (GiB):0       Free (GiB):199     \n\t\t\tBricks:\n\n\tNode Id: da3182148f764d2a2c98ac265c904daa\n\tState: online\n\tCluster Id: 7918d44cd584a71f7ff52a65a1dbd7dd\n\tZone: 1\n\tManagement Hostname: dhcp47-87.lab.eng.blr.redhat.com\n\tStorage Hostname: 10.70.47.87\n\tDevices:\n\t\tId:684cb1ab8c3bf1c57584f31106b378ef   Name:/dev/sdd            State:online    Size (GiB):199     Used (GiB):0       Free (GiB):199     \n\t\t\tBricks:\n

 oc get pods
NAME                  READY     STATUS    RESTARTS   AGE
aplo-router-1-xvvkl   1/1       Running   1          17h
glusterfs-2bsn7       1/1       Running   0          2m
glusterfs-3q880       1/1       Running   0          2m
glusterfs-fl1lk       1/1       Running   0          2m
heketi-1-vhl8z        1/1       Running   0          1m

Comment 14 Jose A. Rivera 2017-03-15 13:07:54 UTC
Ashiq,

We probably never had issues with the GlusterFS pods because it runs as privileged, whereas the heketi pod is not. Given that anyuid worked, my guess is that the script has a bad owner or group. Is there anything we can do about this?

Comment 16 Michael Adam 2017-03-16 14:17:18 UTC
Fix will be in the next build of the heketi docker image.

Comment 17 Humble Chirammal 2017-03-17 07:56:26 UTC
This container build should have the fix # rhgs3/rhgs-volmanager-rhel7:3.2.0-2 . I am moving the bz to "ON_QA"

Comment 18 Apeksha 2017-03-20 09:37:24 UTC
I hit this issue on build - heketi-client-4.0.0-2.el7rhgs.x86_64 and cns-deploy-4.0.0-4.el7rhgs.x86_64

Output of cns_deploy command and other oc commands - http://pastebin.test.redhat.com/466117

Comment 19 Mohamed Ashiq 2017-03-20 11:00:25 UTC
(In reply to Apeksha from comment #18)
> I hit this issue on build - heketi-client-4.0.0-2.el7rhgs.x86_64 and
> cns-deploy-4.0.0-4.el7rhgs.x86_64
> 
> Output of cns_deploy command and other oc commands -
> http://pastebin.test.redhat.com/466117

I have gone through the pastebin looks like we might be facing a new issue here. If this is not a permission denied from the volmanager-docker container then it is something new and the fix which was supposed to handle the permission denied error is fixed.

If you are not seeing a permission denied error<https://bugzilla.redhat.com/show_bug.cgi?id=1427846#c5>, can you please do a verified on this one and create a new BZ. As the RCA for this issue is completely different and the old issue is fixed. 

Can you also quit/(edit the script to quit) when a failure happens? If you are really sure  about this issue. you can add `exit 1` in the script after heketi is started so the state will be preserved (stop from abort) and helpful to debug.

Please Let me know If I am wrong. 

Moving it back to on QA to verify the fix of permission denied.

Comment 20 Apeksha 2017-03-20 15:59:38 UTC
Ashiq,

As suggested in #c19 and our discussion i created a new setup and put a exit 1 where it fails in the cns_deploy script and ran it. 

Output of cns_deploy command and other oc commands: http://pastebin.test.redhat.com/466318

Since i dont see any permission issue i have created a new bug - https://bugzilla.redhat.com/show_bug.cgi?id=1434055

Comment 21 Apeksha 2017-03-28 07:03:25 UTC
I dont see the permission issue in build - cns-deploy-4.0.0-9.el7rhgs.x86_64 and rhgs3/rhgs-volmanager-rhel7:3.2.0-4, hence marking it as verified.

Comment 22 errata-xmlrpc 2017-04-20 18:26:53 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2017:1112