Description of problem: cns-deploy tool continues with the setup of heketi, even when gluster pods are not ready. Gluster pods stayed in not-ready state even after heketi pod was deployed later. This wasn't the behavior with the previous release. When gluster pods are not ready, cns-deploy fails after timing out and cleans up the pods and daemon set. snippet of cns-deploy ====================== Do you wish to proceed with deployment? [Y]es, [N]o? [Default: Y]: y Using OpenShift CLI. NAME STATUS AGE storage-project Active 8m Using namespace "storage-project". serviceaccount "heketi-service-account" created template "heketi" created template "glusterfs" created role "edit" added: "system:serviceaccount:storage-project:heketi-service-account" node "dhcp46-41.lab.eng.blr.redhat.com" labeled node "dhcp46-70.lab.eng.blr.redhat.com" labeled node "dhcp46-98.lab.eng.blr.redhat.com" labeled daemonset "glusterfs" created Waiting for GlusterFS pods to start ... OK secret "heketi-db-backup" created service "heketi" created route "heketi" created deploymentconfig "heketi" created Waiting for heketi pod to start ... OK Failed to communicate with heketi service. Please verify that a router has been properly configured. deploymentconfig "heketi" deleted service "heketi" deleted route "heketi" deleted serviceaccount "heketi-service-account" deleted secret "heketi-db-backup" deleted template "heketi" deleted node "dhcp46-41.lab.eng.blr.redhat.com" labeled node "dhcp46-70.lab.eng.blr.redhat.com" labeled node "dhcp46-98.lab.eng.blr.redhat.com" labeled [root@dhcp46-201 ~]# oc get pods NAME READY STATUS RESTARTS AGE glusterfs-1mdg7 0/1 Running 0 5m glusterfs-pn260 0/1 Running 0 5m glusterfs-q8rbw 0/1 Running 0 5m storage-project-router-1-4znj8 1/1 Running 0 12m Version-Release number of selected component (if applicable): rpm -qa | grep 'heketi' heketi-client-4.0.0-1.el7rhgs.x86_64 [root@dhcp46-201 ~]# rpm -qa | grep 'cns-deploy' cns-deploy-4.0.0-2.el7rhgs.x86_64 How reproducible: I haven't tried to reproduce the issue yet. Steps to Reproduce: NA Actual results: cns-deploy continued with setting up of heketi Expected results: cns-deploy should fail if pods are not ready Additional info: I don't have any logs, but have captured few cli outputs. http://pastebin.test.redhat.com/460795
While trying to reproduce this, please run cns-deploy with the -v flag and capture the verbose output of it waiting for the Gluster nodes to come up.
(In reply to Jose A. Rivera from comment #2) > While trying to reproduce this, please run cns-deploy with the -v flag and > capture the verbose output of it waiting for the Gluster nodes to come up. I managed to reproduce the issue with -v flag. But I don't see any additional information being captured. This is all I got. # cns-deploy -v -n storage-project -g topology.json Welcome to the deployment tool for GlusterFS on Kubernetes and OpenShift. Before getting started, this script has some requirements of the execution environment and of the container platform that you should verify. The client machine that will run this script must have: * Administrative access to an existing Kubernetes or OpenShift cluster * Access to a python interpreter 'python' * Access to the heketi client 'heketi-cli' Each of the nodes that will host GlusterFS must also have appropriate firewall rules for the required GlusterFS ports: * 2222 - sshd (if running GlusterFS in a pod) * 24007 - GlusterFS Daemon * 24008 - GlusterFS Management * 49152 to 49251 - Each brick for every volume on the host requires its own port. For every new brick, one new port will be used starting at 49152. We recommend a default range of 49152-49251 on each host, though you can adjust this to fit your needs. In addition, for an OpenShift deployment you must: * Have 'cluster_admin' role on the administrative account doing the deployment * Add the 'default' and 'router' Service Accounts to the 'privileged' SCC * Add the 'heketi-service-account' Service Account to the 'privileged' SCC * Have a router deployed that is configured to allow apps to access services running in the cluster Do you wish to proceed with deployment? [Y]es, [N]o? [Default: Y]: y Using OpenShift CLI. NAME STATUS AGE storage-project Active 13m Using namespace "storage-project". serviceaccount "heketi-service-account" created template "heketi" created template "glusterfs" created role "edit" added: "system:serviceaccount:storage-project:heketi-service-account" Marking 'dhcp47-21.lab.eng.blr.redhat.com' as a GlusterFS node. node "dhcp47-21.lab.eng.blr.redhat.com" labeled Marking 'dhcp46-165.lab.eng.blr.redhat.com' as a GlusterFS node. node "dhcp46-165.lab.eng.blr.redhat.com" labeled Marking 'dhcp47-51.lab.eng.blr.redhat.com' as a GlusterFS node. node "dhcp47-51.lab.eng.blr.redhat.com" labeled Deploying GlusterFS pods. daemonset "glusterfs" created Waiting for GlusterFS pods to start ... OK secret "heketi-db-backup" created service "heketi" created route "heketi" created deploymentconfig "heketi" created Waiting for heketi pod to start ... OK Determining heketi service URL ... OK Failed to communicate with heketi service. Please verify that a router has been properly configured. deploymentconfig "heketi" deleted service "heketi" deleted route "heketi" deleted serviceaccount "heketi-service-account" deleted secret "heketi-db-backup" deleted template "heketi" deleted Removing label from 'dhcp47-21.lab.eng.blr.redhat.com' as a GlusterFS node. node "dhcp47-21.lab.eng.blr.redhat.com" labeled Removing label from 'dhcp46-165.lab.eng.blr.redhat.com' as a GlusterFS node. node "dhcp46-165.lab.eng.blr.redhat.com" labeled Removing label from 'dhcp47-51.lab.eng.blr.redhat.com' as a GlusterFS node. node "dhcp47-51.lab.eng.blr.redhat.com" labeled daemonset "glusterfs" deleted template "glusterfs" deleted #oc get pods NAME READY STATUS RESTARTS AGE glusterfs-75l9t 1/1 Running 0 7m glusterfs-nzzzv 1/1 Running 0 7m glusterfs-v9pbr 0/1 Running 1 7m heketi-1-5rtvp 0/1 CrashLoopBackOff 2 1m heketi-1-deploy 1/1 Running 0 1m storage-project-router-1-bzn6h 1/1 Running 2 20m # oc describe pods/glusterfs-v9pbr Name: glusterfs-v9pbr Namespace: storage-project Security Policy: privileged Node: dhcp47-51.lab.eng.blr.redhat.com/10.70.47.51 Start Time: Tue, 07 Mar 2017 14:46:35 +0530 Labels: glusterfs-node=pod Status: Running IP: 10.70.47.51 Controllers: DaemonSet/glusterfs Containers: glusterfs: Container ID: docker://f190164f35839e3337c0827f5d2ff1345c49c7e8d3377c78f5f9a4667f606a54 Image: rhgs3/rhgs-server-rhel7:3.2.0-3 Image ID: docker-pullable://brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/rhgs3/rhgs-server-rhel7@sha256:85804dae88fabc9d3416d16f4c005c16fd21789aeb9bb708a5bef84b5cc02bfb Port: State: Running Started: Tue, 07 Mar 2017 14:52:27 +0530 Last State: Terminated Reason: Error Exit Code: 1 Started: Tue, 07 Mar 2017 14:50:19 +0530 Finished: Tue, 07 Mar 2017 14:52:11 +0530 Ready: False Restart Count: 1 Liveness: exec [/bin/bash -c systemctl status glusterd.service] delay=100s timeout=3s period=10s #success=1 #failure=3 Readiness: exec [/bin/bash -c systemctl status glusterd.service] delay=100s timeout=3s period=10s #success=1 #failure=3 Volume Mounts: /dev from glusterfs-dev (rw) /etc/glusterfs from glusterfs-etc (rw) /etc/ssl from glusterfs-ssl (ro) /run from glusterfs-run (rw) /run/lvm from glusterfs-lvm (rw) /sys/fs/cgroup from glusterfs-cgroup (ro) /var/lib/glusterd from glusterfs-config (rw) /var/lib/heketi from glusterfs-heketi (rw) /var/lib/misc/glusterfsd from glusterfs-misc (rw) /var/log/glusterfs from glusterfs-logs (rw) /var/run/secrets/kubernetes.io/serviceaccount from default-token-4bvgn (ro) Environment Variables: <none> Conditions: Type Status Initialized True Ready False PodScheduled True Volumes: glusterfs-heketi: Type: HostPath (bare host directory volume) Path: /var/lib/heketi glusterfs-run: Type: EmptyDir (a temporary directory that shares a pod's lifetime) Medium: glusterfs-lvm: Type: HostPath (bare host directory volume) Path: /run/lvm glusterfs-etc: Type: HostPath (bare host directory volume) Path: /etc/glusterfs glusterfs-logs: Type: HostPath (bare host directory volume) Path: /var/log/glusterfs glusterfs-config: Type: HostPath (bare host directory volume) Path: /var/lib/glusterd glusterfs-dev: Type: HostPath (bare host directory volume) Path: /dev glusterfs-misc: Type: HostPath (bare host directory volume) Path: /var/lib/misc/glusterfsd glusterfs-cgroup: Type: HostPath (bare host directory volume) Path: /sys/fs/cgroup glusterfs-ssl: Type: HostPath (bare host directory volume) Path: /etc/ssl default-token-4bvgn: Type: Secret (a volume populated by a Secret) SecretName: default-token-4bvgn QoS Class: BestEffort Tolerations: <none> Events: FirstSeen LastSeen Count From SubObjectPath Type Reason Message --------- -------- ----- ---- ------------- -------- ------ ------- 8m 8m 1 {kubelet dhcp47-51.lab.eng.blr.redhat.com} spec.containers{glusterfs} Normal Pulling pulling image "rhgs3/rhgs-server-rhel7:3.2.0-3" 5m 5m 1 {kubelet dhcp47-51.lab.eng.blr.redhat.com} spec.containers{glusterfs} Normal Pulling pulling image "rhgs3/rhgs-server-rhel7:3.2.0-3" 4m 4m 1 {kubelet dhcp47-51.lab.eng.blr.redhat.com} spec.containers{glusterfs} Normal Pulled Successfully pulled image "rhgs3/rhgs-server-rhel7:3.2.0-3" 4m 4m 1 {kubelet dhcp47-51.lab.eng.blr.redhat.com} spec.containers{glusterfs} Normal Created Created container with docker id ef90aba17e95; Security:[seccomp=unconfined] 4m 4m 1 {kubelet dhcp47-51.lab.eng.blr.redhat.com} spec.containers{glusterfs} Normal Started Started container with docker id ef90aba17e95 2m 2m 1 {kubelet dhcp47-51.lab.eng.blr.redhat.com} spec.containers{glusterfs} Normal Pulled Container image "rhgs3/rhgs-server-rhel7:3.2.0-3" already present on machine 2m 2m 1 {kubelet dhcp47-51.lab.eng.blr.redhat.com} spec.containers{glusterfs} Normal Created Created container with docker id f190164f3583; Security:[seccomp=unconfined] 2m 2m 1 {kubelet dhcp47-51.lab.eng.blr.redhat.com} spec.containers{glusterfs} Normal Started Started container with docker id f190164f3583
I see in both instances the following message: Failed to communicate with heketi service. Please verify that a router has been properly configured. It looks like it is failing on bad communication to heketi instead of bad GlusterFS pods. My guess would be that the pods are coming up fine and are just not being cleaned up properly. I see that the heketi pod is stuck in a CrashLoopBackoff state. Can you do an oc describe on that?
(In reply to Jose A. Rivera from comment #5) > I see in both instances the following message: > > Failed to communicate with heketi service. > Please verify that a router has been properly configured. > > It looks like it is failing on bad communication to heketi instead of bad > GlusterFS pods. My guess would be that the pods are coming up fine and are > just not being cleaned up properly. > > I see that the heketi pod is stuck in a CrashLoopBackoff state. Can you do > an oc describe on that? The bug in discussion is that although gluster pods are not actually up, cns-deploy continues with the deployment. The question of heketi being in bad state is later when gluster pods are actually up. Ideally, I'd expect cns-deploy to fail right when all gluster pods fail to start before a stipulated time (300 seconds I guess?) and proceed with the clean up. This is how the behavior was in previous release. But now, I don't see that happening.
Krishna, best I can tell from your pasted output the bug in discussion does not exist. The GlusterFS pods seem to be coming up properly, hence why the deployment continues. That the GlusterFS pods are still extant after the script abort is a different bug that we could look in to.
Upstream has a rewritten check_pods function, that shouldt fix this issue. Will be taken into the next build.
The bug has been resolved, please find the results as follows [root@dhcp47-158 ~]# cns-deploy -v -n storage-project -g topology-sample.json Welcome to the deployment tool for GlusterFS on Kubernetes and OpenShift. Before getting started, this script has some requirements of the execution environment and of the container platform that you should verify. The client machine that will run this script must have: * Administrative access to an existing Kubernetes or OpenShift cluster * Access to a python interpreter 'python' * Access to the heketi client 'heketi-cli' Each of the nodes that will host GlusterFS must also have appropriate firewall rules for the required GlusterFS ports: * 2222 - sshd (if running GlusterFS in a pod) * 24007 - GlusterFS Daemon * 24008 - GlusterFS Management * 49152 to 49251 - Each brick for every volume on the host requires its own port. For every new brick, one new port will be used starting at 49152. We recommend a default range of 49152-49251 on each host, though you can adjust this to fit your needs. In addition, for an OpenShift deployment you must: * Have 'cluster_admin' role on the administrative account doing the deployment * Add the 'default' and 'router' Service Accounts to the 'privileged' SCC * Have a router deployed that is configured to allow apps to access services running in the cluster Do you wish to proceed with deployment? [Y]es, [N]o? [Default: Y]: y Using OpenShift CLI. NAME STATUS AGE storage-project Active 1h Using namespace "storage-project". Checking that heketi pod is not running ... Checking status of pods matching 'glusterfs=heketi-pod': No resources found. Timed out waiting for pods matching 'glusterfs=heketi-pod'. OK template "deploy-heketi" created serviceaccount "heketi-service-account" created template "heketi" created template "glusterfs" created role "edit" added: "system:serviceaccount:storage-project:heketi-service-account" Marking 'dhcp47-159.lab.eng.blr.redhat.com' as a GlusterFS node. node "dhcp47-159.lab.eng.blr.redhat.com" labeled Marking 'dhcp47-160.lab.eng.blr.redhat.com' as a GlusterFS node. node "dhcp47-160.lab.eng.blr.redhat.com" labeled Marking 'dhcp47-149.lab.eng.blr.redhat.com' as a GlusterFS node. node "dhcp47-149.lab.eng.blr.redhat.com" labeled Deploying GlusterFS pods. daemonset "glusterfs" created Waiting for GlusterFS pods to start ... Checking status of pods matching 'glusterfs-node=pod': glusterfs-5l7jg 1/1 Running 0 5m glusterfs-h9js4 0/1 Running 0 5m glusterfs-knr54 1/1 Running 0 5m Timed out waiting for pods matching 'glusterfs-node=pod'. pods not found. Error from server (NotFound): services "heketi" not found serviceaccount "heketi-service-account" deleted No resources found Error from server (NotFound): services "heketi-storage-endpoints" not found Error from server (NotFound): deploymentconfig "heketi" not found Error from server (NotFound): routes "heketi" not found template "deploy-heketi" deleted template "heketi" deleted Removing label from 'dhcp47-159.lab.eng.blr.redhat.com' as a GlusterFS node. node "dhcp47-159.lab.eng.blr.redhat.com" labeled Removing label from 'dhcp47-160.lab.eng.blr.redhat.com' as a GlusterFS node. node "dhcp47-160.lab.eng.blr.redhat.com" labeled Removing label from 'dhcp47-149.lab.eng.blr.redhat.com' as a GlusterFS node. node "dhcp47-149.lab.eng.blr.redhat.com" labeled daemonset "glusterfs" deleted template "glusterfs" deleted
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2017:1112