1428381 – cns-deploy continues with the deployment even when gluster pods are not ready

Bug 1428381 - cns-deploy continues with the deployment even when gluster pods are not ready

Summary: cns-deploy continues with the deployment even when gluster pods are not ready

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	cns-deploy-tool
Sub Component:
Version:	cns-3.5
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	CNS 3.5
Assignee:	Jose A. Rivera
QA Contact:	Tejas Chaphekar
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1415600
TreeView+	depends on / blocked

Reported:	2017-03-02 13:06 UTC by krishnaram Karthick
Modified:	2018-12-06 19:20 UTC (History)
CC List:	8 users (show)
Fixed In Version:	cns-deploy-4.0.0-3.el7rhgs
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2017-04-20 18:27:03 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHEA-2017:1112	0	normal	SHIPPED_LIVE	cns-deploy-tool bug fix and enhancement update	2017-04-20 22:25:47 UTC

Description krishnaram Karthick 2017-03-02 13:06:55 UTC

Description of problem:

cns-deploy tool continues with the setup of heketi, even when gluster pods are not ready. Gluster pods stayed in not-ready state even after heketi pod was deployed later.

This wasn't the behavior with the previous release. When gluster pods are not ready, cns-deploy fails after timing out and cleans up the pods and daemon set. 

snippet of cns-deploy
======================
Do you wish to proceed with deployment?

[Y]es, [N]o? [Default: Y]: y
Using OpenShift CLI.
NAME              STATUS    AGE
storage-project   Active    8m
Using namespace "storage-project".
serviceaccount "heketi-service-account" created
template "heketi" created
template "glusterfs" created
role "edit" added: "system:serviceaccount:storage-project:heketi-service-account"
node "dhcp46-41.lab.eng.blr.redhat.com" labeled
node "dhcp46-70.lab.eng.blr.redhat.com" labeled
node "dhcp46-98.lab.eng.blr.redhat.com" labeled
daemonset "glusterfs" created
Waiting for GlusterFS pods to start ... OK
secret "heketi-db-backup" created
service "heketi" created
route "heketi" created
deploymentconfig "heketi" created
Waiting for heketi pod to start ... 
OK
Failed to communicate with heketi service.
Please verify that a router has been properly configured.
deploymentconfig "heketi" deleted
service "heketi" deleted
route "heketi" deleted
serviceaccount "heketi-service-account" deleted
secret "heketi-db-backup" deleted
template "heketi" deleted
node "dhcp46-41.lab.eng.blr.redhat.com" labeled
node "dhcp46-70.lab.eng.blr.redhat.com" labeled
node "dhcp46-98.lab.eng.blr.redhat.com" labeled


[root@dhcp46-201 ~]# oc get pods
NAME                             READY     STATUS    RESTARTS   AGE
glusterfs-1mdg7                  0/1       Running   0          5m
glusterfs-pn260                  0/1       Running   0          5m
glusterfs-q8rbw                  0/1       Running   0          5m
storage-project-router-1-4znj8   1/1       Running   0          12m


Version-Release number of selected component (if applicable):
rpm -qa | grep 'heketi'
heketi-client-4.0.0-1.el7rhgs.x86_64
[root@dhcp46-201 ~]# rpm -qa | grep 'cns-deploy'
cns-deploy-4.0.0-2.el7rhgs.x86_64

How reproducible:
I haven't tried to reproduce the issue yet. 

Steps to Reproduce:
NA

Actual results:
cns-deploy continued with setting up of heketi

Expected results:
cns-deploy should fail if pods are not ready

Additional info:
I don't have any logs, but have captured few cli outputs. http://pastebin.test.redhat.com/460795

Comment 2 Jose A. Rivera 2017-03-02 13:13:37 UTC

While trying to reproduce this, please run cns-deploy with the -v flag and capture the verbose output of it waiting for the Gluster nodes to come up.

Comment 3 krishnaram Karthick 2017-03-07 10:04:43 UTC

(In reply to Jose A. Rivera from comment #2)
> While trying to reproduce this, please run cns-deploy with the -v flag and
> capture the verbose output of it waiting for the Gluster nodes to come up.

I managed to reproduce the issue with -v flag. But I don't see any additional information being captured. This is all I got.

# cns-deploy -v -n storage-project -g topology.json
Welcome to the deployment tool for GlusterFS on Kubernetes and OpenShift.

Before getting started, this script has some requirements of the execution
environment and of the container platform that you should verify.

The client machine that will run this script must have:
 * Administrative access to an existing Kubernetes or OpenShift cluster
 * Access to a python interpreter 'python'
 * Access to the heketi client 'heketi-cli'

Each of the nodes that will host GlusterFS must also have appropriate firewall
rules for the required GlusterFS ports:
 * 2222  - sshd (if running GlusterFS in a pod)
 * 24007 - GlusterFS Daemon
 * 24008 - GlusterFS Management
 * 49152 to 49251 - Each brick for every volume on the host requires its own
   port. For every new brick, one new port will be used starting at 49152. We
   recommend a default range of 49152-49251 on each host, though you can adjust
   this to fit your needs.

In addition, for an OpenShift deployment you must:
 * Have 'cluster_admin' role on the administrative account doing the deployment
 * Add the 'default' and 'router' Service Accounts to the 'privileged' SCC
 * Add the 'heketi-service-account' Service Account to the 'privileged' SCC
 * Have a router deployed that is configured to allow apps to access services
   running in the cluster

Do you wish to proceed with deployment?

[Y]es, [N]o? [Default: Y]: y
Using OpenShift CLI.
NAME              STATUS    AGE
storage-project   Active    13m
Using namespace "storage-project".
serviceaccount "heketi-service-account" created
template "heketi" created
template "glusterfs" created
role "edit" added: "system:serviceaccount:storage-project:heketi-service-account"
Marking 'dhcp47-21.lab.eng.blr.redhat.com' as a GlusterFS node.
node "dhcp47-21.lab.eng.blr.redhat.com" labeled
Marking 'dhcp46-165.lab.eng.blr.redhat.com' as a GlusterFS node.
node "dhcp46-165.lab.eng.blr.redhat.com" labeled
Marking 'dhcp47-51.lab.eng.blr.redhat.com' as a GlusterFS node.
node "dhcp47-51.lab.eng.blr.redhat.com" labeled
Deploying GlusterFS pods.
daemonset "glusterfs" created
Waiting for GlusterFS pods to start ... OK
secret "heketi-db-backup" created
service "heketi" created
route "heketi" created
deploymentconfig "heketi" created
Waiting for heketi pod to start ... OK
Determining heketi service URL ... OK
Failed to communicate with heketi service.
Please verify that a router has been properly configured.
deploymentconfig "heketi" deleted
service "heketi" deleted
route "heketi" deleted
serviceaccount "heketi-service-account" deleted
secret "heketi-db-backup" deleted
template "heketi" deleted
Removing label from 'dhcp47-21.lab.eng.blr.redhat.com' as a GlusterFS node.
node "dhcp47-21.lab.eng.blr.redhat.com" labeled
Removing label from 'dhcp46-165.lab.eng.blr.redhat.com' as a GlusterFS node.
node "dhcp46-165.lab.eng.blr.redhat.com" labeled
Removing label from 'dhcp47-51.lab.eng.blr.redhat.com' as a GlusterFS node.
node "dhcp47-51.lab.eng.blr.redhat.com" labeled
daemonset "glusterfs" deleted
template "glusterfs" deleted

#oc get pods
NAME                             READY     STATUS             RESTARTS   AGE
glusterfs-75l9t                  1/1       Running            0          7m
glusterfs-nzzzv                  1/1       Running            0          7m
glusterfs-v9pbr                  0/1       Running            1          7m
heketi-1-5rtvp                   0/1       CrashLoopBackOff   2          1m
heketi-1-deploy                  1/1       Running            0          1m
storage-project-router-1-bzn6h   1/1       Running            2          20m


# oc describe pods/glusterfs-v9pbr
Name:			glusterfs-v9pbr
Namespace:		storage-project
Security Policy:	privileged
Node:			dhcp47-51.lab.eng.blr.redhat.com/10.70.47.51
Start Time:		Tue, 07 Mar 2017 14:46:35 +0530
Labels:			glusterfs-node=pod
Status:			Running
IP:			10.70.47.51
Controllers:		DaemonSet/glusterfs
Containers:
  glusterfs:
    Container ID:	docker://f190164f35839e3337c0827f5d2ff1345c49c7e8d3377c78f5f9a4667f606a54
    Image:		rhgs3/rhgs-server-rhel7:3.2.0-3
    Image ID:		docker-pullable://brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/rhgs3/rhgs-server-rhel7@sha256:85804dae88fabc9d3416d16f4c005c16fd21789aeb9bb708a5bef84b5cc02bfb
    Port:		
    State:		Running
      Started:		Tue, 07 Mar 2017 14:52:27 +0530
    Last State:		Terminated
      Reason:		Error
      Exit Code:	1
      Started:		Tue, 07 Mar 2017 14:50:19 +0530
      Finished:		Tue, 07 Mar 2017 14:52:11 +0530
    Ready:		False
    Restart Count:	1
    Liveness:		exec [/bin/bash -c systemctl status glusterd.service] delay=100s timeout=3s period=10s #success=1 #failure=3
    Readiness:		exec [/bin/bash -c systemctl status glusterd.service] delay=100s timeout=3s period=10s #success=1 #failure=3
    Volume Mounts:
      /dev from glusterfs-dev (rw)
      /etc/glusterfs from glusterfs-etc (rw)
      /etc/ssl from glusterfs-ssl (ro)
      /run from glusterfs-run (rw)
      /run/lvm from glusterfs-lvm (rw)
      /sys/fs/cgroup from glusterfs-cgroup (ro)
      /var/lib/glusterd from glusterfs-config (rw)
      /var/lib/heketi from glusterfs-heketi (rw)
      /var/lib/misc/glusterfsd from glusterfs-misc (rw)
      /var/log/glusterfs from glusterfs-logs (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-4bvgn (ro)
    Environment Variables:	<none>
Conditions:
  Type		Status
  Initialized 	True 
  Ready 	False 
  PodScheduled 	True 
Volumes:
  glusterfs-heketi:
    Type:	HostPath (bare host directory volume)
    Path:	/var/lib/heketi
  glusterfs-run:
    Type:	EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:	
  glusterfs-lvm:
    Type:	HostPath (bare host directory volume)
    Path:	/run/lvm
  glusterfs-etc:
    Type:	HostPath (bare host directory volume)
    Path:	/etc/glusterfs
  glusterfs-logs:
    Type:	HostPath (bare host directory volume)
    Path:	/var/log/glusterfs
  glusterfs-config:
    Type:	HostPath (bare host directory volume)
    Path:	/var/lib/glusterd
  glusterfs-dev:
    Type:	HostPath (bare host directory volume)
    Path:	/dev
  glusterfs-misc:
    Type:	HostPath (bare host directory volume)
    Path:	/var/lib/misc/glusterfsd
  glusterfs-cgroup:
    Type:	HostPath (bare host directory volume)
    Path:	/sys/fs/cgroup
  glusterfs-ssl:
    Type:	HostPath (bare host directory volume)
    Path:	/etc/ssl
  default-token-4bvgn:
    Type:	Secret (a volume populated by a Secret)
    SecretName:	default-token-4bvgn
QoS Class:	BestEffort
Tolerations:	<none>
Events:
  FirstSeen	LastSeen	Count	From						SubObjectPath			Type		Reason	Message
  ---------	--------	-----	----						-------------			--------	------	-------
  8m		8m		1	{kubelet dhcp47-51.lab.eng.blr.redhat.com}	spec.containers{glusterfs}	Normal		Pulling	pulling image "rhgs3/rhgs-server-rhel7:3.2.0-3"
  5m		5m		1	{kubelet dhcp47-51.lab.eng.blr.redhat.com}	spec.containers{glusterfs}	Normal		Pulling	pulling image "rhgs3/rhgs-server-rhel7:3.2.0-3"
  4m		4m		1	{kubelet dhcp47-51.lab.eng.blr.redhat.com}	spec.containers{glusterfs}	Normal		Pulled	Successfully pulled image "rhgs3/rhgs-server-rhel7:3.2.0-3"
  4m		4m		1	{kubelet dhcp47-51.lab.eng.blr.redhat.com}	spec.containers{glusterfs}	Normal		Created	Created container with docker id ef90aba17e95; Security:[seccomp=unconfined]
  4m		4m		1	{kubelet dhcp47-51.lab.eng.blr.redhat.com}	spec.containers{glusterfs}	Normal		Started	Started container with docker id ef90aba17e95
  2m		2m		1	{kubelet dhcp47-51.lab.eng.blr.redhat.com}	spec.containers{glusterfs}	Normal		Pulled	Container image "rhgs3/rhgs-server-rhel7:3.2.0-3" already present on machine
  2m		2m		1	{kubelet dhcp47-51.lab.eng.blr.redhat.com}	spec.containers{glusterfs}	Normal		Created	Created container with docker id f190164f3583; Security:[seccomp=unconfined]
  2m		2m		1	{kubelet dhcp47-51.lab.eng.blr.redhat.com}	spec.containers{glusterfs}	Normal		Started	Started container with docker id f190164f3583

Comment 5 Jose A. Rivera 2017-03-13 17:50:00 UTC

I see in both instances the following message:

Failed to communicate with heketi service.
Please verify that a router has been properly configured.

It looks like it is failing on bad communication to heketi instead of bad GlusterFS pods. My guess would be that the pods are coming up fine and are just not being cleaned up properly.

I see that the heketi pod is stuck in a CrashLoopBackoff state. Can you do an oc describe on that?

Comment 6 krishnaram Karthick 2017-03-14 03:13:16 UTC

(In reply to Jose A. Rivera from comment #5)
> I see in both instances the following message:
> 
> Failed to communicate with heketi service.
> Please verify that a router has been properly configured.
> 
> It looks like it is failing on bad communication to heketi instead of bad
> GlusterFS pods. My guess would be that the pods are coming up fine and are
> just not being cleaned up properly.
> 
> I see that the heketi pod is stuck in a CrashLoopBackoff state. Can you do
> an oc describe on that?

The bug in discussion is that although gluster pods are not actually up, cns-deploy continues with the deployment. The question of heketi being in bad state is later when gluster pods are actually up.

Ideally, I'd expect cns-deploy to fail right when all gluster pods fail to start before a stipulated time (300 seconds I guess?) and proceed with the clean up. This is how the behavior was in previous release. But now, I don't see that happening.

Comment 13 Jose A. Rivera 2017-03-15 13:36:23 UTC

Krishna, best I can tell from your pasted output the bug in discussion does not exist. The GlusterFS pods seem to be coming up properly, hence why the deployment continues. That the GlusterFS pods are still extant after the script abort is a different bug that we could look in to.

Comment 14 Michael Adam 2017-03-16 14:15:52 UTC

Upstream has a rewritten check_pods function, that shouldt fix this issue. Will be taken into the next build.

Comment 15 Tejas Chaphekar 2017-04-11 06:57:29 UTC

The bug has been resolved, please find the results as follows


[root@dhcp47-158 ~]# cns-deploy -v -n storage-project -g topology-sample.json
    Welcome to the deployment tool for GlusterFS on Kubernetes and OpenShift.
     
    Before getting started, this script has some requirements of the execution
    environment and of the container platform that you should verify.
     
    The client machine that will run this script must have:
     * Administrative access to an existing Kubernetes or OpenShift cluster
     * Access to a python interpreter 'python'
     * Access to the heketi client 'heketi-cli'
     
    Each of the nodes that will host GlusterFS must also have appropriate firewall
    rules for the required GlusterFS ports:
     * 2222  - sshd (if running GlusterFS in a pod)
     * 24007 - GlusterFS Daemon
     * 24008 - GlusterFS Management
     * 49152 to 49251 - Each brick for every volume on the host requires its own
       port. For every new brick, one new port will be used starting at 49152. We
       recommend a default range of 49152-49251 on each host, though you can adjust
       this to fit your needs.
     
    In addition, for an OpenShift deployment you must:
     * Have 'cluster_admin' role on the administrative account doing the deployment
     * Add the 'default' and 'router' Service Accounts to the 'privileged' SCC
     * Have a router deployed that is configured to allow apps to access services
       running in the cluster
     
    Do you wish to proceed with deployment?
     
    [Y]es, [N]o? [Default: Y]: y
    Using OpenShift CLI.
    NAME              STATUS    AGE
    storage-project   Active    1h
    Using namespace "storage-project".
    Checking that heketi pod is not running ...
    Checking status of pods matching 'glusterfs=heketi-pod':
    No resources found.
    Timed out waiting for pods matching 'glusterfs=heketi-pod'.
    OK
    template "deploy-heketi" created
    serviceaccount "heketi-service-account" created
    template "heketi" created
    template "glusterfs" created
    role "edit" added: "system:serviceaccount:storage-project:heketi-service-account"
    Marking 'dhcp47-159.lab.eng.blr.redhat.com' as a GlusterFS node.
    node "dhcp47-159.lab.eng.blr.redhat.com" labeled
    Marking 'dhcp47-160.lab.eng.blr.redhat.com' as a GlusterFS node.
    node "dhcp47-160.lab.eng.blr.redhat.com" labeled
    Marking 'dhcp47-149.lab.eng.blr.redhat.com' as a GlusterFS node.
    node "dhcp47-149.lab.eng.blr.redhat.com" labeled
    Deploying GlusterFS pods.
    daemonset "glusterfs" created
    Waiting for GlusterFS pods to start ...
    Checking status of pods matching 'glusterfs-node=pod':
    glusterfs-5l7jg   1/1       Running   0         5m
    glusterfs-h9js4   0/1       Running   0         5m
    glusterfs-knr54   1/1       Running   0         5m
    Timed out waiting for pods matching 'glusterfs-node=pod'.
    pods not found.
    Error from server (NotFound): services "heketi" not found
    serviceaccount "heketi-service-account" deleted
    No resources found
    Error from server (NotFound): services "heketi-storage-endpoints" not found
    Error from server (NotFound): deploymentconfig "heketi" not found
    Error from server (NotFound): routes "heketi" not found
    template "deploy-heketi" deleted
    template "heketi" deleted
    Removing label from 'dhcp47-159.lab.eng.blr.redhat.com' as a GlusterFS node.
    node "dhcp47-159.lab.eng.blr.redhat.com" labeled
    Removing label from 'dhcp47-160.lab.eng.blr.redhat.com' as a GlusterFS node.
    node "dhcp47-160.lab.eng.blr.redhat.com" labeled
    Removing label from 'dhcp47-149.lab.eng.blr.redhat.com' as a GlusterFS node.
    node "dhcp47-149.lab.eng.blr.redhat.com" labeled
    daemonset "glusterfs" deleted
    template "glusterfs" deleted

Comment 16 errata-xmlrpc 2017-04-20 18:27:03 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2017:1112

Note You need to log in before you can comment on or make changes to this bug.