1435165 – CNS: cns_deploy fails- gluster pods not found, nodes going to NotReady state

Bug 1435165 - CNS: cns_deploy fails- gluster pods not found, nodes going to NotReady state

Summary: CNS: cns_deploy fails- gluster pods not found, nodes going to NotReady state

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	cns-deploy-tool
Sub Component:
Version:	cns-3.5
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	urgent
Target Milestone:	---
Target Release:	---
Assignee:	Humble Chirammal
QA Contact:	Apeksha
Docs Contact:
URL:
Whiteboard:
Depends On:	1436536
Blocks:	1415600
TreeView+	depends on / blocked

Reported:	2017-03-23 10:10 UTC by Apeksha
Modified:	2017-04-04 13:00 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2017-04-04 13:00:22 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Apeksha 2017-03-23 10:10:17 UTC

Description of problem:
CNS: cns_deploy fails- gluster pods not found, nodes goint to notready state

Version-Release number of selected component (if applicable):
cns-deploy-4.0.0-6.el7rhgs.x86_64
heketi-client-4.0.0-3.el7rhgs.x86_64
docker-1.12.6-16.el7.x86_64
atomic-openshift-3.5.2-1.git.0.d570b4d.el7.x86_64

How reproducible: Hit this twice


Steps to Reproduce:
1. Setup openshift
2. Setup router
 oc get nodes shows all the nodes in Ready state
3. Now run cns_deploy command, it fails saying gluster pods not found, n oc get nodes shows nodes in NotReady State

Output of cns_deploy command:
'Using OpenShift CLI.\nNAME      STATUS    AGE\naplo      Active    12m\nUsing namespace "aplo".\nChecking that heketi pod is not running ... OK\ntemplate "deploy-heketi" created\nserviceaccount "heketi-service-account" created\ntemplate "heketi" created\ntemplate "glusterfs" created\nrole "edit" added: "system:serviceaccount:aplo:heketi-service-account"\nnode "dhcp47-185.lab.eng.blr.redhat.com" labeled\nnode "dhcp46-67.lab.eng.blr.redhat.com" labeled\nnode "dhcp47-17.lab.eng.blr.redhat.com" labeled\ndaemonset "glusterfs" created\nWaiting for GlusterFS pods to start ... pods not found.\nserviceaccount "heketi-service-account" deleted\nNo resources found\ntemplate "deploy-heketi" deleted\ntemplate "heketi" deleted\nnode "dhcp47-185.lab.eng.blr.redhat.com" labeled\nnode "dhcp46-67.lab.eng.blr.redhat.com" labeled\nnode "dhcp47-17.lab.eng.blr.redhat.com" labeled\ntemplate "glusterfs" deleted\n', 'Error from server (NotFound): services "heketi" not found\nError from server (NotFound): services "heketi-storage-endpoints" not found\nError from server (NotFound): deploymentconfig "heketi" not found\nError from server (NotFound): routes "heketi" not found\nError from server (NotFound): secrets "heketi-db-backup" not found\nerror: timed out waiting for the condition\n'

[root@dhcp46-230 ~]# oc get nodes
NAME                                STATUS                     AGE
dhcp46-108.lab.eng.blr.redhat.com   NotReady                   2h
dhcp46-127.lab.eng.blr.redhat.com   NotReady                   2h
dhcp46-205.lab.eng.blr.redhat.com   NotReady                   2h
dhcp46-97.lab.eng.blr.redhat.com    Ready,SchedulingDisabled   2h
[root@dhcp46-230 ~]# oc get pods
NAME                  READY     STATUS        RESTARTS   AGE
aplo-router-1-0l68h   1/1       Running       0          1h
glusterfs-5ll2q       0/1       Terminating   0          1h
glusterfs-hwn46       0/1       Terminating   0          1h
glusterfs-j6l97       0/1       Terminating   0          1h
[root@dhcp46-230 ~]# oc get ds
NAME        DESIRED   CURRENT   READY     NODE-SELECTOR                                                               AGE
glusterfs   0         0         0         492bf98e-0fd8-11e7-a311-005056b3d32d=492bf9e0-0fd8-11e7-a311-005056b3d32d   1h
[root@dhcp46-230 ~]# oc describe ds glusterfs
Name:		glusterfs
Image(s):	rhgs3/rhgs-server-rhel7:3.2.0-4
Selector:	glusterfs-node=pod
Node-Selector:	492bf98e-0fd8-11e7-a311-005056b3d32d=492bf9e0-0fd8-11e7-a311-005056b3d32d
Labels:		glusterfs=daemonset
Desired Number of Nodes Scheduled: 0
Current Number of Nodes Scheduled: 0
Number of Nodes Misscheduled: 3
Pods Status:	3 Running / 0 Waiting / 0 Succeeded / 0 Failed
Events:
  FirstSeen	LastSeen	Count	From		SubObjectPath	Type		Reason			Message
  ---------	--------	-----	----		-------------	--------	------			-------
  1h		1h		1	{daemon-set }			Normal		SuccessfulCreate	Created pod: glusterfs-j6l97
  1h		1h		1	{daemon-set }			Normal		SuccessfulCreate	Created pod: glusterfs-5ll2q
  1h		1h		1	{daemon-set }			Normal		SuccessfulCreate	Created pod: glusterfs-hwn46
  1h		2m		11	{daemon-set }			Normal		SuccessfulDelete	Deleted pod: glusterfs-j6l97
  1h		2m		10	{daemon-set }			Normal		SuccessfulDelete	Deleted pod: glusterfs-5ll2q
  1h		2m		10	{daemon-set }			Normal		SuccessfulDelete	Deleted pod: glusterfs-hwn46


Additional info: oc describe of node and docker status - http://pastebin.test.redhat.com/467628

Comment 2 Humble Chirammal 2017-03-23 12:45:17 UTC

If the nodes are in 'Not ready' status, pod deployment will fail. Can you make sure nodes are in 'Ready' status before you deploy ? If its in not ready status please reboot node service and wait for the status change.

Comment 6 Humble Chirammal 2017-03-24 11:29:57 UTC

Apeksha, can you please remove the testblocker flag from this bugzilla?  Also this does not looks like an issue caused by cns, rather looks to be an issue with your setup. Can you make sure your setup is good and running before trying cns-deploy ?

Comment 7 Apeksha 2017-03-27 05:58:53 UTC

Humble,

I had tried this on a fresh setup and as mentioned in the steps to reproduce, before running cns_deploy the nodes were in Ready state. Anyways i am trying it again on the latest build and will update the bug accordingly.

Comment 8 Apeksha 2017-03-28 06:47:13 UTC

Hit this issue again on latest builld  cns-deploy-4.0.0-9.el7rhgs.x86_64

Before running cns_deploy nodes were in READY state
[root@dhcp47-105 ~]# oc get pods
NAME                  READY     STATUS    RESTARTS   AGE
aplo-router-1-zh1bz   1/1       Running   0          3h
[root@dhcp47-105 ~]# oc get nodes
NAME                                STATUS                     AGE
dhcp46-3.lab.eng.blr.redhat.com     Ready,SchedulingDisabled   3h
dhcp46-67.lab.eng.blr.redhat.com    Ready                      3h
dhcp47-17.lab.eng.blr.redhat.com    Ready                      3h
dhcp47-185.lab.eng.blr.redhat.com   Ready                      3h

I have kept the setup in same state for debugging

Comment 9 Mohamed Ashiq 2017-03-28 08:56:12 UTC

I see the setup. All the service are running as expected. I am not sure what happened at the initial state.

I restarted all the atomic-openshift-node service and the state became ready. Just to be sure cleaned up all the gluster stuff and did a gluster create on the cluster. 

Now everything looks good.

[root@dhcp47-105 ~]# oc get pods -w
NAME                  READY     STATUS    RESTARTS   AGE
aplo-router-1-fnq00   1/1       Running   0          14m
glusterfs-b2sz0       0/1       Running   0          <invalid>
glusterfs-b6hcc       0/1       Running   0          <invalid>
glusterfs-w1c8v       0/1       Running   0          <invalid>
NAME              READY     STATUS    RESTARTS   AGE
glusterfs-b2sz0   1/1       Running   0          51s
glusterfs-w1c8v   1/1       Running   0         51s
glusterfs-b6hcc   1/1       Running   0         51s
^C[root@dhcp47-105 ~]oc get nodes
NAME                                STATUS                     AGE
dhcp46-3.lab.eng.blr.redhat.com     Ready,SchedulingDisabled   5h
dhcp46-67.lab.eng.blr.redhat.com    Ready                      5h
dhcp47-17.lab.eng.blr.redhat.com    Ready                      5h
dhcp47-185.lab.eng.blr.redhat.com   Ready                      5h

Comment 10 Apeksha 2017-03-30 07:27:21 UTC

As per inputs given by ashiq, created an OCP bug for the same - https://bugzilla.redhat.com/show_bug.cgi?id=1436536

Comment 12 Apeksha 2017-04-03 07:14:03 UTC

Talur,
 I do have a seperate partition in vmware setup:

[root@dhcp47-105 ~]# df -h
Filesystem                         Size  Used Avail Use% Mounted on
/dev/mapper/rhel_dhcp47--183-root   50G  2.1G   48G   5% /
devtmpfs                            24G     0   24G   0% /dev
tmpfs                               24G     0   24G   0% /dev/shm
tmpfs                               24G   17M   24G   1% /run
tmpfs                               24G     0   24G   0% /sys/fs/cgroup
/dev/sdb1                           40G  343M   40G   1% /var
/dev/sda1                         1014M  211M  804M  21% /boot
/dev/mapper/rhel_dhcp47--183-home   50G   33M   50G   1% /home
tmpfs                              4.8G     0  4.8G   0% /run/user/0

Comment 13 Raghavendra Talur 2017-04-04 13:00:22 UTC

We have now confirmed that the issue is consequence of how the VMs were hosted on the hypervisor. The hypervisor was set to a date +0530 ahead of world time. When the nodes started, they did start with wrong time but were corrected by chronyd. When the gluster container started it started systemd in privileged mode which read the hardware clock and reset the system clock and changed many things on the node(for example, nodes rejected certs from master). Hence the NotReady state. We have fixed both the hypervisors now and it seems to be fine.

Closing this bug as a configuration issue where the hardware clock on the hypervisor hosting the OpenShift nodes did not have proper time set.

Note You need to log in before you can comment on or make changes to this bug.