Fedora Account System
Red Hat Associate
Red Hat Customer
Description of problem: CNS: cns_deploy fails- gluster pods not found, nodes goint to notready state Version-Release number of selected component (if applicable): cns-deploy-4.0.0-6.el7rhgs.x86_64 heketi-client-4.0.0-3.el7rhgs.x86_64 docker-1.12.6-16.el7.x86_64 atomic-openshift-3.5.2-1.git.0.d570b4d.el7.x86_64 How reproducible: Hit this twice Steps to Reproduce: 1. Setup openshift 2. Setup router oc get nodes shows all the nodes in Ready state 3. Now run cns_deploy command, it fails saying gluster pods not found, n oc get nodes shows nodes in NotReady State Output of cns_deploy command: 'Using OpenShift CLI.\nNAME STATUS AGE\naplo Active 12m\nUsing namespace "aplo".\nChecking that heketi pod is not running ... OK\ntemplate "deploy-heketi" created\nserviceaccount "heketi-service-account" created\ntemplate "heketi" created\ntemplate "glusterfs" created\nrole "edit" added: "system:serviceaccount:aplo:heketi-service-account"\nnode "dhcp47-185.lab.eng.blr.redhat.com" labeled\nnode "dhcp46-67.lab.eng.blr.redhat.com" labeled\nnode "dhcp47-17.lab.eng.blr.redhat.com" labeled\ndaemonset "glusterfs" created\nWaiting for GlusterFS pods to start ... pods not found.\nserviceaccount "heketi-service-account" deleted\nNo resources found\ntemplate "deploy-heketi" deleted\ntemplate "heketi" deleted\nnode "dhcp47-185.lab.eng.blr.redhat.com" labeled\nnode "dhcp46-67.lab.eng.blr.redhat.com" labeled\nnode "dhcp47-17.lab.eng.blr.redhat.com" labeled\ntemplate "glusterfs" deleted\n', 'Error from server (NotFound): services "heketi" not found\nError from server (NotFound): services "heketi-storage-endpoints" not found\nError from server (NotFound): deploymentconfig "heketi" not found\nError from server (NotFound): routes "heketi" not found\nError from server (NotFound): secrets "heketi-db-backup" not found\nerror: timed out waiting for the condition\n' [root@dhcp46-230 ~]# oc get nodes NAME STATUS AGE dhcp46-108.lab.eng.blr.redhat.com NotReady 2h dhcp46-127.lab.eng.blr.redhat.com NotReady 2h dhcp46-205.lab.eng.blr.redhat.com NotReady 2h dhcp46-97.lab.eng.blr.redhat.com Ready,SchedulingDisabled 2h [root@dhcp46-230 ~]# oc get pods NAME READY STATUS RESTARTS AGE aplo-router-1-0l68h 1/1 Running 0 1h glusterfs-5ll2q 0/1 Terminating 0 1h glusterfs-hwn46 0/1 Terminating 0 1h glusterfs-j6l97 0/1 Terminating 0 1h [root@dhcp46-230 ~]# oc get ds NAME DESIRED CURRENT READY NODE-SELECTOR AGE glusterfs 0 0 0 492bf98e-0fd8-11e7-a311-005056b3d32d=492bf9e0-0fd8-11e7-a311-005056b3d32d 1h [root@dhcp46-230 ~]# oc describe ds glusterfs Name: glusterfs Image(s): rhgs3/rhgs-server-rhel7:3.2.0-4 Selector: glusterfs-node=pod Node-Selector: 492bf98e-0fd8-11e7-a311-005056b3d32d=492bf9e0-0fd8-11e7-a311-005056b3d32d Labels: glusterfs=daemonset Desired Number of Nodes Scheduled: 0 Current Number of Nodes Scheduled: 0 Number of Nodes Misscheduled: 3 Pods Status: 3 Running / 0 Waiting / 0 Succeeded / 0 Failed Events: FirstSeen LastSeen Count From SubObjectPath Type Reason Message --------- -------- ----- ---- ------------- -------- ------ ------- 1h 1h 1 {daemon-set } Normal SuccessfulCreate Created pod: glusterfs-j6l97 1h 1h 1 {daemon-set } Normal SuccessfulCreate Created pod: glusterfs-5ll2q 1h 1h 1 {daemon-set } Normal SuccessfulCreate Created pod: glusterfs-hwn46 1h 2m 11 {daemon-set } Normal SuccessfulDelete Deleted pod: glusterfs-j6l97 1h 2m 10 {daemon-set } Normal SuccessfulDelete Deleted pod: glusterfs-5ll2q 1h 2m 10 {daemon-set } Normal SuccessfulDelete Deleted pod: glusterfs-hwn46 Additional info: oc describe of node and docker status - http://pastebin.test.redhat.com/467628
If the nodes are in 'Not ready' status, pod deployment will fail. Can you make sure nodes are in 'Ready' status before you deploy ? If its in not ready status please reboot node service and wait for the status change.
Apeksha, can you please remove the testblocker flag from this bugzilla? Also this does not looks like an issue caused by cns, rather looks to be an issue with your setup. Can you make sure your setup is good and running before trying cns-deploy ?
Humble, I had tried this on a fresh setup and as mentioned in the steps to reproduce, before running cns_deploy the nodes were in Ready state. Anyways i am trying it again on the latest build and will update the bug accordingly.
Hit this issue again on latest builld cns-deploy-4.0.0-9.el7rhgs.x86_64 Before running cns_deploy nodes were in READY state [root@dhcp47-105 ~]# oc get pods NAME READY STATUS RESTARTS AGE aplo-router-1-zh1bz 1/1 Running 0 3h [root@dhcp47-105 ~]# oc get nodes NAME STATUS AGE dhcp46-3.lab.eng.blr.redhat.com Ready,SchedulingDisabled 3h dhcp46-67.lab.eng.blr.redhat.com Ready 3h dhcp47-17.lab.eng.blr.redhat.com Ready 3h dhcp47-185.lab.eng.blr.redhat.com Ready 3h I have kept the setup in same state for debugging
I see the setup. All the service are running as expected. I am not sure what happened at the initial state. I restarted all the atomic-openshift-node service and the state became ready. Just to be sure cleaned up all the gluster stuff and did a gluster create on the cluster. Now everything looks good. [root@dhcp47-105 ~]# oc get pods -w NAME READY STATUS RESTARTS AGE aplo-router-1-fnq00 1/1 Running 0 14m glusterfs-b2sz0 0/1 Running 0 <invalid> glusterfs-b6hcc 0/1 Running 0 <invalid> glusterfs-w1c8v 0/1 Running 0 <invalid> NAME READY STATUS RESTARTS AGE glusterfs-b2sz0 1/1 Running 0 51s glusterfs-w1c8v 1/1 Running 0 51s glusterfs-b6hcc 1/1 Running 0 51s ^C[root@dhcp47-105 ~]oc get nodes NAME STATUS AGE dhcp46-3.lab.eng.blr.redhat.com Ready,SchedulingDisabled 5h dhcp46-67.lab.eng.blr.redhat.com Ready 5h dhcp47-17.lab.eng.blr.redhat.com Ready 5h dhcp47-185.lab.eng.blr.redhat.com Ready 5h
As per inputs given by ashiq, created an OCP bug for the same - https://bugzilla.redhat.com/show_bug.cgi?id=1436536
Talur, I do have a seperate partition in vmware setup: [root@dhcp47-105 ~]# df -h Filesystem Size Used Avail Use% Mounted on /dev/mapper/rhel_dhcp47--183-root 50G 2.1G 48G 5% / devtmpfs 24G 0 24G 0% /dev tmpfs 24G 0 24G 0% /dev/shm tmpfs 24G 17M 24G 1% /run tmpfs 24G 0 24G 0% /sys/fs/cgroup /dev/sdb1 40G 343M 40G 1% /var /dev/sda1 1014M 211M 804M 21% /boot /dev/mapper/rhel_dhcp47--183-home 50G 33M 50G 1% /home tmpfs 4.8G 0 4.8G 0% /run/user/0
We have now confirmed that the issue is consequence of how the VMs were hosted on the hypervisor. The hypervisor was set to a date +0530 ahead of world time. When the nodes started, they did start with wrong time but were corrected by chronyd. When the gluster container started it started systemd in privileged mode which read the hardware clock and reset the system clock and changed many things on the node(for example, nodes rejected certs from master). Hence the NotReady state. We have fixed both the hypervisors now and it seems to be fine. Closing this bug as a configuration issue where the hardware clock on the hypervisor hosting the OpenShift nodes did not have proper time set.