Bug 1436536

Summary: nodes going to NotReady state while running cns_deploy
Product: OpenShift Container Platform Reporter: Apeksha <akhakhar>
Component: NodeAssignee: Seth Jennings <sjenning>
Status: CLOSED ERRATA QA Contact: DeShuai Ma <dma>
Severity: high Docs Contact:
Priority: unspecified    
Version: 3.5.0CC: akhakhar, aos-bugs, jokerman, mmccomas, smunilla, tdawson
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
undefined
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-08-10 05:20:02 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1435165    
Attachments:
Description Flags
journalctl.log none

Description Apeksha 2017-03-28 06:56:19 UTC
Description of problem:
nodes going to NotReady state while running cns_deploy


Version-Release number of selected component (if applicable):
atomic-openshift-3.5.5-1.git.0.3f53382.el7.x86_64
docker-1.12.6-11.el7.x86_64
cns-deploy-4.0.0-9.el7rhgs.x86_64
heketi-client-4.0.0-4.el7rhgs.x86_64
openshift-ansible-3.5.45-1.git.0.eb0859b.el7.noarch

Steps to Reproduce:
1. Setup openshift
2. Setup router
 oc get nodes shows all the nodes in Ready state
 [root@dhcp47-105 ~]# oc get pods
NAME                  READY     STATUS    RESTARTS   AGE
aplo-router-1-zh1bz   1/1       Running   0          3h
[root@dhcp47-105 ~]# oc get nodes
NAME                                STATUS                     AGE
dhcp46-3.lab.eng.blr.redhat.com     Ready,SchedulingDisabled   3h
dhcp46-67.lab.eng.blr.redhat.com    Ready                      3h
dhcp47-17.lab.eng.blr.redhat.com    Ready                      3h
dhcp47-185.lab.eng.blr.redhat.com   Ready                      3h
3. Now run cns_deploy command, it fails saying gluster pods not found, n oc get nodes shows nodes in NotReady State

Output of cns_deploy command:
('Using OpenShift CLI.\nNAME      STATUS    AGE\naplo      Active    3h\nUsing namespace "aplo".\nChecking that heketi pod is not running ... OK\ntemplate "deploy-heketi" created\nserviceaccount "heketi-service-account" created\ntemplate "heketi" created\ntemplate "glusterfs" created\nrole "edit" added: "system:serviceaccount:aplo:heketi-service-account"\nnode "dhcp47-185.lab.eng.blr.redhat.com" labeled\nnode "dhcp46-67.lab.eng.blr.redhat.com" labeled\nnode "dhcp47-17.lab.eng.blr.redhat.com" labeled\ndaemonset "glusterfs" created\nWaiting for GlusterFS pods to start ... pods not found.\nserviceaccount "heketi-service-account" deleted\nNo resources found\ntemplate "deploy-heketi" deleted\ntemplate "heketi" deleted\nnode "dhcp47-185.lab.eng.blr.redhat.com" labeled\nnode "dhcp46-67.lab.eng.blr.redhat.com" labeled\nnode "dhcp47-17.lab.eng.blr.redhat.com" labeled\ntemplate "glusterfs" deleted\n', 'Error from server (NotFound): services "heketi" not found\nError from server (NotFound): services "heketi-storage-endpoints" not found\nError from server (NotFound): deploymentconfig "heketi" not found\nError from server (NotFound): routes "heketi" not found\nError from server (NotFound): secrets "heketi-db-backup" not found\nerror: timed out waiting for the condition\n')

('', 'curl: (3) <url> malformed\n')
('', 'Server must be provided\n')

[root@dhcp47-105 ~]# oc get nodes
NAME                                STATUS                     AGE
dhcp46-3.lab.eng.blr.redhat.com     Ready,SchedulingDisabled   3h
dhcp46-67.lab.eng.blr.redhat.com    NotReady                   3h
dhcp47-17.lab.eng.blr.redhat.com    NotReady                   3h
dhcp47-185.lab.eng.blr.redhat.com   NotReady                   3h

[root@dhcp47-105 ~]# oc get pods
NAME                  READY     STATUS        RESTARTS   AGE
aplo-router-1-zh1bz   1/1       Running       0          3h
glusterfs-bx4fz       0/1       Terminating   0          31m
glusterfs-mtzhf       0/1       Terminating   0          31m
glusterfs-r2q8k       0/1       Terminating   0          31m

Additional info: 
http://pastebin.test.redhat.com/469095

Comment 1 Jhon Honce 2017-03-28 15:24:38 UTC
Please attach the docker.log from this incident with the -D option turned on.  Thanks.

Comment 2 Apeksha 2017-03-30 07:25:37 UTC
Created attachment 1267431 [details]
journalctl.log

Comment 3 Apeksha 2017-03-30 07:26:37 UTC
John,

none of the conatiners are running so could not get docker log, but i have attached the journalctl logs.

I see foloowing messages in journalctl:
Mar 30 17:07:38 dhcp46-67.lab.eng.blr.redhat.com atomic-openshift-node[11085]: I0330 17:07:38.357164   11085 conversion.go:134] failed to handle multiple devices for contain
Mar 30 17:07:38 dhcp46-67.lab.eng.blr.redhat.com atomic-openshift-node[11085]: I0330 17:07:38.357191   11085 conversion.go:134] failed to handle multiple devices for contain
Mar 30 17:07:42 dhcp46-67.lab.eng.blr.redhat.com atomic-openshift-node[11085]: I0330 17:07:42.016953   11085 kube_docker_client.go:328] Pulling image "rhgs3/rhgs-server-rhel
Mar 30 17:07:48 dhcp46-67.lab.eng.blr.redhat.com atomic-openshift-node[11085]: I0330 17:07:48.399563   11085 conversion.go:134] failed to handle multiple devices for contain
Mar 30 17:07:48 dhcp46-67.lab.eng.blr.redhat.com atomic-openshift-node[11085]: I0330 17:07:48.399596   11085 conversion.go:134] failed to handle multiple devices for contain
Mar 30 17:07:52 dhcp46-67.lab.eng.blr.redhat.com atomic-openshift-node[11085]: I0330 17:07:52.016936   11085 kube_docker_client.go:328] Pulling image "rhgs3/rhgs-server-rhel
Mar 30 17:07:58 dhcp46-67.lab.eng.blr.redhat.com atomic-openshift-node[11085]: I0330 17:07:58.452112   11085 conversion.go:134] failed to handle multiple devices for contain
Mar 30 17:07:58 dhcp46-67.lab.eng.blr.redhat.com atomic-openshift-node[11085]: I0330 17:07:58.452134   11085 conversion.go:134] failed to handle multiple devices for contain
Mar 30 17:08:02 dhcp46-67.lab.eng.blr.redhat.com atomic-openshift-node[11085]: I0330 17:08:02.048062   11085 kube_docker_client.go:328] Pulling image "rhgs3/rhgs-server-rhel
Mar 30 17:08:06 dhcp46-67.lab.eng.blr.redhat.com atomic-openshift-node[11085]: I0330 17:08:06.623274   11085 operation_executor.go:917] MountVolume.SetUp succeeded for volum
Mar 30 17:08:08 dhcp46-67.lab.eng.blr.redhat.com atomic-openshift-node[11085]: I0330 17:08:08.481778   11085 conversion.go:134] failed to handle multiple devices for contain

I have mailed you the setup details, setup is in same state for you to debug.

Comment 4 Jhon Honce 2017-04-03 22:45:20 UTC
Should have been resolved with https://github.com/kubernetes/kubernetes/pull/40095

Comment 5 Troy Dawson 2017-04-11 21:15:03 UTC
Do we know if that kubernetes pull request has made it into origin yet?

Comment 6 Seth Jennings 2017-04-12 20:06:11 UTC
Jhon was unclear about what exactly in 40095 should fix this but it doesn't look like we have bumped aws-sdk-go for origin.  It is still on v1.0.8 where the upstream PR bumps it to v1.6.10.

Comment 7 Seth Jennings 2017-04-12 20:23:25 UTC
Waiting on rebase to pull this into origin 1.6

Comment 8 Seth Jennings 2017-05-05 14:26:46 UTC
Both cadvisor and aws-sdk-go are a least the versions mentioned in the comment 6 PR in Origin master now.

Comment 10 DeShuai Ma 2017-06-05 10:01:43 UTC
Could you help verify the bug?

Comment 11 DeShuai Ma 2017-07-06 06:37:42 UTC
Verify on openshift v3.6.135, no this error.

Comment 13 errata-xmlrpc 2017-08-10 05:20:02 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2017:1716