Description of problem: nodes going to NotReady state while running cns_deploy Version-Release number of selected component (if applicable): atomic-openshift-3.5.5-1.git.0.3f53382.el7.x86_64 docker-1.12.6-11.el7.x86_64 cns-deploy-4.0.0-9.el7rhgs.x86_64 heketi-client-4.0.0-4.el7rhgs.x86_64 openshift-ansible-3.5.45-1.git.0.eb0859b.el7.noarch Steps to Reproduce: 1. Setup openshift 2. Setup router oc get nodes shows all the nodes in Ready state [root@dhcp47-105 ~]# oc get pods NAME READY STATUS RESTARTS AGE aplo-router-1-zh1bz 1/1 Running 0 3h [root@dhcp47-105 ~]# oc get nodes NAME STATUS AGE dhcp46-3.lab.eng.blr.redhat.com Ready,SchedulingDisabled 3h dhcp46-67.lab.eng.blr.redhat.com Ready 3h dhcp47-17.lab.eng.blr.redhat.com Ready 3h dhcp47-185.lab.eng.blr.redhat.com Ready 3h 3. Now run cns_deploy command, it fails saying gluster pods not found, n oc get nodes shows nodes in NotReady State Output of cns_deploy command: ('Using OpenShift CLI.\nNAME STATUS AGE\naplo Active 3h\nUsing namespace "aplo".\nChecking that heketi pod is not running ... OK\ntemplate "deploy-heketi" created\nserviceaccount "heketi-service-account" created\ntemplate "heketi" created\ntemplate "glusterfs" created\nrole "edit" added: "system:serviceaccount:aplo:heketi-service-account"\nnode "dhcp47-185.lab.eng.blr.redhat.com" labeled\nnode "dhcp46-67.lab.eng.blr.redhat.com" labeled\nnode "dhcp47-17.lab.eng.blr.redhat.com" labeled\ndaemonset "glusterfs" created\nWaiting for GlusterFS pods to start ... pods not found.\nserviceaccount "heketi-service-account" deleted\nNo resources found\ntemplate "deploy-heketi" deleted\ntemplate "heketi" deleted\nnode "dhcp47-185.lab.eng.blr.redhat.com" labeled\nnode "dhcp46-67.lab.eng.blr.redhat.com" labeled\nnode "dhcp47-17.lab.eng.blr.redhat.com" labeled\ntemplate "glusterfs" deleted\n', 'Error from server (NotFound): services "heketi" not found\nError from server (NotFound): services "heketi-storage-endpoints" not found\nError from server (NotFound): deploymentconfig "heketi" not found\nError from server (NotFound): routes "heketi" not found\nError from server (NotFound): secrets "heketi-db-backup" not found\nerror: timed out waiting for the condition\n') ('', 'curl: (3) <url> malformed\n') ('', 'Server must be provided\n') [root@dhcp47-105 ~]# oc get nodes NAME STATUS AGE dhcp46-3.lab.eng.blr.redhat.com Ready,SchedulingDisabled 3h dhcp46-67.lab.eng.blr.redhat.com NotReady 3h dhcp47-17.lab.eng.blr.redhat.com NotReady 3h dhcp47-185.lab.eng.blr.redhat.com NotReady 3h [root@dhcp47-105 ~]# oc get pods NAME READY STATUS RESTARTS AGE aplo-router-1-zh1bz 1/1 Running 0 3h glusterfs-bx4fz 0/1 Terminating 0 31m glusterfs-mtzhf 0/1 Terminating 0 31m glusterfs-r2q8k 0/1 Terminating 0 31m Additional info: http://pastebin.test.redhat.com/469095
Please attach the docker.log from this incident with the -D option turned on. Thanks.
Created attachment 1267431 [details] journalctl.log
John, none of the conatiners are running so could not get docker log, but i have attached the journalctl logs. I see foloowing messages in journalctl: Mar 30 17:07:38 dhcp46-67.lab.eng.blr.redhat.com atomic-openshift-node[11085]: I0330 17:07:38.357164 11085 conversion.go:134] failed to handle multiple devices for contain Mar 30 17:07:38 dhcp46-67.lab.eng.blr.redhat.com atomic-openshift-node[11085]: I0330 17:07:38.357191 11085 conversion.go:134] failed to handle multiple devices for contain Mar 30 17:07:42 dhcp46-67.lab.eng.blr.redhat.com atomic-openshift-node[11085]: I0330 17:07:42.016953 11085 kube_docker_client.go:328] Pulling image "rhgs3/rhgs-server-rhel Mar 30 17:07:48 dhcp46-67.lab.eng.blr.redhat.com atomic-openshift-node[11085]: I0330 17:07:48.399563 11085 conversion.go:134] failed to handle multiple devices for contain Mar 30 17:07:48 dhcp46-67.lab.eng.blr.redhat.com atomic-openshift-node[11085]: I0330 17:07:48.399596 11085 conversion.go:134] failed to handle multiple devices for contain Mar 30 17:07:52 dhcp46-67.lab.eng.blr.redhat.com atomic-openshift-node[11085]: I0330 17:07:52.016936 11085 kube_docker_client.go:328] Pulling image "rhgs3/rhgs-server-rhel Mar 30 17:07:58 dhcp46-67.lab.eng.blr.redhat.com atomic-openshift-node[11085]: I0330 17:07:58.452112 11085 conversion.go:134] failed to handle multiple devices for contain Mar 30 17:07:58 dhcp46-67.lab.eng.blr.redhat.com atomic-openshift-node[11085]: I0330 17:07:58.452134 11085 conversion.go:134] failed to handle multiple devices for contain Mar 30 17:08:02 dhcp46-67.lab.eng.blr.redhat.com atomic-openshift-node[11085]: I0330 17:08:02.048062 11085 kube_docker_client.go:328] Pulling image "rhgs3/rhgs-server-rhel Mar 30 17:08:06 dhcp46-67.lab.eng.blr.redhat.com atomic-openshift-node[11085]: I0330 17:08:06.623274 11085 operation_executor.go:917] MountVolume.SetUp succeeded for volum Mar 30 17:08:08 dhcp46-67.lab.eng.blr.redhat.com atomic-openshift-node[11085]: I0330 17:08:08.481778 11085 conversion.go:134] failed to handle multiple devices for contain I have mailed you the setup details, setup is in same state for you to debug.
Should have been resolved with https://github.com/kubernetes/kubernetes/pull/40095
Do we know if that kubernetes pull request has made it into origin yet?
Jhon was unclear about what exactly in 40095 should fix this but it doesn't look like we have bumped aws-sdk-go for origin. It is still on v1.0.8 where the upstream PR bumps it to v1.6.10.
Waiting on rebase to pull this into origin 1.6
Both cadvisor and aws-sdk-go are a least the versions mentioned in the comment 6 PR in Origin master now.
Could you help verify the bug?
Verify on openshift v3.6.135, no this error.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2017:1716