Bug 1436536 - nodes going to NotReady state while running cns_deploy
Summary: nodes going to NotReady state while running cns_deploy
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Node
Version: 3.5.0
Hardware: x86_64
OS: Linux
unspecified
high
Target Milestone: ---
: ---
Assignee: Seth Jennings
QA Contact: DeShuai Ma
URL:
Whiteboard:
Depends On:
Blocks: 1435165
TreeView+ depends on / blocked
 
Reported: 2017-03-28 06:56 UTC by Apeksha
Modified: 2017-08-16 19:51 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
undefined
Clone Of:
Environment:
Last Closed: 2017-08-10 05:20:02 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
journalctl.log (220.29 KB, text/x-vhdl)
2017-03-30 07:25 UTC, Apeksha
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHEA-2017:1716 0 normal SHIPPED_LIVE Red Hat OpenShift Container Platform 3.6 RPM Release Advisory 2017-08-10 09:02:50 UTC

Description Apeksha 2017-03-28 06:56:19 UTC
Description of problem:
nodes going to NotReady state while running cns_deploy


Version-Release number of selected component (if applicable):
atomic-openshift-3.5.5-1.git.0.3f53382.el7.x86_64
docker-1.12.6-11.el7.x86_64
cns-deploy-4.0.0-9.el7rhgs.x86_64
heketi-client-4.0.0-4.el7rhgs.x86_64
openshift-ansible-3.5.45-1.git.0.eb0859b.el7.noarch

Steps to Reproduce:
1. Setup openshift
2. Setup router
 oc get nodes shows all the nodes in Ready state
 [root@dhcp47-105 ~]# oc get pods
NAME                  READY     STATUS    RESTARTS   AGE
aplo-router-1-zh1bz   1/1       Running   0          3h
[root@dhcp47-105 ~]# oc get nodes
NAME                                STATUS                     AGE
dhcp46-3.lab.eng.blr.redhat.com     Ready,SchedulingDisabled   3h
dhcp46-67.lab.eng.blr.redhat.com    Ready                      3h
dhcp47-17.lab.eng.blr.redhat.com    Ready                      3h
dhcp47-185.lab.eng.blr.redhat.com   Ready                      3h
3. Now run cns_deploy command, it fails saying gluster pods not found, n oc get nodes shows nodes in NotReady State

Output of cns_deploy command:
('Using OpenShift CLI.\nNAME      STATUS    AGE\naplo      Active    3h\nUsing namespace "aplo".\nChecking that heketi pod is not running ... OK\ntemplate "deploy-heketi" created\nserviceaccount "heketi-service-account" created\ntemplate "heketi" created\ntemplate "glusterfs" created\nrole "edit" added: "system:serviceaccount:aplo:heketi-service-account"\nnode "dhcp47-185.lab.eng.blr.redhat.com" labeled\nnode "dhcp46-67.lab.eng.blr.redhat.com" labeled\nnode "dhcp47-17.lab.eng.blr.redhat.com" labeled\ndaemonset "glusterfs" created\nWaiting for GlusterFS pods to start ... pods not found.\nserviceaccount "heketi-service-account" deleted\nNo resources found\ntemplate "deploy-heketi" deleted\ntemplate "heketi" deleted\nnode "dhcp47-185.lab.eng.blr.redhat.com" labeled\nnode "dhcp46-67.lab.eng.blr.redhat.com" labeled\nnode "dhcp47-17.lab.eng.blr.redhat.com" labeled\ntemplate "glusterfs" deleted\n', 'Error from server (NotFound): services "heketi" not found\nError from server (NotFound): services "heketi-storage-endpoints" not found\nError from server (NotFound): deploymentconfig "heketi" not found\nError from server (NotFound): routes "heketi" not found\nError from server (NotFound): secrets "heketi-db-backup" not found\nerror: timed out waiting for the condition\n')

('', 'curl: (3) <url> malformed\n')
('', 'Server must be provided\n')

[root@dhcp47-105 ~]# oc get nodes
NAME                                STATUS                     AGE
dhcp46-3.lab.eng.blr.redhat.com     Ready,SchedulingDisabled   3h
dhcp46-67.lab.eng.blr.redhat.com    NotReady                   3h
dhcp47-17.lab.eng.blr.redhat.com    NotReady                   3h
dhcp47-185.lab.eng.blr.redhat.com   NotReady                   3h

[root@dhcp47-105 ~]# oc get pods
NAME                  READY     STATUS        RESTARTS   AGE
aplo-router-1-zh1bz   1/1       Running       0          3h
glusterfs-bx4fz       0/1       Terminating   0          31m
glusterfs-mtzhf       0/1       Terminating   0          31m
glusterfs-r2q8k       0/1       Terminating   0          31m

Additional info: 
http://pastebin.test.redhat.com/469095

Comment 1 Jhon Honce 2017-03-28 15:24:38 UTC
Please attach the docker.log from this incident with the -D option turned on.  Thanks.

Comment 2 Apeksha 2017-03-30 07:25:37 UTC
Created attachment 1267431 [details]
journalctl.log

Comment 3 Apeksha 2017-03-30 07:26:37 UTC
John,

none of the conatiners are running so could not get docker log, but i have attached the journalctl logs.

I see foloowing messages in journalctl:
Mar 30 17:07:38 dhcp46-67.lab.eng.blr.redhat.com atomic-openshift-node[11085]: I0330 17:07:38.357164   11085 conversion.go:134] failed to handle multiple devices for contain
Mar 30 17:07:38 dhcp46-67.lab.eng.blr.redhat.com atomic-openshift-node[11085]: I0330 17:07:38.357191   11085 conversion.go:134] failed to handle multiple devices for contain
Mar 30 17:07:42 dhcp46-67.lab.eng.blr.redhat.com atomic-openshift-node[11085]: I0330 17:07:42.016953   11085 kube_docker_client.go:328] Pulling image "rhgs3/rhgs-server-rhel
Mar 30 17:07:48 dhcp46-67.lab.eng.blr.redhat.com atomic-openshift-node[11085]: I0330 17:07:48.399563   11085 conversion.go:134] failed to handle multiple devices for contain
Mar 30 17:07:48 dhcp46-67.lab.eng.blr.redhat.com atomic-openshift-node[11085]: I0330 17:07:48.399596   11085 conversion.go:134] failed to handle multiple devices for contain
Mar 30 17:07:52 dhcp46-67.lab.eng.blr.redhat.com atomic-openshift-node[11085]: I0330 17:07:52.016936   11085 kube_docker_client.go:328] Pulling image "rhgs3/rhgs-server-rhel
Mar 30 17:07:58 dhcp46-67.lab.eng.blr.redhat.com atomic-openshift-node[11085]: I0330 17:07:58.452112   11085 conversion.go:134] failed to handle multiple devices for contain
Mar 30 17:07:58 dhcp46-67.lab.eng.blr.redhat.com atomic-openshift-node[11085]: I0330 17:07:58.452134   11085 conversion.go:134] failed to handle multiple devices for contain
Mar 30 17:08:02 dhcp46-67.lab.eng.blr.redhat.com atomic-openshift-node[11085]: I0330 17:08:02.048062   11085 kube_docker_client.go:328] Pulling image "rhgs3/rhgs-server-rhel
Mar 30 17:08:06 dhcp46-67.lab.eng.blr.redhat.com atomic-openshift-node[11085]: I0330 17:08:06.623274   11085 operation_executor.go:917] MountVolume.SetUp succeeded for volum
Mar 30 17:08:08 dhcp46-67.lab.eng.blr.redhat.com atomic-openshift-node[11085]: I0330 17:08:08.481778   11085 conversion.go:134] failed to handle multiple devices for contain

I have mailed you the setup details, setup is in same state for you to debug.

Comment 4 Jhon Honce 2017-04-03 22:45:20 UTC
Should have been resolved with https://github.com/kubernetes/kubernetes/pull/40095

Comment 5 Troy Dawson 2017-04-11 21:15:03 UTC
Do we know if that kubernetes pull request has made it into origin yet?

Comment 6 Seth Jennings 2017-04-12 20:06:11 UTC
Jhon was unclear about what exactly in 40095 should fix this but it doesn't look like we have bumped aws-sdk-go for origin.  It is still on v1.0.8 where the upstream PR bumps it to v1.6.10.

Comment 7 Seth Jennings 2017-04-12 20:23:25 UTC
Waiting on rebase to pull this into origin 1.6

Comment 8 Seth Jennings 2017-05-05 14:26:46 UTC
Both cadvisor and aws-sdk-go are a least the versions mentioned in the comment 6 PR in Origin master now.

Comment 10 DeShuai Ma 2017-06-05 10:01:43 UTC
Could you help verify the bug?

Comment 11 DeShuai Ma 2017-07-06 06:37:42 UTC
Verify on openshift v3.6.135, no this error.

Comment 13 errata-xmlrpc 2017-08-10 05:20:02 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2017:1716


Note You need to log in before you can comment on or make changes to this bug.