Bugzilla will be upgraded to version 5.0. The upgrade date is tentatively scheduled for 2 December 2018, pending final testing and feedback.
Bug 1436536 - nodes going to NotReady state while running cns_deploy
nodes going to NotReady state while running cns_deploy
Status: CLOSED ERRATA
Product: OpenShift Container Platform
Classification: Red Hat
Component: Pod (Show other bugs)
3.5.0
x86_64 Linux
unspecified Severity high
: ---
: ---
Assigned To: Seth Jennings
DeShuai Ma
:
Depends On:
Blocks: 1435165
  Show dependency treegraph
 
Reported: 2017-03-28 02:56 EDT by Apeksha
Modified: 2017-08-16 15 EDT (History)
6 users (show)

See Also:
Fixed In Version:
Doc Type: No Doc Update
Doc Text:
undefined
Story Points: ---
Clone Of:
Environment:
Last Closed: 2017-08-10 01:20:02 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
journalctl.log (220.29 KB, text/x-vhdl)
2017-03-30 03:25 EDT, Apeksha
no flags Details


External Trackers
Tracker ID Priority Status Summary Last Updated
Red Hat Product Errata RHEA-2017:1716 normal SHIPPED_LIVE Red Hat OpenShift Container Platform 3.6 RPM Release Advisory 2017-08-10 05:02:50 EDT

  None (edit)
Description Apeksha 2017-03-28 02:56:19 EDT
Description of problem:
nodes going to NotReady state while running cns_deploy


Version-Release number of selected component (if applicable):
atomic-openshift-3.5.5-1.git.0.3f53382.el7.x86_64
docker-1.12.6-11.el7.x86_64
cns-deploy-4.0.0-9.el7rhgs.x86_64
heketi-client-4.0.0-4.el7rhgs.x86_64
openshift-ansible-3.5.45-1.git.0.eb0859b.el7.noarch

Steps to Reproduce:
1. Setup openshift
2. Setup router
 oc get nodes shows all the nodes in Ready state
 [root@dhcp47-105 ~]# oc get pods
NAME                  READY     STATUS    RESTARTS   AGE
aplo-router-1-zh1bz   1/1       Running   0          3h
[root@dhcp47-105 ~]# oc get nodes
NAME                                STATUS                     AGE
dhcp46-3.lab.eng.blr.redhat.com     Ready,SchedulingDisabled   3h
dhcp46-67.lab.eng.blr.redhat.com    Ready                      3h
dhcp47-17.lab.eng.blr.redhat.com    Ready                      3h
dhcp47-185.lab.eng.blr.redhat.com   Ready                      3h
3. Now run cns_deploy command, it fails saying gluster pods not found, n oc get nodes shows nodes in NotReady State

Output of cns_deploy command:
('Using OpenShift CLI.\nNAME      STATUS    AGE\naplo      Active    3h\nUsing namespace "aplo".\nChecking that heketi pod is not running ... OK\ntemplate "deploy-heketi" created\nserviceaccount "heketi-service-account" created\ntemplate "heketi" created\ntemplate "glusterfs" created\nrole "edit" added: "system:serviceaccount:aplo:heketi-service-account"\nnode "dhcp47-185.lab.eng.blr.redhat.com" labeled\nnode "dhcp46-67.lab.eng.blr.redhat.com" labeled\nnode "dhcp47-17.lab.eng.blr.redhat.com" labeled\ndaemonset "glusterfs" created\nWaiting for GlusterFS pods to start ... pods not found.\nserviceaccount "heketi-service-account" deleted\nNo resources found\ntemplate "deploy-heketi" deleted\ntemplate "heketi" deleted\nnode "dhcp47-185.lab.eng.blr.redhat.com" labeled\nnode "dhcp46-67.lab.eng.blr.redhat.com" labeled\nnode "dhcp47-17.lab.eng.blr.redhat.com" labeled\ntemplate "glusterfs" deleted\n', 'Error from server (NotFound): services "heketi" not found\nError from server (NotFound): services "heketi-storage-endpoints" not found\nError from server (NotFound): deploymentconfig "heketi" not found\nError from server (NotFound): routes "heketi" not found\nError from server (NotFound): secrets "heketi-db-backup" not found\nerror: timed out waiting for the condition\n')

('', 'curl: (3) <url> malformed\n')
('', 'Server must be provided\n')

[root@dhcp47-105 ~]# oc get nodes
NAME                                STATUS                     AGE
dhcp46-3.lab.eng.blr.redhat.com     Ready,SchedulingDisabled   3h
dhcp46-67.lab.eng.blr.redhat.com    NotReady                   3h
dhcp47-17.lab.eng.blr.redhat.com    NotReady                   3h
dhcp47-185.lab.eng.blr.redhat.com   NotReady                   3h

[root@dhcp47-105 ~]# oc get pods
NAME                  READY     STATUS        RESTARTS   AGE
aplo-router-1-zh1bz   1/1       Running       0          3h
glusterfs-bx4fz       0/1       Terminating   0          31m
glusterfs-mtzhf       0/1       Terminating   0          31m
glusterfs-r2q8k       0/1       Terminating   0          31m

Additional info: 
http://pastebin.test.redhat.com/469095
Comment 1 Jhon Honce 2017-03-28 11:24:38 EDT
Please attach the docker.log from this incident with the -D option turned on.  Thanks.
Comment 2 Apeksha 2017-03-30 03:25 EDT
Created attachment 1267431 [details]
journalctl.log
Comment 3 Apeksha 2017-03-30 03:26:37 EDT
John,

none of the conatiners are running so could not get docker log, but i have attached the journalctl logs.

I see foloowing messages in journalctl:
Mar 30 17:07:38 dhcp46-67.lab.eng.blr.redhat.com atomic-openshift-node[11085]: I0330 17:07:38.357164   11085 conversion.go:134] failed to handle multiple devices for contain
Mar 30 17:07:38 dhcp46-67.lab.eng.blr.redhat.com atomic-openshift-node[11085]: I0330 17:07:38.357191   11085 conversion.go:134] failed to handle multiple devices for contain
Mar 30 17:07:42 dhcp46-67.lab.eng.blr.redhat.com atomic-openshift-node[11085]: I0330 17:07:42.016953   11085 kube_docker_client.go:328] Pulling image "rhgs3/rhgs-server-rhel
Mar 30 17:07:48 dhcp46-67.lab.eng.blr.redhat.com atomic-openshift-node[11085]: I0330 17:07:48.399563   11085 conversion.go:134] failed to handle multiple devices for contain
Mar 30 17:07:48 dhcp46-67.lab.eng.blr.redhat.com atomic-openshift-node[11085]: I0330 17:07:48.399596   11085 conversion.go:134] failed to handle multiple devices for contain
Mar 30 17:07:52 dhcp46-67.lab.eng.blr.redhat.com atomic-openshift-node[11085]: I0330 17:07:52.016936   11085 kube_docker_client.go:328] Pulling image "rhgs3/rhgs-server-rhel
Mar 30 17:07:58 dhcp46-67.lab.eng.blr.redhat.com atomic-openshift-node[11085]: I0330 17:07:58.452112   11085 conversion.go:134] failed to handle multiple devices for contain
Mar 30 17:07:58 dhcp46-67.lab.eng.blr.redhat.com atomic-openshift-node[11085]: I0330 17:07:58.452134   11085 conversion.go:134] failed to handle multiple devices for contain
Mar 30 17:08:02 dhcp46-67.lab.eng.blr.redhat.com atomic-openshift-node[11085]: I0330 17:08:02.048062   11085 kube_docker_client.go:328] Pulling image "rhgs3/rhgs-server-rhel
Mar 30 17:08:06 dhcp46-67.lab.eng.blr.redhat.com atomic-openshift-node[11085]: I0330 17:08:06.623274   11085 operation_executor.go:917] MountVolume.SetUp succeeded for volum
Mar 30 17:08:08 dhcp46-67.lab.eng.blr.redhat.com atomic-openshift-node[11085]: I0330 17:08:08.481778   11085 conversion.go:134] failed to handle multiple devices for contain

I have mailed you the setup details, setup is in same state for you to debug.
Comment 4 Jhon Honce 2017-04-03 18:45:20 EDT
Should have been resolved with https://github.com/kubernetes/kubernetes/pull/40095
Comment 5 Troy Dawson 2017-04-11 17:15:03 EDT
Do we know if that kubernetes pull request has made it into origin yet?
Comment 6 Seth Jennings 2017-04-12 16:06:11 EDT
Jhon was unclear about what exactly in 40095 should fix this but it doesn't look like we have bumped aws-sdk-go for origin.  It is still on v1.0.8 where the upstream PR bumps it to v1.6.10.
Comment 7 Seth Jennings 2017-04-12 16:23:25 EDT
Waiting on rebase to pull this into origin 1.6
Comment 8 Seth Jennings 2017-05-05 10:26:46 EDT
Both cadvisor and aws-sdk-go are a least the versions mentioned in the comment 6 PR in Origin master now.
Comment 10 DeShuai Ma 2017-06-05 06:01:43 EDT
Could you help verify the bug?
Comment 11 DeShuai Ma 2017-07-06 02:37:42 EDT
Verify on openshift v3.6.135, no this error.
Comment 13 errata-xmlrpc 2017-08-10 01:20:02 EDT
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2017:1716

Note You need to log in before you can comment on or make changes to this bug.