1388867 – node service restart failed when a pod is running on this node

Bug 1388867 - node service restart failed when a pod is running on this node

Summary: node service restart failed when a pod is running on this node

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	3.4.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Dan Williams
QA Contact:	Meng Bo
Docs Contact:
URL:
Whiteboard:	aos-scalability-34
Depends On:
Blocks:	OSOPS_V3
TreeView+	depends on / blocked

Reported:	2016-10-26 10:39 UTC by Johnny Liu
Modified:	2017-03-08 18:43 UTC (History)
CC List:	11 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:	undefined
Clone Of:
Environment:
Last Closed:	2017-01-18 12:46:12 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
node start failure log (388.29 KB, text/x-vhdl) 2016-10-26 10:39 UTC, Johnny Liu	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Origin (Github)	11613	0	None	None	None	2016-11-01 14:01:44 UTC
Red Hat Product Errata	RHBA-2017:0066	0	normal	SHIPPED_LIVE	Red Hat OpenShift Container Platform 3.4 RPM Release Advisory	2017-01-18 17:23:26 UTC

Description Johnny Liu 2016-10-26 10:39:23 UTC

Created attachment 1214248 [details]
node start failure log

Description of problem:
Clone this bug from https://bugzilla.redhat.com/show_bug.cgi?id=1388288#c7.

Version-Release number of selected component (if applicable):
# openshift version
openshift v3.4.0.15+9c963ec
kubernetes v1.4.0+776c994
etcd 3.1.0-alpha.1
# rpm -q docker
docker-1.10.3-57.el7.x86_64

How reproducible:
Always

Steps to Reproduce:

1. install env successfully with "redhat/openshift-ovs-multitenant"
# openshift version
openshift v3.4.0.15+9c963ec
kubernetes v1.4.0+776c994
etcd 3.1.0-alpha.1
# rpm -q docker
docker-1.10.3-57.el7.x86_64
# oc get nodes
NAME                           STATUS                     AGE
ip-172-18-10-70.ec2.internal   Ready                      1h
ip-172-18-6-3.ec2.internal     Ready,SchedulingDisabled   1h

2. make sure there is no pod running on node.
# oc scale --replicas=0 dc/registry-console

3. restart node successfully.

4. make sure there is a pod running on node.
# oc scale --replicas=1 dc/registry-console
# oc get po
NAME                       READY     STATUS    RESTARTS   AGE
registry-console-1-k2brf   1/1       Running   0          3m

5. restart node, failed.
# service atomic-openshift-node restart
Redirecting to /bin/systemctl restart  atomic-openshift-node.service
Job for atomic-openshift-node.service failed because a timeout was exceeded. See "systemctl status atomic-openshift-node.service" and "journalctl -xe" for details.

Actual results:
restart node failed.

Expected results:
restart successfully.

Additional info:

Comment 1 Meng Bo 2016-10-26 11:11:18 UTC

It is not related to plugin type, the problem exists in both subnet and multitenant env.

The valuable logs from my viewpoint are:
Oct 25 08:26:01 ip-172-18-24-156.ec2.internal atomic-openshift-node[92648]: I1025 08:26:01.979679   92648 kubelet.go:2240] skipping pod synchronization - [SDN pod network is not ready]
Oct 25 08:26:31 ip-172-18-24-156.ec2.internal atomic-openshift-node[92648]: I1025 08:26:31.980867   92648 kubelet.go:2240] skipping pod synchronization - [SDN pod network is not ready]
Oct 25 08:26:36 ip-172-18-24-156.ec2.internal atomic-openshift-node[92648]: I1025 08:26:36.981065   92648 kubelet.go:2240] skipping pod synchronization - [SDN pod network is not ready]
Oct 25 08:27:02 ip-172-18-24-156.ec2.internal atomic-openshift-node[92947]: I1025 08:27:02.257550   92947 kubelet.go:2240] skipping pod synchronization - [network state unknown container runtime is down]

Seems that the node/kubelet cannot get the correct pod status or cannot bring the existing pods up after restarting.

Comment 2 Ben Bennett 2016-10-27 12:49:25 UTC

Is this a dupe of https://bugzilla.redhat.com/show_bug.cgi?id=1388556 ?

Comment 3 Dan Williams 2016-10-28 22:01:34 UTC

Any chance you get get more of the node's logs, and better yet with --loglevel=5 ?

Comment 4 Johnny Liu 2016-10-29 01:27:50 UTC

(In reply to Dan Williams from comment #3)
> Any chance you get get more of the node's logs, and better yet with
> --loglevel=5 ?

The node logs was gotten at --loglevel=5.

Comment 5 Dan Williams 2016-10-31 19:02:20 UTC

I believe this should be fixed by https://github.com/openshift/origin/pull/11613 and more specifically https://github.com/openshift/origin/pull/11613/commits/d861f0630f5888756516277e6e5800a83089208c

Comment 6 Ben Bennett 2016-11-01 14:03:53 UTC

Can't be MODIFIED until the PR is merged.

Comment 7 Troy Dawson 2016-11-04 18:50:31 UTC

This has been merged into ose and is in OSE v3.4.0.22 or newer.

Comment 9 Johnny Liu 2016-11-07 08:33:19 UTC

Verified this bug with atomic-openshift-3.4.0.22-1.git.0.5c56720.el7.x86_64, and PASS.

Now re-start node successfully.

Comment 11 errata-xmlrpc 2017-01-18 12:46:12 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:0066

Note You need to log in before you can comment on or make changes to this bug.