Bug 1388867

Summary: node service restart failed when a pod is running on this node
Product: OpenShift Container Platform Reporter: Johnny Liu <jialiu>
Component: NetworkingAssignee: Dan Williams <dcbw>
Status: CLOSED ERRATA QA Contact: Meng Bo <bmeng>
Severity: high Docs Contact:
Priority: high    
Version: 3.4.0CC: aos-bugs, bbennett, dakini, ekuric, eparis, haowang, jeder, jialiu, tdawson, vlaad, wmeng
Target Milestone: ---Keywords: TestBlocker
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: aos-scalability-34
Fixed In Version: Doc Type: No Doc Update
Doc Text:
undefined
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-01-18 12:46:12 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1303130    
Attachments:
Description Flags
node start failure log none

Description Johnny Liu 2016-10-26 10:39:23 UTC
Created attachment 1214248 [details]
node start failure log

Description of problem:
Clone this bug from https://bugzilla.redhat.com/show_bug.cgi?id=1388288#c7.

Version-Release number of selected component (if applicable):
# openshift version
openshift v3.4.0.15+9c963ec
kubernetes v1.4.0+776c994
etcd 3.1.0-alpha.1
# rpm -q docker
docker-1.10.3-57.el7.x86_64

How reproducible:
Always

Steps to Reproduce:

1. install env successfully with "redhat/openshift-ovs-multitenant"
# openshift version
openshift v3.4.0.15+9c963ec
kubernetes v1.4.0+776c994
etcd 3.1.0-alpha.1
# rpm -q docker
docker-1.10.3-57.el7.x86_64
# oc get nodes
NAME                           STATUS                     AGE
ip-172-18-10-70.ec2.internal   Ready                      1h
ip-172-18-6-3.ec2.internal     Ready,SchedulingDisabled   1h

2. make sure there is no pod running on node.
# oc scale --replicas=0 dc/registry-console

3. restart node successfully.

4. make sure there is a pod running on node.
# oc scale --replicas=1 dc/registry-console
# oc get po
NAME                       READY     STATUS    RESTARTS   AGE
registry-console-1-k2brf   1/1       Running   0          3m

5. restart node, failed.
# service atomic-openshift-node restart
Redirecting to /bin/systemctl restart  atomic-openshift-node.service
Job for atomic-openshift-node.service failed because a timeout was exceeded. See "systemctl status atomic-openshift-node.service" and "journalctl -xe" for details.

Actual results:
restart node failed.

Expected results:
restart successfully.

Additional info:

Comment 1 Meng Bo 2016-10-26 11:11:18 UTC
It is not related to plugin type, the problem exists in both subnet and multitenant env.

The valuable logs from my viewpoint are:
Oct 25 08:26:01 ip-172-18-24-156.ec2.internal atomic-openshift-node[92648]: I1025 08:26:01.979679   92648 kubelet.go:2240] skipping pod synchronization - [SDN pod network is not ready]
Oct 25 08:26:31 ip-172-18-24-156.ec2.internal atomic-openshift-node[92648]: I1025 08:26:31.980867   92648 kubelet.go:2240] skipping pod synchronization - [SDN pod network is not ready]
Oct 25 08:26:36 ip-172-18-24-156.ec2.internal atomic-openshift-node[92648]: I1025 08:26:36.981065   92648 kubelet.go:2240] skipping pod synchronization - [SDN pod network is not ready]
Oct 25 08:27:02 ip-172-18-24-156.ec2.internal atomic-openshift-node[92947]: I1025 08:27:02.257550   92947 kubelet.go:2240] skipping pod synchronization - [network state unknown container runtime is down]

Seems that the node/kubelet cannot get the correct pod status or cannot bring the existing pods up after restarting.

Comment 2 Ben Bennett 2016-10-27 12:49:25 UTC
Is this a dupe of https://bugzilla.redhat.com/show_bug.cgi?id=1388556 ?

Comment 3 Dan Williams 2016-10-28 22:01:34 UTC
Any chance you get get more of the node's logs, and better yet with --loglevel=5 ?

Comment 4 Johnny Liu 2016-10-29 01:27:50 UTC
(In reply to Dan Williams from comment #3)
> Any chance you get get more of the node's logs, and better yet with
> --loglevel=5 ?

The node logs was gotten at --loglevel=5.

Comment 6 Ben Bennett 2016-11-01 14:03:53 UTC
Can't be MODIFIED until the PR is merged.

Comment 7 Troy Dawson 2016-11-04 18:50:31 UTC
This has been merged into ose and is in OSE v3.4.0.22 or newer.

Comment 9 Johnny Liu 2016-11-07 08:33:19 UTC
Verified this bug with atomic-openshift-3.4.0.22-1.git.0.5c56720.el7.x86_64, and PASS.

Now re-start node successfully.

Comment 11 errata-xmlrpc 2017-01-18 12:46:12 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:0066