Bug 1388867 - node service restart failed when a pod is running on this node
Summary: node service restart failed when a pod is running on this node
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 3.4.0
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: ---
Assignee: Dan Williams
QA Contact: Meng Bo
URL:
Whiteboard: aos-scalability-34
Depends On:
Blocks: OSOPS_V3
TreeView+ depends on / blocked
 
Reported: 2016-10-26 10:39 UTC by Johnny Liu
Modified: 2017-03-08 18:43 UTC (History)
11 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
undefined
Clone Of:
Environment:
Last Closed: 2017-01-18 12:46:12 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
node start failure log (388.29 KB, text/x-vhdl)
2016-10-26 10:39 UTC, Johnny Liu
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Origin (Github) 11613 0 None None None 2016-11-01 14:01:44 UTC
Red Hat Product Errata RHBA-2017:0066 0 normal SHIPPED_LIVE Red Hat OpenShift Container Platform 3.4 RPM Release Advisory 2017-01-18 17:23:26 UTC

Description Johnny Liu 2016-10-26 10:39:23 UTC
Created attachment 1214248 [details]
node start failure log

Description of problem:
Clone this bug from https://bugzilla.redhat.com/show_bug.cgi?id=1388288#c7.

Version-Release number of selected component (if applicable):
# openshift version
openshift v3.4.0.15+9c963ec
kubernetes v1.4.0+776c994
etcd 3.1.0-alpha.1
# rpm -q docker
docker-1.10.3-57.el7.x86_64

How reproducible:
Always

Steps to Reproduce:

1. install env successfully with "redhat/openshift-ovs-multitenant"
# openshift version
openshift v3.4.0.15+9c963ec
kubernetes v1.4.0+776c994
etcd 3.1.0-alpha.1
# rpm -q docker
docker-1.10.3-57.el7.x86_64
# oc get nodes
NAME                           STATUS                     AGE
ip-172-18-10-70.ec2.internal   Ready                      1h
ip-172-18-6-3.ec2.internal     Ready,SchedulingDisabled   1h

2. make sure there is no pod running on node.
# oc scale --replicas=0 dc/registry-console

3. restart node successfully.

4. make sure there is a pod running on node.
# oc scale --replicas=1 dc/registry-console
# oc get po
NAME                       READY     STATUS    RESTARTS   AGE
registry-console-1-k2brf   1/1       Running   0          3m

5. restart node, failed.
# service atomic-openshift-node restart
Redirecting to /bin/systemctl restart  atomic-openshift-node.service
Job for atomic-openshift-node.service failed because a timeout was exceeded. See "systemctl status atomic-openshift-node.service" and "journalctl -xe" for details.

Actual results:
restart node failed.

Expected results:
restart successfully.

Additional info:

Comment 1 Meng Bo 2016-10-26 11:11:18 UTC
It is not related to plugin type, the problem exists in both subnet and multitenant env.

The valuable logs from my viewpoint are:
Oct 25 08:26:01 ip-172-18-24-156.ec2.internal atomic-openshift-node[92648]: I1025 08:26:01.979679   92648 kubelet.go:2240] skipping pod synchronization - [SDN pod network is not ready]
Oct 25 08:26:31 ip-172-18-24-156.ec2.internal atomic-openshift-node[92648]: I1025 08:26:31.980867   92648 kubelet.go:2240] skipping pod synchronization - [SDN pod network is not ready]
Oct 25 08:26:36 ip-172-18-24-156.ec2.internal atomic-openshift-node[92648]: I1025 08:26:36.981065   92648 kubelet.go:2240] skipping pod synchronization - [SDN pod network is not ready]
Oct 25 08:27:02 ip-172-18-24-156.ec2.internal atomic-openshift-node[92947]: I1025 08:27:02.257550   92947 kubelet.go:2240] skipping pod synchronization - [network state unknown container runtime is down]

Seems that the node/kubelet cannot get the correct pod status or cannot bring the existing pods up after restarting.

Comment 2 Ben Bennett 2016-10-27 12:49:25 UTC
Is this a dupe of https://bugzilla.redhat.com/show_bug.cgi?id=1388556 ?

Comment 3 Dan Williams 2016-10-28 22:01:34 UTC
Any chance you get get more of the node's logs, and better yet with --loglevel=5 ?

Comment 4 Johnny Liu 2016-10-29 01:27:50 UTC
(In reply to Dan Williams from comment #3)
> Any chance you get get more of the node's logs, and better yet with
> --loglevel=5 ?

The node logs was gotten at --loglevel=5.

Comment 6 Ben Bennett 2016-11-01 14:03:53 UTC
Can't be MODIFIED until the PR is merged.

Comment 7 Troy Dawson 2016-11-04 18:50:31 UTC
This has been merged into ose and is in OSE v3.4.0.22 or newer.

Comment 9 Johnny Liu 2016-11-07 08:33:19 UTC
Verified this bug with atomic-openshift-3.4.0.22-1.git.0.5c56720.el7.x86_64, and PASS.

Now re-start node successfully.

Comment 11 errata-xmlrpc 2017-01-18 12:46:12 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:0066


Note You need to log in before you can comment on or make changes to this bug.