Bug 1388867

Summary:

node service restart failed when a pod is running on this node

Product:

OpenShift Container Platform

Reporter:

Johnny Liu <jialiu>

Component:

Networking

Assignee:

Dan Williams <dcbw>

Status:

CLOSED ERRATA

QA Contact:

Meng Bo <bmeng>

Severity:

high

Docs Contact:

Priority:

high

Version:

3.4.0

CC:

aos-bugs, bbennett, dakini, ekuric, eparis, haowang, jeder, jialiu, tdawson, vlaad, wmeng

Target Milestone:

---

Keywords:

TestBlocker

Target Release:

---

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

aos-scalability-34

Fixed In Version:

Doc Type:

No Doc Update

Doc Text:

undefined

Story Points:

---

Clone Of:

Environment:

Last Closed:

2017-01-18 12:46:12 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

1303130

Attachments:

Description	Flags
node start failure log	none

Description Johnny Liu 2016-10-26 10:39:23 UTC

Created attachment 1214248 [details]
node start failure log

Description of problem:
Clone this bug from https://bugzilla.redhat.com/show_bug.cgi?id=1388288#c7.

Version-Release number of selected component (if applicable):
# openshift version
openshift v3.4.0.15+9c963ec
kubernetes v1.4.0+776c994
etcd 3.1.0-alpha.1
# rpm -q docker
docker-1.10.3-57.el7.x86_64

How reproducible:
Always

Steps to Reproduce:

1. install env successfully with "redhat/openshift-ovs-multitenant"
# openshift version
openshift v3.4.0.15+9c963ec
kubernetes v1.4.0+776c994
etcd 3.1.0-alpha.1
# rpm -q docker
docker-1.10.3-57.el7.x86_64
# oc get nodes
NAME                           STATUS                     AGE
ip-172-18-10-70.ec2.internal   Ready                      1h
ip-172-18-6-3.ec2.internal     Ready,SchedulingDisabled   1h

2. make sure there is no pod running on node.
# oc scale --replicas=0 dc/registry-console

3. restart node successfully.

4. make sure there is a pod running on node.
# oc scale --replicas=1 dc/registry-console
# oc get po
NAME                       READY     STATUS    RESTARTS   AGE
registry-console-1-k2brf   1/1       Running   0          3m

5. restart node, failed.
# service atomic-openshift-node restart
Redirecting to /bin/systemctl restart  atomic-openshift-node.service
Job for atomic-openshift-node.service failed because a timeout was exceeded. See "systemctl status atomic-openshift-node.service" and "journalctl -xe" for details.

Actual results:
restart node failed.

Expected results:
restart successfully.

Additional info:

Comment 1 Meng Bo 2016-10-26 11:11:18 UTC

It is not related to plugin type, the problem exists in both subnet and multitenant env.

The valuable logs from my viewpoint are:
Oct 25 08:26:01 ip-172-18-24-156.ec2.internal atomic-openshift-node[92648]: I1025 08:26:01.979679   92648 kubelet.go:2240] skipping pod synchronization - [SDN pod network is not ready]
Oct 25 08:26:31 ip-172-18-24-156.ec2.internal atomic-openshift-node[92648]: I1025 08:26:31.980867   92648 kubelet.go:2240] skipping pod synchronization - [SDN pod network is not ready]
Oct 25 08:26:36 ip-172-18-24-156.ec2.internal atomic-openshift-node[92648]: I1025 08:26:36.981065   92648 kubelet.go:2240] skipping pod synchronization - [SDN pod network is not ready]
Oct 25 08:27:02 ip-172-18-24-156.ec2.internal atomic-openshift-node[92947]: I1025 08:27:02.257550   92947 kubelet.go:2240] skipping pod synchronization - [network state unknown container runtime is down]

Seems that the node/kubelet cannot get the correct pod status or cannot bring the existing pods up after restarting.

Comment 2 Ben Bennett 2016-10-27 12:49:25 UTC

Is this a dupe of https://bugzilla.redhat.com/show_bug.cgi?id=1388556 ?

Comment 3 Dan Williams 2016-10-28 22:01:34 UTC

Any chance you get get more of the node's logs, and better yet with --loglevel=5 ?

Comment 4 Johnny Liu 2016-10-29 01:27:50 UTC

(In reply to Dan Williams from comment #3)
> Any chance you get get more of the node's logs, and better yet with
> --loglevel=5 ?

The node logs was gotten at --loglevel=5.

Comment 5 Dan Williams 2016-10-31 19:02:20 UTC

I believe this should be fixed by https://github.com/openshift/origin/pull/11613 and more specifically https://github.com/openshift/origin/pull/11613/commits/d861f0630f5888756516277e6e5800a83089208c

Comment 6 Ben Bennett 2016-11-01 14:03:53 UTC

Can't be MODIFIED until the PR is merged.

Comment 7 Troy Dawson 2016-11-04 18:50:31 UTC

This has been merged into ose and is in OSE v3.4.0.22 or newer.

Comment 9 Johnny Liu 2016-11-07 08:33:19 UTC

Verified this bug with atomic-openshift-3.4.0.22-1.git.0.5c56720.el7.x86_64, and PASS.

Now re-start node successfully.

Comment 11 errata-xmlrpc 2017-01-18 12:46:12 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:0066