1300582 – pod network connection is not available after restart node service.

Bug 1300582 - pod network connection is not available after restart node service.

Summary: pod network connection is not available after restart node service.

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	3.1.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Eric Paris
QA Contact:	Meng Bo
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2016-01-21 08:32 UTC by Johnny Liu
Modified:	2016-01-29 20:58 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2016-01-29 20:58:26 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Johnny Liu 2016-01-21 08:32:44 UTC

Description of problem:
Deploy an app, access app's endpoint successfully, then restart node service, app's endpoint is not available.

Version-Release number of selected component (if applicable):
openshift v3.1.1.5
kubernetes v1.1.0-origin-1107-g4c8e6f4
etcd 2.1.2

How reproducible:
Always

Steps to Reproduce:
1. Deploy docker-registry in OSE env as admin.
2. Log in as normal user, deploy an app.
$ oc new-app https://github.com/openshift/simple-openshift-sinatra-sti.git -i openshift/ruby
3. Access the endpoint of docker-registry and the created app.
4. restart node service.
5. Try to access the endpoint of docker-registry and the created app again.

Actual results:
Step 3:
[root@openshift-146 ~]# oc get po
NAME                      READY     STATUS    RESTARTS   AGE
docker-registry-1-6ax1i   1/1       Running   0          7m
[root@openshift-146 ~]# oc describe po docker-registry-1-6ax1i|grep IP
IP:				10.2.0.5
[root@openshift-146 ~]# curl 10.2.0.5:5000; echo $?
0
[jialiu@jialiu-laptop ~]$ oc get po
NAME                                   READY     STATUS      RESTARTS   AGE
simple-openshift-sinatra-sti-1-build   0/1       Completed   0          7m
simple-openshift-sinatra-sti-1-k0vtx   1/1       Running     0          4m
[jialiu@jialiu-laptop ~]$ oc describe po simple-openshift-sinatra-sti-1-k0vtx|grep IP
IP:				10.2.0.8
# curl 10.2.0.8:8080; echo $?
Hello, Sinatra!0

Step 4:
docker-registry pod go into a loop of restarting.
# oc get po
NAME                      READY     STATUS    RESTARTS   AGE
docker-registry-1-6ax1i   1/1       Running   5          13m
[root@openshift-146 ~]# curl 10.2.0.5:5000
curl: (7) Failed connect to 10.2.0.5:5000; No route to host


Here is node logs:
<--snip-->
Jan 21 14:47:45 openshift-111.lab.eng.nay.redhat.com atomic-openshift-node[46410]: I0121 14:47:45.468901   46410 manager.go:1776] pod "docker-registry-1-6ax1i_default" container "registry" is unhealthy, it will be killed and re-created.
Jan 21 14:47:45 openshift-111.lab.eng.nay.redhat.com atomic-openshift-node[46410]: I0121 14:47:45.468944   46410 manager.go:1419] Killing container "15cdc522bab9aa924f934e31c19c67156028b6f1abe6df43e40d0e266b181a28 registry default/docker-registry-1-6ax1i" with 30 second grace period
Jan 21 14:47:45 openshift-111.lab.eng.nay.redhat.com atomic-openshift-node[46410]: I0121 14:47:45.754880   46410 manager.go:1451] Container "15cdc522bab9aa924f934e31c19c67156028b6f1abe6df43e40d0e266b181a28 registry default/docker-registry-1-6ax1i" exited after 285.908772ms
Jan 21 14:47:46 openshift-111.lab.eng.nay.redhat.com atomic-openshift-node[46410]: I0121 14:47:46.018982   46410 helpers.go:96] Unable to get network stats from pid 47166: couldn't read network stats: failure opening /proc/47166/net/dev: open /proc/47166/net/dev: no such file or directory
Jan 21 14:47:54 openshift-111.lab.eng.nay.redhat.com atomic-openshift-node[46410]: W0121 14:47:54.467809   46410 prober.go:96] No ref for pod {"docker" "15cdc522bab9aa924f934e31c19c67156028b6f1abe6df43e40d0e266b181a28"} - 'registry'
Jan 21 14:47:54 openshift-111.lab.eng.nay.redhat.com atomic-openshift-node[46410]: I0121 14:47:54.467907   46410 prober.go:105] Liveness probe for "docker-registry-1-6ax1i_default:registry" failed (failure): Get http://10.2.0.5:5000/healthz: dial tcp 10.2.0.5:5000: no route to ho
<--snip-->

Try to access the created app, failed.
# curl 10.2.0.8:8080; echo $?
curl: (7) Failed connect to 10.2.0.8:8080; No route to host
7


Expected results:
After node restart, running pods should still be available.

Additional info:
Workaround:
Run "oc delete po --all" to recreate new pod by rc, then pod's endpoint will be available.
[root@openshift-146 ~]# oc get po
NAME                      READY     STATUS    RESTARTS   AGE
docker-registry-1-6ax1i   1/1       Running   27         20m
[root@openshift-146 ~]# oc delete po --all
pod "docker-registry-1-6ax1i" deleted
[root@openshift-146 ~]# oc get po
NAME                      READY     STATUS    RESTARTS   AGE
docker-registry-1-9cp84   1/1       Running   0          25s
[root@openshift-146 ~]# oc describe pod docker-registry-1-9cp84|grep IP
IP:				10.2.0.9
[root@openshift-146 ~]# curl 10.2.0.9:5000; echo $?
0

Comment 1 Dan Winship 2016-01-25 16:14:09 UTC

I'm guessing that after the restart, "ovs-ofctl -O OpenFlow13 show br0" no longer shows the pod's veth device being connected to OVS. Can you confirm that?

Comment 2 Eric Paris 2016-01-25 17:31:05 UTC

believed this might be a dup of 
https://bugzilla.redhat.com/show_bug.cgi?id=1275904
https://bugzilla.redhat.com/show_bug.cgi?id=1299756

Comment 3 Dan Winship 2016-01-25 19:28:55 UTC

(In reply to Eric Paris from comment #2)
> believed this might be a dup of 
> https://bugzilla.redhat.com/show_bug.cgi?id=1275904
> https://bugzilla.redhat.com/show_bug.cgi?id=1299756

It's a dup of the latter, but neither bug is a dup of the former. I was confused earlier.

This is fixed by https://github.com/openshift/origin/pull/6684, but that didn't manage to land before the freeze due to test flakes.

Submitted a PR against OSE enterprise-3.1 with just the minimal relevant bits here: https://github.com/openshift/ose/pull/129

Comment 4 Johnny Liu 2016-01-26 05:36:23 UTC

(In reply to Dan Winship from comment #1)
> I'm guessing that after the restart, "ovs-ofctl -O OpenFlow13 show br0" no
> longer shows the pod's veth device being connected to OVS. Can you confirm
> that?

Yeah, you are right.

Before:
# ovs-ofctl -O OpenFlow13 show br0
OFPT_FEATURES_REPLY (OF1.3) (xid=0x2): dpid:0000be41c082cf4e
n_tables:254, n_buffers:256
capabilities: FLOW_STATS TABLE_STATS PORT_STATS GROUP_STATS QUEUE_STATS
OFPST_PORT_DESC reply (OF1.3) (xid=0x3):
 1(vxlan0): addr:7a:77:fb:2b:93:cb
     config:     0
     state:      0
     speed: 0 Mbps now, 0 Mbps max
 2(tun0): addr:be:82:38:c8:c8:4d
     config:     0
     state:      0
     speed: 0 Mbps now, 0 Mbps max
 3(vovsbr): addr:56:40:6c:10:1d:84
     config:     0
     state:      0
     current:    10GB-FD COPPER
     speed: 10000 Mbps now, 0 Mbps max
 9(veth6955879): addr:fe:1c:c3:de:01:12
     config:     0
     state:      0
     current:    10GB-FD COPPER
     speed: 10000 Mbps now, 0 Mbps max
 LOCAL(br0): addr:be:41:c0:82:cf:4e
     config:     PORT_DOWN
     state:      LINK_DOWN
     speed: 0 Mbps now, 0 Mbps max
OFPT_GET_CONFIG_REPLY (OF1.3) (xid=0x5): frags=normal miss_send_len=0


After:
# ovs-ofctl -O OpenFlow13 show br0
OFPT_FEATURES_REPLY (OF1.3) (xid=0x2): dpid:0000be41c082cf4e
n_tables:254, n_buffers:256
capabilities: FLOW_STATS TABLE_STATS PORT_STATS GROUP_STATS QUEUE_STATS
OFPST_PORT_DESC reply (OF1.3) (xid=0x3):
 1(vxlan0): addr:56:bc:d9:9e:3e:17
     config:     0
     state:      0
     speed: 0 Mbps now, 0 Mbps max
 2(tun0): addr:ce:8c:ad:3f:bf:19
     config:     0
     state:      0
     speed: 0 Mbps now, 0 Mbps max
 3(vovsbr): addr:86:d5:1c:62:d2:02
     config:     0
     state:      0
     current:    10GB-FD COPPER
     speed: 10000 Mbps now, 0 Mbps max
 LOCAL(br0): addr:be:41:c0:82:cf:4e
     config:     PORT_DOWN
     state:      LINK_DOWN
     speed: 0 Mbps now, 0 Mbps max
OFPT_GET_CONFIG_REPLY (OF1.3) (xid=0x5): frags=normal miss_send_len=0

Comment 5 Meng Bo 2016-01-26 05:58:08 UTC

The issue has been fixed with the latest OSE puddle 2016-01-25.1.
Versions:
# openshift version
openshift v3.1.1.6
kubernetes v1.1.0-origin-1107-g4c8e6f4


The pods still have network connection after node service restarted.
And the pods' openflow info still exist.

Comment 6 Meng Bo 2016-01-27 02:08:03 UTC

Bug fixed.

Note You need to log in before you can comment on or make changes to this bug.