Bug 1300582 - pod network connection is not available after restart node service.
pod network connection is not available after restart node service.
Status: CLOSED CURRENTRELEASE
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking (Show other bugs)
3.1.0
Unspecified Unspecified
high Severity high
: ---
: ---
Assigned To: Eric Paris
Meng Bo
: Regression
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2016-01-21 03:32 EST by Johnny Liu
Modified: 2016-01-29 15:58 EST (History)
7 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2016-01-29 15:58:26 EST
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Johnny Liu 2016-01-21 03:32:44 EST
Description of problem:
Deploy an app, access app's endpoint successfully, then restart node service, app's endpoint is not available.

Version-Release number of selected component (if applicable):
openshift v3.1.1.5
kubernetes v1.1.0-origin-1107-g4c8e6f4
etcd 2.1.2

How reproducible:
Always

Steps to Reproduce:
1. Deploy docker-registry in OSE env as admin.
2. Log in as normal user, deploy an app.
$ oc new-app https://github.com/openshift/simple-openshift-sinatra-sti.git -i openshift/ruby
3. Access the endpoint of docker-registry and the created app.
4. restart node service.
5. Try to access the endpoint of docker-registry and the created app again.

Actual results:
Step 3:
[root@openshift-146 ~]# oc get po
NAME                      READY     STATUS    RESTARTS   AGE
docker-registry-1-6ax1i   1/1       Running   0          7m
[root@openshift-146 ~]# oc describe po docker-registry-1-6ax1i|grep IP
IP:				10.2.0.5
[root@openshift-146 ~]# curl 10.2.0.5:5000; echo $?
0
[jialiu@jialiu-laptop ~]$ oc get po
NAME                                   READY     STATUS      RESTARTS   AGE
simple-openshift-sinatra-sti-1-build   0/1       Completed   0          7m
simple-openshift-sinatra-sti-1-k0vtx   1/1       Running     0          4m
[jialiu@jialiu-laptop ~]$ oc describe po simple-openshift-sinatra-sti-1-k0vtx|grep IP
IP:				10.2.0.8
# curl 10.2.0.8:8080; echo $?
Hello, Sinatra!0

Step 4:
docker-registry pod go into a loop of restarting.
# oc get po
NAME                      READY     STATUS    RESTARTS   AGE
docker-registry-1-6ax1i   1/1       Running   5          13m
[root@openshift-146 ~]# curl 10.2.0.5:5000
curl: (7) Failed connect to 10.2.0.5:5000; No route to host


Here is node logs:
<--snip-->
Jan 21 14:47:45 openshift-111.lab.eng.nay.redhat.com atomic-openshift-node[46410]: I0121 14:47:45.468901   46410 manager.go:1776] pod "docker-registry-1-6ax1i_default" container "registry" is unhealthy, it will be killed and re-created.
Jan 21 14:47:45 openshift-111.lab.eng.nay.redhat.com atomic-openshift-node[46410]: I0121 14:47:45.468944   46410 manager.go:1419] Killing container "15cdc522bab9aa924f934e31c19c67156028b6f1abe6df43e40d0e266b181a28 registry default/docker-registry-1-6ax1i" with 30 second grace period
Jan 21 14:47:45 openshift-111.lab.eng.nay.redhat.com atomic-openshift-node[46410]: I0121 14:47:45.754880   46410 manager.go:1451] Container "15cdc522bab9aa924f934e31c19c67156028b6f1abe6df43e40d0e266b181a28 registry default/docker-registry-1-6ax1i" exited after 285.908772ms
Jan 21 14:47:46 openshift-111.lab.eng.nay.redhat.com atomic-openshift-node[46410]: I0121 14:47:46.018982   46410 helpers.go:96] Unable to get network stats from pid 47166: couldn't read network stats: failure opening /proc/47166/net/dev: open /proc/47166/net/dev: no such file or directory
Jan 21 14:47:54 openshift-111.lab.eng.nay.redhat.com atomic-openshift-node[46410]: W0121 14:47:54.467809   46410 prober.go:96] No ref for pod {"docker" "15cdc522bab9aa924f934e31c19c67156028b6f1abe6df43e40d0e266b181a28"} - 'registry'
Jan 21 14:47:54 openshift-111.lab.eng.nay.redhat.com atomic-openshift-node[46410]: I0121 14:47:54.467907   46410 prober.go:105] Liveness probe for "docker-registry-1-6ax1i_default:registry" failed (failure): Get http://10.2.0.5:5000/healthz: dial tcp 10.2.0.5:5000: no route to ho
<--snip-->

Try to access the created app, failed.
# curl 10.2.0.8:8080; echo $?
curl: (7) Failed connect to 10.2.0.8:8080; No route to host
7


Expected results:
After node restart, running pods should still be available.

Additional info:
Workaround:
Run "oc delete po --all" to recreate new pod by rc, then pod's endpoint will be available.
[root@openshift-146 ~]# oc get po
NAME                      READY     STATUS    RESTARTS   AGE
docker-registry-1-6ax1i   1/1       Running   27         20m
[root@openshift-146 ~]# oc delete po --all
pod "docker-registry-1-6ax1i" deleted
[root@openshift-146 ~]# oc get po
NAME                      READY     STATUS    RESTARTS   AGE
docker-registry-1-9cp84   1/1       Running   0          25s
[root@openshift-146 ~]# oc describe pod docker-registry-1-9cp84|grep IP
IP:				10.2.0.9
[root@openshift-146 ~]# curl 10.2.0.9:5000; echo $?
0
Comment 1 Dan Winship 2016-01-25 11:14:09 EST
I'm guessing that after the restart, "ovs-ofctl -O OpenFlow13 show br0" no longer shows the pod's veth device being connected to OVS. Can you confirm that?
Comment 3 Dan Winship 2016-01-25 14:28:55 EST
(In reply to Eric Paris from comment #2)
> believed this might be a dup of 
> https://bugzilla.redhat.com/show_bug.cgi?id=1275904
> https://bugzilla.redhat.com/show_bug.cgi?id=1299756

It's a dup of the latter, but neither bug is a dup of the former. I was confused earlier.

This is fixed by https://github.com/openshift/origin/pull/6684, but that didn't manage to land before the freeze due to test flakes.

Submitted a PR against OSE enterprise-3.1 with just the minimal relevant bits here: https://github.com/openshift/ose/pull/129
Comment 4 Johnny Liu 2016-01-26 00:36:23 EST
(In reply to Dan Winship from comment #1)
> I'm guessing that after the restart, "ovs-ofctl -O OpenFlow13 show br0" no
> longer shows the pod's veth device being connected to OVS. Can you confirm
> that?

Yeah, you are right.

Before:
# ovs-ofctl -O OpenFlow13 show br0
OFPT_FEATURES_REPLY (OF1.3) (xid=0x2): dpid:0000be41c082cf4e
n_tables:254, n_buffers:256
capabilities: FLOW_STATS TABLE_STATS PORT_STATS GROUP_STATS QUEUE_STATS
OFPST_PORT_DESC reply (OF1.3) (xid=0x3):
 1(vxlan0): addr:7a:77:fb:2b:93:cb
     config:     0
     state:      0
     speed: 0 Mbps now, 0 Mbps max
 2(tun0): addr:be:82:38:c8:c8:4d
     config:     0
     state:      0
     speed: 0 Mbps now, 0 Mbps max
 3(vovsbr): addr:56:40:6c:10:1d:84
     config:     0
     state:      0
     current:    10GB-FD COPPER
     speed: 10000 Mbps now, 0 Mbps max
 9(veth6955879): addr:fe:1c:c3:de:01:12
     config:     0
     state:      0
     current:    10GB-FD COPPER
     speed: 10000 Mbps now, 0 Mbps max
 LOCAL(br0): addr:be:41:c0:82:cf:4e
     config:     PORT_DOWN
     state:      LINK_DOWN
     speed: 0 Mbps now, 0 Mbps max
OFPT_GET_CONFIG_REPLY (OF1.3) (xid=0x5): frags=normal miss_send_len=0


After:
# ovs-ofctl -O OpenFlow13 show br0
OFPT_FEATURES_REPLY (OF1.3) (xid=0x2): dpid:0000be41c082cf4e
n_tables:254, n_buffers:256
capabilities: FLOW_STATS TABLE_STATS PORT_STATS GROUP_STATS QUEUE_STATS
OFPST_PORT_DESC reply (OF1.3) (xid=0x3):
 1(vxlan0): addr:56:bc:d9:9e:3e:17
     config:     0
     state:      0
     speed: 0 Mbps now, 0 Mbps max
 2(tun0): addr:ce:8c:ad:3f:bf:19
     config:     0
     state:      0
     speed: 0 Mbps now, 0 Mbps max
 3(vovsbr): addr:86:d5:1c:62:d2:02
     config:     0
     state:      0
     current:    10GB-FD COPPER
     speed: 10000 Mbps now, 0 Mbps max
 LOCAL(br0): addr:be:41:c0:82:cf:4e
     config:     PORT_DOWN
     state:      LINK_DOWN
     speed: 0 Mbps now, 0 Mbps max
OFPT_GET_CONFIG_REPLY (OF1.3) (xid=0x5): frags=normal miss_send_len=0
Comment 5 Meng Bo 2016-01-26 00:58:08 EST
The issue has been fixed with the latest OSE puddle 2016-01-25.1.
Versions:
# openshift version
openshift v3.1.1.6
kubernetes v1.1.0-origin-1107-g4c8e6f4


The pods still have network connection after node service restarted.
And the pods' openflow info still exist.
Comment 6 Meng Bo 2016-01-26 21:08:03 EST
Bug fixed.

Note You need to log in before you can comment on or make changes to this bug.