Description of problem: Deploy an app, access app's endpoint successfully, then restart node service, app's endpoint is not available. Version-Release number of selected component (if applicable): openshift v3.1.1.5 kubernetes v1.1.0-origin-1107-g4c8e6f4 etcd 2.1.2 How reproducible: Always Steps to Reproduce: 1. Deploy docker-registry in OSE env as admin. 2. Log in as normal user, deploy an app. $ oc new-app https://github.com/openshift/simple-openshift-sinatra-sti.git -i openshift/ruby 3. Access the endpoint of docker-registry and the created app. 4. restart node service. 5. Try to access the endpoint of docker-registry and the created app again. Actual results: Step 3: [root@openshift-146 ~]# oc get po NAME READY STATUS RESTARTS AGE docker-registry-1-6ax1i 1/1 Running 0 7m [root@openshift-146 ~]# oc describe po docker-registry-1-6ax1i|grep IP IP: 10.2.0.5 [root@openshift-146 ~]# curl 10.2.0.5:5000; echo $? 0 [jialiu@jialiu-laptop ~]$ oc get po NAME READY STATUS RESTARTS AGE simple-openshift-sinatra-sti-1-build 0/1 Completed 0 7m simple-openshift-sinatra-sti-1-k0vtx 1/1 Running 0 4m [jialiu@jialiu-laptop ~]$ oc describe po simple-openshift-sinatra-sti-1-k0vtx|grep IP IP: 10.2.0.8 # curl 10.2.0.8:8080; echo $? Hello, Sinatra!0 Step 4: docker-registry pod go into a loop of restarting. # oc get po NAME READY STATUS RESTARTS AGE docker-registry-1-6ax1i 1/1 Running 5 13m [root@openshift-146 ~]# curl 10.2.0.5:5000 curl: (7) Failed connect to 10.2.0.5:5000; No route to host Here is node logs: <--snip--> Jan 21 14:47:45 openshift-111.lab.eng.nay.redhat.com atomic-openshift-node[46410]: I0121 14:47:45.468901 46410 manager.go:1776] pod "docker-registry-1-6ax1i_default" container "registry" is unhealthy, it will be killed and re-created. Jan 21 14:47:45 openshift-111.lab.eng.nay.redhat.com atomic-openshift-node[46410]: I0121 14:47:45.468944 46410 manager.go:1419] Killing container "15cdc522bab9aa924f934e31c19c67156028b6f1abe6df43e40d0e266b181a28 registry default/docker-registry-1-6ax1i" with 30 second grace period Jan 21 14:47:45 openshift-111.lab.eng.nay.redhat.com atomic-openshift-node[46410]: I0121 14:47:45.754880 46410 manager.go:1451] Container "15cdc522bab9aa924f934e31c19c67156028b6f1abe6df43e40d0e266b181a28 registry default/docker-registry-1-6ax1i" exited after 285.908772ms Jan 21 14:47:46 openshift-111.lab.eng.nay.redhat.com atomic-openshift-node[46410]: I0121 14:47:46.018982 46410 helpers.go:96] Unable to get network stats from pid 47166: couldn't read network stats: failure opening /proc/47166/net/dev: open /proc/47166/net/dev: no such file or directory Jan 21 14:47:54 openshift-111.lab.eng.nay.redhat.com atomic-openshift-node[46410]: W0121 14:47:54.467809 46410 prober.go:96] No ref for pod {"docker" "15cdc522bab9aa924f934e31c19c67156028b6f1abe6df43e40d0e266b181a28"} - 'registry' Jan 21 14:47:54 openshift-111.lab.eng.nay.redhat.com atomic-openshift-node[46410]: I0121 14:47:54.467907 46410 prober.go:105] Liveness probe for "docker-registry-1-6ax1i_default:registry" failed (failure): Get http://10.2.0.5:5000/healthz: dial tcp 10.2.0.5:5000: no route to ho <--snip--> Try to access the created app, failed. # curl 10.2.0.8:8080; echo $? curl: (7) Failed connect to 10.2.0.8:8080; No route to host 7 Expected results: After node restart, running pods should still be available. Additional info: Workaround: Run "oc delete po --all" to recreate new pod by rc, then pod's endpoint will be available. [root@openshift-146 ~]# oc get po NAME READY STATUS RESTARTS AGE docker-registry-1-6ax1i 1/1 Running 27 20m [root@openshift-146 ~]# oc delete po --all pod "docker-registry-1-6ax1i" deleted [root@openshift-146 ~]# oc get po NAME READY STATUS RESTARTS AGE docker-registry-1-9cp84 1/1 Running 0 25s [root@openshift-146 ~]# oc describe pod docker-registry-1-9cp84|grep IP IP: 10.2.0.9 [root@openshift-146 ~]# curl 10.2.0.9:5000; echo $? 0
I'm guessing that after the restart, "ovs-ofctl -O OpenFlow13 show br0" no longer shows the pod's veth device being connected to OVS. Can you confirm that?
believed this might be a dup of https://bugzilla.redhat.com/show_bug.cgi?id=1275904 https://bugzilla.redhat.com/show_bug.cgi?id=1299756
(In reply to Eric Paris from comment #2) > believed this might be a dup of > https://bugzilla.redhat.com/show_bug.cgi?id=1275904 > https://bugzilla.redhat.com/show_bug.cgi?id=1299756 It's a dup of the latter, but neither bug is a dup of the former. I was confused earlier. This is fixed by https://github.com/openshift/origin/pull/6684, but that didn't manage to land before the freeze due to test flakes. Submitted a PR against OSE enterprise-3.1 with just the minimal relevant bits here: https://github.com/openshift/ose/pull/129
(In reply to Dan Winship from comment #1) > I'm guessing that after the restart, "ovs-ofctl -O OpenFlow13 show br0" no > longer shows the pod's veth device being connected to OVS. Can you confirm > that? Yeah, you are right. Before: # ovs-ofctl -O OpenFlow13 show br0 OFPT_FEATURES_REPLY (OF1.3) (xid=0x2): dpid:0000be41c082cf4e n_tables:254, n_buffers:256 capabilities: FLOW_STATS TABLE_STATS PORT_STATS GROUP_STATS QUEUE_STATS OFPST_PORT_DESC reply (OF1.3) (xid=0x3): 1(vxlan0): addr:7a:77:fb:2b:93:cb config: 0 state: 0 speed: 0 Mbps now, 0 Mbps max 2(tun0): addr:be:82:38:c8:c8:4d config: 0 state: 0 speed: 0 Mbps now, 0 Mbps max 3(vovsbr): addr:56:40:6c:10:1d:84 config: 0 state: 0 current: 10GB-FD COPPER speed: 10000 Mbps now, 0 Mbps max 9(veth6955879): addr:fe:1c:c3:de:01:12 config: 0 state: 0 current: 10GB-FD COPPER speed: 10000 Mbps now, 0 Mbps max LOCAL(br0): addr:be:41:c0:82:cf:4e config: PORT_DOWN state: LINK_DOWN speed: 0 Mbps now, 0 Mbps max OFPT_GET_CONFIG_REPLY (OF1.3) (xid=0x5): frags=normal miss_send_len=0 After: # ovs-ofctl -O OpenFlow13 show br0 OFPT_FEATURES_REPLY (OF1.3) (xid=0x2): dpid:0000be41c082cf4e n_tables:254, n_buffers:256 capabilities: FLOW_STATS TABLE_STATS PORT_STATS GROUP_STATS QUEUE_STATS OFPST_PORT_DESC reply (OF1.3) (xid=0x3): 1(vxlan0): addr:56:bc:d9:9e:3e:17 config: 0 state: 0 speed: 0 Mbps now, 0 Mbps max 2(tun0): addr:ce:8c:ad:3f:bf:19 config: 0 state: 0 speed: 0 Mbps now, 0 Mbps max 3(vovsbr): addr:86:d5:1c:62:d2:02 config: 0 state: 0 current: 10GB-FD COPPER speed: 10000 Mbps now, 0 Mbps max LOCAL(br0): addr:be:41:c0:82:cf:4e config: PORT_DOWN state: LINK_DOWN speed: 0 Mbps now, 0 Mbps max OFPT_GET_CONFIG_REPLY (OF1.3) (xid=0x5): frags=normal miss_send_len=0
The issue has been fixed with the latest OSE puddle 2016-01-25.1. Versions: # openshift version openshift v3.1.1.6 kubernetes v1.1.0-origin-1107-g4c8e6f4 The pods still have network connection after node service restarted. And the pods' openflow info still exist.
Bug fixed.