Description of problem: During Online Hackday, I added the jenkins-persistent instant app to my project. The deployer pod failed with no good messages available on the console. Dan M found the following error in the Sentry logs: New Issue on OpenShift 3 Beta3 INT failed to "TeardownNetwork" for "jenkins-1-deploy_mffiedler" with TeardownNetworkError: "Failed to teardown network for pod \"3439c0bc-1230-11e6-9f75-0a1d348c34bb\" using network plugins \"redhat/openshift-ovs-multitenant\": Error running network teardown script: + ovs-ofctl -O OpenFlow13 del-flows br0 ip,nw_dst=\novs-ofctl: field nw_dst missing value\n" ID: 0278663ec6b24f9c96c7d8efe23e1ea9 May 4, 2016, 7:43 p.m. UTC Exception errors.aggregate: failed to "TeardownNetwork" for "jenkins-1-deploy_mffiedler" with TeardownNetworkError: "Failed to teardown network for pod \"3439c0bc-1230-11e6-9f75-0a1d348c34bb\" using network plugins \"redhat/openshift-ovs-multitenant\": Error running network teardown script: + ovs-ofctl -O OpenFlow13 del-flows br0 ip,nw_dst=\novs-ofctl: field nw_dst missing value\n" File "/builddir/build/BUILD/atomic-openshift-git-0.f44746c/_build/src/github.com/openshift/origin/pkg/cmd/util/serviceability/sentry.go", line 61, in CaptureError File "/builddir/build/BUILD/atomic-openshift-git-0.f44746c/_build/src/github.com/openshift/origin/pkg/cmd/util/serviceability/panic.go", line 28, in CaptureError.fm File "/builddir/build/BUILD/atomic-openshift-git-0.f44746c/_thirdpartyhacks/src/k8s.io/kubernetes/pkg/util/runtime/runtime.go", line 71, in HandleError File "/builddir/build/BUILD/atomic-openshift-git-0.f44746c/_thirdpartyhacks/src/k8s.io/kubernetes/pkg/kubelet/kubelet.go", line 1756, in syncPod File "/builddir/build/BUILD/atomic-openshift-git-0.f44746c/_thirdpartyhacks/src/k8s.io/kubernetes/pkg/kubelet/kubelet.go", line 475, in syncPod).fm ... (4 additional frame(s) were not displayed) Version-Release number of selected component (if applicable): 3.2 How reproducible: Unknown - subsequent deployment succeeded Steps to Reproduce: 1. deploy jenkins-persistent in Online preview 2. if error occurs, deployment will fail Actual results: See error messages above - deployment failed Expected results: Successful deployment
(In reply to Mike Fiedler from comment #0) > During Online Hackday, I added the jenkins-persistent instant app to my > project. The deployer pod failed with no good messages available on the > console. Dan M found the following error in the Sentry logs: > > New Issue on OpenShift 3 Beta3 INT > > failed to "TeardownNetwork" for "jenkins-1-deploy_mffiedler" with This error message is when cleaning up the pod, so it's not actually the cause of the failure, just another symptom. I assume you don't still have the logs from this machine?
No luck getting logs. This is kind of a "can't happen" error; kubernetes will only try to tear down the infrastructure pod if it successfully set it up (including setting up networking). I guess this must mean that the infrastructure pod was set up successfully, but then died/was killed unexpectedly, such that it no longer existed when TearDownPod() was run... Anyway, this means that we will end up leaving cruft in the OVS rules, which is bad... (when the IP address gets reused, we might direct traffic to the wrong port). dcbw mentioned having code in some branch to cache the IP addresses. That would mostly fix this, though to be completely safe we'd want to cache the veth interface names too.
So this bug pointed out a problem in the sdn teardown code, but that's not actually the bug that the reporter was reporting (which is "The deployer pod failed with no good messages"). Logs are no longer available, so we're never going to debug what actually happened there. So while we're continuing to work on sdn-teardown-related fixes, that isn't really this bug.
root@ip-172-31-31-215: /var/log # openshift version openshift v3.2.1.1-1-g33fa4ea kubernetes v1.2.0-36-g4a3f9c5 etcd 2.2.5 root@ip-172-31-31-215: /var/log # docker version Client: Version: 1.10.3 API version: 1.22 Package version: docker-common-1.10.3-42.el7.x86_64 Go version: go1.4.2 Git commit: 02f13c2-unsupported Built: Mon Jun 13 15:22:15 2016 OS/Arch: linux/amd64 Server: Version: 1.10.3 API version: 1.22 Package version: docker-common-1.10.3-42.el7.x86_64 Go version: go1.4.2 Git commit: 02f13c2-unsupported Built: Mon Jun 13 15:22:15 2016 OS/Arch: linux/amd64
OK, so the logs show a build pod crashing: atomic-openshift-node: I0614 17:38:35.523370 7736 kubelet.go:2430] SyncLoop (PLEG): "cakephp-mysql-example-1-build_test(eda34196-3277-11e6-926c-028d6556ee79)", event: &pleg.PodLifecycleEvent{ID:"eda34196-3277-11e6-926c-028d6556ee79", Type:"ContainerDied", Data:"044c5e961ef5ea49c554e165fbbb540727f4cac579b63c7b12b599b45fd2e64a"} but I'm not sure there are any good hints as to why. (There's a ton of stuff in the logs, but none of it jumps out as being relevant.) But if the crashing bug is actually happening regularly, then that's a problem because of the openshift-sdn cleanup bug; the references to the failed pod's IP address will linger in OVS after the pod exits, and so then the next pod to get assigned that IP address won't have working networking, because the OVS rules will still try to route its traffic to the old pod. A workaround if this happens is: systemctl stop atomic-openshift-node ovs-vsctl del-br br0 systemctl start atomic-openshift-node which will regenerate the OVS rules from scratch, correctly. (Or if you know that certain pods have the broken IPs, you can just kill those pods; when the cleanup code runs for them, it will delete both the "good" OVS rules and the stale ones for those IP addresses, so then the next pod to get that IP will work again.)
Actually, Mike, if you can reproduce this reliably, can you try to get logs with --loglevel=5? There's something weird going on here; the logs show us successfully removing the veth from the OVS bridge *before* failing to remove the flow rules: > Jun 14 17:38:36 xxxxxx ovs-vsctl: ovs|00001|vsctl|INFO|Called as ovs-vsctl --if-exists del-port veth439c632 > ... > Jun 14 17:38:36 xxxx atomic-openshift-node: E0614 17:38:36.810673 7736 manager.go:1297] Failed to teardown network for pod "eda34196-3277-11e6-926c-028d6556ee79" using network plugins "redhat/openshift-ovs-multitenant": Error running network teardown script: + ovs-ofctl -O OpenFlow13 del-flows br0 ip,nw_dst= which makes no sense because (a) the teardown script deletes the flow rules before deleting the port, and (b) if it exited with a flow-rule-deletion error, then it wouldn't even get to the port deletion part of the script... So now I'm wondering if maybe TearDownPod() is getting called twice for some reason, succeeding the first time and then failing the second. (Possibly supporting this hypothesis: the openshift-sdn-debug output that you linked to doesn't show any evidence of stale OVS rules being left around, although that's not conclusive since the bad rules may have just been cleaned up by other pods later.)
OK, yes, those logs show that the pod *is* getting torn down twice: > Jun 19 04:17:50 xxxxxx atomic-openshift-node: I0619 04:17:50.391887 88550 plugin.go:243] TearDownPod network plugin output: > Jun 19 04:17:50 xxxxxx atomic-openshift-node: + action=teardown > Jun 19 04:17:50 xxxxxx atomic-openshift-node: + net_container=7b3cd786da0c12320e986b8363575f7b219ac1c33b8474638df8a2cee9b9cc26 > ... > Jun 19 04:17:50 xxxxxx atomic-openshift-node: + ovs-ofctl -O OpenFlow13 del-flows br0 ip,nw_dst=172.20.2.3 > ... and then later: > Jun 19 04:17:51 xxxxxx atomic-openshift-node: I0619 04:17:51.426410 88550 plugin.go:243] TearDownPod network plugin output: + lock_file=/var/lock/openshift-sdn.lock > Jun 19 04:17:51 xxxxxx atomic-openshift-node: + action=teardown > Jun 19 04:17:51 xxxxxx atomic-openshift-node: + net_container=7b3cd786da0c12320e986b8363575f7b219ac1c33b8474638df8a2cee9b9cc26 > ... > Jun 19 04:17:51 xxxxxx atomic-openshift-node: ++ docker inspect --format '{{.NetworkSettings.IPAddress}}' 7b3cd786da0c12320e986b8363575f7b219ac1c33b8474638df8a2cee9b9cc26 > Jun 19 04:17:51 xxxxxx atomic-openshift-node: + ipaddr= > ... > Jun 19 04:17:51 xxxxxx atomic-openshift-node: + ovs-ofctl -O OpenFlow13 del-flows br0 ip,nw_dst= > Jun 19 04:17:51 xxxxxx atomic-openshift-node: ovs-ofctl: field nw_dst missing value > Jun 19 04:17:51 xxxxxx atomic-openshift-node: , exit status 1 So that's a bug, but it means that we *aren't* leaving crufty rules behind in OVS and breaking the network, so this is less urgent again.
This bug no longer occurs in OpenShift 3.4. In my opinion this bz should be marked as a duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1359240. @danw - agree?
agreed *** This bug has been marked as a duplicate of bug 1359240 ***