Bug 1333393
| Summary: | TeardownNetworkError for Online preview: field nw_dst missing value | |||
|---|---|---|---|---|
| Product: | OpenShift Online | Reporter: | Mike Fiedler <mifiedle> | |
| Component: | Networking | Assignee: | Dan Winship <danw> | |
| Status: | CLOSED DUPLICATE | QA Contact: | Meng Bo <bmeng> | |
| Severity: | low | Docs Contact: | ||
| Priority: | medium | |||
| Version: | 3.x | CC: | abhgupta, aloughla, aos-bugs, bingli, bmchugh, danw, dcbw, haliu, jeder, mifiedle, mleitner, rkhan, sukulkar, vlaad | |
| Target Milestone: | --- | Keywords: | Reopened | |
| Target Release: | --- | |||
| Hardware: | x86_64 | |||
| OS: | Linux | |||
| Whiteboard: | aos-scalability-34 | |||
| Fixed In Version: | Doc Type: | Bug Fix | ||
| Doc Text: | Story Points: | --- | ||
| Clone Of: | ||||
| : | 1359240 (view as bug list) | Environment: | ||
| Last Closed: | 2017-01-03 14:45:29 UTC | Type: | Bug | |
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | --- | Target Upstream Version: | ||
| Embargoed: | ||||
| Bug Depends On: | ||||
| Bug Blocks: | 1359240 | |||
|
Description
Mike Fiedler
2016-05-05 12:13:42 UTC
(In reply to Mike Fiedler from comment #0) > During Online Hackday, I added the jenkins-persistent instant app to my > project. The deployer pod failed with no good messages available on the > console. Dan M found the following error in the Sentry logs: > > New Issue on OpenShift 3 Beta3 INT > > failed to "TeardownNetwork" for "jenkins-1-deploy_mffiedler" with This error message is when cleaning up the pod, so it's not actually the cause of the failure, just another symptom. I assume you don't still have the logs from this machine? No luck getting logs. This is kind of a "can't happen" error; kubernetes will only try to tear down the infrastructure pod if it successfully set it up (including setting up networking). I guess this must mean that the infrastructure pod was set up successfully, but then died/was killed unexpectedly, such that it no longer existed when TearDownPod() was run... Anyway, this means that we will end up leaving cruft in the OVS rules, which is bad... (when the IP address gets reused, we might direct traffic to the wrong port). dcbw mentioned having code in some branch to cache the IP addresses. That would mostly fix this, though to be completely safe we'd want to cache the veth interface names too. So this bug pointed out a problem in the sdn teardown code, but that's not actually the bug that the reporter was reporting (which is "The deployer pod failed with no good messages"). Logs are no longer available, so we're never going to debug what actually happened there. So while we're continuing to work on sdn-teardown-related fixes, that isn't really this bug. root@ip-172-31-31-215: /var/log # openshift version openshift v3.2.1.1-1-g33fa4ea kubernetes v1.2.0-36-g4a3f9c5 etcd 2.2.5 root@ip-172-31-31-215: /var/log # docker version Client: Version: 1.10.3 API version: 1.22 Package version: docker-common-1.10.3-42.el7.x86_64 Go version: go1.4.2 Git commit: 02f13c2-unsupported Built: Mon Jun 13 15:22:15 2016 OS/Arch: linux/amd64 Server: Version: 1.10.3 API version: 1.22 Package version: docker-common-1.10.3-42.el7.x86_64 Go version: go1.4.2 Git commit: 02f13c2-unsupported Built: Mon Jun 13 15:22:15 2016 OS/Arch: linux/amd64 OK, so the logs show a build pod crashing:
atomic-openshift-node: I0614 17:38:35.523370 7736 kubelet.go:2430] SyncLoop (PLEG): "cakephp-mysql-example-1-build_test(eda34196-3277-11e6-926c-028d6556ee79)", event: &pleg.PodLifecycleEvent{ID:"eda34196-3277-11e6-926c-028d6556ee79", Type:"ContainerDied", Data:"044c5e961ef5ea49c554e165fbbb540727f4cac579b63c7b12b599b45fd2e64a"}
but I'm not sure there are any good hints as to why. (There's a ton of stuff in the logs, but none of it jumps out as being relevant.)
But if the crashing bug is actually happening regularly, then that's a problem because of the openshift-sdn cleanup bug; the references to the failed pod's IP address will linger in OVS after the pod exits, and so then the next pod to get assigned that IP address won't have working networking, because the OVS rules will still try to route its traffic to the old pod.
A workaround if this happens is:
systemctl stop atomic-openshift-node
ovs-vsctl del-br br0
systemctl start atomic-openshift-node
which will regenerate the OVS rules from scratch, correctly. (Or if you know that certain pods have the broken IPs, you can just kill those pods; when the cleanup code runs for them, it will delete both the "good" OVS rules and the stale ones for those IP addresses, so then the next pod to get that IP will work again.)
Actually, Mike, if you can reproduce this reliably, can you try to get logs with --loglevel=5? There's something weird going on here; the logs show us successfully removing the veth from the OVS bridge *before* failing to remove the flow rules:
> Jun 14 17:38:36 xxxxxx ovs-vsctl: ovs|00001|vsctl|INFO|Called as ovs-vsctl --if-exists del-port veth439c632
> ...
> Jun 14 17:38:36 xxxx atomic-openshift-node: E0614 17:38:36.810673 7736 manager.go:1297] Failed to teardown network for pod "eda34196-3277-11e6-926c-028d6556ee79" using network plugins "redhat/openshift-ovs-multitenant": Error running network teardown script: + ovs-ofctl -O OpenFlow13 del-flows br0 ip,nw_dst=
which makes no sense because (a) the teardown script deletes the flow rules before deleting the port, and (b) if it exited with a flow-rule-deletion error, then it wouldn't even get to the port deletion part of the script...
So now I'm wondering if maybe TearDownPod() is getting called twice for some reason, succeeding the first time and then failing the second. (Possibly supporting this hypothesis: the openshift-sdn-debug output that you linked to doesn't show any evidence of stale OVS rules being left around, although that's not conclusive since the bad rules may have just been cleaned up by other pods later.)
OK, yes, those logs show that the pod *is* getting torn down twice: > Jun 19 04:17:50 xxxxxx atomic-openshift-node: I0619 04:17:50.391887 88550 plugin.go:243] TearDownPod network plugin output: > Jun 19 04:17:50 xxxxxx atomic-openshift-node: + action=teardown > Jun 19 04:17:50 xxxxxx atomic-openshift-node: + net_container=7b3cd786da0c12320e986b8363575f7b219ac1c33b8474638df8a2cee9b9cc26 > ... > Jun 19 04:17:50 xxxxxx atomic-openshift-node: + ovs-ofctl -O OpenFlow13 del-flows br0 ip,nw_dst=172.20.2.3 > ... and then later: > Jun 19 04:17:51 xxxxxx atomic-openshift-node: I0619 04:17:51.426410 88550 plugin.go:243] TearDownPod network plugin output: + lock_file=/var/lock/openshift-sdn.lock > Jun 19 04:17:51 xxxxxx atomic-openshift-node: + action=teardown > Jun 19 04:17:51 xxxxxx atomic-openshift-node: + net_container=7b3cd786da0c12320e986b8363575f7b219ac1c33b8474638df8a2cee9b9cc26 > ... > Jun 19 04:17:51 xxxxxx atomic-openshift-node: ++ docker inspect --format '{{.NetworkSettings.IPAddress}}' 7b3cd786da0c12320e986b8363575f7b219ac1c33b8474638df8a2cee9b9cc26 > Jun 19 04:17:51 xxxxxx atomic-openshift-node: + ipaddr= > ... > Jun 19 04:17:51 xxxxxx atomic-openshift-node: + ovs-ofctl -O OpenFlow13 del-flows br0 ip,nw_dst= > Jun 19 04:17:51 xxxxxx atomic-openshift-node: ovs-ofctl: field nw_dst missing value > Jun 19 04:17:51 xxxxxx atomic-openshift-node: , exit status 1 So that's a bug, but it means that we *aren't* leaving crufty rules behind in OVS and breaking the network, so this is less urgent again. This bug no longer occurs in OpenShift 3.4. In my opinion this bz should be marked as a duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1359240. @danw - agree? agreed *** This bug has been marked as a duplicate of bug 1359240 *** |