Bug 1333393

Summary:	TeardownNetworkError for Online preview: field nw_dst missing value
Product:	OpenShift Online	Reporter:	Mike Fiedler <mifiedle>
Component:	Networking	Assignee:	Dan Winship <danw>
Status:	CLOSED DUPLICATE	QA Contact:	Meng Bo <bmeng>
Severity:	low	Docs Contact:
Priority:	medium
Version:	3.x	CC:	abhgupta, aloughla, aos-bugs, bingli, bmchugh, danw, dcbw, haliu, jeder, mifiedle, mleitner, rkhan, sukulkar, vlaad
Target Milestone:	---	Keywords:	Reopened
Target Release:	---
Hardware:	x86_64
OS:	Linux
Whiteboard:	aos-scalability-34
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:
Clones:	1359240 (view as bug list)		Environment:
Last Closed:	2017-01-03 14:45:29 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1359240

Description Mike Fiedler 2016-05-05 12:13:42 UTC

Description of problem:

During Online Hackday, I added the jenkins-persistent instant app to my project.   The deployer pod failed with no good messages available on the console.   Dan M found the following error in the Sentry logs:

New Issue on OpenShift 3 Beta3 INT

failed to "TeardownNetwork" for "jenkins-1-deploy_mffiedler" with TeardownNetworkError: "Failed to teardown network for pod \"3439c0bc-1230-11e6-9f75-0a1d348c34bb\" using network plugins \"redhat/openshift-ovs-multitenant\": Error running network teardown script: + ovs-ofctl -O OpenFlow13 del-flows br0 ip,nw_dst=\novs-ofctl: field nw_dst missing value\n"
ID: 0278663ec6b24f9c96c7d8efe23e1ea9

May 4, 2016, 7:43 p.m. UTC
Exception

errors.aggregate: failed to "TeardownNetwork" for "jenkins-1-deploy_mffiedler" with TeardownNetworkError: "Failed to teardown network for pod \"3439c0bc-1230-11e6-9f75-0a1d348c34bb\" using network plugins \"redhat/openshift-ovs-multitenant\": Error running network teardown script: + ovs-ofctl -O OpenFlow13 del-flows br0 ip,nw_dst=\novs-ofctl: field nw_dst missing value\n"

  File "/builddir/build/BUILD/atomic-openshift-git-0.f44746c/_build/src/github.com/openshift/origin/pkg/cmd/util/serviceability/sentry.go", line 61, in CaptureError
  File "/builddir/build/BUILD/atomic-openshift-git-0.f44746c/_build/src/github.com/openshift/origin/pkg/cmd/util/serviceability/panic.go", line 28, in CaptureError.fm
  File "/builddir/build/BUILD/atomic-openshift-git-0.f44746c/_thirdpartyhacks/src/k8s.io/kubernetes/pkg/util/runtime/runtime.go", line 71, in HandleError
  File "/builddir/build/BUILD/atomic-openshift-git-0.f44746c/_thirdpartyhacks/src/k8s.io/kubernetes/pkg/kubelet/kubelet.go", line 1756, in syncPod
  File "/builddir/build/BUILD/atomic-openshift-git-0.f44746c/_thirdpartyhacks/src/k8s.io/kubernetes/pkg/kubelet/kubelet.go", line 475, in syncPod).fm
...
(4 additional frame(s) were not displayed)


Version-Release number of selected component (if applicable): 3.2


How reproducible: Unknown - subsequent deployment succeeded


Steps to Reproduce:
1.  deploy jenkins-persistent in Online preview
2.  if error occurs, deployment will fail

Actual results:

See error messages above - deployment failed

Expected results:

Successful deployment

Comment 1 Dan Winship 2016-05-24 14:34:53 UTC

(In reply to Mike Fiedler from comment #0)
> During Online Hackday, I added the jenkins-persistent instant app to my
> project.   The deployer pod failed with no good messages available on the
> console.   Dan M found the following error in the Sentry logs:
> 
> New Issue on OpenShift 3 Beta3 INT
> 
> failed to "TeardownNetwork" for "jenkins-1-deploy_mffiedler" with

This error message is when cleaning up the pod, so it's not actually the cause of the failure, just another symptom.

I assume you don't still have the logs from this machine?

Comment 3 Dan Winship 2016-05-26 15:44:26 UTC

No luck getting logs.

This is kind of a "can't happen" error; kubernetes will only try to tear down the infrastructure pod if it successfully set it up (including setting up networking). I guess this must mean that the infrastructure pod was set up successfully, but then died/was killed unexpectedly, such that it no longer existed when TearDownPod() was run...

Anyway, this means that we will end up leaving cruft in the OVS rules, which is bad... (when the IP address gets reused, we might direct traffic to the wrong port).

dcbw mentioned having code in some branch to cache the IP addresses. That would mostly fix this, though to be completely safe we'd want to cache the veth interface names too.

Comment 4 Dan Winship 2016-06-01 17:54:57 UTC

So this bug pointed out a problem in the sdn teardown code, but that's not actually the bug that the reporter was reporting (which is "The deployer pod failed with no good messages"). Logs are no longer available, so we're never going to debug what actually happened there. So while we're continuing to work on sdn-teardown-related fixes, that isn't really this bug.

Comment 6 Mike Fiedler 2016-06-16 18:06:23 UTC

root@ip-172-31-31-215: /var/log # openshift version
openshift v3.2.1.1-1-g33fa4ea
kubernetes v1.2.0-36-g4a3f9c5
etcd 2.2.5

root@ip-172-31-31-215: /var/log # docker version
Client:
 Version:         1.10.3
 API version:     1.22
 Package version: docker-common-1.10.3-42.el7.x86_64
 Go version:      go1.4.2
 Git commit:      02f13c2-unsupported
 Built:           Mon Jun 13 15:22:15 2016
 OS/Arch:         linux/amd64

Server:
 Version:         1.10.3
 API version:     1.22
 Package version: docker-common-1.10.3-42.el7.x86_64
 Go version:      go1.4.2
 Git commit:      02f13c2-unsupported
 Built:           Mon Jun 13 15:22:15 2016
 OS/Arch:         linux/amd64

Comment 11 Dan Winship 2016-06-16 20:33:26 UTC

OK, so the logs show a build pod crashing:

atomic-openshift-node: I0614 17:38:35.523370    7736 kubelet.go:2430] SyncLoop (PLEG): "cakephp-mysql-example-1-build_test(eda34196-3277-11e6-926c-028d6556ee79)", event: &pleg.PodLifecycleEvent{ID:"eda34196-3277-11e6-926c-028d6556ee79", Type:"ContainerDied", Data:"044c5e961ef5ea49c554e165fbbb540727f4cac579b63c7b12b599b45fd2e64a"}

but I'm not sure there are any good hints as to why. (There's a ton of stuff in the logs, but none of it jumps out as being relevant.)


But if the crashing bug is actually happening regularly, then that's a problem because of the openshift-sdn cleanup bug; the references to the failed pod's IP address will linger in OVS after the pod exits, and so then the next pod to get assigned that IP address won't have working networking, because the OVS rules will still try to route its traffic to the old pod.

A workaround if this happens is:

    systemctl stop atomic-openshift-node
    ovs-vsctl del-br br0
    systemctl start atomic-openshift-node

which will regenerate the OVS rules from scratch, correctly. (Or if you know that certain pods have the broken IPs, you can just kill those pods; when the cleanup code runs for them, it will delete both the "good" OVS rules and the stale ones for those IP addresses, so then the next pod to get that IP will work again.)

Comment 12 Dan Winship 2016-06-16 21:15:08 UTC

Actually, Mike, if you can reproduce this reliably, can you try to get logs with --loglevel=5? There's something weird going on here; the logs show us successfully removing the veth from the OVS bridge *before* failing to remove the flow rules:

> Jun 14 17:38:36 xxxxxx ovs-vsctl: ovs|00001|vsctl|INFO|Called as ovs-vsctl --if-exists del-port veth439c632
> ...
> Jun 14 17:38:36 xxxx atomic-openshift-node: E0614 17:38:36.810673    7736 manager.go:1297] Failed to teardown network for pod "eda34196-3277-11e6-926c-028d6556ee79" using network plugins "redhat/openshift-ovs-multitenant": Error running network teardown script: + ovs-ofctl -O OpenFlow13 del-flows br0 ip,nw_dst=

which makes no sense because (a) the teardown script deletes the flow rules before deleting the port, and (b) if it exited with a flow-rule-deletion error, then it wouldn't even get to the port deletion part of the script...

So now I'm wondering if maybe TearDownPod() is getting called twice for some reason, succeeding the first time and then failing the second. (Possibly supporting this hypothesis: the openshift-sdn-debug output that you linked to doesn't show any evidence of stale OVS rules being left around, although that's not conclusive since the bad rules may have just been cleaned up by other pods later.)

Comment 14 Dan Winship 2016-06-20 11:10:24 UTC

OK, yes, those logs show that the pod *is* getting torn down twice:

> Jun 19 04:17:50 xxxxxx atomic-openshift-node: I0619 04:17:50.391887   88550 plugin.go:243] TearDownPod network plugin output:
> Jun 19 04:17:50 xxxxxx atomic-openshift-node: + action=teardown
> Jun 19 04:17:50 xxxxxx atomic-openshift-node: + net_container=7b3cd786da0c12320e986b8363575f7b219ac1c33b8474638df8a2cee9b9cc26
> ...
> Jun 19 04:17:50 xxxxxx atomic-openshift-node: + ovs-ofctl -O OpenFlow13 del-flows br0 ip,nw_dst=172.20.2.3
> ...

and then later:

> Jun 19 04:17:51 xxxxxx atomic-openshift-node: I0619 04:17:51.426410   88550 plugin.go:243] TearDownPod network plugin output: + lock_file=/var/lock/openshift-sdn.lock
> Jun 19 04:17:51 xxxxxx atomic-openshift-node: + action=teardown
> Jun 19 04:17:51 xxxxxx atomic-openshift-node: + net_container=7b3cd786da0c12320e986b8363575f7b219ac1c33b8474638df8a2cee9b9cc26
> ...
> Jun 19 04:17:51 xxxxxx atomic-openshift-node: ++ docker inspect --format '{{.NetworkSettings.IPAddress}}' 7b3cd786da0c12320e986b8363575f7b219ac1c33b8474638df8a2cee9b9cc26
> Jun 19 04:17:51 xxxxxx atomic-openshift-node: + ipaddr=
> ...
> Jun 19 04:17:51 xxxxxx atomic-openshift-node: + ovs-ofctl -O OpenFlow13 del-flows br0 ip,nw_dst=
> Jun 19 04:17:51 xxxxxx atomic-openshift-node: ovs-ofctl: field nw_dst missing value
> Jun 19 04:17:51 xxxxxx atomic-openshift-node: , exit status 1

So that's a bug, but it means that we *aren't* leaving crufty rules behind in OVS and breaking the network, so this is less urgent again.

Comment 22 Mike Fiedler 2017-01-02 13:46:44 UTC

This bug no longer occurs in OpenShift 3.4.  In my opinion this bz should be marked as a duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1359240.


@danw - agree?

Comment 23 Dan Winship 2017-01-03 14:45:29 UTC

agreed

*** This bug has been marked as a duplicate of bug 1359240 ***