1359240 – Pod deployment error (resulting in "TeardownNetworkError for OSE 3.2: field nw_dst missing value")

Bug 1359240 - Pod deployment error (resulting in "TeardownNetworkError for OSE 3.2: field nw_dst missing value")

Summary: Pod deployment error (resulting in "TeardownNetworkError for OSE 3.2: field n...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	3.2.1
Hardware:	x86_64
OS:	Linux
Priority:	medium
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Dan Winship
QA Contact:	Meng Bo
Docs Contact:
URL:
Whiteboard:
Duplicates (2):	1333393 1392255 (view as bug list)
Depends On:	1333393
Blocks:
TreeView+	depends on / blocked

Reported:	2016-07-22 15:06 UTC by Brendan Mchugh
Modified:	2017-03-08 18:43 UTC (History)
CC List:	21 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Cause: Sometimes when a pod exits, OpenShift performs cleanup steps that aren't actually needed, but the default networking plugin assumed it would only be called to do cleanup when there was actual cleanup to be done. Consequence: Nodes would sometimes log error messages about "Failed to teardown network for pod" when no error had actually occurred. Most of the time, this error would only be noticed in the logs by users who were trying to find the cause of a pod failure, and they were often confused into thinking that the "teardown" error was the cause of their problems, rather than an unrelated side effect. Fix: The default networking plugin now recognizes when it has been called after the pod networking state has already been cleaned up successfully. Result: No spurious error message is logged.
Clone Of:	1333393
Environment:
Last Closed:	2017-01-18 12:51:34 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
Node logs at loglevel=5 (22.61 KB, application/x-gzip) 2016-08-18 00:25 UTC, Mike Fiedler	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Origin (Github)	11808	0	None	None	None	2016-11-07 14:04:28 UTC
Red Hat Product Errata	RHBA-2017:0066	0	normal	SHIPPED_LIVE	Red Hat OpenShift Container Platform 3.4 RPM Release Advisory	2017-01-18 17:23:26 UTC

Comment 1 Ben Bennett 2016-08-08 19:50:13 UTC

Dan, is this fixed by our recent changes?

Comment 2 Dan Winship 2016-08-08 21:10:29 UTC

No, this isn't fixed. More precisely, there is a chain of errors here, none of which are fixed:

  1. Something (non-networking-related) causes the build/deploy to fail.

  2. Something in kubelet gets confused when this happens and calls the
     network teardown function twice while cleaning up the failed pod

  3. The second time the network teardown function gets called, it prints
     a confusing error message because the pod is in an unexpected state
     (since it has already had its networking torn down).

(Well, the last part is a little bit fixed; it will now print "Could not find IP address for container" rather than "field nw_dst missing value". That may be the "recent changes" you were thinking of.)

I'm assuming we'd close this bug after figuring out and fixing #2.

Comment 3 Ben Bennett 2016-08-12 13:33:42 UTC

Note that this bug is just a symptom of an earlier failure... it's typically not networking related.  As Dan said, something causes the pod deploy to fail, and that needs to be checked on each system.  The times we've troubleshot, the causes have been varied and often due to misconfiguration.

At the moment, #2 in Dan's list is annoying, but not really problematic.

We have a card open to work on that:
  https://trello.com/c/9L7VBvSv/242-3-fix-teardownpod-to-work-better-after-errors-techdebt

In light of all that, i'm dropping the severity of this bug.  For the associated cases, people should troubleshoot for errors that ar ecausing pod setup to fail.  OVS just happened to be the most obvious symptom of the earlier failure.

Comment 4 Maru Newby 2016-08-17 04:29:17 UTC

Am I intended to figure out what is causing the kubelet to call teardown twice, or solve the problem on the card (ensure cleanup when the pod is dead)?

Comment 5 Ben Bennett 2016-08-17 13:08:48 UTC

Maru: Neither.  Sorry.  I want to work out what the root cause of the reported problem is.  At that point we may clone this bug and send the root cause to the right place, then just leave the cleanup problem as a lower priority bug on the networking team.

Comment 6 Ben Bennett 2016-08-17 13:20:03 UTC

Passing off to the Kubernetes component.  This is not a networking problem, it's just that the loudest errors happened to come because the networking code was either called twice, or was called after partial setup.

We have made some changes to the errors we spit out and have a card open to go further:
  https://trello.com/c/9L7VBvSv/242-3-fix-teardownpod-to-work-better-after-errors-techdebt

I looked at the sosreport from a linked case and nothing was obviously wrong.  So if someone with a better eye for container startup errors can look, that would be great.

Comment 7 Andy Goldstein 2016-08-17 13:24:40 UTC

Paul, according to Ben, something causes the pod startup to fail, resulting in networking error messages down the line. Last time I looked, I don't believe any of the logs attached to the support cases were from atomic-openshift-node. Please take a look, and if we're missing the node logs, set needinfo. Thanks.

Comment 8 Mike Fiedler 2016-08-17 13:59:26 UTC

In 3.3 this happens everytime a pod terminates in an HA cluster.   The message is changed (see comment #2) to "Could not find IP address for container"

Comment 9 Dan Williams 2016-08-17 20:07:05 UTC

See also https://github.com/kubernetes/kubernetes/pull/29217 for some more detail; pod network operations are not exclusive in the docker and rocket k8s runtimes.  Specifically in the teardown case, there are two places that pods can be killed on the node. First is kubelet.go::killPod(), which is called from the pod sync goroutine. kilPod() is also called from podKiller() on a separate goroutine.  These can race with each other and could lead to calling pod teardown in close succession.

Mike, assuming that's the case, could you grab a new set of node logs that show the problem, ideally with loglevel 5?  I'd like to see if this race is indeed the case.

Comment 10 Mike Fiedler 2016-08-18 00:25:43 UTC

Created attachment 1191756 [details]
Node logs at loglevel=5

Comment 11 Mike Fiedler 2016-08-18 00:26:09 UTC

Node log attached.   Error occurred at 20:16:00 - lots of messages pop in the log shortly after.

Corresponding event:

2016-08-17T20:16:00-04:00   2016-08-17T20:16:00-04:00   1         nodejs-example-5-deploy   Pod                   Warning   FailedSync   {kubelet ip-172-31-57-158.us-west-2.compute.internal}   Error syncing pod, skipping: failed to "TeardownNetwork" for "nodejs-example-5-deploy_svt" with TeardownNetworkError: "Failed to teardown network for pod \"eddfa32a-64d8-11e6-963c-02a86e5559df\" using network plugins \"redhat/openshift-ovs-multitenant\": Error running network teardown script: Could not find IP address for container a3c971a3e3082a661c5a3b9bd816f20fca241c3b62d9e4383f1686d847ab670c"

Comment 12 Andy Goldstein 2016-08-18 21:02:58 UTC

I looked at Mike's log from comment 11. The pod nodejs-example-5-deploy_svt completed successfully (phase=succeeded). So it looks like this is just TeardownNetwork getting called twice when it should only be called once.

Comment 13 Andy Goldstein 2016-08-18 21:14:47 UTC

One of the teardowns is coming from the podKiller, which exists to remove pods whose associated pod workers aren't working properly. This comes first. The second teardown appears to come from the pod worker itself. It looks like there's an issue where the podKiller is being too aggressive. I'll continue to dig into this.

Comment 14 Andy Goldstein 2016-08-18 22:35:22 UTC

Actually both teardowns are coming from the podKiller. The first comes from 'SyncLoop (REMOVE, "api"): "nodejs-example-5-deploy_svt(eddfa32a-64d8-11e6-963c-02a86e55     59df)"', and the second (the one that fails) comes from 'kubelet.go:2609] SyncLoop (housekeeping)'.

The housekeeping invocation is a periodic function that gets its list of running pods from the kubelet's runtimeCache. So this is probably just a race (the runtimeCache is a bit out of date). I haven't confirmed this yet, but that's my current theory.

Comment 15 Andy Goldstein 2016-08-18 22:45:21 UTC

Actually I was incorrect. The first teardown comes from SyncLoop REMOVE. I'm not yet sure where the 2nd teardown comes from. There is a 3rd attempt to delete the pod (SyncLoop housekeeping), but that short-circuits because it can't find the pod (it's gone by this time).

Comment 16 Andy Goldstein 2016-08-19 02:07:59 UTC

The SDN teardown code needs to not return an error if it's called multiple times for the same pod. There's no guarantee that the kubelet will only call the network teardown function a single time.

Comment 20 Ben Bennett 2016-10-26 20:16:47 UTC

Is this fixed post-CNI?

Comment 21 Dan Williams 2016-10-27 15:27:06 UTC

(In reply to Ben Bennett from comment #20)
> Is this fixed post-CNI?

Not quite, but it's symptoms will be different.

Comment 23 Ben Bennett 2016-11-07 14:05:20 UTC

PR https://github.com/openshift/origin/pull/11808 is open to resolve this.

Comment 24 Dan Winship 2016-11-07 14:06:31 UTC

to resolve it from Networking's POV. There will still be the problem that kubelet is trying to tear down the same pod multiple times for some reason

Comment 25 Andy Goldstein 2016-11-07 14:09:08 UTC

(In reply to Dan Winship from comment #24)
> to resolve it from Networking's POV. There will still be the problem that
> kubelet is trying to tear down the same pod multiple times for some reason

And that is acceptable based on how the kubelet is designed.

Comment 26 Ben Bennett 2016-11-07 16:04:35 UTC

*** Bug 1392255 has been marked as a duplicate of this bug. ***

Comment 27 Vladislav Walek 2016-11-10 15:21:50 UTC

do you have workaround for comment #2 and comment #8 ?
The pod is failing with the same error message in comments.
Thanks

Comment 28 Dan Winship 2016-11-10 15:31:56 UTC

This bug is not about any specific cause of pod failure, it's just about the internal errors that happen when cleaning up after a pod has failed, for whatever reason. If you're seeing a reproducible pod deployment failure, file a new bug about that.

Comment 29 Troy Dawson 2016-12-12 20:00:42 UTC

This has been merged into ocp and is in OCP v3.4.0.35 or newer.

Comment 31 Meng Bo 2016-12-19 09:43:05 UTC

Tested on OCP v3.4.0.37

When trying to delete the deployer pod which is during the deployment process, it will call the teardown only once. And no teardown script error appears in node log.


# journalctl -lf -u atomic-openshift-node.service  | grep "teardown"
Dec 19 17:41:19 ose-node1.bmeng.local atomic-openshift-node[91652]: + action=teardown
Dec 19 17:41:19 ose-node1.bmeng.local atomic-openshift-node[91652]: I1219 17:41:19.901143   91652 pod_linux.go:505] teardown called on already-destroyed pod default/router2-1-deploy; only cleaning up IPAM

Comment 32 Dan Winship 2017-01-03 14:45:29 UTC

*** Bug 1333393 has been marked as a duplicate of this bug. ***

Comment 34 errata-xmlrpc 2017-01-18 12:51:34 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:0066

Note You need to log in before you can comment on or make changes to this bug.

abhgupta
agoldste
aloughla
aos-bugs
bbennett
bmchugh
bmeng
danw
dcbw
eparis
erich
jokerman
mifiedle
mleitner
mmccomas
rkhan
sukulkar
tdawson
vlaad
vwalek
xiazhao