1322077 – TeardownNetworkError for deploy pod on all deployments in AWS scale cluster.

Bug 1322077 - TeardownNetworkError for deploy pod on all deployments in AWS scale cluster.

Summary: TeardownNetworkError for deploy pod on all deployments in AWS scale cluster.

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	3.2.0
Hardware:	x86_64
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Ravi Sankar
QA Contact:	Mike Fiedler
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1322697 (view as bug list)
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2016-03-29 18:12 UTC by Mike Fiedler
Modified:	2016-05-31 06:10 UTC (History)
CC List:	10 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2016-05-12 16:34:42 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
network debug script output (978 bytes, text/plain) 2016-03-30 17:22 UTC, Mike Fiedler	no flags	Details
debug tarball output (553.38 KB, application/x-gzip) 2016-03-30 18:09 UTC, Mike Fiedler	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2016:1064	0	normal	SHIPPED_LIVE	Important: Red Hat OpenShift Enterprise 3.2 security, bug fix, and enhancement update	2016-05-12 20:19:17 UTC

Description Mike Fiedler 2016-03-29 18:12:32 UTC

Description of problem:

Every deployment in the horizontal scale environment shows a TeardownNetworkError in kubectl get events for the deployment pod.

Environment is on AWS. 209 nodes including 3 master/etcd, 5 HAProxy routers, 2 docker-registry pods. Using the multitennant plugin

The test case is a script to load the cluster with 1000 projects with 4 running pods, 3 running services and some other artifacts

Version-Release number of selected component (if applicable): 3.2.0.8

How reproducible: Always (in this env, anyways)

Steps to Reproduce:
1. In env described above - not sure about others
2. create new project
3. create new app using django-example template
4. watch kubectl get events --all-namespaces -w as it deploys. Event below will be seen

Actual results:

mff 2016-03-29 14:00:25 -0400 EDT 2016-03-29 14:00:25 -0400 EDT 1 django-example-2-deploy Pod Warning FailedSync {kubelet ip-172-31-3-96.us-west-2.compute.internal} Error syncing pod, skipping: failed to "TeardownNetwork" for "django-example-2-deploy_mff" with TeardownNetworkError: "Failed to teardown network for pod \"041b30b7-f5d8-11e5-ad39-02243e13a1d3\" using network plugins \"redhat/openshift-ovs-multitenant\": exit status 1"

Expected results:

No errors on normal deployments.

Additional info:

openshift_master_portal_net=172.24.0.0/14
osm_cluster_network_cidr=172.20.0.0/14
osm_host_subnet_length=8
os_sdn_network_plugin_name='redhat/openshift-ovs-multitenant'

Comment 1 Mike Fiedler 2016-03-30 14:55:05 UTC

This is happening in the much smaller SVT reliability cluster (3.2.0.8) as well

Comment 2 Ben Bennett 2016-03-30 15:04:05 UTC

Can you please run the script to gather debugging information:
  https://docs.openshift.com/enterprise/3.1/admin_guide/sdn_troubleshooting.html#further-help

Comment 3 Mike Fiedler 2016-03-30 17:22:30 UTC

Created attachment 1141879 [details]
network debug script output

Comment 4 Mike Fiedler 2016-03-30 17:23:20 UTC

Ran the script on master of a smaller (3 node) cluster experiencing the problem.  Not much in the output - let me know what else would help.

Comment 5 Mike Fiedler 2016-03-30 18:07:22 UTC

Ooops, did not notice tgz.  Attaching.  This was a fresh 3 node install of 3.2.0.8 on AWS

Comment 6 Mike Fiedler 2016-03-30 18:09:32 UTC

Created attachment 1141894 [details]
debug tarball output

Comment 7 Ben Bennett 2016-04-01 14:49:18 UTC

*** Bug 1322697 has been marked as a duplicate of this bug. ***

Comment 8 Dan Williams 2016-04-01 16:02:48 UTC

Analysis... it looks like atomic-openshift-node was restarted at some point, while containers were still running.  Then on restart:

- the previous OVS setup either needed to be updated, was not complete, or there is a bug in openshift-sdn startup that fails to detect when the OVS config is OK

- the OVS bridge is removed and recreated, which obviously detaches all ports that were previously attached to the bridge

Mar 30 13:44:12 xxxxx.compute.internal ovs-vsctl[7405]: ovs|00001|vsctl|INFO|Called as /usr/bin/ovs-vsctl --if-exists del-br br0 -- add-br br0 -- set Bridge br0 fail-mode=secure protocols=OpenFlow13

- then when atomic-openshift-node sees that it should kill/cleanup pods that were previous running before the restart, it attempts to tear them down

Mar 30 13:58:48 xxxxx.compute.internal atomic-openshift-node[7277]: I0330 13:58:48.152764    7277 kubelet.go:2245] Killing unwanted pod "router-5-deploy"

- for some reason, ovs-vsctl returns an error even if --if-exists is given, which it shouldn't.  This doesn't happen normally executing those commands manually.

Mar 30 13:58:48 xxxxx.compute.internal ovs-vsctl[9553]: ovs|00001|vsctl|INFO|Called as ovs-vsctl --if-exists del-port vethd6032cc
Mar 30 13:58:48 xxxxx.compute.internal ovs-vsctl[9559]: ovs|00001|vsctl|ERR|no row "vethd6032cc" in table Port


Was atomic-openshift-node restarted at some point here?  Perhaps it was previously configured for single-tenant plugin and then was switched to multi-tenant while pods were still active on the node?

Comment 9 Ravi Sankar 2016-04-01 18:19:49 UTC

Fixed in https://github.com/openshift/openshift-sdn/pull/277

Comment 10 Timothy St. Clair 2016-04-01 21:14:11 UTC

Hit this exact issue on fresh install on our scale cluster.  
No previous state.
0-deployments passed.

Comment 11 Mike Fiedler 2016-04-01 22:16:03 UTC

re: comment 8 - no node restarts I am aware of.  Definitely no re-configuration of the network plugin.  Always multiten.  I hit it again today on a fresh install with no node restarts.  Let me know if you want additional debug.sh output from that cluster.

Comment 12 Mike Fiedler 2016-04-01 22:17:43 UTC

Disregard comment 11, I see there's an upstream fix.

Comment 18 Ravi Sankar 2016-04-11 20:12:21 UTC

Marking back to Assigned, fix not merged in origin repo yet.

Comment 19 Dan Williams 2016-04-13 17:22:24 UTC

Merged to origin now; https://github.com/openshift/origin/pull/8468

Comment 20 Meng Bo 2016-04-14 03:09:55 UTC

Move it back to MODIFIED till it is built in latest OSE.

Comment 21 Troy Dawson 2016-04-15 16:33:22 UTC

This should be in atomic-openshift-3.2.0.16-1.git.0.738b760.el7 which has been built and readied for qe.

Comment 22 Meng Bo 2016-04-18 11:09:16 UTC

Verified on openshift v3.2.0.16 with steps in bz#1322697

Will not meet the error.

Move to verified.

Comment 24 errata-xmlrpc 2016-05-12 16:34:42 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2016:1064

Note You need to log in before you can comment on or make changes to this bug.