Bugzilla will be upgraded to version 5.0. The upgrade date is tentatively scheduled for 2 December 2018, pending final testing and feedback.
Bug 1322077 - TeardownNetworkError for deploy pod on all deployments in AWS scale cluster.
TeardownNetworkError for deploy pod on all deployments in AWS scale cluster.
Status: CLOSED ERRATA
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking (Show other bugs)
3.2.0
x86_64 Linux
high Severity high
: ---
: ---
Assigned To: Ravi Sankar
Mike Fiedler
:
: 1322697 (view as bug list)
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2016-03-29 14:12 EDT by Mike Fiedler
Modified: 2016-05-31 02:10 EDT (History)
10 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2016-05-12 12:34:42 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
network debug script output (978 bytes, text/plain)
2016-03-30 13:22 EDT, Mike Fiedler
no flags Details
debug tarball output (553.38 KB, application/x-gzip)
2016-03-30 14:09 EDT, Mike Fiedler
no flags Details


External Trackers
Tracker ID Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2016:1064 normal SHIPPED_LIVE Important: Red Hat OpenShift Enterprise 3.2 security, bug fix, and enhancement update 2016-05-12 16:19:17 EDT

  None (edit)
Description Mike Fiedler 2016-03-29 14:12:32 EDT
Description of problem:

Every deployment in the horizontal scale environment shows a TeardownNetworkError in kubectl get events for the deployment pod.

Environment is on AWS.  209 nodes including 3 master/etcd, 5 HAProxy routers, 2 docker-registry pods.  Using the multitennant plugin

The test case is a script to load the cluster with 1000 projects with 4 running pods, 3 running services and some other artifacts


Version-Release number of selected component (if applicable):  3.2.0.8


How reproducible: Always (in this env, anyways)


Steps to Reproduce:
1. In env described above - not sure about others
2. create new project 
3. create new app using django-example template
4. watch kubectl get events --all-namespaces -w as it deploys.  Event below will be seen

Actual results:

mff       2016-03-29 14:00:25 -0400 EDT   2016-03-29 14:00:25 -0400 EDT   1         django-example-2-deploy   Pod                 Warning   FailedSync   {kubelet ip-172-31-3-96.us-west-2.compute.internal}   Error syncing pod, skipping: failed to "TeardownNetwork" for "django-example-2-deploy_mff" with TeardownNetworkError: "Failed to teardown network for pod \"041b30b7-f5d8-11e5-ad39-02243e13a1d3\" using network plugins \"redhat/openshift-ovs-multitenant\": exit status 1"

Expected results:

No errors on normal deployments.


Additional info:

openshift_master_portal_net=172.24.0.0/14
osm_cluster_network_cidr=172.20.0.0/14
osm_host_subnet_length=8
os_sdn_network_plugin_name='redhat/openshift-ovs-multitenant'
Comment 1 Mike Fiedler 2016-03-30 10:55:05 EDT
This is happening in the much smaller SVT reliability cluster (3.2.0.8) as well
Comment 2 Ben Bennett 2016-03-30 11:04:05 EDT
Can you please run the script to gather debugging information:
  https://docs.openshift.com/enterprise/3.1/admin_guide/sdn_troubleshooting.html#further-help
Comment 3 Mike Fiedler 2016-03-30 13:22 EDT
Created attachment 1141879 [details]
network debug script output
Comment 4 Mike Fiedler 2016-03-30 13:23:20 EDT
Ran the script on master of a smaller (3 node) cluster experiencing the problem.  Not much in the output - let me know what else would help.
Comment 5 Mike Fiedler 2016-03-30 14:07:22 EDT
Ooops, did not notice tgz.  Attaching.  This was a fresh 3 node install of 3.2.0.8 on AWS
Comment 6 Mike Fiedler 2016-03-30 14:09 EDT
Created attachment 1141894 [details]
debug tarball output
Comment 7 Ben Bennett 2016-04-01 10:49:18 EDT
*** Bug 1322697 has been marked as a duplicate of this bug. ***
Comment 8 Dan Williams 2016-04-01 12:02:48 EDT
Analysis... it looks like atomic-openshift-node was restarted at some point, while containers were still running.  Then on restart:

- the previous OVS setup either needed to be updated, was not complete, or there is a bug in openshift-sdn startup that fails to detect when the OVS config is OK

- the OVS bridge is removed and recreated, which obviously detaches all ports that were previously attached to the bridge

Mar 30 13:44:12 xxxxx.compute.internal ovs-vsctl[7405]: ovs|00001|vsctl|INFO|Called as /usr/bin/ovs-vsctl --if-exists del-br br0 -- add-br br0 -- set Bridge br0 fail-mode=secure protocols=OpenFlow13

- then when atomic-openshift-node sees that it should kill/cleanup pods that were previous running before the restart, it attempts to tear them down

Mar 30 13:58:48 xxxxx.compute.internal atomic-openshift-node[7277]: I0330 13:58:48.152764    7277 kubelet.go:2245] Killing unwanted pod "router-5-deploy"

- for some reason, ovs-vsctl returns an error even if --if-exists is given, which it shouldn't.  This doesn't happen normally executing those commands manually.

Mar 30 13:58:48 xxxxx.compute.internal ovs-vsctl[9553]: ovs|00001|vsctl|INFO|Called as ovs-vsctl --if-exists del-port vethd6032cc
Mar 30 13:58:48 xxxxx.compute.internal ovs-vsctl[9559]: ovs|00001|vsctl|ERR|no row "vethd6032cc" in table Port


Was atomic-openshift-node restarted at some point here?  Perhaps it was previously configured for single-tenant plugin and then was switched to multi-tenant while pods were still active on the node?
Comment 9 Ravi Sankar 2016-04-01 14:19:49 EDT
Fixed in https://github.com/openshift/openshift-sdn/pull/277
Comment 10 Timothy St. Clair 2016-04-01 17:14:11 EDT
Hit this exact issue on fresh install on our scale cluster.  
No previous state.
0-deployments passed.
Comment 11 Mike Fiedler 2016-04-01 18:16:03 EDT
re: comment 8 - no node restarts I am aware of.  Definitely no re-configuration of the network plugin.  Always multiten.  I hit it again today on a fresh install with no node restarts.  Let me know if you want additional debug.sh output from that cluster.
Comment 12 Mike Fiedler 2016-04-01 18:17:43 EDT
Disregard comment 11, I see there's an upstream fix.
Comment 18 Ravi Sankar 2016-04-11 16:12:21 EDT
Marking back to Assigned, fix not merged in origin repo yet.
Comment 19 Dan Williams 2016-04-13 13:22:24 EDT
Merged to origin now; https://github.com/openshift/origin/pull/8468
Comment 20 Meng Bo 2016-04-13 23:09:55 EDT
Move it back to MODIFIED till it is built in latest OSE.
Comment 21 Troy Dawson 2016-04-15 12:33:22 EDT
This should be in atomic-openshift-3.2.0.16-1.git.0.738b760.el7 which has been built and readied for qe.
Comment 22 Meng Bo 2016-04-18 07:09:16 EDT
Verified on openshift v3.2.0.16 with steps in bz#1322697

Will not meet the error.

Move to verified.
Comment 24 errata-xmlrpc 2016-05-12 12:34:42 EDT
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2016:1064

Note You need to log in before you can comment on or make changes to this bug.