Bug 1322077 - TeardownNetworkError for deploy pod on all deployments in AWS scale cluster.
Summary: TeardownNetworkError for deploy pod on all deployments in AWS scale cluster.
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking   
(Show other bugs)
Version: 3.2.0
Hardware: x86_64
OS: Linux
high
high
Target Milestone: ---
: ---
Assignee: Ravi Sankar
QA Contact: Mike Fiedler
URL:
Whiteboard:
Keywords:
: 1322697 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-03-29 18:12 UTC by Mike Fiedler
Modified: 2016-05-31 06:10 UTC (History)
10 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2016-05-12 16:34:42 UTC
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
network debug script output (978 bytes, text/plain)
2016-03-30 17:22 UTC, Mike Fiedler
no flags Details
debug tarball output (553.38 KB, application/x-gzip)
2016-03-30 18:09 UTC, Mike Fiedler
no flags Details


External Trackers
Tracker ID Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2016:1064 normal SHIPPED_LIVE Important: Red Hat OpenShift Enterprise 3.2 security, bug fix, and enhancement update 2016-05-12 20:19:17 UTC

Description Mike Fiedler 2016-03-29 18:12:32 UTC
Description of problem:

Every deployment in the horizontal scale environment shows a TeardownNetworkError in kubectl get events for the deployment pod.

Environment is on AWS.  209 nodes including 3 master/etcd, 5 HAProxy routers, 2 docker-registry pods.  Using the multitennant plugin

The test case is a script to load the cluster with 1000 projects with 4 running pods, 3 running services and some other artifacts


Version-Release number of selected component (if applicable):  3.2.0.8


How reproducible: Always (in this env, anyways)


Steps to Reproduce:
1. In env described above - not sure about others
2. create new project 
3. create new app using django-example template
4. watch kubectl get events --all-namespaces -w as it deploys.  Event below will be seen

Actual results:

mff       2016-03-29 14:00:25 -0400 EDT   2016-03-29 14:00:25 -0400 EDT   1         django-example-2-deploy   Pod                 Warning   FailedSync   {kubelet ip-172-31-3-96.us-west-2.compute.internal}   Error syncing pod, skipping: failed to "TeardownNetwork" for "django-example-2-deploy_mff" with TeardownNetworkError: "Failed to teardown network for pod \"041b30b7-f5d8-11e5-ad39-02243e13a1d3\" using network plugins \"redhat/openshift-ovs-multitenant\": exit status 1"

Expected results:

No errors on normal deployments.


Additional info:

openshift_master_portal_net=172.24.0.0/14
osm_cluster_network_cidr=172.20.0.0/14
osm_host_subnet_length=8
os_sdn_network_plugin_name='redhat/openshift-ovs-multitenant'

Comment 1 Mike Fiedler 2016-03-30 14:55:05 UTC
This is happening in the much smaller SVT reliability cluster (3.2.0.8) as well

Comment 2 Ben Bennett 2016-03-30 15:04:05 UTC
Can you please run the script to gather debugging information:
  https://docs.openshift.com/enterprise/3.1/admin_guide/sdn_troubleshooting.html#further-help

Comment 3 Mike Fiedler 2016-03-30 17:22 UTC
Created attachment 1141879 [details]
network debug script output

Comment 4 Mike Fiedler 2016-03-30 17:23:20 UTC
Ran the script on master of a smaller (3 node) cluster experiencing the problem.  Not much in the output - let me know what else would help.

Comment 5 Mike Fiedler 2016-03-30 18:07:22 UTC
Ooops, did not notice tgz.  Attaching.  This was a fresh 3 node install of 3.2.0.8 on AWS

Comment 6 Mike Fiedler 2016-03-30 18:09 UTC
Created attachment 1141894 [details]
debug tarball output

Comment 7 Ben Bennett 2016-04-01 14:49:18 UTC
*** Bug 1322697 has been marked as a duplicate of this bug. ***

Comment 8 Dan Williams 2016-04-01 16:02:48 UTC
Analysis... it looks like atomic-openshift-node was restarted at some point, while containers were still running.  Then on restart:

- the previous OVS setup either needed to be updated, was not complete, or there is a bug in openshift-sdn startup that fails to detect when the OVS config is OK

- the OVS bridge is removed and recreated, which obviously detaches all ports that were previously attached to the bridge

Mar 30 13:44:12 xxxxx.compute.internal ovs-vsctl[7405]: ovs|00001|vsctl|INFO|Called as /usr/bin/ovs-vsctl --if-exists del-br br0 -- add-br br0 -- set Bridge br0 fail-mode=secure protocols=OpenFlow13

- then when atomic-openshift-node sees that it should kill/cleanup pods that were previous running before the restart, it attempts to tear them down

Mar 30 13:58:48 xxxxx.compute.internal atomic-openshift-node[7277]: I0330 13:58:48.152764    7277 kubelet.go:2245] Killing unwanted pod "router-5-deploy"

- for some reason, ovs-vsctl returns an error even if --if-exists is given, which it shouldn't.  This doesn't happen normally executing those commands manually.

Mar 30 13:58:48 xxxxx.compute.internal ovs-vsctl[9553]: ovs|00001|vsctl|INFO|Called as ovs-vsctl --if-exists del-port vethd6032cc
Mar 30 13:58:48 xxxxx.compute.internal ovs-vsctl[9559]: ovs|00001|vsctl|ERR|no row "vethd6032cc" in table Port


Was atomic-openshift-node restarted at some point here?  Perhaps it was previously configured for single-tenant plugin and then was switched to multi-tenant while pods were still active on the node?

Comment 9 Ravi Sankar 2016-04-01 18:19:49 UTC
Fixed in https://github.com/openshift/openshift-sdn/pull/277

Comment 10 Timothy St. Clair 2016-04-01 21:14:11 UTC
Hit this exact issue on fresh install on our scale cluster.  
No previous state.
0-deployments passed.

Comment 11 Mike Fiedler 2016-04-01 22:16:03 UTC
re: comment 8 - no node restarts I am aware of.  Definitely no re-configuration of the network plugin.  Always multiten.  I hit it again today on a fresh install with no node restarts.  Let me know if you want additional debug.sh output from that cluster.

Comment 12 Mike Fiedler 2016-04-01 22:17:43 UTC
Disregard comment 11, I see there's an upstream fix.

Comment 18 Ravi Sankar 2016-04-11 20:12:21 UTC
Marking back to Assigned, fix not merged in origin repo yet.

Comment 19 Dan Williams 2016-04-13 17:22:24 UTC
Merged to origin now; https://github.com/openshift/origin/pull/8468

Comment 20 Meng Bo 2016-04-14 03:09:55 UTC
Move it back to MODIFIED till it is built in latest OSE.

Comment 21 Troy Dawson 2016-04-15 16:33:22 UTC
This should be in atomic-openshift-3.2.0.16-1.git.0.738b760.el7 which has been built and readied for qe.

Comment 22 Meng Bo 2016-04-18 11:09:16 UTC
Verified on openshift v3.2.0.16 with steps in bz#1322697

Will not meet the error.

Move to verified.

Comment 24 errata-xmlrpc 2016-05-12 16:34:42 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2016:1064


Note You need to log in before you can comment on or make changes to this bug.