Bug 1322077
Summary: | TeardownNetworkError for deploy pod on all deployments in AWS scale cluster. | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Mike Fiedler <mifiedle> | ||||||
Component: | Networking | Assignee: | Ravi Sankar <rpenta> | ||||||
Status: | CLOSED ERRATA | QA Contact: | Mike Fiedler <mifiedle> | ||||||
Severity: | high | Docs Contact: | |||||||
Priority: | high | ||||||||
Version: | 3.2.0 | CC: | aos-bugs, bmeng, dcbw, haowang, jeder, mifiedle, rpenta, tdawson, tstclair, xtian | ||||||
Target Milestone: | --- | ||||||||
Target Release: | --- | ||||||||
Hardware: | x86_64 | ||||||||
OS: | Linux | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||
Doc Text: | Story Points: | --- | |||||||
Clone Of: | Environment: | ||||||||
Last Closed: | 2016-05-12 16:34:42 UTC | Type: | Bug | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Attachments: |
|
Description
Mike Fiedler
2016-03-29 18:12:32 UTC
This is happening in the much smaller SVT reliability cluster (3.2.0.8) as well Can you please run the script to gather debugging information: https://docs.openshift.com/enterprise/3.1/admin_guide/sdn_troubleshooting.html#further-help Created attachment 1141879 [details]
network debug script output
Ran the script on master of a smaller (3 node) cluster experiencing the problem. Not much in the output - let me know what else would help. Ooops, did not notice tgz. Attaching. This was a fresh 3 node install of 3.2.0.8 on AWS Created attachment 1141894 [details]
debug tarball output
*** Bug 1322697 has been marked as a duplicate of this bug. *** Analysis... it looks like atomic-openshift-node was restarted at some point, while containers were still running. Then on restart: - the previous OVS setup either needed to be updated, was not complete, or there is a bug in openshift-sdn startup that fails to detect when the OVS config is OK - the OVS bridge is removed and recreated, which obviously detaches all ports that were previously attached to the bridge Mar 30 13:44:12 xxxxx.compute.internal ovs-vsctl[7405]: ovs|00001|vsctl|INFO|Called as /usr/bin/ovs-vsctl --if-exists del-br br0 -- add-br br0 -- set Bridge br0 fail-mode=secure protocols=OpenFlow13 - then when atomic-openshift-node sees that it should kill/cleanup pods that were previous running before the restart, it attempts to tear them down Mar 30 13:58:48 xxxxx.compute.internal atomic-openshift-node[7277]: I0330 13:58:48.152764 7277 kubelet.go:2245] Killing unwanted pod "router-5-deploy" - for some reason, ovs-vsctl returns an error even if --if-exists is given, which it shouldn't. This doesn't happen normally executing those commands manually. Mar 30 13:58:48 xxxxx.compute.internal ovs-vsctl[9553]: ovs|00001|vsctl|INFO|Called as ovs-vsctl --if-exists del-port vethd6032cc Mar 30 13:58:48 xxxxx.compute.internal ovs-vsctl[9559]: ovs|00001|vsctl|ERR|no row "vethd6032cc" in table Port Was atomic-openshift-node restarted at some point here? Perhaps it was previously configured for single-tenant plugin and then was switched to multi-tenant while pods were still active on the node? Hit this exact issue on fresh install on our scale cluster. No previous state. 0-deployments passed. re: comment 8 - no node restarts I am aware of. Definitely no re-configuration of the network plugin. Always multiten. I hit it again today on a fresh install with no node restarts. Let me know if you want additional debug.sh output from that cluster. Disregard comment 11, I see there's an upstream fix. Marking back to Assigned, fix not merged in origin repo yet. Merged to origin now; https://github.com/openshift/origin/pull/8468 Move it back to MODIFIED till it is built in latest OSE. This should be in atomic-openshift-3.2.0.16-1.git.0.738b760.el7 which has been built and readied for qe. Verified on openshift v3.2.0.16 with steps in bz#1322697 Will not meet the error. Move to verified. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2016:1064 |