Bug 1322130

Summary: SetupNetworkError: Error fetching VNID for namespace causes a set of pods to hang permanently
Product: OpenShift Container Platform Reporter: Mike Fiedler <mifiedle>
Component: NetworkingAssignee: Ravi Sankar <rpenta>
Status: CLOSED DUPLICATE QA Contact: Mike Fiedler <mifiedle>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 3.2.0CC: aos-bugs, bbennett, mifiedle, xtian
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-04-08 15:29:29 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Description Mike Fiedler 2016-03-29 21:10:15 UTC
Description of problem:

Environment is on AWS.  209 nodes including 3 master/etcd, 5 HAProxy routers, 2 docker-registry pods.  Using the multitennant plugin

The test case is a script to load the cluster with 1000 projects with 4 running pods, 3 running services and some other artifacts

Project and pod creation up to project 373 was successful.  373-379 had deployter pod creation errors with the reason "Error fetching VNID for namespace".  The pods stuck in ContainerCreating indefinitely.   After 379, the rest (to 1000) were successful.  Complete errors below.

There's no way to re-deploy the DC, the project has to be deleted.

Version-Release number of selected component (if applicable):


How reproducible:  Rare.   I will run again  and see if I can get some pods in this state.   Felt I should report it.


Steps to Reproduce:
1.  In env described above, run test to create 1000 projects x 4 deployments
2.  When the run completed, 41 out of 4000 pods were stuck
3.  Re-deploying hits the problem again.  The project must be deleted.


Actual results:
 1h		1h		1	{default-scheduler }							Normal		Scheduled	Successfully assigned deploymentconfig1-1-deploy to ip-172-31-41-184.us-west-2.compute.internal
  1h		1h		2	{kubelet ip-172-31-41-184.us-west-2.compute.internal}			Warning		FailedSync	Error syncing pod, skipping: API error (500): Unknown device 608171fb3e8a5b7e38b14d7fee83151b7484ad2c999c623092013097f9310f2a

  1h	1h	1	{kubelet ip-172-31-41-184.us-west-2.compute.internal}		Warning	FailedSync	Error syncing pod, skipping: API error (500): Unknown device 0da2b739109b277b832c8801638a240ade69d6b6dee0d26e7e086dcca8785f35

  39m	39m	2	{kubelet ip-172-31-41-184.us-west-2.compute.internal}		Warning	FailedSync	Error syncing pod, skipping: API error (500): Unknown device 5fa10a89c9176b7866ffc9e6511c5526f4b2cf5f8eb2be1419691bd9403de869

  18m	18m	1	{kubelet ip-172-31-41-184.us-west-2.compute.internal}		Warning	FailedSync	Error syncing pod, skipping: API error (500): Unknown device bc157f39813fe82a13062d9365b75b59da02d703a9865137597e71185133ec3d

  2m	2m	1	{kubelet ip-172-31-41-184.us-west-2.compute.internal}		Warning	FailedSync	Error syncing pod, skipping: API error (500): Unknown device 36d4efbdc0d9ac355207eceaf8641b6140573a369bbb0d5f5ee859abaf6db22b

  1h	0s	925	{kubelet ip-172-31-41-184.us-west-2.compute.internal}		Warning	FailedSync	Error syncing pod, skipping: failed to "SetupNetwork" for "deploymentconfig1-1-deploy_run-a-377" with SetupNetworkError: "Failed to setup network for pod \"deploymentconfig1-1-deploy_run-a-377(ecc8ce02-f5e1-11e5-ad39-02243e13a1d3)\" using network plugins \"redhat/openshift-ovs-multitenant\": Error fetching VNID for namespace: run-a-377; Skipping pod"


Expected results:

Deployer pod runs successfully.


Additional info:

Comment 1 Ben Bennett 2016-03-30 13:33:15 UTC
Can you please run the script at:
  https://docs.openshift.com/enterprise/3.1/admin_guide/sdn_troubleshooting.html#further-help

It will gather all the debug info we need.

Comment 3 Ravi Sankar 2016-04-07 19:34:03 UTC
The root cause of this issue is same as https://bugzilla.redhat.com/show_bug.cgi?id=1323279

On the node, we didn't get the vnid updates yet but pod setup tried to fetch vnid and failed.

Comment 4 Ben Bennett 2016-04-08 15:29:29 UTC

*** This bug has been marked as a duplicate of bug 1323279 ***