Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1323279

Summary:	New project networking is not setup properly
Product:	OpenShift Container Platform	Reporter:	Vikas Laad <vlaad>
Component:	Networking	Assignee:	Ravi Sankar <rpenta>
Status:	CLOSED ERRATA	QA Contact:	Meng Bo <bmeng>
Severity:	medium	Docs Contact:
Priority:	medium
Version:	3.2.0	CC:	aos-bugs, bbennett, mifiedle, tdawson, vlaad
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2016-09-27 09:37:22 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Vikas Laad 2016-04-01 17:07:55 UTC

Description of problem:
Running OSE reliability tests for long time running into this problem. After project creation networking was not setup properly.

Version-Release number of selected component (if applicable):


How reproducible:
Randomly

Steps to Reproduce:
Running OSE reliability tests for long time. Contact vlaad for running them. 

Actual results:
These logs are taken immediately after project creation

Mar 31 15:22:56 ip-172-31-58-185 atomic-openshift-master: I0331 15:22:56.785395   77411 create_dockercfg_secrets.go:161] View of ServiceAccount dancer-mysql-example-usr-2877-2-3027-5/builder is not up to date, skipping dockercfg creation
Mar 31 15:22:56 ip-172-31-58-185 atomic-openshift-master: I0331 15:22:56.853244   77411 create_dockercfg_secrets.go:161] View of ServiceAccount dancer-mysql-example-usr-2877-2-3027-5/deployer is not up to date, skipping dockercfg creation
Mar 31 15:22:56 ip-172-31-58-185 atomic-openshift-master: I0331 15:22:56.905873   77411 create_dockercfg_secrets.go:161] View of ServiceAccount dancer-mysql-example-usr-2877-2-3027-5/default is not up to date, skipping dockercfg creation
Mar 31 15:22:59 ip-172-31-58-185 atomic-openshift-master: I0331 15:22:59.287836   77411 replication_controller.go:434] Too few "dancer-mysql-example-usr-2877-1-3027-4"/"database-1" replicas, need 1, creating 1
Mar 31 15:22:59 ip-172-31-58-185 atomic-openshift-master: I0331 15:22:59.298034   77411 factory.go:412] Attempting to bind database-1-j99la to ip-172-31-58-187.us-west-2.compute.internal
Mar 31 15:22:59 ip-172-31-58-185 atomic-openshift-master: I0331 15:22:59.299005   77411 event.go:211] Event(api.ObjectReference{Kind:"ReplicationController", Namespace:"dancer-mysql-example-usr-2877-1-3027-4", Name:"database-1", UID:"f7b3bd60-f775-11e5-9bf3-02cbd31a1cb3", APIVersion:"v1", ResourceVersion:"337555", FieldPath:""}): type: 'Normal' reason: 'SuccessfulCreate' Created pod: database-1-j99la
Mar 31 15:22:59 ip-172-31-58-185 atomic-openshift-master: I0331 15:22:59.453145   77411 trace.go:57] Trace "List /oapi/v1/namespaces/openshift/templates" (started 2016-03-31 15:22:58.950150371 -0400 EDT):
Mar 31 15:22:59 ip-172-31-58-185 atomic-openshift-master: [4.532µs] [4.532µs] About to List from storage
Mar 31 15:22:59 ip-172-31-58-185 atomic-openshift-master: [25.92368ms] [25.919148ms] Listing from storage done
Mar 31 15:22:59 ip-172-31-58-185 atomic-openshift-master: [26.364835ms] [441.155µs] Self-linking done
Mar 31 15:22:59 ip-172-31-58-185 atomic-openshift-master: [502.950673ms] [476.585838ms] Writing http response done (75 items)
Mar 31 15:22:59 ip-172-31-58-185 atomic-openshift-master: [502.951951ms] [1.278µs] END
Mar 31 15:22:59 ip-172-31-58-185 atomic-openshift-node: E0331 15:22:59.912468  121292 vnids.go:292] Error fetching Net ID for namespace: dancer-mysql-example-usr-2877-2-3027-5, skipped serviceEvent: &{ADDED {dancer-mysql-example dancer-mysql-example-usr-2877-2-3027-5 fb014a9f-f775-11e5-9bf3-02cbd31a1cb3 172.26.55.230 [{TCP 8080}]}}

Also see the attached logs.

Expected results:
Project should be setup properly.

Additional info:

Comment 1 Ravi Sankar 2016-04-07 18:59:55 UTC

I was able to simulate the error.

We expect this behavior:
(1) On Master: Project gets created
(2) On Master: Asynchronously, SDN watches for new projects and it creates NetNamespace with VNID info
(3) On Master: Service gets created in the project

(4) On Node: Asynchronously, SDN watches for new netnamespaces and it maintains VNIDMap (key: namespace, value: VNID)
(5) On Node: Asynchronously, SDN watches for new services and it adds OpenFlow  service rules. It will fetch VNID by doing a lookup on VNIDMAP [populated in (4)]

Since (4) and (5) are async ops. If (5) occurs before (4) then VNID will not present in the VNIDMap and it fails to add service rules on the node.
This is the issue. For some reason, either master is overloaded or too many projects are created in a short time, node didn't get the vnid updates before service/pod looks up the value.

Comment 2 Ben Bennett 2016-04-08 15:29:29 UTC

*** Bug 1322130 has been marked as a duplicate of this bug. ***

Comment 3 Ben Bennett 2016-04-14 17:58:41 UTC

https://github.com/openshift/openshift-sdn/pull/293

Comment 4 Ben Bennett 2016-04-21 14:02:10 UTC

Vikas has been unable to reproduce this lately, and I'm reluctant to churn the code at this point.  So we'll push this fix into the next release, or if it rears it's ugly head again we'll do a hot fix.

Comment 5 Ravi Sankar 2016-05-05 17:56:11 UTC

Merged in https://github.com/openshift/origin/pull/8629

Comment 6 Meng Bo 2016-05-06 10:14:10 UTC

@vlaad, I cannot re-create this issue with old version of OSE.

I was using two `yes > /dev/null &` processes to eat up the two cores of the CPU.
And create 100 project with 1 service concurrently. 

But I cannot see the vnids.go error in the logs.

Can you help check this bug in your env?

Comment 7 Vikas Laad 2016-05-06 13:53:51 UTC

I was also not able to re-create this issue in next are runs, its really random. I have  started another will let you know how it goes.

Comment 8 Meng Bo 2016-05-25 10:14:36 UTC

@vlaad

Any progress on this bug? Can we close it if both of us cannot reproduce it anymore?

Comment 9 Vikas Laad 2016-05-25 13:15:46 UTC

Yes, in the followup runs also I was not able to reproduce it. Lets close this bug.

Comment 10 Meng Bo 2016-05-26 02:27:58 UTC

Move the bug to verified.

Comment 12 errata-xmlrpc 2016-09-27 09:37:22 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2016:1933