Bug 1323279 - New project networking is not setup properly
Summary: New project networking is not setup properly
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 3.2.0
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: ---
Assignee: Ravi Sankar
QA Contact: Meng Bo
URL:
Whiteboard:
: 1322130 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-04-01 17:07 UTC by Vikas Laad
Modified: 2016-09-27 09:37 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2016-09-27 09:37:22 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2016:1933 normal SHIPPED_LIVE Red Hat OpenShift Container Platform 3.3 Release Advisory 2016-09-27 13:24:36 UTC

Description Vikas Laad 2016-04-01 17:07:55 UTC
Description of problem:
Running OSE reliability tests for long time running into this problem. After project creation networking was not setup properly.

Version-Release number of selected component (if applicable):


How reproducible:
Randomly

Steps to Reproduce:
Running OSE reliability tests for long time. Contact vlaad@redhat.com for running them. 

Actual results:
These logs are taken immediately after project creation

Mar 31 15:22:56 ip-172-31-58-185 atomic-openshift-master: I0331 15:22:56.785395   77411 create_dockercfg_secrets.go:161] View of ServiceAccount dancer-mysql-example-usr-2877-2-3027-5/builder is not up to date, skipping dockercfg creation
Mar 31 15:22:56 ip-172-31-58-185 atomic-openshift-master: I0331 15:22:56.853244   77411 create_dockercfg_secrets.go:161] View of ServiceAccount dancer-mysql-example-usr-2877-2-3027-5/deployer is not up to date, skipping dockercfg creation
Mar 31 15:22:56 ip-172-31-58-185 atomic-openshift-master: I0331 15:22:56.905873   77411 create_dockercfg_secrets.go:161] View of ServiceAccount dancer-mysql-example-usr-2877-2-3027-5/default is not up to date, skipping dockercfg creation
Mar 31 15:22:59 ip-172-31-58-185 atomic-openshift-master: I0331 15:22:59.287836   77411 replication_controller.go:434] Too few "dancer-mysql-example-usr-2877-1-3027-4"/"database-1" replicas, need 1, creating 1
Mar 31 15:22:59 ip-172-31-58-185 atomic-openshift-master: I0331 15:22:59.298034   77411 factory.go:412] Attempting to bind database-1-j99la to ip-172-31-58-187.us-west-2.compute.internal
Mar 31 15:22:59 ip-172-31-58-185 atomic-openshift-master: I0331 15:22:59.299005   77411 event.go:211] Event(api.ObjectReference{Kind:"ReplicationController", Namespace:"dancer-mysql-example-usr-2877-1-3027-4", Name:"database-1", UID:"f7b3bd60-f775-11e5-9bf3-02cbd31a1cb3", APIVersion:"v1", ResourceVersion:"337555", FieldPath:""}): type: 'Normal' reason: 'SuccessfulCreate' Created pod: database-1-j99la
Mar 31 15:22:59 ip-172-31-58-185 atomic-openshift-master: I0331 15:22:59.453145   77411 trace.go:57] Trace "List /oapi/v1/namespaces/openshift/templates" (started 2016-03-31 15:22:58.950150371 -0400 EDT):
Mar 31 15:22:59 ip-172-31-58-185 atomic-openshift-master: [4.532µs] [4.532µs] About to List from storage
Mar 31 15:22:59 ip-172-31-58-185 atomic-openshift-master: [25.92368ms] [25.919148ms] Listing from storage done
Mar 31 15:22:59 ip-172-31-58-185 atomic-openshift-master: [26.364835ms] [441.155µs] Self-linking done
Mar 31 15:22:59 ip-172-31-58-185 atomic-openshift-master: [502.950673ms] [476.585838ms] Writing http response done (75 items)
Mar 31 15:22:59 ip-172-31-58-185 atomic-openshift-master: [502.951951ms] [1.278µs] END
Mar 31 15:22:59 ip-172-31-58-185 atomic-openshift-node: E0331 15:22:59.912468  121292 vnids.go:292] Error fetching Net ID for namespace: dancer-mysql-example-usr-2877-2-3027-5, skipped serviceEvent: &{ADDED {dancer-mysql-example dancer-mysql-example-usr-2877-2-3027-5 fb014a9f-f775-11e5-9bf3-02cbd31a1cb3 172.26.55.230 [{TCP 8080}]}}

Also see the attached logs.

Expected results:
Project should be setup properly.

Additional info:

Comment 1 Ravi Sankar 2016-04-07 18:59:55 UTC
I was able to simulate the error.

We expect this behavior:
(1) On Master: Project gets created
(2) On Master: Asynchronously, SDN watches for new projects and it creates NetNamespace with VNID info
(3) On Master: Service gets created in the project

(4) On Node: Asynchronously, SDN watches for new netnamespaces and it maintains VNIDMap (key: namespace, value: VNID)
(5) On Node: Asynchronously, SDN watches for new services and it adds OpenFlow  service rules. It will fetch VNID by doing a lookup on VNIDMAP [populated in (4)]

Since (4) and (5) are async ops. If (5) occurs before (4) then VNID will not present in the VNIDMap and it fails to add service rules on the node.
This is the issue. For some reason, either master is overloaded or too many projects are created in a short time, node didn't get the vnid updates before service/pod looks up the value.

Comment 2 Ben Bennett 2016-04-08 15:29:29 UTC
*** Bug 1322130 has been marked as a duplicate of this bug. ***

Comment 4 Ben Bennett 2016-04-21 14:02:10 UTC
Vikas has been unable to reproduce this lately, and I'm reluctant to churn the code at this point.  So we'll push this fix into the next release, or if it rears it's ugly head again we'll do a hot fix.

Comment 5 Ravi Sankar 2016-05-05 17:56:11 UTC
Merged in https://github.com/openshift/origin/pull/8629

Comment 6 Meng Bo 2016-05-06 10:14:10 UTC
@vlaad, I cannot re-create this issue with old version of OSE.

I was using two `yes > /dev/null &` processes to eat up the two cores of the CPU.
And create 100 project with 1 service concurrently. 

But I cannot see the vnids.go error in the logs.

Can you help check this bug in your env?

Comment 7 Vikas Laad 2016-05-06 13:53:51 UTC
I was also not able to re-create this issue in next are runs, its really random. I have  started another will let you know how it goes.

Comment 8 Meng Bo 2016-05-25 10:14:36 UTC
@vlaad

Any progress on this bug? Can we close it if both of us cannot reproduce it anymore?

Comment 9 Vikas Laad 2016-05-25 13:15:46 UTC
Yes, in the followup runs also I was not able to reproduce it. Lets close this bug.

Comment 10 Meng Bo 2016-05-26 02:27:58 UTC
Move the bug to verified.

Comment 12 errata-xmlrpc 2016-09-27 09:37:22 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2016:1933


Note You need to log in before you can comment on or make changes to this bug.