Description of problem: Running OSE reliability tests for long time running into this problem. After project creation networking was not setup properly. Version-Release number of selected component (if applicable): How reproducible: Randomly Steps to Reproduce: Running OSE reliability tests for long time. Contact vlaad for running them. Actual results: These logs are taken immediately after project creation Mar 31 15:22:56 ip-172-31-58-185 atomic-openshift-master: I0331 15:22:56.785395 77411 create_dockercfg_secrets.go:161] View of ServiceAccount dancer-mysql-example-usr-2877-2-3027-5/builder is not up to date, skipping dockercfg creation Mar 31 15:22:56 ip-172-31-58-185 atomic-openshift-master: I0331 15:22:56.853244 77411 create_dockercfg_secrets.go:161] View of ServiceAccount dancer-mysql-example-usr-2877-2-3027-5/deployer is not up to date, skipping dockercfg creation Mar 31 15:22:56 ip-172-31-58-185 atomic-openshift-master: I0331 15:22:56.905873 77411 create_dockercfg_secrets.go:161] View of ServiceAccount dancer-mysql-example-usr-2877-2-3027-5/default is not up to date, skipping dockercfg creation Mar 31 15:22:59 ip-172-31-58-185 atomic-openshift-master: I0331 15:22:59.287836 77411 replication_controller.go:434] Too few "dancer-mysql-example-usr-2877-1-3027-4"/"database-1" replicas, need 1, creating 1 Mar 31 15:22:59 ip-172-31-58-185 atomic-openshift-master: I0331 15:22:59.298034 77411 factory.go:412] Attempting to bind database-1-j99la to ip-172-31-58-187.us-west-2.compute.internal Mar 31 15:22:59 ip-172-31-58-185 atomic-openshift-master: I0331 15:22:59.299005 77411 event.go:211] Event(api.ObjectReference{Kind:"ReplicationController", Namespace:"dancer-mysql-example-usr-2877-1-3027-4", Name:"database-1", UID:"f7b3bd60-f775-11e5-9bf3-02cbd31a1cb3", APIVersion:"v1", ResourceVersion:"337555", FieldPath:""}): type: 'Normal' reason: 'SuccessfulCreate' Created pod: database-1-j99la Mar 31 15:22:59 ip-172-31-58-185 atomic-openshift-master: I0331 15:22:59.453145 77411 trace.go:57] Trace "List /oapi/v1/namespaces/openshift/templates" (started 2016-03-31 15:22:58.950150371 -0400 EDT): Mar 31 15:22:59 ip-172-31-58-185 atomic-openshift-master: [4.532µs] [4.532µs] About to List from storage Mar 31 15:22:59 ip-172-31-58-185 atomic-openshift-master: [25.92368ms] [25.919148ms] Listing from storage done Mar 31 15:22:59 ip-172-31-58-185 atomic-openshift-master: [26.364835ms] [441.155µs] Self-linking done Mar 31 15:22:59 ip-172-31-58-185 atomic-openshift-master: [502.950673ms] [476.585838ms] Writing http response done (75 items) Mar 31 15:22:59 ip-172-31-58-185 atomic-openshift-master: [502.951951ms] [1.278µs] END Mar 31 15:22:59 ip-172-31-58-185 atomic-openshift-node: E0331 15:22:59.912468 121292 vnids.go:292] Error fetching Net ID for namespace: dancer-mysql-example-usr-2877-2-3027-5, skipped serviceEvent: &{ADDED {dancer-mysql-example dancer-mysql-example-usr-2877-2-3027-5 fb014a9f-f775-11e5-9bf3-02cbd31a1cb3 172.26.55.230 [{TCP 8080}]}} Also see the attached logs. Expected results: Project should be setup properly. Additional info:
I was able to simulate the error. We expect this behavior: (1) On Master: Project gets created (2) On Master: Asynchronously, SDN watches for new projects and it creates NetNamespace with VNID info (3) On Master: Service gets created in the project (4) On Node: Asynchronously, SDN watches for new netnamespaces and it maintains VNIDMap (key: namespace, value: VNID) (5) On Node: Asynchronously, SDN watches for new services and it adds OpenFlow service rules. It will fetch VNID by doing a lookup on VNIDMAP [populated in (4)] Since (4) and (5) are async ops. If (5) occurs before (4) then VNID will not present in the VNIDMap and it fails to add service rules on the node. This is the issue. For some reason, either master is overloaded or too many projects are created in a short time, node didn't get the vnid updates before service/pod looks up the value.
*** Bug 1322130 has been marked as a duplicate of this bug. ***
https://github.com/openshift/openshift-sdn/pull/293
Vikas has been unable to reproduce this lately, and I'm reluctant to churn the code at this point. So we'll push this fix into the next release, or if it rears it's ugly head again we'll do a hot fix.
Merged in https://github.com/openshift/origin/pull/8629
@vlaad, I cannot re-create this issue with old version of OSE. I was using two `yes > /dev/null &` processes to eat up the two cores of the CPU. And create 100 project with 1 service concurrently. But I cannot see the vnids.go error in the logs. Can you help check this bug in your env?
I was also not able to re-create this issue in next are runs, its really random. I have started another will let you know how it goes.
@vlaad Any progress on this bug? Can we close it if both of us cannot reproduce it anymore?
Yes, in the followup runs also I was not able to reproduce it. Lets close this bug.
Move the bug to verified.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2016:1933