Description of problem: When the pod and svc is running. the endpoints always be delayed to get the value. seems about need to wait 2 mins Version-Release number of selected component (if applicable): starter-us-east-2 3.6.173.0.5 How reproducible: always Steps to Reproduce: 1. Create pod/svs and make the pod are running oc create -f https://raw.githubusercontent.com/openshift-qe/v3-testfiles/master/networking/list_for_pods.json 2. oc get endpoints 3. wait about 2 Mins 4. oc get endpoints Actual results: steps 2 cannot get the value step 4. oc get endpoints NAME ENDPOINTS AGE test-service 10.128.35.4:8080,10.130.28.46:8080 13m Expected results: the endpoints should get the correct value timely. Additional info:
hi folks, do we have an ETA on this ?
Any update on this?
Adding email conversation for visibility. Clayton Coleman (21.04 UTC): Quick update - endpoints controller was hopelessly backlogged, so we bumped the qps to see if it clears the queue. Real fix will be coming in 3.7 but not sure on backport. Size of cluster related. Clayton Coleman (23:40 UTC): Looks like the config loop reset this an hour ago so we're back to the lower rate (but we hadn't burned through the backlog yet, so it's not the end of the world). We may need to reapply the settings eric had in place.
I set qps=800 burst=1200 in the config loop for starter-us-east-1 ONLY it is rolling out and we will see what comes of this change.
We are doing about 24 PUT to endpoints per second after the change versus 9 previously.
On starter-us-east-2 (where openshift.io lives) it appears this workaround causes endpoint updates in ~30-45 seconds.
We have made a code change that we are hoping will alleviate this issue by reducing the number of endpoint updates occurring on the cluster. The upstream Kubernetes PR is merged: https://github.com/kubernetes/kubernetes/pull/50934 We're still working on a Origin/master PR: https://github.com/openshift/origin/pull/15888 and an Origin PR for 3.6: https://github.com/openshift/origin/pull/15889 We'll re-assess once the new code has been deployed to the starter clusters.
When are we expecting this new code to be deployed to starter clusters ?
If nothing changes, we'll get it when the starter clusters go to 3.7. In the mean time, we're building our confidence in the fix with all the testing of the 3.7 branch and keeping an eye on how well the queue depth work-around is holding up on the starter clusters. We haven't made a firm decision yet on whether to try to push the code change into the 3.6 branch and roll an update out prior to 3.7. Probably it will come down to how well the work-around is working and our confidence in the fix. So in other words, we're balancing the severity of the problem with the risk of pushing out the fix. Our recent testing showed that endpoints were generally updated within 30 seconds. A test that we just ran got updates in about 12 seconds. If this isn't representative of your experience, please let us know.
Which ocp version will fix the bug? Could you help set the Target Release, thanks.
v3.7.0-0.106.0 and v3.6.173.0.35-1
In starter-us-east-2: # oc version oc v3.6.173.0.37 # rpm -q atomic-openshift-master atomic-openshift-master-3.6.173.0.37-1.git.0.fd828e7.el7.x86_64 # oc get endpoints --all-namespaces | wc -l 2401 # watch -n1 'oc get --raw /metrics --server https://172.31.73.226:8444 | grep endpoint_depth' The watch above shows the queue depth jump to about 2400 then decrease over time to about 300-400 then jump back to 2400 every 60 seconds. This is the same behavior we saw before this fix.
It looks like this was already fixed upstream a few months ago in https://github.com/kubernetes/kubernetes/pull/47731 and https://github.com/kubernetes/kubernetes/pull/47788 both of which were backported to 3.7 in https://github.com/openshift/origin/pull/15067 I've cherry-picked the fixes back to 3.6 and we'll work on deciding whether a merge is appropriate. So the fix is available today in 3.7 candidates. Here's the 3.6 back port PR: https://github.com/openshift/origin/pull/16575
the problem exists still in the starter-us-east-2 and starter-us-west-2 envs: starter-us-east-2: To create endpoints will take about 1 min. starter-us-west-2: To create endpoints will take about 2 mins. starter-us-east-2: openshift v3.6.173.0.37 kubernetes v1.6.1+5115d708d7 starter-us-west-2: openshift v3.6.173.0.5 kubernetes v1.6.1+5115d708d7
The versions of OCP currently installed on starter-us-{east,west}-2 don't appear to contain our latest fix code. We'll need at least version v3.6.173.0.48-1 to get the latest code related to this bug.
starter-us-east-2: openshift v3.6.173.0.37 kubernetes v1.6.1+5115d708d7 starter-us-west-2: openshift v3.6.173.0.5 kubernetes v1.6.1+5115d708d7 Tried in starter us west 2 env, but this env is not available: the pod has never been ready. The phenomenon is same as the description in bug https://bugzilla.redhat.com/show_bug.cgi?id=1472624. The validation of the problem is blocked by the bug.
hi,@mifiedle From this issue https://github.com/openshift/origin/issues/14710 this bug needs about 10K endpoints to reproduced.So I'm wondering if your team have this kind of env and give some help to do the testing and verified this bug on OCP(3.6.173.0.49-1) side. that's will be helpful. we will also test this issue in online side once it deploy to the fixed verion in comment 18. thanks.
Tested this bug on v3.6.173.0.56 steps: 1 Create pod and 10k service to refer the pod 2. Create another app and check the endpoints the endpoints will return the pod ip value almost right away. this bug should be fixed in my testing So verified this bug for now, since this bug was reported in online (see the title) env at the beginning, we will also take a look once it is updated to the fixed version (see Comment 18) @joelsmith Could you also help do more testing for this and add cases in the future OCP stress testing. if you can still reproduce the issue. please feel free file a new bug. thanks.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2017:3049