1481603 – [online-stg][starter-us-east-2][starter-us-west-2]The endpoints value always be delayed

Bug 1481603 - [online-stg][starter-us-east-2][starter-us-west-2]The endpoints value always be delayed

Summary: [online-stg][starter-us-east-2][starter-us-west-2]The endpoints value always ...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Node
Sub Component:
Version:	3.6.0
Hardware:	All
OS:	All
Priority:	high
Severity:	medium
Target Milestone:	---
Target Release:	3.6.z
Assignee:	Joel Smith
QA Contact:	zhaozhanqi
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2017-08-15 07:29 UTC by zhaozhanqi
Modified:	2018-10-08 18:09 UTC (History)
CC List:	17 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2017-10-25 13:04:36 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2017:3049	0	normal	SHIPPED_LIVE	OpenShift Container Platform 3.6, 3.5, and 3.4 bug fix and enhancement update	2017-10-25 15:57:15 UTC

Description zhaozhanqi 2017-08-15 07:29:03 UTC

Description of problem:
When the pod and svc is running. the endpoints always be delayed to get the value. seems about need to wait 2 mins

Version-Release number of selected component (if applicable):
starter-us-east-2
3.6.173.0.5
How reproducible:
always

Steps to Reproduce:
1. Create pod/svs and make the pod are running
  oc create -f https://raw.githubusercontent.com/openshift-qe/v3-testfiles/master/networking/list_for_pods.json
2. oc get endpoints
3. wait about 2 Mins
4. oc get endpoints

Actual results:

steps 2 cannot get the value

step 4. oc get endpoints
NAME           ENDPOINTS                            AGE
test-service   10.128.35.4:8080,10.130.28.46:8080   13m

Expected results:

the endpoints should get the correct value timely.

Additional info:

Comment 1 Karanbir Singh 2017-08-16 15:26:41 UTC

hi folks, do we have an ETA on this ?

Comment 2 Paul Bergene 2017-08-17 07:08:10 UTC

Any update on this?

Comment 5 Paul Bergene 2017-08-18 08:53:22 UTC

Adding email conversation for visibility.

Clayton Coleman (21.04 UTC): Quick update - endpoints controller was hopelessly backlogged, so we bumped the qps to see if it clears the queue.  Real fix will be coming in 3.7 but not sure on backport.  Size of cluster related.

Clayton Coleman (23:40 UTC): Looks like the config loop reset this an hour ago so we're back to the lower rate (but we hadn't burned through the backlog yet, so it's not the end of the world).  We may need to reapply the settings eric had in place.

Comment 6 Eric Paris 2017-08-18 15:47:29 UTC

I set qps=800  burst=1200 in the config loop for starter-us-east-1 ONLY
  it is rolling out and we will see what comes of this change.

Comment 7 Derek Carr 2017-08-18 16:26:11 UTC

We are doing about 24 PUT to endpoints per second after the change versus 9 previously.

Comment 8 Eric Paris 2017-08-18 21:02:40 UTC

On starter-us-east-2 (where openshift.io lives) it appears this workaround causes endpoint updates in ~30-45 seconds.

Comment 9 Joel Smith 2017-08-22 17:54:09 UTC

We have made a code change that we are hoping will alleviate this issue by reducing the number of endpoint updates occurring on the cluster.

The upstream Kubernetes PR is merged:
https://github.com/kubernetes/kubernetes/pull/50934

We're still working on a Origin/master PR:
https://github.com/openshift/origin/pull/15888
and an Origin PR for 3.6:
https://github.com/openshift/origin/pull/15889

We'll re-assess once the new code has been deployed to the starter clusters.

Comment 10 Karanbir Singh 2017-08-30 19:36:31 UTC

When are we expecting this new code to be deployed to starter clusters ?

Comment 11 Joel Smith 2017-08-30 20:08:06 UTC

If nothing changes, we'll get it when the starter clusters go to 3.7. In the mean time, we're building our confidence in the fix with all the testing of the 3.7 branch and keeping an eye on how well the queue depth work-around is holding up on the starter clusters.

We haven't made a firm decision yet on whether to try to push the code change into the 3.6 branch and roll an update out prior to 3.7. Probably it will come down to how well the work-around is working and our confidence in the fix. So in other words, we're balancing the severity of the problem with the risk of pushing out the fix.

Our recent testing showed that endpoints were generally updated within 30 seconds. A test that we just ran got updates in about 12 seconds. If this isn't representative of your experience, please let us know.

Comment 12 DeShuai Ma 2017-09-18 07:52:32 UTC

Which ocp version will fix the bug? Could you help set the Target Release, thanks.

Comment 13 Seth Jennings 2017-09-18 13:35:16 UTC

v3.7.0-0.106.0 and v3.6.173.0.35-1

Comment 14 Eric Paris 2017-09-26 16:07:22 UTC

In starter-us-east-2:

# oc version
oc v3.6.173.0.37

# rpm -q atomic-openshift-master
atomic-openshift-master-3.6.173.0.37-1.git.0.fd828e7.el7.x86_64

# oc get endpoints --all-namespaces | wc -l
2401

# watch -n1 'oc get --raw /metrics --server https://172.31.73.226:8444 | grep endpoint_depth'


The watch above shows the queue depth jump to about 2400 then decrease over time to about 300-400 then jump back to 2400 every 60 seconds.

This is the same behavior we saw before this fix.

Comment 15 Joel Smith 2017-09-27 03:42:17 UTC

It looks like this was already fixed upstream a few months ago in
https://github.com/kubernetes/kubernetes/pull/47731
and
https://github.com/kubernetes/kubernetes/pull/47788
both of which were backported to 3.7 in
https://github.com/openshift/origin/pull/15067

I've cherry-picked the fixes back to 3.6 and we'll work on deciding whether a merge is appropriate. So the fix is available today in 3.7 candidates. Here's the 3.6 back port PR:
https://github.com/openshift/origin/pull/16575

Comment 17 zhaliu 2017-10-10 11:37:54 UTC

the problem exists still in the starter-us-east-2 and starter-us-west-2 envs:

starter-us-east-2: To create endpoints will take about 1 min.
starter-us-west-2: To create endpoints will take about 2 mins.


starter-us-east-2:
openshift v3.6.173.0.37
kubernetes v1.6.1+5115d708d7

starter-us-west-2:
openshift v3.6.173.0.5
kubernetes v1.6.1+5115d708d7

Comment 18 Joel Smith 2017-10-10 14:03:15 UTC

The versions of OCP currently installed on starter-us-{east,west}-2 don't appear to contain our latest fix code. We'll need at least version v3.6.173.0.48-1 to get the latest code related to this bug.

Comment 20 zhaliu 2017-10-13 07:57:11 UTC

starter-us-east-2:
openshift v3.6.173.0.37
kubernetes v1.6.1+5115d708d7

starter-us-west-2:
openshift v3.6.173.0.5
kubernetes v1.6.1+5115d708d7

Tried in starter us west 2 env, but this env is not available: the pod has never been ready.

The phenomenon is same as the description in bug https://bugzilla.redhat.com/show_bug.cgi?id=1472624.

The validation of the problem is blocked by the bug.

Comment 22 zhaozhanqi 2017-10-17 07:57:36 UTC

hi,@mifiedle

From this issue https://github.com/openshift/origin/issues/14710

this bug needs about 10K endpoints to reproduced.So I'm wondering if your team have this kind of env and give some help to do the testing and verified this bug on OCP(3.6.173.0.49-1) side. that's will be helpful.

we will also test this issue in online side once it deploy to the fixed verion in comment 18.

thanks.

Comment 24 zhaozhanqi 2017-10-19 08:08:48 UTC

Tested this bug on v3.6.173.0.56

steps:

1 Create pod and 10k service to refer the pod
2. Create another app and check the endpoints

the endpoints will return the pod ip value almost right away. this bug should be fixed in my testing

So verified this bug for now, since this bug was reported in online (see the title) env at the beginning, we will also take a look once it is updated to the fixed version (see Comment 18)

@joelsmith

Could you also help do more testing for this and add cases in the future OCP stress testing. if you can still reproduce the issue. please feel free file a new bug. thanks.

Comment 26 errata-xmlrpc 2017-10-25 13:04:36 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:3049

Note You need to log in before you can comment on or make changes to this bug.