Bug 1595513 - Router keep restarting duo to large number of routes. [NEEDINFO]
Summary: Router keep restarting duo to large number of routes.
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Routing
Version: 3.9.0
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: 3.11.0
Assignee: Ivan Chavero
QA Contact: zhaozhanqi
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-06-27 03:47 UTC by sfu@redhat.com
Modified: 2018-08-01 17:43 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-08-01 17:43:26 UTC
Target Upstream Version:
ichavero: needinfo? (sfu)


Attachments (Terms of Use)

Description sfu@redhat.com 2018-06-27 03:47:32 UTC
Description of problem:
When router(haproxy) is performing reload operations and if it contains large number of route(about 10000). It won't pass the health check duo to Low performance,which kills router continuously.


Version-Release number of selected component (if applicable):
ocp 3.9.14
router 3.9.14

How reproducible:
always

Steps to Reproduce:
1.create a router
2.scale up to 3 router pods
3.create 10000+ routes

Actual results:
router pod keep restarting

Expected results:
running well

Additional info:
when increase haproxy backend check interval to 300s,the problem can be avoid.

Comment 1 Ben Bennett 2018-06-27 13:29:59 UTC
Can we get details on the vm/machine they are running the router on.

However, it sounds like they have identified a work-around for the time being.  Changes that are going into 3.11 may help with this situation.

Comment 2 sfu@redhat.com 2018-06-28 01:43:48 UTC
The hardware information of this infra node that running router is:8core 32GB

Please feel free to let me know what additional info you need,thanks.

Comment 3 sfu@redhat.com 2018-07-05 08:31:08 UTC
(In reply to Ben Bennett from comment #1)
> Can we get details on the vm/machine they are running the router on.
> 
> However, it sounds like they have identified a work-around for the time
> being.  Changes that are going into 3.11 may help with this situation.

Thanks Bennett for your reply.

The root cause of this issue probably is that haproxy has very low performance in reloading large number of routes, so that a health check can not be completed within the default Readniess and livessness detection cycle, which requires an improvement to fix. At present, temporarily increase number of health checks and the interval of detection can only evade the problem.

Comment 18 Ivan Chavero 2018-08-01 17:43:26 UTC
I'm closing this bug, feel free to reopen it if the problem persists.


Note You need to log in before you can comment on or make changes to this bug.