Bug 1624078 - Intermitent error on OpenShift HAProxy Router reload
Summary: Intermitent error on OpenShift HAProxy Router reload
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 3.9.0
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 3.9.z
Assignee: Ram Ranganathan
QA Contact: Hongan Li
URL:
Whiteboard:
Depends On:
Blocks: 1647176
TreeView+ depends on / blocked
 
Reported: 2018-08-30 20:46 UTC by Mauricio Magnani
Modified: 2022-08-04 22:20 UTC (History)
17 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: When wildcard routes are enabled and namespace ownership checks are disabled, non-wildcard routes get removed and immediately re-added on the resync interval boundaries and this causes a brief route outage and results in intermittent errors on a route. Consequence: Intermittent errors on non-wildcard routes. See cause above. Fix: To not remove and re-add the routes on resync interval in the specific case when wildcard routes enabled and namespace ownership checks are disabled. Result: non-wildcard routes continue to serve without any intermittent errors after the fix.
Clone Of:
: 1647176 (view as bug list)
Environment:
Last Closed: 2018-12-13 19:27:05 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift ose pull 1422 0 None None None 2020-10-07 08:05:24 UTC
Red Hat Knowledge Base (Solution) 4001431 0 Troubleshoot None Intermitent error on OpenShift HAProxy Router reload 2019-03-21 13:05:06 UTC
Red Hat Product Errata RHBA-2018:3748 0 None None None 2018-12-13 19:27:21 UTC

Description Mauricio Magnani 2018-08-30 20:46:57 UTC
### Description of problem ###

Customer is facing a lot of 503 errors while running a pipeline and also during a load test.

This is documented https://docs.openshift.com/container-platform/3.9/install_config/router/default_haproxy_router.html#preventing-connection-failures-during-restarts 

We tried to apply the workaround [1] and even then the issue is still present.

Looks like the "router" is hitting this bug - https://bugzilla.redhat.com/show_bug.cgi?id=1464657 which has been fixed in errata "RHBA-2018:0489".

The current version of the router in the client environment is  1.8.8-1.el7.

[1] - https://access.redhat.com/solutions/2775611

### Version-Release number ###

OCP 3.9.41

Comment 31 guilherme.camposo 2018-09-13 15:50:58 UTC
Hi @Ram, 

based on our conversation yesterday, we tried to monitor what was causing the monitored routes to be deleted and readmitted. What we noticed was that every route were readmitted, but the effect of it is not perceptible by all routes.
Another thing we noticed was that this would happen in an interval of 10 minutes. 

We believe this is something related to this: https://bugzilla.redhat.com/show_bug.cgi?id=1320233

We did a test and changed the rsync internal to 3 minutes, and now the errors appears every 3 minutes.

Comment 38 guilherme.camposo 2018-09-17 20:26:06 UTC
Just to summarize. 

Customer is receiving 503 responses from HA proxy in a interval of 10 minutes. 
We noticed this might me related with the forced synchronization of the routes.

There is two forms of forced synchronization. One we can change by setting --resync-interval in the container cmd, the other is hardcoded here : https://github.com/openshift/origin/blob/71543b2d15e53f4ae56272988a6604bf2f790dfd/pkg/cmd/infra/router/template.go#L418

by changing the first form to 3 minutes, we were able to change the behavior to 503's after 3 minutes.

Another curious thing is that routes are being deleted and readmitted in the HA proxy frequently, but supposedly the routes weren't updated.


@Ram any updates from Ravi? We are in a very difficult situation with the customer, we need at least a work around to avoid bigger problems.

Comment 44 Ram Ranganathan 2018-09-19 23:57:04 UTC
Created a PR against master: https://github.com/openshift/origin/pull/21053

Comment 51 Andrea Cavallari 2018-09-23 23:55:11 UTC
Setting priority to high since customer is expecting to receive the patch asap.

Comment 52 Ram Ranganathan 2018-10-25 19:37:01 UTC
Associated 3.9 backport PR: https://github.com/openshift/ose/pull/1422
Associated 3.10 backport PR: https://github.com/openshift/ose/pull/1423

Waiting on merges. Apologies for the delay - the PRs were ready a while back and slipped through the cracks. 

@Brenton, can you please help. Ben doesn't have permissions to merge this into the OSE repo. Thanks a ton.

Comment 53 Dan Mace 2018-11-05 16:59:11 UTC
Hey Ram,

Just following up here since the PRs merged. Anything else left to do?

Comment 54 Ram Ranganathan 2018-11-05 20:10:09 UTC
@Dan, The work's done on this one. Am not sure on what the next step on the OSE front is ... basically what's the procedure to convert those PRs to actual router images for those 2 backported releases? And for QE to verify? 

Since Brenton's out am not sure who would be the best person to ask/see what else is needed here.

@zhaozhanqi do you know what we need to do here?  Thx

Comment 55 Ram Ranganathan 2018-11-06 19:25:15 UTC
Setting this to POST and will create a separate bugz for OSE 3.10

Comment 56 Ram Ranganathan 2018-11-06 19:46:27 UTC
Cloned bugz for backporting to OSE 3.10 is: https://bugzilla.redhat.com/show_bug.cgi?id=1647176

Comment 63 Hongan Li 2018-11-30 09:31:18 UTC
Thank you Ram.
Increase the log level to 4 on the router, I can see logs below with the OCP v3.9.41 while the router reloading, but not existing on OCP v3.9.55. So the issue has been verified.

I1130 08:08:06.486294       1 unique_host.go:211] Deleting routes for hongli/service-unsecure
I1130 08:08:06.486298       1 plugin.go:187] Deleting route hongli/service-unsecure
I1130 08:08:06.486312       1 unique_host.go:195] Route hongli/service-unsecure claims service-unsecure-hongli.apps.1130-g2b.qe.rhcloud.com
I1130 08:08:06.486322       1 status.go:245] admit: route already admitted
I1130 08:08:06.486331       1 router.go:682] Adding route hongli/service-unsecure

BTW, tried to using curl/ab but unfortunately didn't get the 503 error during router reload.

Comment 65 errata-xmlrpc 2018-12-13 19:27:05 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:3748


Note You need to log in before you can comment on or make changes to this bug.