1624078 – Intermitent error on OpenShift HAProxy Router reload

Bug 1624078 - Intermitent error on OpenShift HAProxy Router reload

Summary: Intermitent error on OpenShift HAProxy Router reload

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	3.9.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	3.9.z
Assignee:	Ram Ranganathan
QA Contact:	Hongan Li
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1647176
TreeView+	depends on / blocked

Reported:	2018-08-30 20:46 UTC by Mauricio Magnani
Modified:	2022-08-04 22:20 UTC (History)
CC List:	17 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Cause: When wildcard routes are enabled and namespace ownership checks are disabled, non-wildcard routes get removed and immediately re-added on the resync interval boundaries and this causes a brief route outage and results in intermittent errors on a route. Consequence: Intermittent errors on non-wildcard routes. See cause above. Fix: To not remove and re-add the routes on resync interval in the specific case when wildcard routes enabled and namespace ownership checks are disabled. Result: non-wildcard routes continue to serve without any intermittent errors after the fix.
Clone Of:
Clones:	1647176 (view as bug list)
Environment:
Last Closed:	2018-12-13 19:27:05 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	openshift ose pull 1422	None	None	None	2020-10-07 08:05:24 UTC
Red Hat Knowledge Base (Solution)	4001431	Troubleshoot	None	Intermitent error on OpenShift HAProxy Router reload	2019-03-21 13:05:06 UTC
Red Hat Product Errata	RHBA-2018:3748	None	None	None	2018-12-13 19:27:21 UTC

Description Mauricio Magnani 2018-08-30 20:46:57 UTC

### Description of problem ###

Customer is facing a lot of 503 errors while running a pipeline and also during a load test.

This is documented https://docs.openshift.com/container-platform/3.9/install_config/router/default_haproxy_router.html#preventing-connection-failures-during-restarts 

We tried to apply the workaround [1] and even then the issue is still present.

Looks like the "router" is hitting this bug - https://bugzilla.redhat.com/show_bug.cgi?id=1464657 which has been fixed in errata "RHBA-2018:0489".

The current version of the router in the client environment is  1.8.8-1.el7.

[1] - https://access.redhat.com/solutions/2775611

### Version-Release number ###

OCP 3.9.41

Comment 31 guilherme.camposo 2018-09-13 15:50:58 UTC

Hi @Ram, 

based on our conversation yesterday, we tried to monitor what was causing the monitored routes to be deleted and readmitted. What we noticed was that every route were readmitted, but the effect of it is not perceptible by all routes.
Another thing we noticed was that this would happen in an interval of 10 minutes. 

We believe this is something related to this: https://bugzilla.redhat.com/show_bug.cgi?id=1320233

We did a test and changed the rsync internal to 3 minutes, and now the errors appears every 3 minutes.

Comment 38 guilherme.camposo 2018-09-17 20:26:06 UTC

Just to summarize. 

Customer is receiving 503 responses from HA proxy in a interval of 10 minutes. 
We noticed this might me related with the forced synchronization of the routes.

There is two forms of forced synchronization. One we can change by setting --resync-interval in the container cmd, the other is hardcoded here : https://github.com/openshift/origin/blob/71543b2d15e53f4ae56272988a6604bf2f790dfd/pkg/cmd/infra/router/template.go#L418

by changing the first form to 3 minutes, we were able to change the behavior to 503's after 3 minutes.

Another curious thing is that routes are being deleted and readmitted in the HA proxy frequently, but supposedly the routes weren't updated.


@Ram any updates from Ravi? We are in a very difficult situation with the customer, we need at least a work around to avoid bigger problems.

Comment 44 Ram Ranganathan 2018-09-19 23:57:04 UTC

Created a PR against master: https://github.com/openshift/origin/pull/21053

Comment 51 Andrea Cavallari 2018-09-23 23:55:11 UTC

Setting priority to high since customer is expecting to receive the patch asap.

Comment 52 Ram Ranganathan 2018-10-25 19:37:01 UTC

Associated 3.9 backport PR: https://github.com/openshift/ose/pull/1422
Associated 3.10 backport PR: https://github.com/openshift/ose/pull/1423

Waiting on merges. Apologies for the delay - the PRs were ready a while back and slipped through the cracks. 

@Brenton, can you please help. Ben doesn't have permissions to merge this into the OSE repo. Thanks a ton.

Comment 53 Dan Mace 2018-11-05 16:59:11 UTC

Hey Ram,

Just following up here since the PRs merged. Anything else left to do?

Comment 54 Ram Ranganathan 2018-11-05 20:10:09 UTC

@Dan, The work's done on this one. Am not sure on what the next step on the OSE front is ... basically what's the procedure to convert those PRs to actual router images for those 2 backported releases? And for QE to verify? 

Since Brenton's out am not sure who would be the best person to ask/see what else is needed here.

@zhaozhanqi do you know what we need to do here?  Thx

Comment 55 Ram Ranganathan 2018-11-06 19:25:15 UTC

Setting this to POST and will create a separate bugz for OSE 3.10

Comment 56 Ram Ranganathan 2018-11-06 19:46:27 UTC

Cloned bugz for backporting to OSE 3.10 is: https://bugzilla.redhat.com/show_bug.cgi?id=1647176

Comment 63 Hongan Li 2018-11-30 09:31:18 UTC

Thank you Ram.
Increase the log level to 4 on the router, I can see logs below with the OCP v3.9.41 while the router reloading, but not existing on OCP v3.9.55. So the issue has been verified.

I1130 08:08:06.486294       1 unique_host.go:211] Deleting routes for hongli/service-unsecure
I1130 08:08:06.486298       1 plugin.go:187] Deleting route hongli/service-unsecure
I1130 08:08:06.486312       1 unique_host.go:195] Route hongli/service-unsecure claims service-unsecure-hongli.apps.1130-g2b.qe.rhcloud.com
I1130 08:08:06.486322       1 status.go:245] admit: route already admitted
I1130 08:08:06.486331       1 router.go:682] Adding route hongli/service-unsecure

BTW, tried to using curl/ab but unfortunately didn't get the 503 error during router reload.

Comment 65 errata-xmlrpc 2018-12-13 19:27:05 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:3748

Note You need to log in before you can comment on or make changes to this bug.