### Description of problem ### Customer is facing a lot of 503 errors while running a pipeline and also during a load test. This is documented https://docs.openshift.com/container-platform/3.9/install_config/router/default_haproxy_router.html#preventing-connection-failures-during-restarts We tried to apply the workaround [1] and even then the issue is still present. Looks like the "router" is hitting this bug - https://bugzilla.redhat.com/show_bug.cgi?id=1464657 which has been fixed in errata "RHBA-2018:0489". The current version of the router in the client environment is 1.8.8-1.el7. [1] - https://access.redhat.com/solutions/2775611 ### Version-Release number ### OCP 3.9.41
Hi @Ram, based on our conversation yesterday, we tried to monitor what was causing the monitored routes to be deleted and readmitted. What we noticed was that every route were readmitted, but the effect of it is not perceptible by all routes. Another thing we noticed was that this would happen in an interval of 10 minutes. We believe this is something related to this: https://bugzilla.redhat.com/show_bug.cgi?id=1320233 We did a test and changed the rsync internal to 3 minutes, and now the errors appears every 3 minutes.
Just to summarize. Customer is receiving 503 responses from HA proxy in a interval of 10 minutes. We noticed this might me related with the forced synchronization of the routes. There is two forms of forced synchronization. One we can change by setting --resync-interval in the container cmd, the other is hardcoded here : https://github.com/openshift/origin/blob/71543b2d15e53f4ae56272988a6604bf2f790dfd/pkg/cmd/infra/router/template.go#L418 by changing the first form to 3 minutes, we were able to change the behavior to 503's after 3 minutes. Another curious thing is that routes are being deleted and readmitted in the HA proxy frequently, but supposedly the routes weren't updated. @Ram any updates from Ravi? We are in a very difficult situation with the customer, we need at least a work around to avoid bigger problems.
Created a PR against master: https://github.com/openshift/origin/pull/21053
Setting priority to high since customer is expecting to receive the patch asap.
Associated 3.9 backport PR: https://github.com/openshift/ose/pull/1422 Associated 3.10 backport PR: https://github.com/openshift/ose/pull/1423 Waiting on merges. Apologies for the delay - the PRs were ready a while back and slipped through the cracks. @Brenton, can you please help. Ben doesn't have permissions to merge this into the OSE repo. Thanks a ton.
Hey Ram, Just following up here since the PRs merged. Anything else left to do?
@Dan, The work's done on this one. Am not sure on what the next step on the OSE front is ... basically what's the procedure to convert those PRs to actual router images for those 2 backported releases? And for QE to verify? Since Brenton's out am not sure who would be the best person to ask/see what else is needed here. @zhaozhanqi do you know what we need to do here? Thx
Setting this to POST and will create a separate bugz for OSE 3.10
Cloned bugz for backporting to OSE 3.10 is: https://bugzilla.redhat.com/show_bug.cgi?id=1647176
Thank you Ram. Increase the log level to 4 on the router, I can see logs below with the OCP v3.9.41 while the router reloading, but not existing on OCP v3.9.55. So the issue has been verified. I1130 08:08:06.486294 1 unique_host.go:211] Deleting routes for hongli/service-unsecure I1130 08:08:06.486298 1 plugin.go:187] Deleting route hongli/service-unsecure I1130 08:08:06.486312 1 unique_host.go:195] Route hongli/service-unsecure claims service-unsecure-hongli.apps.1130-g2b.qe.rhcloud.com I1130 08:08:06.486322 1 status.go:245] admit: route already admitted I1130 08:08:06.486331 1 router.go:682] Adding route hongli/service-unsecure BTW, tried to using curl/ab but unfortunately didn't get the 503 error during router reload.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:3748