Description of problem: We have a number of ingresscontrollers setup to handle different routes: NAME AGE crcshard-0 159d crcshard-1 159d crcshard-2 159d crcshard-3 159d crcshard-4 159d crcshard-5 159d default 286d public 138d Each of these has different routeSelector, but no namespaceSelector. Each of the routes in our clusters match one, and only one of these routeSelectors (handled by custom webhook/operator). What we are seeing is constant (every 5 seconds as that appears to be the min interval): I1208 12:10:16.145539 1 router.go:536] template "msg"="router reloaded" "output"=" - Checking http://localhost:80 ...\n - Health check ok : 0 retry attempt(s).\n" I1208 12:10:21.144016 1 router.go:536] template "msg"="router reloaded" "output"=" - Checking http://localhost:80 ...\n - Health check ok : 0 retry attempt(s).\n" I1208 12:10:26.131853 1 router.go:536] template "msg"="router reloaded" "output"=" - Checking http://localhost:80 ...\n - Health check ok : 0 retry attempt(s).\n" I1208 12:10:31.186205 1 router.go:536] template "msg"="router reloaded" "output"=" - Checking http://localhost:80 ...\n - Health check ok : 0 retry attempt(s).\n" I1208 12:10:36.145065 1 router.go:536] template "msg"="router reloaded" "output"=" - Checking http://localhost:80 ...\n - Health check ok : 0 retry attempt(s).\n" I1208 12:10:41.167243 1 router.go:536] template "msg"="router reloaded" "output"=" - Checking http://localhost:80 ...\n - Health check ok : 0 retry attempt(s).\n" I1208 12:10:46.134906 1 router.go:536] template "msg"="router reloaded" "output"=" - Checking http://localhost:80 ...\n - Health check ok : 0 retry attempt(s).\n" I1208 12:10:51.165807 1 router.go:536] template "msg"="router reloaded" "output"=" - Checking http://localhost:80 ...\n - Health check ok : 0 retry attempt(s).\n" What's more is, all of the router-controllers do this in unison, even though most of their endpoints do not change in that interval (But likely one or two do). It appears that `registerInformerEventHandlers()` and `HandleEndpoints()`, in cases where namespaceSelector is nil, causes commitAndReload() being hit even if no changes were done in that cycle, on all routers. Confirmed that the router configs are identical before and after reloads in most cases, but the reloads (and config rewrites) keep coming. In addition to this, because of the very frequent reloading, the `balance` configuration of each route becomes biased. Regardless of `leastconn` or `roundrobin`, as the `chash` tree within haproxy appears to gets reset on reload, this in turn makes the router severely favor the first pod on in the configuration. Version-Release number of selected component (if applicable): OpenShift 4.5.16 How reproducible: Always Steps to Reproduce: 1. Create several router shards 2. See all shards reloaded every 5s even when endpoints do not change 3. Actual results: All shards are reloaded Expected results: Shards are only reloaded when an endpoint changes Additional info:
There is this existing BZ which is broadly the same issue: https://bugzilla.redhat.com/show_bug.cgi?id=1839989 As this really is an enhancement to the current design this is now captured in the following RFE: https://issues.redhat.com/browse/NE-391
Re-opening this BZ. That RFE refers to an old v3.11 BZ and it doesn't address the issue that the customer is experiencing. They have 6 router shards. When a single endpoint changes (create/delete/migrate), *ALL* the routers reload, not just the router with the endpoint. This isn't an haproxy issue, it's a k8s issue. The customer has dug into the code and this is what they have to say: "This [issue] is a result of the "Kind: endpoints/endpointslice" changing in k8s, not haproxy noticing dead backends."
(In reply to Dan Yocum from comment #0) > Description of problem: > We have a number of ingresscontrollers setup to handle different routes: > > NAME AGE > crcshard-0 159d > crcshard-1 159d > crcshard-2 159d > crcshard-3 159d > crcshard-4 159d > crcshard-5 159d > default 286d > public 138d > > Each of these has different routeSelector, but no namespaceSelector. Each > of the routes in our clusters match one, and only one of these > routeSelectors (handled by custom webhook/operator). Please could we attach the YAML output for all these ingresscontrollers: $ oc get ingresscontrollers --all-namespaces -o yaml
Tested with 4.7.0-0.nightly-2021-01-22-134922 and passed. 1. create route shards with two more custom ingresscontrollers, one is using namespace label and one is using route label. spec: namespaceSelector: matchLabels: namespace: router-test spec: routeSelector: matchLabels: route: router-test 2. create three projects, pods, services and routes, ns1 is labelled as "namespace=router-test", route2 in ns2 is labelled as route=router-test, 3. scale up/down the pods in ns3 to make endpoints change, no reload in both router pods with labels. 4. scale up/down the pods in ns2, no reload in the router pod with namespace label. 5. scale up/down the pods in ns1, no reload in the router pod with route label. logs: $ oc -n openshift-ingress logs router-nslabel-6b9c5d77b-l25mj | tail -n2 I0125 09:15:24.735305 1 router.go:578] template "msg"="router reloaded" "output"=" - Checking http://localhost:80 ...\n - Health check ok : 0 retry attempt(s).\n" I0125 09:25:34.356529 1 router.go:578] template "msg"="router reloaded" "output"=" - Checking http://localhost:80 ...\n - Health check ok : 0 retry attempt(s).\n" $ oc -n openshift-ingress logs router-routelabel-ff4dfdd4-bfbtg | tail -n2 I0125 09:17:10.457487 1 router.go:578] template "msg"="router reloaded" "output"=" - Checking http://localhost:80 ...\n - Health check ok : 0 retry attempt(s).\n" I0125 09:23:26.001079 1 router.go:578] template "msg"="router reloaded" "output"=" - Checking http://localhost:80 ...\n - Health check ok : 0 retry attempt(s).\n"
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:5633