Description of problem: Getting 503 messages from routes. This occurs on pods that are already running for some time and begins seemingly randomly. We set the ELB to test against only one router pod at a time and the issue seems to consistently affect 5 of 7 router pods; 2 of 7 router pods work fine 100% of the time. Version-Release number of selected component (if applicable): v3.4.1.44 How reproducible: unconfirmed Additional info: Router logs do not indicate that this is the known deadlock/router panic bugs, and we flushed with: # ip neigh flush dev tun0 In case it was a known ARP caching issue, but the issue remained.
Steps taken: - Verified that the route exists in all router pods: sh-4.2$ cat haproxy.config | grep dcs-dt backend be_tcp_elis-dt_dcs-dt - rsh'd into the router pods and curl'd the service ip for that route's service: $ oc rsh router-4-7syw0 sh-4.2$ curl -kv dcs-dt.elis-dt.svc.cluster.local:8401 * About to connect() to dcs-dt.elis-dt.svc.cluster.local port 8401 (#0) * Trying 172.30.82.247... ^C sh-4.2$ curl -kv 172.30.82.247:8401 * About to connect() to 172.30.82.247 port 8401 (#0) * Trying 172.30.82.247... ^C Both attempts hang. When running from a "good" router pod, the curls work as expected.
We have sosreports from a working node and a not-working node for any comparisons or data needed.
How many routes are they serving? What do they get from: time oc exec -t $routepod ../reload-haproxy Do they have the RELOAD_INTERVAL env var set on the deployment/pod? We have found that making sure RELOAD_INTERVAL env var on the router pod is set larger than the time it takes to run "reload-haproxy" is a potential workaround in a similar situation.
Approximately 117 routes: $ cat router-4-mrhxe.config | grep backend | grep -v "#" | wc 117 260 5556 [ec2-user@ip-10-193-188-222 ~]$ time oc exec -t router-4-7syw0 ../reload-haproxy - Checking HAProxy /healthz on port 1936 ... - HAProxy port 1936 health check ok : 0 retry attempt(s). real 0m0.511s user 0m0.190s sys 0m0.036s It does not appear that RELOAD_INTERVAL is set -- iirc the default is in the scope of seconds, not ms
Can you try v3.4.1.44-2 It has the Pop() fix which you are not seeing and it has an additional change that limits the amount of work needed on reload. (v3.4.1-44 does not have the change)
stwalter Can you try v3.4.1.44-2?
I asked the customer to test it yesterday afternoon, I'll let you know results
Customer confirmed hitting the same issue with the specified tag, v3.4.1.44-2
stwalter Lets dig into the router a bit. Are these existing routes or newly added routes that are failing? Can you oc get dc <router> -o yaml for a failing router? Are these router shards? Are the routes in different namespaces? Pick a router that fails. When the curl fails, oc get route <name> -o yaml ---- to see if it is admitted on the router. oc rsh <router-pod> cat haproxy.config --- to see if there is a backend for the route. With the "oc get route" its ok to delete the certs from what you send back, we don't need them for this. I am just trying to match up the various items.
These are existing routes. These are routes that are working for some time and then stop. I'll upload dc yaml and describe output momentarily. No router shards. There is one router dc, and some replicas in the DC are causing 503s, other replicas in the same DC are not. I'll check with the customer to verify if its affecting all routes or just some, and if they're in the same namespaces. We verified that the route exists in haproxy.config in *every* router pod *during* failure, see c#1. Uploading info in private comment
Router pods have not been restarting router-10-0z75f 1/1 Running 0 22h router-10-86ncu 1/1 Running 0 22h router-10-98spj 1/1 Running 0 22h router-10-dce2y 1/1 Running 0 22h router-10-fkp4e 1/1 Running 0 22h router-10-m1l36 1/1 Running 0 22h router-10-p0en3 1/1 Running 0 22h router-10-vl7p9 1/1 Running 0 22h
Based on the config tar above, all of the 3 configs are the same, which means that the routers themselves must be marking the endpoints bad. My hunch would have been https://bugzilla.redhat.com/show_bug.cgi?id=1462955 , but that was fixed in 3.4.1.44-1, and they have a later version than that. So... can we get the stats from a router that is bad so that we can see what its view of the world is? Can you run the command here to gather the stats: https://docs.openshift.org/latest/admin_guide/router.html#disabling-statistics-view Not the first code block (that disables external stats) but the second. Then get us the output please.