1476842 – 503 Messages produced from only some router pods

Bug 1476842 - 503 Messages produced from only some router pods

Summary: 503 Messages produced from only some router pods

Keywords:
Status:	CLOSED WORKSFORME
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	3.4.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	urgent
Target Milestone:	---
Target Release:	3.9.0
Assignee:	Rajat Chopra
QA Contact:	zhaozhanqi
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2017-07-31 15:03 UTC by Steven Walter
Modified:	2022-08-04 22:20 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2018-01-08 19:42:50 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Steven Walter 2017-07-31 15:03:59 UTC

Description of problem:
Getting 503 messages from routes. This occurs on pods that are already running for some time and begins seemingly randomly. We set the ELB to test against only one router pod at a time and the issue seems to consistently affect 5 of 7 router pods; 2 of 7 router pods work fine 100% of the time.

Version-Release number of selected component (if applicable):
v3.4.1.44

How reproducible:
unconfirmed




Additional info:
Router logs do not indicate that this is the known deadlock/router panic bugs, and we flushed with:
# ip neigh flush dev tun0
In case it was a known ARP caching issue, but the issue remained.

Comment 1 Steven Walter 2017-07-31 15:05:59 UTC

Steps taken:
- Verified that the route exists in all router pods:
sh-4.2$ cat haproxy.config | grep  dcs-dt
backend be_tcp_elis-dt_dcs-dt

- rsh'd into the router pods and curl'd the service ip for that route's service:
$ oc rsh router-4-7syw0
sh-4.2$ curl -kv dcs-dt.elis-dt.svc.cluster.local:8401  
* About to connect() to dcs-dt.elis-dt.svc.cluster.local port 8401 (#0)
*   Trying 172.30.82.247...
^C
sh-4.2$ curl -kv 172.30.82.247:8401  
* About to connect() to 172.30.82.247 port 8401 (#0)
*   Trying 172.30.82.247...
^C

Both attempts hang. When running from a "good" router pod, the curls work as expected.

Comment 3 Steven Walter 2017-07-31 15:15:35 UTC

We have sosreports from a working node and a not-working node for any comparisons or data needed.

Comment 4 Eric Paris 2017-07-31 16:13:56 UTC

How many routes are they serving?
What do they get from:
  time oc exec -t $routepod ../reload-haproxy
Do they have the RELOAD_INTERVAL env var set on the deployment/pod?

We have found that making sure RELOAD_INTERVAL env var on the router pod is set larger than the time it takes to run "reload-haproxy" is a potential workaround in a similar situation.

Comment 5 Steven Walter 2017-08-01 20:19:51 UTC

Approximately 117 routes:

$ cat router-4-mrhxe.config | grep backend | grep -v "#" | wc
    117     260    5556

[ec2-user@ip-10-193-188-222 ~]$ time oc exec -t router-4-7syw0 ../reload-haproxy
 - Checking HAProxy /healthz on port 1936 ...
 - HAProxy port 1936 health check ok : 0 retry attempt(s).

real    0m0.511s
user    0m0.190s
sys     0m0.036s

It does not appear that RELOAD_INTERVAL is set -- iirc the default is in the scope of seconds, not ms

Comment 6 Phil Cameron 2017-08-07 19:53:30 UTC

Can you try v3.4.1.44-2
It has the Pop() fix which you are not seeing and it has an additional change that limits the amount of work needed on reload. (v3.4.1-44 does not have the change)

Comment 7 Phil Cameron 2017-08-08 14:37:15 UTC

stwalter Can you try v3.4.1.44-2?

Comment 8 Steven Walter 2017-08-08 14:46:42 UTC

I asked the customer to test it yesterday afternoon, I'll let you know results

Comment 9 Steven Walter 2017-08-08 15:31:00 UTC

Customer confirmed hitting the same issue with the specified tag, v3.4.1.44-2

Comment 10 Phil Cameron 2017-08-08 17:13:39 UTC

stwalter Lets dig into the router a bit.

Are these existing routes or newly added routes that are failing? Can you oc get dc <router> -o yaml for a failing router? Are these router shards? Are the routes in different namespaces?

Pick a router that fails. When the curl fails,
oc get route <name> -o yaml    ---- to see if it is admitted on the router.
oc rsh <router-pod> cat haproxy.config   --- to see if there is a backend for the route.

With the "oc get route" its ok to delete the certs from what you send back, we don't need them for this. I am just trying to match up the various items.

Comment 11 Steven Walter 2017-08-08 19:00:55 UTC

These are existing routes. These are routes that are working for some time and then stop. I'll upload dc yaml and describe output momentarily. No router shards. There is one router dc, and some replicas in the DC are causing 503s, other replicas in the same DC are not. I'll check with the customer to verify if its affecting all routes or just some, and if they're in the same namespaces.

We verified that the route exists in haproxy.config in *every* router pod *during* failure, see c#1.

Uploading info in private comment

Comment 21 Steven Walter 2017-08-09 19:46:52 UTC

Router pods have not been restarting

router-10-0z75f            1/1       Running   0          22h
router-10-86ncu            1/1       Running   0          22h
router-10-98spj            1/1       Running   0          22h
router-10-dce2y            1/1       Running   0          22h
router-10-fkp4e            1/1       Running   0          22h
router-10-m1l36            1/1       Running   0          22h
router-10-p0en3            1/1       Running   0          22h
router-10-vl7p9            1/1       Running   0          22h

Comment 35 Ben Bennett 2017-08-31 17:50:29 UTC

Based on the config tar above, all of the 3 configs are the same, which means that the routers themselves must be marking the endpoints bad.

My hunch would have been https://bugzilla.redhat.com/show_bug.cgi?id=1462955 , but that was fixed in 3.4.1.44-1, and they have a later version than that.

So... can we get the stats from a router that is bad so that we can see what its view of the world is?

Can you run the command here to gather the stats:
   https://docs.openshift.org/latest/admin_guide/router.html#disabling-statistics-view

Not the first code block (that disables external stats) but the second.

Then get us the output please.

Note You need to log in before you can comment on or make changes to this bug.