Bug 1476842 - 503 Messages produced from only some router pods
503 Messages produced from only some router pods
Product: OpenShift Container Platform
Classification: Red Hat
Component: Routing (Show other bugs)
Unspecified Unspecified
high Severity urgent
: ---
: 3.9.0
Assigned To: Rajat Chopra
Depends On:
  Show dependency treegraph
Reported: 2017-07-31 11:03 EDT by Steven Walter
Modified: 2018-01-08 14:42 EST (History)
6 users (show)

See Also:
Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
Last Closed: 2018-01-08 14:42:50 EST
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---

Attachments (Terms of Use)

  None (edit)
Description Steven Walter 2017-07-31 11:03:59 EDT
Description of problem:
Getting 503 messages from routes. This occurs on pods that are already running for some time and begins seemingly randomly. We set the ELB to test against only one router pod at a time and the issue seems to consistently affect 5 of 7 router pods; 2 of 7 router pods work fine 100% of the time.

Version-Release number of selected component (if applicable):

How reproducible:

Additional info:
Router logs do not indicate that this is the known deadlock/router panic bugs, and we flushed with:
# ip neigh flush dev tun0
In case it was a known ARP caching issue, but the issue remained.
Comment 1 Steven Walter 2017-07-31 11:05:59 EDT
Steps taken:
- Verified that the route exists in all router pods:
sh-4.2$ cat haproxy.config | grep  dcs-dt
backend be_tcp_elis-dt_dcs-dt

- rsh'd into the router pods and curl'd the service ip for that route's service:
$ oc rsh router-4-7syw0
sh-4.2$ curl -kv dcs-dt.elis-dt.svc.cluster.local:8401  
* About to connect() to dcs-dt.elis-dt.svc.cluster.local port 8401 (#0)
*   Trying
sh-4.2$ curl -kv  
* About to connect() to port 8401 (#0)
*   Trying

Both attempts hang. When running from a "good" router pod, the curls work as expected.
Comment 3 Steven Walter 2017-07-31 11:15:35 EDT
We have sosreports from a working node and a not-working node for any comparisons or data needed.
Comment 4 Eric Paris 2017-07-31 12:13:56 EDT
How many routes are they serving?
What do they get from:
  time oc exec -t $routepod ../reload-haproxy
Do they have the RELOAD_INTERVAL env var set on the deployment/pod?

We have found that making sure RELOAD_INTERVAL env var on the router pod is set larger than the time it takes to run "reload-haproxy" is a potential workaround in a similar situation.
Comment 5 Steven Walter 2017-08-01 16:19:51 EDT
Approximately 117 routes:

$ cat router-4-mrhxe.config | grep backend | grep -v "#" | wc
    117     260    5556

[ec2-user@ip-10-193-188-222 ~]$ time oc exec -t router-4-7syw0 ../reload-haproxy
 - Checking HAProxy /healthz on port 1936 ...
 - HAProxy port 1936 health check ok : 0 retry attempt(s).

real    0m0.511s
user    0m0.190s
sys     0m0.036s

It does not appear that RELOAD_INTERVAL is set -- iirc the default is in the scope of seconds, not ms
Comment 6 Phil Cameron 2017-08-07 15:53:30 EDT
Can you try v3.4.1.44-2
It has the Pop() fix which you are not seeing and it has an additional change that limits the amount of work needed on reload. (v3.4.1-44 does not have the change)
Comment 7 Phil Cameron 2017-08-08 10:37:15 EDT
stwalter@redhat.com Can you try v3.4.1.44-2?
Comment 8 Steven Walter 2017-08-08 10:46:42 EDT
I asked the customer to test it yesterday afternoon, I'll let you know results
Comment 9 Steven Walter 2017-08-08 11:31:00 EDT
Customer confirmed hitting the same issue with the specified tag, v3.4.1.44-2
Comment 10 Phil Cameron 2017-08-08 13:13:39 EDT
stwalter@redhat.com Lets dig into the router a bit.

Are these existing routes or newly added routes that are failing? Can you oc get dc <router> -o yaml for a failing router? Are these router shards? Are the routes in different namespaces?

Pick a router that fails. When the curl fails,
oc get route <name> -o yaml    ---- to see if it is admitted on the router.
oc rsh <router-pod> cat haproxy.config   --- to see if there is a backend for the route.

With the "oc get route" its ok to delete the certs from what you send back, we don't need them for this. I am just trying to match up the various items.
Comment 11 Steven Walter 2017-08-08 15:00:55 EDT
These are existing routes. These are routes that are working for some time and then stop. I'll upload dc yaml and describe output momentarily. No router shards. There is one router dc, and some replicas in the DC are causing 503s, other replicas in the same DC are not. I'll check with the customer to verify if its affecting all routes or just some, and if they're in the same namespaces.

We verified that the route exists in haproxy.config in *every* router pod *during* failure, see c#1.

Uploading info in private comment
Comment 21 Steven Walter 2017-08-09 15:46:52 EDT
Router pods have not been restarting

router-10-0z75f            1/1       Running   0          22h
router-10-86ncu            1/1       Running   0          22h
router-10-98spj            1/1       Running   0          22h
router-10-dce2y            1/1       Running   0          22h
router-10-fkp4e            1/1       Running   0          22h
router-10-m1l36            1/1       Running   0          22h
router-10-p0en3            1/1       Running   0          22h
router-10-vl7p9            1/1       Running   0          22h
Comment 35 Ben Bennett 2017-08-31 13:50:29 EDT
Based on the config tar above, all of the 3 configs are the same, which means that the routers themselves must be marking the endpoints bad.

My hunch would have been https://bugzilla.redhat.com/show_bug.cgi?id=1462955 , but that was fixed in, and they have a later version than that.

So... can we get the stats from a router that is bad so that we can see what its view of the world is?

Can you run the command here to gather the stats:

Not the first code block (that disables external stats) but the second.

Then get us the output please.

Note You need to log in before you can comment on or make changes to this bug.