Description of problem: Intermittent 503 errors with multiple applications. Multiple router pods are deployed and it seems that haproxy.config is sometimes not updated on some routers when changes to routes are made. The specific routers affected vary. Version-Release number of selected component (if applicable): 3.9 How reproducible: We have tried to reproduce by making changes to application deployments but have not been able to trigger the problem intentionally. Additional info: Redeploying the router pods resolves the issue but we have not been able to identify a root cause.
https://github.com/kubernetes/kubernetes/issues/55860 related upstream
Can you please oc rsh into one of the routers that is experiencing the problem (while the problem is ongoing) and run: curl http://$STATS_USERNAME:$STATS_PASSWORD@127.0.0.1:$STATS_PORT/debug/pprof/goroutine?debug=1 It should generate a stack trace from each goroutine. Can you capture the output to a file and then re-do the same thing about ten times to different files with a second or so between them. That will let us see if a thread is locked up. If for some reason that does not work, then you may need to set the OPENSHIFT_PROFILE env on the dc to web: oc env dc router OPENSHIFT_PROFILE=web There is no deleterious impact to doing that... it will only enable a debugging endpoint that doesn't do anything until the above request hits it. But I understand if you can't do that in your production environment.
The analysis is that the only difference between the broken and fixed networking state was the MTU change. So, we should backport https://github.com/openshift/origin/pull/19372 (which was a fix for https://bugzilla.redhat.com/show_bug.cgi?id=1564346 in 3.10 and beyond) to 3.9 to resolve this.
Hello. I'm having the same issue. I'm using Openshift Online v3.11 Below I copied what "oc get routes -o json" is showing. The route is pointing to a single pod. I tried scaling to 0 and then back to 1 as a restart, but problem stills exists. From terminal pointing to localhost:4444/wd/hub all requests are answered properly. However when testing: curl http://selenium-facturado-selenium.7e14.starter-us-west-2.openshiftapps.com/wd/hub Only gets proper response 1 out of 5 or 10 times. The rest of them show 503 error, "Application is not available" Is there a way to solve this? Thanks and regards, { "apiVersion": "v1", "items": [ { "apiVersion": "route.openshift.io/v1", "kind": "Route", "metadata": { "annotations": { "openshift.io/host.generated": "true" }, "creationTimestamp": "2018-11-03T14:22:30Z", "labels": { "app": "selenium-openshift" }, "name": "selenium", "namespace": "facturado-selenium", "resourceVersion": "2893863589", "selfLink": "/apis/route.openshift.io/v1/namespaces/facturado-selenium/routes/selenium", "uid": "e5d891b0-df73-11e8-b5c5-0a2a2b777307" }, "spec": { "host": "selenium-facturado-selenium.7e14.starter-us-west-2.openshiftapps.com", "path": "/wd/hub", "port": { "targetPort": "4444-tcp" }, "to": { "kind": "Service", "name": "selenium-openshift", "weight": 100 }, "wildcardPolicy": "None" }, "status": { "ingress": [ { "conditions": [ { "lastTransitionTime": "2018-11-03T14:22:30Z", "status": "True", "type": "Admitted" } ], "host": "selenium-facturado-selenium.7e14.starter-us-west-2.openshiftapps.com", "routerCanonicalHostname": "elb.7e14.starter-us-west-2.openshiftapps.com", "routerName": "router", "wildcardPolicy": "None" } ] } } ], "kind": "List", "metadata": { "resourceVersion": "", "selfLink": "" } }
@Rodrigo, I see this is on OpenShift starter and not for an OpenShift environment you are running yourself. You may want to file a bug against OpenShift Hosted. But that said, I just gave this a whirl with hitting it 50 times: $ for i in `seq 50`; do echo `date`: $(curl -L -s -o /dev/null -w "%{http_code}" http://selenium-facturado-selenium.7e14.starter-us-west-2.openshiftapps.com/wd/hub) ; done | grep -e '200$' | wc -l 50 And it returns 200 ok all 50 times. So doesn't look to be a problem now. Is that what you see as well? Thanks.
You are right, Ram. Seems it got solved by itself. I just needed to wait a day or two. Thanks and regards,
Backported to OSE 3.9 - associated PR is https://github.com/openshift/ose/pull/1455
@Rodrigo Thanks for the update. Cool - glad it works for you.
Verified with atomic-openshift-3.9.57-1.git.0.67e0f0f.el7 and the issue has been fixed. The tun0's MTU keeps the same whatever the node has pods or not. # ovs-vsctl show dbe9ef54-0f2c-4f96-87ed-b14bb4d6a6b8 Bridge "br0" fail_mode: secure Port "tun0" Interface "tun0" type: internal Port "br0" Interface "br0" type: internal Port "vxlan0" Interface "vxlan0" type: vxlan options: {key=flow, remote_ip=flow} ovs_version: "2.9.0" # ip a show tun0 12: tun0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UNKNOWN group default qlen 1000 link/ether 86:89:3c:1d:d9:40 brd ff:ff:ff:ff:ff:ff inet 10.129.0.1/23 brd 10.129.1.255 scope global tun0 valid_lft forever preferred_lft forever inet6 fe80::8489:3cff:fe1d:d940/64 scope link valid_lft forever preferred_lft forever # grep -i mtu /etc/origin/node/node-config.yaml mtu: 1450
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:3748