1628252 – Intermittent 503 errors (application is not available)

Bug 1628252 - Intermittent 503 errors (application is not available)

Summary: Intermittent 503 errors (application is not available)

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	3.9.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	3.9.z
Assignee:	Ram Ranganathan
QA Contact:	Hongan Li
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2018-09-12 14:51 UTC by David Kaylor
Modified:	2022-08-04 22:20 UTC (History)
CC List:	15 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:	undefined
Clone Of:
Environment:
Last Closed:	2018-12-13 19:27:05 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2018:3748	0	None	None	None	2018-12-13 19:27:21 UTC

Description David Kaylor 2018-09-12 14:51:11 UTC

Description of problem:
Intermittent 503 errors with multiple applications. Multiple router pods are deployed and it seems that haproxy.config is sometimes not updated on some routers when changes to routes are made. The specific routers affected vary.

Version-Release number of selected component (if applicable):
3.9

How reproducible:
We have tried to reproduce by making changes to application deployments but have not been able to trigger the problem intentionally.

Additional info:
Redeploying the router pods resolves the issue but we have not been able to identify a root cause.

Comment 4 Miheer Salunke 2018-09-13 00:53:17 UTC

https://github.com/kubernetes/kubernetes/issues/55860 related upstream

Comment 13 Ben Bennett 2018-09-14 19:25:38 UTC

Can you please oc rsh into one of the routers that is experiencing the problem (while the problem is ongoing) and run:
  curl http://$STATS_USERNAME:$STATS_PASSWORD@127.0.0.1:$STATS_PORT/debug/pprof/goroutine?debug=1

It should generate a stack trace from each goroutine.  Can you capture the output to a file and then re-do the same thing about ten times to different files with a second or so between them.  That will let us see if a thread is locked up.


If for some reason that does not work, then you may need to set the OPENSHIFT_PROFILE env on the dc to web:
  oc env dc router OPENSHIFT_PROFILE=web

There is no deleterious impact to doing that... it will only enable a debugging endpoint that doesn't do anything until the above request hits it.   But I understand if you can't do that in your production environment.

Comment 18 Ben Bennett 2018-09-28 18:07:36 UTC

The analysis is that the only difference between the broken and fixed networking state was the MTU change.

So, we should backport https://github.com/openshift/origin/pull/19372 (which was a fix for https://bugzilla.redhat.com/show_bug.cgi?id=1564346 in 3.10 and beyond) to 3.9 to resolve this.

Comment 22 Rodrigo Espinosa 2018-11-04 14:42:24 UTC

Hello.
I'm having the same issue.
I'm using Openshift Online v3.11
Below I copied what "oc get routes -o json" is showing.
The route is pointing to a single pod. I tried scaling to 0 and then back to 1 as a restart, but problem stills exists.
From terminal pointing to localhost:4444/wd/hub all requests are answered properly. 
However when testing:
curl http://selenium-facturado-selenium.7e14.starter-us-west-2.openshiftapps.com/wd/hub
Only gets proper response 1 out of 5 or 10 times. The rest of them show 503 error, "Application is not available"
Is there a way to solve this?
Thanks and regards,


{
    "apiVersion": "v1",
    "items": [
        {
            "apiVersion": "route.openshift.io/v1",
            "kind": "Route",
            "metadata": {
                "annotations": {
                    "openshift.io/host.generated": "true"
                },
                "creationTimestamp": "2018-11-03T14:22:30Z",
                "labels": {
                    "app": "selenium-openshift"
                },
                "name": "selenium",
                "namespace": "facturado-selenium",
                "resourceVersion": "2893863589",
                "selfLink": "/apis/route.openshift.io/v1/namespaces/facturado-selenium/routes/selenium",
                "uid": "e5d891b0-df73-11e8-b5c5-0a2a2b777307"
            },
            "spec": {
                "host": "selenium-facturado-selenium.7e14.starter-us-west-2.openshiftapps.com",
                "path": "/wd/hub",
                "port": {
                    "targetPort": "4444-tcp"
                },
                "to": {
                    "kind": "Service",
                    "name": "selenium-openshift",
                    "weight": 100
                },
                "wildcardPolicy": "None"
            },
            "status": {
                "ingress": [
                    {
                        "conditions": [
                            {
                                "lastTransitionTime": "2018-11-03T14:22:30Z",
                                "status": "True",
                                "type": "Admitted"
                            }
                        ],
                        "host": "selenium-facturado-selenium.7e14.starter-us-west-2.openshiftapps.com",
                        "routerCanonicalHostname": "elb.7e14.starter-us-west-2.openshiftapps.com",
                        "routerName": "router",
                        "wildcardPolicy": "None"
                    }
                ]
            }
        }
    ],
    "kind": "List",
    "metadata": {
        "resourceVersion": "",
        "selfLink": ""
    }
}

Comment 24 Ram Ranganathan 2018-11-06 21:26:22 UTC

@Rodrigo, I see this is on OpenShift starter and not for an OpenShift environment you are running yourself. 

You may want to file a bug against OpenShift Hosted. 

But that said, I just gave this a whirl with hitting it 50 times: 
$ for i in `seq 50`; do echo `date`: $(curl -L -s -o /dev/null -w "%{http_code}"  http://selenium-facturado-selenium.7e14.starter-us-west-2.openshiftapps.com/wd/hub) ; done | grep -e '200$' | wc -l
50

And it returns 200 ok all 50 times. So doesn't look to be a problem now. Is that what you see as well? Thanks.

Comment 25 Rodrigo Espinosa 2018-11-07 01:05:56 UTC

You are right, Ram. Seems it got solved by itself.
I just needed to wait a day or two.
Thanks and regards,

Comment 26 Ram Ranganathan 2018-11-07 22:51:49 UTC

Backported to OSE 3.9 - associated PR is
   https://github.com/openshift/ose/pull/1455

Comment 27 Ram Ranganathan 2018-11-07 22:53:08 UTC

@Rodrigo Thanks for the update. Cool - glad it works for you.

Comment 31 Hongan Li 2018-12-03 07:38:57 UTC

Verified with atomic-openshift-3.9.57-1.git.0.67e0f0f.el7 and the issue has been fixed.

The tun0's MTU keeps the same whatever the node has pods or not.

# ovs-vsctl show
dbe9ef54-0f2c-4f96-87ed-b14bb4d6a6b8
    Bridge "br0"
        fail_mode: secure
        Port "tun0"
            Interface "tun0"
                type: internal
        Port "br0"
            Interface "br0"
                type: internal
        Port "vxlan0"
            Interface "vxlan0"
                type: vxlan
                options: {key=flow, remote_ip=flow}
    ovs_version: "2.9.0"
# ip a show tun0
12: tun0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UNKNOWN group default qlen 1000
    link/ether 86:89:3c:1d:d9:40 brd ff:ff:ff:ff:ff:ff
    inet 10.129.0.1/23 brd 10.129.1.255 scope global tun0
       valid_lft forever preferred_lft forever
    inet6 fe80::8489:3cff:fe1d:d940/64 scope link 
       valid_lft forever preferred_lft forever

# grep -i mtu /etc/origin/node/node-config.yaml 
   mtu: 1450

Comment 33 errata-xmlrpc 2018-12-13 19:27:05 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:3748

Note You need to log in before you can comment on or make changes to this bug.