Bug 1760948

Summary:	.well-known/oauth-authorization-server Client.Timeout exceeded while awaiting headers on upgrade
Product:	OpenShift Container Platform	Reporter:	Ben Bennett <bbennett>
Component:	Networking	Assignee:	Juan Luis de Sousa-Valadas <jdesousa>
Networking sub component:	openshift-sdn	QA Contact:	huirwang
Status:	CLOSED ERRATA	Docs Contact:
Severity:	high
Priority:	unspecified	CC:	acomabon, aos-bugs, bbennett, cdc, huirwang, wking, zzhao
Version:	4.2.0
Target Milestone:	---
Target Release:	4.2.z
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:	SDN-CI-IMPACT SDN-BP
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:	1760103	Environment:
Last Closed:	2020-06-03 09:26:03 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1760103
Bug Blocks:

Description Ben Bennett 2019-10-11 19:21:58 UTC

+++ This bug was initially created as a clone of Bug #1760103 +++

Seen in a 4.2.0-rc.0->4.2.0-rc.3 upgrade test on AWS [1,2]:

    lastState:
      terminated:
        containerID: cri-o://2a94ca623772ba2a7362e1b6e4044b921177be9e1bb46223baf480be997b5669
        exitCode: 2
        finishedAt: 2019-10-09T15:41:12Z
        message: |
          eaders)
          2019/10/9 15:39:17 auth: error contacting auth provider (retrying in 10s): Get https://kubernetes.default.svc/.well-known/oauth-authorization-server: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
          2019/10/9 15:39:32 auth: error contacting auth provider (retrying in 10s): Get https://kubernetes.default.svc/.well-known/oauth-authorization-server: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
          2019/10/9 15:39:47 auth: error contacting auth provider (retrying in 10s): Get https://kubernetes.default.svc/.well-known/oauth-authorization-server: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
          2019/10/9 15:40:02 auth: error contacting auth provider (retrying in 10s): Get https://kubernetes.default.svc/.well-known/oauth-authorization-server: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
          2019/10/9 15:40:17 auth: error contacting auth provider (retrying in 10s): Get https://kubernetes.default.svc/.well-known/oauth-authorization-server: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
          2019/10/9 15:40:32 auth: error contacting auth provider (retrying in 10s): Get https://kubernetes.default.svc/.well-known/oauth-authorization-server: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
          2019/10/9 15:40:47 auth: error contacting auth provider (retrying in 10s): Get https://kubernetes.default.svc/.well-known/oauth-authorization-server: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
          2019/10/9 15:41:02 auth: error contacting auth provider (retrying in 10s): Get https://kubernetes.default.svc/.well-known/oauth-authorization-server: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
        reason: Error
        startedAt: 2019-10-09T15:38:12Z

The symptoms are a lot like bug 1745807, but there was no proxy involved in this test.  Assigning to Routing for further triage based on Michal's initial guess.

[1]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.2/11
[2]: https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.2/11/artifacts/e2e-aws-upgrade/must-gather/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-2130377ba7ab9dbed8350c52b098dae1575a7dbafe279f8c013e6455d2da6a93/namespaces/openshift-console/pods/console-7d756bff-5s8q5/console-7d756bff-5s8q5.yaml

--- Additional comment from Hongan Li on 2019-10-10 05:10:14 EDT ---

We didn't see those error messages from console pod during the upgrade testing.

--- Additional comment from Dan Mace on 2019-10-10 11:38:17 EDT ---

The reported errors are apiserver connectivity issues. Looking at the test logs, these are the reasons the test reports as the reason for failure:

    fail [github.com/openshift/origin/test/e2e/upgrade/upgrade.go:139]: during upgrade
    Unexpected error:
        <*errors.errorString | 0xc0026bbfe0>: {
            s: "Cluster did not complete upgrade: timed out waiting for the condition: Cluster operator console has not yet reported success",
        }
        Cluster did not complete upgrade: timed out waiting for the condition: Cluster operator console has not yet reported success
    occurred

    failed: (1h21m57s) 2019-10-09T15:41:21 "[Disruptive] Cluster upgrade should maintain a functioning cluster [Feature:ClusterUpgrade] [Suite:openshift] [Serial]"

The test then shows that the console operator is failing:

Oct  9 15:38:21.538: INFO: Cluster version:
{
  "metadata": {
    "name": "version",
    "selfLink": "/apis/config.openshift.io/v1/clusterversions/version",
    "uid": "f69fe9ad-ea9d-11e9-b4a1-12b8e39af2c4",
    "resourceVersion": "30034",
    "generation": 2,
    "creationTimestamp": "2019-10-09T14:06:17Z"
  },
  "spec": {
    "clusterID": "fa257450-2899-4dd6-9469-5613dd68c65a",
    "desiredUpdate": {
      "version": "",
      "image": "registry.svc.ci.openshift.org/ocp/release@sha256:ab2241a765cb7715b7f5b5853b7b0bf8124e71459c5a80b74faf43ff51dcec49",
      "force": true
    },
    "upstream": "https://api.openshift.com/api/upgrades_info/v1/graph",
    "channel": "stable-4.2"
  },
  "status": {
    "desired": {
      "version": "4.2.0-rc.3",
      "image": "registry.svc.ci.openshift.org/ocp/release@sha256:ab2241a765cb7715b7f5b5853b7b0bf8124e71459c5a80b74faf43ff51dcec49",
      "force": true
    },
    "history": [
      {
        "state": "Partial",
        "startedTime": "2019-10-09T14:23:16Z",
        "completionTime": null,
        "version": "4.2.0-rc.3",
        "image": "registry.svc.ci.openshift.org/ocp/release@sha256:ab2241a765cb7715b7f5b5853b7b0bf8124e71459c5a80b74faf43ff51dcec49",
        "verified": true
      },
      {
        "state": "Completed",
        "startedTime": "2019-10-09T14:06:33Z",
        "completionTime": "2019-10-09T14:23:16Z",
        "version": "4.2.0-rc.0",
        "image": "registry.svc.ci.openshift.org/ocp/release@sha256:7e063c6b4a61377755284bf1a37bef8a25e5c1663a63fe2ba2e005f745f4e8c3",
        "verified": false
      }
    ],
    "observedGeneration": 2,
    "versionHash": "pbTC0u6-z1Q=",
    "conditions": [
      {
        "type": "Available",
        "status": "True",
        "lastTransitionTime": "2019-10-09T14:19:16Z",
        "message": "Done applying 4.2.0-rc.0"
      },
      {
        "type": "Failing",
        "status": "True",
        "lastTransitionTime": "2019-10-09T14:42:07Z",
        "reason": "ClusterOperatorNotAvailable",
        "message": "Cluster operator console has not yet reported success"
      },
      {
        "type": "Progressing",
        "status": "True",
        "lastTransitionTime": "2019-10-09T14:23:16Z",
        "reason": "ClusterOperatorNotAvailable",
        "message": "Unable to apply 4.2.0-rc.3: the cluster operator console has not yet successfully rolled out"
      },
      {
        "type": "RetrievedUpdates",
        "status": "False",
        "lastTransitionTime": "2019-10-09T14:06:33Z",
        "reason": "RemoteFailed",
        "message": "Unable to retrieve available updates: currently installed version 4.2.0-rc.3 not found in the \"stable-4.2\" channel"
      }
    ],
    "availableUpdates": null
  }
}

The test then reports the status of all operators, showing that dns and auth are also showing failures:

Oct  9 15:38:21.641: INFO: Cluster operators:
NAME                                     A F P VERSION MESSAGE
authentication                               T <new>   Progressing: not all deployment replicas are ready
cloud-credential                               <new>   4 of 4 credentials requests provisioned and reconciled.
cluster-autoscaler                             <new>   
console                                  F     <new>   
dns                                        T T <new>   Not all DNS DaemonSets available.
image-registry                                 <new>   The registry is ready
ingress                                        <new>   desired and current number of IngressControllers are equal
insights                                       <new>   Monitoring the cluster
kube-apiserver                                 <new>   Progressing: 3 nodes are at revision 7
kube-controller-manager                        <new>   Progressing: 3 nodes are at revision 6
kube-scheduler                                 <new>   Progressing: 3 nodes are at revision 6
machine-api                                    <new>   
machine-config                                 <old>   Cluster version is 4.2.0-rc.0
marketplace                                    <new>   Successfully progressed to release version: 4.2.0-rc.3
monitoring                                     <new>   
network                                        <new>   
node-tuning                                    <new>   Cluster version is "4.2.0-rc.3"
openshift-apiserver                            <new>   
openshift-controller-manager                   <new>   
openshift-samples                              <new>   Samples installation successful at 4.2.0-rc.3
operator-lifecycle-manager                     <new>   Deployed 0.11.0
operator-lifecycle-manager-catalog             <new>   Deployed 0.11.0
operator-lifecycle-manager-packageserver       <new>   Deployed version 0.11.0
service-ca                                     <new>   Progressing: All service-ca-operator deployments updated
service-catalog-apiserver                      <new>   
service-catalog-controller-manager             <new>   
storage                                        <new>   

If you look at the cluster state, CoreDNS is fully rolled out, but dns-operator can't report status back to the apiserver because of a nearly hour long apiserver connectivity outage:

E1009 14:50:24.555770       1 streamwatcher.go:109] Unable to decode an event from the watch stream: read tcp 10.130.0.74:43434->172.30.0.1:443: read: no route to host
E1009 15:41:45.789223       1 reflector.go:126] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:126: Failed to list *v1.DaemonSet: Get https://172.30.0.1:443/apis/apps/v1/namespaces/openshift-dns/daemonsets?limit=500&resourceVersion=0: dial tcp 172.30.0.1:443: connect: no route to host

So, I'm not seeing any evidence of an issue with DNS or ingress here. I do see evidence of an apiserver connectivity issue, which could be networking or the apiserver itself.

I do notice that dns-operator is hotlooping reconciliation during apiserver connectivity outages, which I'll open a new non-blocker bug for.

If you agree with my assessment, can you help me choose a new component for this bug?

--- Additional comment from Dan Mace on 2019-10-10 11:43:22 EDT ---

I opened https://bugzilla.redhat.com/show_bug.cgi?id=1760473 to track the dns-operator hot loop bug.

--- Additional comment from W. Trevor King on 2019-10-10 12:35:41 EDT ---

> If you agree with my assessment, can you help me choose a new component for this bug?

Flipping a coin and going with SDN/Networking.

--- Additional comment from Casey Callendrello on 2019-10-11 10:10:27 EDT ---

Quick notes for those debugging:

pod with connection issues: dns-operator-5db7cc9bd6-6ksfp

node: ip-10-0-159-27.ec2.internal

sdn pod on node: sdn-sw5d9

--- Additional comment from Casey Callendrello on 2019-10-11 10:17:27 EDT ---

Checking the sdn logs, I see that the endpoints are correct. So, this is not any sort of informer issue.

I wonder if there's a chance we somehow tore down the wrong container's eth0.

--- Additional comment from Casey Callendrello on 2019-10-11 10:29:40 EDT ---

Looking at the logs, the dns-operator seems to have trouble reaching the apiserver starting at 14:50:24. There's nothing immediately interesting in the node journal at that time. So there goes that theory.

(Also, I meant to say that the iptables rules are correct).

--- Additional comment from Casey Callendrello on 2019-10-11 12:12:53 EDT ---

There are a few pods failing with "no route to host" when trying to connect to the apiserver, but not all. They are all on this node. But, there are pods running on the host with no problems.

non-exhaustive list:

Bad pods:
oauth-openshift-c8b557487-x7rfg
dns-operator-5db7cc9bd6-6ksfp
openshift-service-catalog-apiserver-operator-568fdc7d75-ddjbv

Good pods:
openshift-apiserver-operator-dc775fbf4-mwqlr
apiserver_apiserver-nt2nn
cluster-storage-operator-f4584dcb9-nz9f9

--- Additional comment from Casey Callendrello on 2019-10-11 13:01:20 EDT ---

Interesting information: the first pod to report "no route to host" does so about when the SDN restarts after OVS restarts.

OVS starts after upgrade at 14:34:22. openshift-sdn then detects this and restarts (but way too late?)

I1009 14:35:18.174477   15632 ovs.go:169] Error executing ovs-ofctl: ovs-ofctl: br0 is not a bridge or a socket
F1009 14:35:18.174515   15632 healthcheck.go:82] SDN healthcheck detected OVS server change, restarting: OVS reinitialization required: plugin is not setup

The first container gets a "no route to host" when talking to the apiserver at 14:34:54.

Is it possible that containers that talk at the wrong time are caching the no-route-to-host?

--- Additional comment from Casey Callendrello on 2019-10-11 13:18:28 EDT ---

Correction: ignore the sdn logs. They're from the wrong node. We don't have any old sdn logs. The SDN process started at 14:34:37

--- Additional comment from Ben Bennett on 2019-10-11 15:21:14 EDT ---

This seems to be the first instance of this bug that we can find, and none of the other CI runs with the same error message exhibited the same underlying problem.

However, if this hits a cluster, the workaround is believed to be to restart the ovs pod on the broken node (or, worst case, reboot the node).  I think we can fix this in 4.2.z.

(I'm setting this bug to 4.3.0 and cloning to track for 4.2.z)

Comment 1 Alejandro Coma 2019-10-24 12:25:58 UTC

I have just hit this in a 4.2.0 cluster.

I can confirm restarting the ovs pod in the affected node makes the trick. No need to restart the node.

Comment 3 Alejandro Coma 2019-10-24 12:32:15 UTC

And it wasn't an upgrade, it was a fresh 4.2.0 install...

Comment 8 errata-xmlrpc 2020-06-03 09:26:03 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2307