Bug 1760103 - .well-known/oauth-authorization-server Client.Timeout exceeded while awaiting headers on upgrade
Summary: .well-known/oauth-authorization-server Client.Timeout exceeded while awaiting...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.2.0
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
: 4.3.0
Assignee: Casey Callendrello
QA Contact: zhaozhanqi
URL:
Whiteboard:
Depends On:
Blocks: 1760948
TreeView+ depends on / blocked
 
Reported: 2019-10-09 20:53 UTC by W. Trevor King
Modified: 2020-05-13 21:27 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1760948 (view as bug list)
Environment:
Last Closed: 2020-05-13 21:27:30 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Priority Status Summary Last Updated
Github openshift sdn pull 52 'None' closed Bug 1760103: Adjust SDN setup so AlreadySetUp check will fail on a half-set-up node 2020-07-12 03:37:02 UTC
Red Hat Product Errata RHBA-2020:0062 None None None 2020-01-23 11:07:03 UTC

Description W. Trevor King 2019-10-09 20:53:14 UTC
Seen in a 4.2.0-rc.0->4.2.0-rc.3 upgrade test on AWS [1,2]:

    lastState:
      terminated:
        containerID: cri-o://2a94ca623772ba2a7362e1b6e4044b921177be9e1bb46223baf480be997b5669
        exitCode: 2
        finishedAt: 2019-10-09T15:41:12Z
        message: |
          eaders)
          2019/10/9 15:39:17 auth: error contacting auth provider (retrying in 10s): Get https://kubernetes.default.svc/.well-known/oauth-authorization-server: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
          2019/10/9 15:39:32 auth: error contacting auth provider (retrying in 10s): Get https://kubernetes.default.svc/.well-known/oauth-authorization-server: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
          2019/10/9 15:39:47 auth: error contacting auth provider (retrying in 10s): Get https://kubernetes.default.svc/.well-known/oauth-authorization-server: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
          2019/10/9 15:40:02 auth: error contacting auth provider (retrying in 10s): Get https://kubernetes.default.svc/.well-known/oauth-authorization-server: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
          2019/10/9 15:40:17 auth: error contacting auth provider (retrying in 10s): Get https://kubernetes.default.svc/.well-known/oauth-authorization-server: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
          2019/10/9 15:40:32 auth: error contacting auth provider (retrying in 10s): Get https://kubernetes.default.svc/.well-known/oauth-authorization-server: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
          2019/10/9 15:40:47 auth: error contacting auth provider (retrying in 10s): Get https://kubernetes.default.svc/.well-known/oauth-authorization-server: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
          2019/10/9 15:41:02 auth: error contacting auth provider (retrying in 10s): Get https://kubernetes.default.svc/.well-known/oauth-authorization-server: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
        reason: Error
        startedAt: 2019-10-09T15:38:12Z

The symptoms are a lot like bug 1745807, but there was no proxy involved in this test.  Assigning to Routing for further triage based on Michal's initial guess.

[1]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.2/11
[2]: https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.2/11/artifacts/e2e-aws-upgrade/must-gather/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-2130377ba7ab9dbed8350c52b098dae1575a7dbafe279f8c013e6455d2da6a93/namespaces/openshift-console/pods/console-7d756bff-5s8q5/console-7d756bff-5s8q5.yaml

Comment 1 Hongan Li 2019-10-10 09:10:14 UTC
We didn't see those error messages from console pod during the upgrade testing.

Comment 2 Dan Mace 2019-10-10 15:38:17 UTC
The reported errors are apiserver connectivity issues. Looking at the test logs, these are the reasons the test reports as the reason for failure:

    fail [github.com/openshift/origin/test/e2e/upgrade/upgrade.go:139]: during upgrade
    Unexpected error:
        <*errors.errorString | 0xc0026bbfe0>: {
            s: "Cluster did not complete upgrade: timed out waiting for the condition: Cluster operator console has not yet reported success",
        }
        Cluster did not complete upgrade: timed out waiting for the condition: Cluster operator console has not yet reported success
    occurred

    failed: (1h21m57s) 2019-10-09T15:41:21 "[Disruptive] Cluster upgrade should maintain a functioning cluster [Feature:ClusterUpgrade] [Suite:openshift] [Serial]"

The test then shows that the console operator is failing:

Oct  9 15:38:21.538: INFO: Cluster version:
{
  "metadata": {
    "name": "version",
    "selfLink": "/apis/config.openshift.io/v1/clusterversions/version",
    "uid": "f69fe9ad-ea9d-11e9-b4a1-12b8e39af2c4",
    "resourceVersion": "30034",
    "generation": 2,
    "creationTimestamp": "2019-10-09T14:06:17Z"
  },
  "spec": {
    "clusterID": "fa257450-2899-4dd6-9469-5613dd68c65a",
    "desiredUpdate": {
      "version": "",
      "image": "registry.svc.ci.openshift.org/ocp/release@sha256:ab2241a765cb7715b7f5b5853b7b0bf8124e71459c5a80b74faf43ff51dcec49",
      "force": true
    },
    "upstream": "https://api.openshift.com/api/upgrades_info/v1/graph",
    "channel": "stable-4.2"
  },
  "status": {
    "desired": {
      "version": "4.2.0-rc.3",
      "image": "registry.svc.ci.openshift.org/ocp/release@sha256:ab2241a765cb7715b7f5b5853b7b0bf8124e71459c5a80b74faf43ff51dcec49",
      "force": true
    },
    "history": [
      {
        "state": "Partial",
        "startedTime": "2019-10-09T14:23:16Z",
        "completionTime": null,
        "version": "4.2.0-rc.3",
        "image": "registry.svc.ci.openshift.org/ocp/release@sha256:ab2241a765cb7715b7f5b5853b7b0bf8124e71459c5a80b74faf43ff51dcec49",
        "verified": true
      },
      {
        "state": "Completed",
        "startedTime": "2019-10-09T14:06:33Z",
        "completionTime": "2019-10-09T14:23:16Z",
        "version": "4.2.0-rc.0",
        "image": "registry.svc.ci.openshift.org/ocp/release@sha256:7e063c6b4a61377755284bf1a37bef8a25e5c1663a63fe2ba2e005f745f4e8c3",
        "verified": false
      }
    ],
    "observedGeneration": 2,
    "versionHash": "pbTC0u6-z1Q=",
    "conditions": [
      {
        "type": "Available",
        "status": "True",
        "lastTransitionTime": "2019-10-09T14:19:16Z",
        "message": "Done applying 4.2.0-rc.0"
      },
      {
        "type": "Failing",
        "status": "True",
        "lastTransitionTime": "2019-10-09T14:42:07Z",
        "reason": "ClusterOperatorNotAvailable",
        "message": "Cluster operator console has not yet reported success"
      },
      {
        "type": "Progressing",
        "status": "True",
        "lastTransitionTime": "2019-10-09T14:23:16Z",
        "reason": "ClusterOperatorNotAvailable",
        "message": "Unable to apply 4.2.0-rc.3: the cluster operator console has not yet successfully rolled out"
      },
      {
        "type": "RetrievedUpdates",
        "status": "False",
        "lastTransitionTime": "2019-10-09T14:06:33Z",
        "reason": "RemoteFailed",
        "message": "Unable to retrieve available updates: currently installed version 4.2.0-rc.3 not found in the \"stable-4.2\" channel"
      }
    ],
    "availableUpdates": null
  }
}

The test then reports the status of all operators, showing that dns and auth are also showing failures:

Oct  9 15:38:21.641: INFO: Cluster operators:
NAME                                     A F P VERSION MESSAGE
authentication                               T <new>   Progressing: not all deployment replicas are ready
cloud-credential                               <new>   4 of 4 credentials requests provisioned and reconciled.
cluster-autoscaler                             <new>   
console                                  F     <new>   
dns                                        T T <new>   Not all DNS DaemonSets available.
image-registry                                 <new>   The registry is ready
ingress                                        <new>   desired and current number of IngressControllers are equal
insights                                       <new>   Monitoring the cluster
kube-apiserver                                 <new>   Progressing: 3 nodes are at revision 7
kube-controller-manager                        <new>   Progressing: 3 nodes are at revision 6
kube-scheduler                                 <new>   Progressing: 3 nodes are at revision 6
machine-api                                    <new>   
machine-config                                 <old>   Cluster version is 4.2.0-rc.0
marketplace                                    <new>   Successfully progressed to release version: 4.2.0-rc.3
monitoring                                     <new>   
network                                        <new>   
node-tuning                                    <new>   Cluster version is "4.2.0-rc.3"
openshift-apiserver                            <new>   
openshift-controller-manager                   <new>   
openshift-samples                              <new>   Samples installation successful at 4.2.0-rc.3
operator-lifecycle-manager                     <new>   Deployed 0.11.0
operator-lifecycle-manager-catalog             <new>   Deployed 0.11.0
operator-lifecycle-manager-packageserver       <new>   Deployed version 0.11.0
service-ca                                     <new>   Progressing: All service-ca-operator deployments updated
service-catalog-apiserver                      <new>   
service-catalog-controller-manager             <new>   
storage                                        <new>   

If you look at the cluster state, CoreDNS is fully rolled out, but dns-operator can't report status back to the apiserver because of a nearly hour long apiserver connectivity outage:

E1009 14:50:24.555770       1 streamwatcher.go:109] Unable to decode an event from the watch stream: read tcp 10.130.0.74:43434->172.30.0.1:443: read: no route to host
E1009 15:41:45.789223       1 reflector.go:126] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:126: Failed to list *v1.DaemonSet: Get https://172.30.0.1:443/apis/apps/v1/namespaces/openshift-dns/daemonsets?limit=500&resourceVersion=0: dial tcp 172.30.0.1:443: connect: no route to host

So, I'm not seeing any evidence of an issue with DNS or ingress here. I do see evidence of an apiserver connectivity issue, which could be networking or the apiserver itself.

I do notice that dns-operator is hotlooping reconciliation during apiserver connectivity outages, which I'll open a new non-blocker bug for.

If you agree with my assessment, can you help me choose a new component for this bug?

Comment 3 Dan Mace 2019-10-10 15:43:22 UTC
I opened https://bugzilla.redhat.com/show_bug.cgi?id=1760473 to track the dns-operator hot loop bug.

Comment 4 W. Trevor King 2019-10-10 16:35:41 UTC
> If you agree with my assessment, can you help me choose a new component for this bug?

Flipping a coin and going with SDN/Networking.

Comment 5 Casey Callendrello 2019-10-11 14:10:27 UTC
Quick notes for those debugging:

pod with connection issues: dns-operator-5db7cc9bd6-6ksfp

node: ip-10-0-159-27.ec2.internal

sdn pod on node: sdn-sw5d9

Comment 6 Casey Callendrello 2019-10-11 14:17:27 UTC
Checking the sdn logs, I see that the endpoints are correct. So, this is not any sort of informer issue.

I wonder if there's a chance we somehow tore down the wrong container's eth0.

Comment 7 Casey Callendrello 2019-10-11 14:29:40 UTC
Looking at the logs, the dns-operator seems to have trouble reaching the apiserver starting at 14:50:24. There's nothing immediately interesting in the node journal at that time. So there goes that theory.

(Also, I meant to say that the iptables rules are correct).

Comment 8 Casey Callendrello 2019-10-11 16:12:53 UTC
There are a few pods failing with "no route to host" when trying to connect to the apiserver, but not all. They are all on this node. But, there are pods running on the host with no problems.

non-exhaustive list:

Bad pods:
oauth-openshift-c8b557487-x7rfg
dns-operator-5db7cc9bd6-6ksfp
openshift-service-catalog-apiserver-operator-568fdc7d75-ddjbv

Good pods:
openshift-apiserver-operator-dc775fbf4-mwqlr
apiserver_apiserver-nt2nn
cluster-storage-operator-f4584dcb9-nz9f9

Comment 9 Casey Callendrello 2019-10-11 17:01:20 UTC
Interesting information: the first pod to report "no route to host" does so about when the SDN restarts after OVS restarts.

OVS starts after upgrade at 14:34:22. openshift-sdn then detects this and restarts (but way too late?)

I1009 14:35:18.174477   15632 ovs.go:169] Error executing ovs-ofctl: ovs-ofctl: br0 is not a bridge or a socket
F1009 14:35:18.174515   15632 healthcheck.go:82] SDN healthcheck detected OVS server change, restarting: OVS reinitialization required: plugin is not setup

The first container gets a "no route to host" when talking to the apiserver at 14:34:54.

Is it possible that containers that talk at the wrong time are caching the no-route-to-host?

Comment 10 Casey Callendrello 2019-10-11 17:18:28 UTC
Correction: ignore the sdn logs. They're from the wrong node. We don't have any old sdn logs. The SDN process started at 14:34:37

Comment 11 Ben Bennett 2019-10-11 19:21:14 UTC
This seems to be the first instance of this bug that we can find, and none of the other CI runs with the same error message exhibited the same underlying problem.

However, if this hits a cluster, the workaround is believed to be to restart the ovs pod on the broken node (or, worst case, reboot the node).  I think we can fix this in 4.2.z.

(I'm setting this bug to 4.3.0 and cloning to track for 4.2.z)

Comment 12 Dan Winship 2019-10-12 13:30:11 UTC
OK. tl;dr: we're killing an SDN pod halfway through SDN startup, and the next SDN pod doesn't recover correctly from the half-initialized state.

I'm sure we've at least discussed this possibility before but we don't seem to be doing anything to prevent it? In particular, I think the problem here is that all of the "create the bridge and set up routes and default OVS flows and stuff" things happened, but then sdn was killed before it finished the "reattach old pods to the bridge" step. We can probably fix this by tweaking the startup initialization stuff vs the alreadySetUp() check. (eg, if we waited to create the table 253 version note until the very end of startup, then the new SDN pod wouldn't have decided that the bridge was fully set up. Of course, it would have had to kill the not-yet-attached pods rather than reattaching them since the state in the ovsdb would have been lost by this point. But OCP is supposed to be able to recover from that anyway...)

Detailed timeline, mostly from kubelet logs in masters-journal:

  - ip-10-0-159-27 comes back up after RHCOS update at 14:04:00

  - sdn-rzhwc and ovs-brgbt (pre-upgrade versions) are started at 14:07:05

  - OVS pod upgrade starts at 14:34:00

        I1009 14:34:00.708770    1179 kubelet.go:1910] SyncLoop (DELETE, "api"): "ovs-brgbt_openshift-sdn(12b4bea1-ea9e-11e9-b4a1-12b8e39af2c4)"

  - sdn-rzhwc exits 10 seconds later after timing out waiting for OVS to come back.

        F1009 14:34:10.910955    3778 healthcheck.go:82] SDN healthcheck detected OVS server change, restarting: timed out waiting for the condition

    (We don't have sdn-rzhwc's logs but that ends up in kubelet's logs due to
    "terminationMessagePolicy: FallbackToLogsOnError".)

  - ovs-fv5j7 (post-upgrade version) is deployed at 14:34:11 (after ovs-brgbt cleanup is
    complete; it is not clear to me why it takes so long for ovs-brgbt to be cleaned up).

        I1009 14:34:11.726383    1179 kubelet.go:1904] SyncLoop (REMOVE, "api"): "ovs-brgbt_openshift-sdn(12b4bea1-ea9e-11e9-b4a1-12b8e39af2c4)"
        ...
        I1009 14:34:11.746904    1179 kubelet.go:1894] SyncLoop (ADD, "api"): "ovs-fv5j7_openshift-sdn(dc3d5456-eaa1-11e9-9590-0a755baa03e8)"

  - meanwhile, kubelet restarts sdn-rzhwc:

        I1009 14:34:11.834980    1179 kuberuntime_manager.go:554] Container {Name:sdn, ...} is dead, but RestartPolicy says that we should restart it.
        ...
        I1009 14:34:12.079259    1179 event.go:209] Event(v1.ObjectReference{Kind:"Pod", Namespace:"openshift-sdn", Name:"sdn-rzhwc", UID:"132fa888-ea9e-11e9-b4a1-12b8e39af2c4", APIVersion:"v1", ResourceVersion:"1587", FieldPath:"spec.containers{sdn}"}): type: 'Normal' reason: 'Started' Started container sdn

  - and almost simultaneously, the DaemonSet controller starts the SDN pod upgrade:

        I1009 14:34:12.718421    1179 kubelet.go:1910] SyncLoop (DELETE, "api"): "sdn-rzhwc_openshift-sdn(132fa888-ea9e-11e9-b4a1-12b8e39af2c4)"

  - ovs-fv5j7 starts up and its logs show that "someone" (presumably the newly-restarted sdn-rzhwc)
    is modifying flows at 14:34:22:

        2019-10-09T14:34:22.677Z|00094|connmgr|INFO|br0<->unix#3: 47 flow_mods in the last 0 s (47 adds)
        ...
        2019-10-09T14:34:23.756Z|00121|connmgr|INFO|br0<->unix#52: 5 flow_mods in the last 0 s (5 adds)

  - sdn-rzhwc is fully killed by 14:34:24:

        I1009 14:34:24.846071    1179 kubelet.go:1939] SyncLoop (PLEG): "sdn-rzhwc_openshift-sdn(132fa888-ea9e-11e9-b4a1-12b8e39af2c4)", event: &pleg.PodLifecycleEvent{ID:"132fa888-ea9e-11e9-b4a1-12b8e39af2c4", Type:"ContainerDied", Data:"01b5d9002780fcab142e1cd7f8a2a42caf3b56c653f3697d416aaf3e1cf37869"}

    (We don't have any of its logs, because it was killed rather than exiting with an error, so
    kubelet didn't grab them.)

  - sdn-sw5d9 (post-upgrade version) is deployed at 14:34:31:

        I1009 14:34:31.740125    1179 kubelet.go:1894] SyncLoop (ADD, "api"): "sdn-sw5d9_openshift-sdn(e8295c08-eaa1-11e9-9590-0a755baa03e8)"

  - sdn-sw5d9 logs begin at 14:34:37 and it decides that br0 is fine and it doesn't
    need to recreate the network:

        I1009 14:34:37.206227   15260 node.go:378] Starting openshift-sdn network plugin
        I1009 14:34:37.502158   15260 vnids.go:148] Associate netid 0 to namespace "default" with mcEnabled false

    (If it decided to recreate the network it would have logged "full SDN setup required"
    in between those messages.)

Comment 13 Dan Winship 2019-10-12 13:31:13 UTC
(hadn't meant to change the Target Release, that was stale bugzilla data)

Comment 15 zhaozhanqi 2019-10-16 11:04:27 UTC
hmm, I checked this issue with old build 4.3.0-0.nightly-2019-10-15-180816

kill the sdn container id in one worker with command.  

   while true; do crictl rm $(crictl ps | grep sdn | awk '{print $1}');sleep 1;done

Then I found the sdn pod is crash

oc get pod -n openshift-sdn -o wide
NAME                   READY   STATUS             RESTARTS   AGE     IP             NODE                                              NOMINATED NODE   READINESS GATES
ovs-86qvx              1/1     Running            0          7h56m   10.0.144.9     ip-10-0-144-9.ap-northeast-1.compute.internal     <none>           <none>
ovs-8d6hf              1/1     Running            0          7h17m   10.0.159.102   ip-10-0-159-102.ap-northeast-1.compute.internal   <none>           <none>
ovs-dxvdn              1/1     Running            0          7h56m   10.0.166.240   ip-10-0-166-240.ap-northeast-1.compute.internal   <none>           <none>
ovs-wwclz              1/1     Running            0          7h48m   10.0.141.185   ip-10-0-141-185.ap-northeast-1.compute.internal   <none>           <none>
ovs-zfdtm              1/1     Running            0          7h56m   10.0.129.242   ip-10-0-129-242.ap-northeast-1.compute.internal   <none>           <none>
sdn-76cjx              1/1     Running            1          7h56m   10.0.129.242   ip-10-0-129-242.ap-northeast-1.compute.internal   <none>           <none>
sdn-controller-h4wxn   1/1     Running            0          7h56m   10.0.166.240   ip-10-0-166-240.ap-northeast-1.compute.internal   <none>           <none>
sdn-controller-ls9pf   1/1     Running            0          7h56m   10.0.144.9     ip-10-0-144-9.ap-northeast-1.compute.internal     <none>           <none>
sdn-controller-m57rv   1/1     Running            0          7h56m   10.0.129.242   ip-10-0-129-242.ap-northeast-1.compute.internal   <none>           <none>
sdn-dcrlr              1/1     Running            1          7h56m   10.0.144.9     ip-10-0-144-9.ap-northeast-1.compute.internal     <none>           <none>
sdn-f2l99              1/1     Running            1          7h56m   10.0.166.240   ip-10-0-166-240.ap-northeast-1.compute.internal   <none>           <none>
sdn-lqr4l              1/1     Running            0          7h48m   10.0.141.185   ip-10-0-141-185.ap-northeast-1.compute.internal   <none>           <none>
sdn-xmh69              0/1     CrashLoopBackOff   2          7h48m   10.0.159.102   ip-10-0-159-102.ap-northeast-1.compute.internal   <none>           <none>


Then I checked this issue on fixed build 4.3.0-0.nightly-2019-10-16-010826 with same above steps, the sdn pod did not met CrashLoopBackOff

Verified this bug.

Comment 17 errata-xmlrpc 2020-05-13 21:27:30 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0062


Note You need to log in before you can comment on or make changes to this bug.