Seen in a 4.2.0-rc.0->4.2.0-rc.3 upgrade test on AWS [1,2]: lastState: terminated: containerID: cri-o://2a94ca623772ba2a7362e1b6e4044b921177be9e1bb46223baf480be997b5669 exitCode: 2 finishedAt: 2019-10-09T15:41:12Z message: | eaders) 2019/10/9 15:39:17 auth: error contacting auth provider (retrying in 10s): Get https://kubernetes.default.svc/.well-known/oauth-authorization-server: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers) 2019/10/9 15:39:32 auth: error contacting auth provider (retrying in 10s): Get https://kubernetes.default.svc/.well-known/oauth-authorization-server: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers) 2019/10/9 15:39:47 auth: error contacting auth provider (retrying in 10s): Get https://kubernetes.default.svc/.well-known/oauth-authorization-server: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers) 2019/10/9 15:40:02 auth: error contacting auth provider (retrying in 10s): Get https://kubernetes.default.svc/.well-known/oauth-authorization-server: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers) 2019/10/9 15:40:17 auth: error contacting auth provider (retrying in 10s): Get https://kubernetes.default.svc/.well-known/oauth-authorization-server: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers) 2019/10/9 15:40:32 auth: error contacting auth provider (retrying in 10s): Get https://kubernetes.default.svc/.well-known/oauth-authorization-server: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers) 2019/10/9 15:40:47 auth: error contacting auth provider (retrying in 10s): Get https://kubernetes.default.svc/.well-known/oauth-authorization-server: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers) 2019/10/9 15:41:02 auth: error contacting auth provider (retrying in 10s): Get https://kubernetes.default.svc/.well-known/oauth-authorization-server: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers) reason: Error startedAt: 2019-10-09T15:38:12Z The symptoms are a lot like bug 1745807, but there was no proxy involved in this test. Assigning to Routing for further triage based on Michal's initial guess. [1]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.2/11 [2]: https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.2/11/artifacts/e2e-aws-upgrade/must-gather/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-2130377ba7ab9dbed8350c52b098dae1575a7dbafe279f8c013e6455d2da6a93/namespaces/openshift-console/pods/console-7d756bff-5s8q5/console-7d756bff-5s8q5.yaml
We didn't see those error messages from console pod during the upgrade testing.
The reported errors are apiserver connectivity issues. Looking at the test logs, these are the reasons the test reports as the reason for failure: fail [github.com/openshift/origin/test/e2e/upgrade/upgrade.go:139]: during upgrade Unexpected error: <*errors.errorString | 0xc0026bbfe0>: { s: "Cluster did not complete upgrade: timed out waiting for the condition: Cluster operator console has not yet reported success", } Cluster did not complete upgrade: timed out waiting for the condition: Cluster operator console has not yet reported success occurred failed: (1h21m57s) 2019-10-09T15:41:21 "[Disruptive] Cluster upgrade should maintain a functioning cluster [Feature:ClusterUpgrade] [Suite:openshift] [Serial]" The test then shows that the console operator is failing: Oct 9 15:38:21.538: INFO: Cluster version: { "metadata": { "name": "version", "selfLink": "/apis/config.openshift.io/v1/clusterversions/version", "uid": "f69fe9ad-ea9d-11e9-b4a1-12b8e39af2c4", "resourceVersion": "30034", "generation": 2, "creationTimestamp": "2019-10-09T14:06:17Z" }, "spec": { "clusterID": "fa257450-2899-4dd6-9469-5613dd68c65a", "desiredUpdate": { "version": "", "image": "registry.svc.ci.openshift.org/ocp/release@sha256:ab2241a765cb7715b7f5b5853b7b0bf8124e71459c5a80b74faf43ff51dcec49", "force": true }, "upstream": "https://api.openshift.com/api/upgrades_info/v1/graph", "channel": "stable-4.2" }, "status": { "desired": { "version": "4.2.0-rc.3", "image": "registry.svc.ci.openshift.org/ocp/release@sha256:ab2241a765cb7715b7f5b5853b7b0bf8124e71459c5a80b74faf43ff51dcec49", "force": true }, "history": [ { "state": "Partial", "startedTime": "2019-10-09T14:23:16Z", "completionTime": null, "version": "4.2.0-rc.3", "image": "registry.svc.ci.openshift.org/ocp/release@sha256:ab2241a765cb7715b7f5b5853b7b0bf8124e71459c5a80b74faf43ff51dcec49", "verified": true }, { "state": "Completed", "startedTime": "2019-10-09T14:06:33Z", "completionTime": "2019-10-09T14:23:16Z", "version": "4.2.0-rc.0", "image": "registry.svc.ci.openshift.org/ocp/release@sha256:7e063c6b4a61377755284bf1a37bef8a25e5c1663a63fe2ba2e005f745f4e8c3", "verified": false } ], "observedGeneration": 2, "versionHash": "pbTC0u6-z1Q=", "conditions": [ { "type": "Available", "status": "True", "lastTransitionTime": "2019-10-09T14:19:16Z", "message": "Done applying 4.2.0-rc.0" }, { "type": "Failing", "status": "True", "lastTransitionTime": "2019-10-09T14:42:07Z", "reason": "ClusterOperatorNotAvailable", "message": "Cluster operator console has not yet reported success" }, { "type": "Progressing", "status": "True", "lastTransitionTime": "2019-10-09T14:23:16Z", "reason": "ClusterOperatorNotAvailable", "message": "Unable to apply 4.2.0-rc.3: the cluster operator console has not yet successfully rolled out" }, { "type": "RetrievedUpdates", "status": "False", "lastTransitionTime": "2019-10-09T14:06:33Z", "reason": "RemoteFailed", "message": "Unable to retrieve available updates: currently installed version 4.2.0-rc.3 not found in the \"stable-4.2\" channel" } ], "availableUpdates": null } } The test then reports the status of all operators, showing that dns and auth are also showing failures: Oct 9 15:38:21.641: INFO: Cluster operators: NAME A F P VERSION MESSAGE authentication T <new> Progressing: not all deployment replicas are ready cloud-credential <new> 4 of 4 credentials requests provisioned and reconciled. cluster-autoscaler <new> console F <new> dns T T <new> Not all DNS DaemonSets available. image-registry <new> The registry is ready ingress <new> desired and current number of IngressControllers are equal insights <new> Monitoring the cluster kube-apiserver <new> Progressing: 3 nodes are at revision 7 kube-controller-manager <new> Progressing: 3 nodes are at revision 6 kube-scheduler <new> Progressing: 3 nodes are at revision 6 machine-api <new> machine-config <old> Cluster version is 4.2.0-rc.0 marketplace <new> Successfully progressed to release version: 4.2.0-rc.3 monitoring <new> network <new> node-tuning <new> Cluster version is "4.2.0-rc.3" openshift-apiserver <new> openshift-controller-manager <new> openshift-samples <new> Samples installation successful at 4.2.0-rc.3 operator-lifecycle-manager <new> Deployed 0.11.0 operator-lifecycle-manager-catalog <new> Deployed 0.11.0 operator-lifecycle-manager-packageserver <new> Deployed version 0.11.0 service-ca <new> Progressing: All service-ca-operator deployments updated service-catalog-apiserver <new> service-catalog-controller-manager <new> storage <new> If you look at the cluster state, CoreDNS is fully rolled out, but dns-operator can't report status back to the apiserver because of a nearly hour long apiserver connectivity outage: E1009 14:50:24.555770 1 streamwatcher.go:109] Unable to decode an event from the watch stream: read tcp 10.130.0.74:43434->172.30.0.1:443: read: no route to host E1009 15:41:45.789223 1 reflector.go:126] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:126: Failed to list *v1.DaemonSet: Get https://172.30.0.1:443/apis/apps/v1/namespaces/openshift-dns/daemonsets?limit=500&resourceVersion=0: dial tcp 172.30.0.1:443: connect: no route to host So, I'm not seeing any evidence of an issue with DNS or ingress here. I do see evidence of an apiserver connectivity issue, which could be networking or the apiserver itself. I do notice that dns-operator is hotlooping reconciliation during apiserver connectivity outages, which I'll open a new non-blocker bug for. If you agree with my assessment, can you help me choose a new component for this bug?
I opened https://bugzilla.redhat.com/show_bug.cgi?id=1760473 to track the dns-operator hot loop bug.
> If you agree with my assessment, can you help me choose a new component for this bug? Flipping a coin and going with SDN/Networking.
Quick notes for those debugging: pod with connection issues: dns-operator-5db7cc9bd6-6ksfp node: ip-10-0-159-27.ec2.internal sdn pod on node: sdn-sw5d9
Checking the sdn logs, I see that the endpoints are correct. So, this is not any sort of informer issue. I wonder if there's a chance we somehow tore down the wrong container's eth0.
Looking at the logs, the dns-operator seems to have trouble reaching the apiserver starting at 14:50:24. There's nothing immediately interesting in the node journal at that time. So there goes that theory. (Also, I meant to say that the iptables rules are correct).
There are a few pods failing with "no route to host" when trying to connect to the apiserver, but not all. They are all on this node. But, there are pods running on the host with no problems. non-exhaustive list: Bad pods: oauth-openshift-c8b557487-x7rfg dns-operator-5db7cc9bd6-6ksfp openshift-service-catalog-apiserver-operator-568fdc7d75-ddjbv Good pods: openshift-apiserver-operator-dc775fbf4-mwqlr apiserver_apiserver-nt2nn cluster-storage-operator-f4584dcb9-nz9f9
Interesting information: the first pod to report "no route to host" does so about when the SDN restarts after OVS restarts. OVS starts after upgrade at 14:34:22. openshift-sdn then detects this and restarts (but way too late?) I1009 14:35:18.174477 15632 ovs.go:169] Error executing ovs-ofctl: ovs-ofctl: br0 is not a bridge or a socket F1009 14:35:18.174515 15632 healthcheck.go:82] SDN healthcheck detected OVS server change, restarting: OVS reinitialization required: plugin is not setup The first container gets a "no route to host" when talking to the apiserver at 14:34:54. Is it possible that containers that talk at the wrong time are caching the no-route-to-host?
Correction: ignore the sdn logs. They're from the wrong node. We don't have any old sdn logs. The SDN process started at 14:34:37
This seems to be the first instance of this bug that we can find, and none of the other CI runs with the same error message exhibited the same underlying problem. However, if this hits a cluster, the workaround is believed to be to restart the ovs pod on the broken node (or, worst case, reboot the node). I think we can fix this in 4.2.z. (I'm setting this bug to 4.3.0 and cloning to track for 4.2.z)
OK. tl;dr: we're killing an SDN pod halfway through SDN startup, and the next SDN pod doesn't recover correctly from the half-initialized state. I'm sure we've at least discussed this possibility before but we don't seem to be doing anything to prevent it? In particular, I think the problem here is that all of the "create the bridge and set up routes and default OVS flows and stuff" things happened, but then sdn was killed before it finished the "reattach old pods to the bridge" step. We can probably fix this by tweaking the startup initialization stuff vs the alreadySetUp() check. (eg, if we waited to create the table 253 version note until the very end of startup, then the new SDN pod wouldn't have decided that the bridge was fully set up. Of course, it would have had to kill the not-yet-attached pods rather than reattaching them since the state in the ovsdb would have been lost by this point. But OCP is supposed to be able to recover from that anyway...) Detailed timeline, mostly from kubelet logs in masters-journal: - ip-10-0-159-27 comes back up after RHCOS update at 14:04:00 - sdn-rzhwc and ovs-brgbt (pre-upgrade versions) are started at 14:07:05 - OVS pod upgrade starts at 14:34:00 I1009 14:34:00.708770 1179 kubelet.go:1910] SyncLoop (DELETE, "api"): "ovs-brgbt_openshift-sdn(12b4bea1-ea9e-11e9-b4a1-12b8e39af2c4)" - sdn-rzhwc exits 10 seconds later after timing out waiting for OVS to come back. F1009 14:34:10.910955 3778 healthcheck.go:82] SDN healthcheck detected OVS server change, restarting: timed out waiting for the condition (We don't have sdn-rzhwc's logs but that ends up in kubelet's logs due to "terminationMessagePolicy: FallbackToLogsOnError".) - ovs-fv5j7 (post-upgrade version) is deployed at 14:34:11 (after ovs-brgbt cleanup is complete; it is not clear to me why it takes so long for ovs-brgbt to be cleaned up). I1009 14:34:11.726383 1179 kubelet.go:1904] SyncLoop (REMOVE, "api"): "ovs-brgbt_openshift-sdn(12b4bea1-ea9e-11e9-b4a1-12b8e39af2c4)" ... I1009 14:34:11.746904 1179 kubelet.go:1894] SyncLoop (ADD, "api"): "ovs-fv5j7_openshift-sdn(dc3d5456-eaa1-11e9-9590-0a755baa03e8)" - meanwhile, kubelet restarts sdn-rzhwc: I1009 14:34:11.834980 1179 kuberuntime_manager.go:554] Container {Name:sdn, ...} is dead, but RestartPolicy says that we should restart it. ... I1009 14:34:12.079259 1179 event.go:209] Event(v1.ObjectReference{Kind:"Pod", Namespace:"openshift-sdn", Name:"sdn-rzhwc", UID:"132fa888-ea9e-11e9-b4a1-12b8e39af2c4", APIVersion:"v1", ResourceVersion:"1587", FieldPath:"spec.containers{sdn}"}): type: 'Normal' reason: 'Started' Started container sdn - and almost simultaneously, the DaemonSet controller starts the SDN pod upgrade: I1009 14:34:12.718421 1179 kubelet.go:1910] SyncLoop (DELETE, "api"): "sdn-rzhwc_openshift-sdn(132fa888-ea9e-11e9-b4a1-12b8e39af2c4)" - ovs-fv5j7 starts up and its logs show that "someone" (presumably the newly-restarted sdn-rzhwc) is modifying flows at 14:34:22: 2019-10-09T14:34:22.677Z|00094|connmgr|INFO|br0<->unix#3: 47 flow_mods in the last 0 s (47 adds) ... 2019-10-09T14:34:23.756Z|00121|connmgr|INFO|br0<->unix#52: 5 flow_mods in the last 0 s (5 adds) - sdn-rzhwc is fully killed by 14:34:24: I1009 14:34:24.846071 1179 kubelet.go:1939] SyncLoop (PLEG): "sdn-rzhwc_openshift-sdn(132fa888-ea9e-11e9-b4a1-12b8e39af2c4)", event: &pleg.PodLifecycleEvent{ID:"132fa888-ea9e-11e9-b4a1-12b8e39af2c4", Type:"ContainerDied", Data:"01b5d9002780fcab142e1cd7f8a2a42caf3b56c653f3697d416aaf3e1cf37869"} (We don't have any of its logs, because it was killed rather than exiting with an error, so kubelet didn't grab them.) - sdn-sw5d9 (post-upgrade version) is deployed at 14:34:31: I1009 14:34:31.740125 1179 kubelet.go:1894] SyncLoop (ADD, "api"): "sdn-sw5d9_openshift-sdn(e8295c08-eaa1-11e9-9590-0a755baa03e8)" - sdn-sw5d9 logs begin at 14:34:37 and it decides that br0 is fine and it doesn't need to recreate the network: I1009 14:34:37.206227 15260 node.go:378] Starting openshift-sdn network plugin I1009 14:34:37.502158 15260 vnids.go:148] Associate netid 0 to namespace "default" with mcEnabled false (If it decided to recreate the network it would have logged "full SDN setup required" in between those messages.)
(hadn't meant to change the Target Release, that was stale bugzilla data)
hmm, I checked this issue with old build 4.3.0-0.nightly-2019-10-15-180816 kill the sdn container id in one worker with command. while true; do crictl rm $(crictl ps | grep sdn | awk '{print $1}');sleep 1;done Then I found the sdn pod is crash oc get pod -n openshift-sdn -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES ovs-86qvx 1/1 Running 0 7h56m 10.0.144.9 ip-10-0-144-9.ap-northeast-1.compute.internal <none> <none> ovs-8d6hf 1/1 Running 0 7h17m 10.0.159.102 ip-10-0-159-102.ap-northeast-1.compute.internal <none> <none> ovs-dxvdn 1/1 Running 0 7h56m 10.0.166.240 ip-10-0-166-240.ap-northeast-1.compute.internal <none> <none> ovs-wwclz 1/1 Running 0 7h48m 10.0.141.185 ip-10-0-141-185.ap-northeast-1.compute.internal <none> <none> ovs-zfdtm 1/1 Running 0 7h56m 10.0.129.242 ip-10-0-129-242.ap-northeast-1.compute.internal <none> <none> sdn-76cjx 1/1 Running 1 7h56m 10.0.129.242 ip-10-0-129-242.ap-northeast-1.compute.internal <none> <none> sdn-controller-h4wxn 1/1 Running 0 7h56m 10.0.166.240 ip-10-0-166-240.ap-northeast-1.compute.internal <none> <none> sdn-controller-ls9pf 1/1 Running 0 7h56m 10.0.144.9 ip-10-0-144-9.ap-northeast-1.compute.internal <none> <none> sdn-controller-m57rv 1/1 Running 0 7h56m 10.0.129.242 ip-10-0-129-242.ap-northeast-1.compute.internal <none> <none> sdn-dcrlr 1/1 Running 1 7h56m 10.0.144.9 ip-10-0-144-9.ap-northeast-1.compute.internal <none> <none> sdn-f2l99 1/1 Running 1 7h56m 10.0.166.240 ip-10-0-166-240.ap-northeast-1.compute.internal <none> <none> sdn-lqr4l 1/1 Running 0 7h48m 10.0.141.185 ip-10-0-141-185.ap-northeast-1.compute.internal <none> <none> sdn-xmh69 0/1 CrashLoopBackOff 2 7h48m 10.0.159.102 ip-10-0-159-102.ap-northeast-1.compute.internal <none> <none> Then I checked this issue on fixed build 4.3.0-0.nightly-2019-10-16-010826 with same above steps, the sdn pod did not met CrashLoopBackOff Verified this bug.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:0062