Description of problem: Most tests that require the oauth server failed because the server does not seem to be running: Jan 24 18:41:44.926: INFO: OAuth server pod is not ready: Container statuses: ([]v1.ContainerStatus) (len=1 cap=1) { (v1.ContainerStatus) &ContainerStatus{Name:oauth-server,State:ContainerState{Waiting:nil,Running:&ContainerStateRunning{StartedAt:2020-01-24 18:41:33 +0000 UTC,},Terminated:nil,},LastTerminationState:ContainerState{Waiting:nil,Running:nil,Terminated:nil,},Ready:false,RestartCount:0,Image:registry.svc.ci.openshift.org/ocp/4.3-2020-01-21-121240@sha256:98ebde80813be5465888692fb5d699eef4438df29dd7f376a60668204755f243,ImageID:registry.svc.ci.openshift.org/ocp/4.3-2020-01-21-121240@sha256:98ebde80813be5465888692fb5d699eef4438df29dd7f376a60668204755f243,ContainerID:cri-o://f13b967b807d28a6ee7d77e044237aee7709deb2e7b2f24535159c8ce5a0e8b9,Started:*true,} } https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-azure-compact-4.3/94
From the logs I can see that the oauth-server pod actually came to life and was responding to health checks, but the linked tests fail on i/o timeouts while trying to reach the server via its route. Moving to routing.
Took a quick look through the logs, nothing stands out yet. Highly unlikely this is a release blocker.
Had a chance today to reproduce this. Looks like the LBs created by K8s for LoadBalancer services on Azure have empty backend pools, which means ingress is totally broken and probably has always been in this topology. There were other fixes upstream in the service controller and GCP cloud provider to support the topology, but apparently no work was done for Azure. Need to investigate the Azure cloud provider implementation.
Looks like the Azure cloud provider uses a `excludeMasterFromStandardLB` cloud provider configuration key to decide whether masters should be excluded, and the default is `true`.
I think the Azure cloud provider code upstream needs some refactoring to honor LegacyNodeRoleBehavior (specifically, by only honoring node role labels when LegacyNodeRoleBehavior=True). I'm not sure we're going to have time to take that on for the release, but I'll leave the issue in 4.4 for now until I've had a chance to talk over my findings with some others on the team. Since Azure is still tech preview, we won't block the release on a fix for this topology.
I think I found both the root cause and solution. Azure cloud provider expects to create a load balancer and backend pool for all eligible nodes and then add traffic to it. The upstream thinks it is filtering out master nodes, but it's using the old label (kubernetes.io/role) which we don't set and no one is supposed to use anyway (all the upstream code for filtering on the correct label was removed, I'll track removing it). The controller reconciles and adds things so it is non-disruptive to any other rules on the LB. This is important because in Azure a single NIC can only be attached to one load balancer at a time. Fortunately, Azure load balancers are designed to support multiple inputs and backends, and health checks (and thus pool membership) are determined per input - so you can have both public API Server traffic and service load balancer traffic on the same backend pool and same Azure LB, but with different frontend IPs. Since in the long run we are looking to always use SLB even for the kube-apiserver, we are actually able to leverage that behavior on Azure by renaming the LB we create to just "cluster_name" and the cloud controller will automatically keep all nodes in the pool. The current health check on 6443 for apiserver will filter the list of nodes down to the masters (although if someone listens on port 6443 on a node they could potentially DoS the frontend by injecting endpoints, but those endpoints would not be able to impersonate a master without the TLS cert. This is currently a problem with the router on any cloud. I think this needs a bit of discussion but is reasonable on the surface. PR will be opened and I'll start discussion with ARO.
*** Bug 1820800 has been marked as a duplicate of this bug. ***
Checked the latest CI and still found many OAuth Server failed test. fail [github.com/openshift/origin/test/extended/oauth/expiration.go:30]: Unexpected error: <*errors.errorString | 0xc0001d4970>: { s: "timed out waiting for the condition", } timed out waiting for the condition occurred see also: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-azure-compact-4.5/77 Checked the CI of compact cluster on AWS, Azure and GCP below, and just see the similar error on Azure. https://testgrid.k8s.io/redhat-openshift-ocp-release-4.5-informing#release-openshift-origin-installer-e2e-azure-compact-4.5 https://testgrid.k8s.io/redhat-openshift-ocp-release-4.5-informing#release-openshift-origin-installer-e2e-aws-compact-4.5 https://testgrid.k8s.io/redhat-openshift-ocp-release-4.5-informing#release-openshift-origin-installer-e2e-gcp-compact-4.5
*** Bug 1818023 has been marked as a duplicate of this bug. ***
Checked the latest CI and oauth server tests are passed. https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-azure-compact-4.5/80 And also confirmed that ingress works in the three-node cluster. $ oc get node NAME STATUS ROLES AGE VERSION hongli-pl601-hvmwm-master-0 Ready master,worker 122m v1.18.0-rc.1 hongli-pl601-hvmwm-master-1 Ready master,worker 122m v1.18.0-rc.1 hongli-pl601-hvmwm-master-2 Ready master,worker 122m v1.18.0-rc.1 $ oc get co ingress authentication console NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE ingress 4.5.0-0.nightly-2020-05-08-222601 True False False 67m NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE authentication 4.5.0-0.nightly-2020-05-08-222601 True False False 97m NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE console 4.5.0-0.nightly-2020-05-08-222601 True False False 83m
*** Bug 1812662 has been marked as a duplicate of this bug. ***
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:2409
*** Bug 1830293 has been marked as a duplicate of this bug. ***