Description of problem: When configuring a BM cluster, any HTTP request to the oauth-openshift route fails with EOF. Version-Release number of selected component (if applicable): 4.3.9 How reproducible: 2/2 Steps to Reproduce: 1. follow bare-metal installation instructions Actual results: installation does not finish with authn operator being degraded due to 'RouteHealthDegraded: failed to GET route: EOF' Expected results: Successful installation Additional info: pods are running, the route exists and has correct canonical host, endpoints exist in the openshift-authentication namespace and the oauth-server pods respond properly on the service-network plane. Additionally, there are no coredups on the cluster so it does not seem like a problem with iptables segfaulting. ``` oc exec -ti sdn-qpkhc -n openshift-sdn -- bash [root@master-0 /]# curl -k https://oauth-openshift.apps.test.myocp4.com curl: (35) Encountered end of file [root@master-0 /]# curl -k https://172.30.19.162 { "kind": "Status", "apiVersion": "v1", "metadata": { }, "status": "Failure", "message": "forbidden: User \"system:anonymous\" cannot get path \"/\"", "reason": "Forbidden", "details": { }, "code": 403 ``` ``` [kni@e16-h12-b01-fc640 ~]$ oc get route oauth-openshift NAME HOST/PORT PATH SERVICES PORT TERMINATION WILDCARD oauth-openshift oauth-openshift.apps.test.myocp4.com oauth-openshift 6443 passthrough/Redirect None [kni@e16-h12-b01-fc640 ~]$ oc get svc oauth-openshift NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE oauth-openshift ClusterIP 172.30.19.162 <none> 443/TCP 59m [kni@e16-h12-b01-fc640 ~]$ oc get ep oauth-openshift NAME ENDPOINTS AGE oauth-openshift 10.129.0.35:6443,10.130.0.49:6443 59m ``` Opened on behalf of Marko Karg.
must-gather is too large to be attached, can be found at https://drive.google.com/open?id=19fWJtBvL4eNAVbIhmxBrW2p39azVb7TY
One more thing - I have a testbed up and running in case you need access to the cluster. Ping me for the credentials please.
https://docs.openshift.com/container-platform/4.3/installing/installing_bare_metal/installing-bare-metal.html Looking at the must gather, there's no evidence of any ingress bug. The ingress operator reports available, and router pods are ready. This topology is using host networked ingress. This means that to make ingress fully functional, the external load balancer and DNS which complete the ingress implementation are a user managed black box outside the cluster. That's all in addition to the requirements for the VPC itself, another potential source of user mistakes which the system has no ability to analyze. I would go back over your external load balancer, DNS implementation, and VPC setup. I don't see any details about whether the VPC setup is aligned with the docs, or any details about how the external load balancer or DNS is implemented, so I can't speculate about how they might be misconfigured. If you can provide those details here, maybe something will stand out, but so far I don't have enough info to accept this as a bug.
This is an IPI bare metal installation, my understanding is that it doesn't need an external load balancer, correct? When you say VPC, what do you mean? We run this on coreOS hosts in a lab in RDU, so if you refer to VPC as in Amazon's VPC I'm afraid I don't follow. Thanks for checking!
The description didn't mention IPI... it says: >Steps to Reproduce: >1. follow bare-metal installation instructions Which I assumed was the supported docs for 4.3 bare metal installations: https://docs.openshift.com/container-platform/4.3/installing/installing_bare_metal/installing-bare-metal.html I wasn't aware bare metal IPI is even supported yet, and I'm not sure what manages VPC, DNS, or load balancing in that topology (certainly not the ingress operator)... Can you please clarify what version and installation method we're talking about? How exactly was this cluster created?
I've heard of bare-metal installer PoCs, but know nothing about it, hearing of bare-metal IPI being supported is new to me, too (hence the BZ description). Perhaps Marko can shed some more light about what this is.
We are using Roger Lopez' ansible work in combination with some plays specific for our setup (https://github.com/dustinblack/baremetal-deploy) DNS is managed through ansible, it takes the masters and workers from the inventory and sets up dnsmasq on the deployhost accordingly. My understanding is that load balancing is done on the haproxy pods in the cluster, using a VIP which we set in dnsmasq. The cluster is created with the ansible scripts, they start a VM to bootstrap the nodes using badfish and IPMI to power them on / off etc. The detailed description is here: https://github.com/dustinblack/baremetal-deploy/tree/master/ansible-ipi-install Hopefully that clarifies things a bit. Let me know if you need more information.
is it possible it could be something like https://bugzilla.redhat.com/show_bug.cgi?id=1711127 based on some hardware driver variation between 4.3.5 and 4.3.9 ? ( as i was told 4.3.5 did work on this same setup)
Using the ansible IPI baremetal on my setup of VMs, I seem to get a similar deployment failure if i include workers, but a successful install in there are no workers. Both cash I have caching enabled. This is with 4.3.9. To understand your deployment options can you scrub anything private and share your inventory file?
Created attachment 1678381 [details] hosts file for the ansible playbook
Steve - hosts file with private bits redacted has just been uploaded.
Just confirmed - the problem also shows on 4.3.10
Also checked with a no-worker deployment, issue is the same: [kni@e16-h12-b01-fc640 ~]$ oc get nodes NAME STATUS ROLES AGE VERSION master-0 Ready master,worker 36m v1.16.2 master-1 Ready master,worker 36m v1.16.2 master-2 Ready master,worker 36m v1.16.2 [kni@e16-h12-b01-fc640 ~]$ oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE authentication Unknown Unknown True 13m cloud-credential 4.3.10 True False False 36m cluster-autoscaler 4.3.10 True False False 12m console 4.3.10 False True False 12m dns 4.3.10 True False False 32m image-registry 4.3.10 True False False 13m ingress 4.3.10 True False False 12m insights 4.3.10 True False False 13m kube-apiserver 4.3.10 True False False 32m kube-controller-manager 4.3.10 True False False 15m kube-scheduler 4.3.10 True False False 15m machine-api 4.3.10 True False False 32m machine-config 4.3.10 True False False 32m marketplace 4.3.10 True False False 12m monitoring 4.3.10 True False False 6m25s network 4.3.10 True False False 32m node-tuning 4.3.10 True False False 13m openshift-apiserver 4.3.10 True False False 13m openshift-controller-manager 4.3.10 True False False 29m openshift-samples 4.3.10 True False False 12m operator-lifecycle-manager 4.3.10 True False False 13m operator-lifecycle-manager-catalog 4.3.10 True False False 13m operator-lifecycle-manager-packageserver 4.3.10 True False False 13m service-ca 4.3.10 True False False 33m service-catalog-apiserver 4.3.10 True False False 13m service-catalog-controller-manager 4.3.10 True False False 13m storage 4.3.10 True False False 13m
I deployed 4.3.12 with a disconnected registry and that worked. It's a workaround for now, but I still think that we need to have it working with a remote registry as well.
Similar problem experienced today with a 4.4 nightly, except this time instead of the EOF error I just get 'connection refused', but the symptom is otherwise the same. $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version False True 173m Unable to apply 4.4.0-0.nightly-2020-04-23-014745: some cluster operators have not yet rolled out $ oc get co | egrep 'NAME|auth|cons' NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE authentication Unknown Unknown True 151m console 4.4.0-0.nightly-2020-04-23-014745 False True False 144m $ oc describe co authentication | grep -n3 RouteHealth 14-Status: 15- Conditions: 16- Last Transition Time: 2020-04-24T21:30:58Z 17: Message: RouteHealthDegraded: failed to GET route: dial tcp 192.168.222.4:443: connect: connection refused 18: Reason: RouteHealth_FailedGet 19- Status: True 20- Type: Degraded 21- Last Transition Time: 2020-04-24T21:22:47Z $ oc logs -n openshift-authentication oauth-openshift-567c864d66-fckfn | tail -5 I0424 21:29:36.788091 1 tlsconfig.go:179] loaded serving cert ["serving-cert::/var/config/system/secrets/v4-0-config-system-serving-cert/tls.crt::/var/config/system/secrets/v4-0-config-system-serving-cert/tls.key"]: "oauth-openshift.openshift-authentication.svc" [serving] validServingFor=[oauth-openshift.openshift-authentication.svc,oauth-openshift.openshift-authentication.svc.cluster.local] issuer="openshift-service-serving-signer@1587763371" (2020-04-24 21:28:51 +0000 UTC to 2022-04-24 21:28:52 +0000 UTC (now=2020-04-24 21:29:36.788083926 +0000 UTC)) I0424 21:29:36.788264 1 named_certificates.go:52] loaded SNI cert [1/"sni-serving-cert::/var/config/system/secrets/v4-0-config-system-router-certs/apps.test.myocp4.com::/var/config/system/secrets/v4-0-config-system-router-certs/apps.test.myocp4.com"]: "*.apps.test.myocp4.com" [serving] validServingFor=[*.apps.test.myocp4.com] issuer="ingress-operator@1587763725" (2020-04-24 21:28:45 +0000 UTC to 2022-04-24 21:28:46 +0000 UTC (now=2020-04-24 21:29:36.788256366 +0000 UTC)) I0424 21:29:36.788423 1 named_certificates.go:52] loaded SNI cert [0/"self-signed loopback"]: "apiserver-loopback-client@1587763776" [serving] validServingFor=[apiserver-loopback-client] issuer="apiserver-loopback-client-ca@1587763776" (2020-04-24 20:29:36 +0000 UTC to 2021-04-24 20:29:36 +0000 UTC (now=2020-04-24 21:29:36.788416404 +0000 UTC)) I0424 21:35:05.927489 1 streamwatcher.go:114] Unexpected EOF during watch stream event decoding: unexpected EOF I0424 21:35:05.927567 1 streamwatcher.go:114] Unexpected EOF during watch stream event decoding: unexpected EOF $ oc logs -n openshift-authentication-operator authentication-operator-5679bf68ff-7sx9v | tail -3 E0424 23:54:49.416222 1 controller.go:129] {AuthenticationOperator2 AuthenticationOperator2} failed with: error checking current version: unable to check route health: failed to GET route: dial tcp 192.168.222.4:443: connect: connection refused E0424 23:55:19.416255 1 controller.go:129] {AuthenticationOperator2 AuthenticationOperator2} failed with: error checking current version: unable to check route health: failed to GET route: dial tcp 192.168.222.4:443: connect: connection refused E0424 23:55:49.416178 1 controller.go:129] {AuthenticationOperator2 AuthenticationOperator2} failed with: error checking current version: unable to check route health: failed to GET route: dial tcp 192.168.222.4:443: connect: connection refused $ oc get pods -n openshift-console NAME READY STATUS RESTARTS AGE console-5458db6d6d-pv9z2 0/1 CrashLoopBackOff 27 148m console-568f78cc5b-djtl9 0/1 CrashLoopBackOff 27 148m console-6df8f6d9c6-65fpq 0/1 Running 27 142m downloads-7cf67c7b5d-9fqmj 1/1 Running 0 149m downloads-7cf67c7b5d-t2rqs 1/1 Running 0 149m $ oc logs -n openshift-console console-5458db6d6d-pv9z2 | tail -2 2020-04-24T23:53:40Z auth: error contacting auth provider (retrying in 10s): request to OAuth issuer endpoint https://oauth-openshift.apps.test.myocp4.com/oauth/token failed: Head https://oauth-openshift.apps.test.myocp4.com: dial tcp 192.168.222.4:443: connect: connection refused 2020-04-24T23:53:50Z auth: error contacting auth provider (retrying in 10s): request to OAuth issuer endpoint https://oauth-openshift.apps.test.myocp4.com/oauth/token failed: Head https://oauth-openshift.apps.test.myocp4.com: dial tcp 192.168.222.4:443: connect: connection refused
It looks like we understand the authentication operator problem now. This stems from either an IP address conflict for the ingressVIP or otherwise a conflicting VRID for VRRP (our particular case in the scale lab). The VRID is auto-generated from the cluster name, and therefore when we re-deploy in the lab with the same cluster name but there are nodes hanging out that are not part of the current deployment but were part of a previous deployment, there will be a VRID conflict and you will see VRRP errors in the keepalived logs. In my test lab, as soon as I shut down the one node that was not part of the cluster, the auth operator came up. It sounds like it is not viable to make an installer change to do something like randomize the VRID and ensure it is effective. This is partially due to the fact that the VRID is limited to 0-255. We are currently adjusting our automation to include some randomization in our own cluster naming scheme. This might not solve the problem 100%, but it should reduce the likelihood of hitting it.
Given that the cause appears to be a conflicting VRID, I'm re-assigning this Bugzilla report to the Installer component. Installer team, would it be feasible to detect VRID conflicts at installation time?
> Installer team, would it be feasible to detect VRID conflicts at installation time? I doubt it, do you have to be on the same L2 network? We don't make that guarantee the installer sits on the same L2 network, only that we can reach the API. Toni would know.
The maximum auto-detection that comes to mind that we could do is: at bootstrap: 1. snoop VRRP traffic for a minute and rule out all the VRRP IDs that we see 2. Pick Three that were not seen 3. Configure the local keepalived with them. 4. Find a way to propagate the decision (maybe via API) so the other nodes can find it and set it.
We addressed it with documentation about how to see which virtual router ids your cluster will end up with in: https://github.com/openshift/installer/pull/3463
Verified on Client Version: 4.6.0-0.nightly-2020-07-07-233934 [root@titan54 ~]# podman run quay.io/openshift/origin-baremetal-runtimecfg:4.6 vr-ids cnf10 APIVirtualRouterID: 147 DNSVirtualRouterID: 158 IngressVirtualRouterID: 2 [root@titan54 ~]# podman run quay.io/openshift/origin-baremetal-runtimecfg:4.6 vr-ids cnf11 APIVirtualRouterID: 228 DNSVirtualRouterID: 239 IngressVirtualRouterID: 147
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:4196