1821283 – [Webscale] Routes don't reach endpoints on BM installations

Bug 1821283 - [Webscale] Routes don't reach endpoints on BM installations

Summary: [Webscale] Routes don't reach endpoints on BM installations

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	4.3.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.6.0
Assignee:	Ben Nemec
QA Contact:	Aleksandra Malykhin
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1823797 1823798 dit
TreeView+	depends on / blocked

Reported:	2020-04-06 13:45 UTC by Standa Laznicka
Modified:	2020-10-27 15:58 UTC (History)
CC List:	15 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Cause: Conflicts between VRRP IDs. Consequence: Nodes from outside the cluster will participate in keepalived negotiations. This can result in a VIP being hosted on a node that is not part of the cluster. Fix: Documented how to manually check for VRRP collisions before installation. Result: Collisions should be found and fixed before installation.
Clone Of:
Clones:	1823797 (view as bug list)
Environment:
Last Closed:	2020-10-27 15:57:43 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
hosts file for the ansible playbook (12.79 KB, text/plain) 2020-04-13 07:18 UTC, Marko Karg	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift installer pull 3463	0	None	closed	bug 1821667: baremetal IPI: Document Virtual Router IDs	2021-01-25 23:41:19 UTC
Red Hat Product Errata	RHBA-2020:4196	0	None	None	None	2020-10-27 15:58:12 UTC

Description Standa Laznicka 2020-04-06 13:45:11 UTC

Description of problem:
When configuring a BM cluster, any HTTP request to the oauth-openshift route fails with EOF.

Version-Release number of selected component (if applicable):
4.3.9

How reproducible:
2/2

Steps to Reproduce:
1. follow bare-metal installation instructions

Actual results:
installation does not finish with authn operator being degraded due to 'RouteHealthDegraded: failed to GET route: EOF'

Expected results:
Successful installation

Additional info:
pods are running, the route exists and has correct canonical host, endpoints exist in the openshift-authentication namespace and the oauth-server pods respond properly on the service-network plane.

Additionally, there are no coredups on the cluster so it does not seem like a problem with iptables segfaulting.



```
oc exec -ti sdn-qpkhc -n openshift-sdn -- bash
[root@master-0 /]# curl -k https://oauth-openshift.apps.test.myocp4.com
curl: (35) Encountered end of file
[root@master-0 /]# curl -k https://172.30.19.162
{
  "kind": "Status",
  "apiVersion": "v1",
  "metadata": {
    
  },
  "status": "Failure",
  "message": "forbidden: User \"system:anonymous\" cannot get path \"/\"",
  "reason": "Forbidden",
  "details": {
    
  },
  "code": 403
```

```
[kni@e16-h12-b01-fc640 ~]$ oc get route oauth-openshift 
NAME              HOST/PORT                              PATH   SERVICES          PORT   TERMINATION            WILDCARD
oauth-openshift   oauth-openshift.apps.test.myocp4.com          oauth-openshift   6443   passthrough/Redirect   None
[kni@e16-h12-b01-fc640 ~]$ oc get svc oauth-openshift 
NAME              TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)   AGE
oauth-openshift   ClusterIP   172.30.19.162   <none>        443/TCP   59m
[kni@e16-h12-b01-fc640 ~]$ oc get ep oauth-openshift 
NAME              ENDPOINTS                           AGE
oauth-openshift   10.129.0.35:6443,10.130.0.49:6443   59m
```

Opened on behalf of Marko Karg.

Comment 2 Marko Karg 2020-04-06 13:50:56 UTC

must-gather is too large to be attached, can be found at 
https://drive.google.com/open?id=19fWJtBvL4eNAVbIhmxBrW2p39azVb7TY

Comment 3 Marko Karg 2020-04-06 15:05:00 UTC

One more thing - I have a testbed up and running in case you need access to the cluster. Ping me for the credentials please.

Comment 4 Dan Mace 2020-04-07 13:15:12 UTC

https://docs.openshift.com/container-platform/4.3/installing/installing_bare_metal/installing-bare-metal.html

Looking at the must gather, there's no evidence of any ingress bug. The ingress operator reports available, and router pods are ready. This topology is using host networked ingress. This means that to make ingress fully functional, the external load balancer and DNS which complete the ingress implementation are a user managed black box outside the cluster. That's all in addition to the requirements for the VPC itself, another potential source of user mistakes which the system has no ability to analyze.

I would go back over your external load balancer, DNS implementation, and VPC setup. I don't see any details about whether the VPC setup is aligned with the docs, or any details about how the external load balancer or DNS is implemented, so I can't speculate about how they might be misconfigured. If you can provide those details here, maybe something will stand out, but so far I don't have enough info to accept this as a bug.

Comment 5 Marko Karg 2020-04-07 13:21:50 UTC

This is an IPI bare metal installation, my understanding is that it doesn't need an external load balancer, correct?

When you say VPC, what do you mean? 
We run this on coreOS hosts in a lab in RDU, so if you refer to VPC as in Amazon's VPC I'm afraid I don't follow. 

Thanks for checking!

Comment 6 Dan Mace 2020-04-07 13:28:24 UTC

The description didn't mention IPI... it says:

>Steps to Reproduce:
>1. follow bare-metal installation instructions

Which I assumed was the supported docs for 4.3 bare metal installations:

https://docs.openshift.com/container-platform/4.3/installing/installing_bare_metal/installing-bare-metal.html

I wasn't aware bare metal IPI is even supported yet, and I'm not sure what manages VPC, DNS, or load balancing in that topology (certainly not the ingress operator)... Can you please clarify what version and installation method we're talking about? How exactly was this cluster created?

Comment 7 Standa Laznicka 2020-04-07 13:56:49 UTC

I've heard of bare-metal installer PoCs, but know nothing about it, hearing of bare-metal IPI being supported is new to me, too (hence the BZ description). Perhaps Marko can shed some more light about what this is.

Comment 8 Marko Karg 2020-04-07 14:02:16 UTC

We are using Roger Lopez' ansible work in combination with some plays specific for our setup (https://github.com/dustinblack/baremetal-deploy)

DNS is managed through ansible, it takes the masters and workers from the inventory and sets up dnsmasq on the deployhost accordingly. My understanding is that load balancing is done on the haproxy pods in the cluster, using a VIP which we set in dnsmasq. 

The cluster is created with the ansible scripts, they start a VM to bootstrap the nodes using badfish and IPMI to power them on / off etc. The detailed description is here: https://github.com/dustinblack/baremetal-deploy/tree/master/ansible-ipi-install

Hopefully that clarifies things a bit. Let me know if you need more information.

Comment 10 Karim Boumedhel 2020-04-07 14:26:21 UTC

is it possible it could be something like https://bugzilla.redhat.com/show_bug.cgi?id=1711127
based on some hardware driver variation between 4.3.5 and 4.3.9 ? ( as i was told 4.3.5 did work on this same setup)

Comment 14 Steve Reichard 2020-04-10 12:48:51 UTC

Using the ansible IPI baremetal on my setup of VMs,  I seem to get a similar deployment failure if i include workers, but a successful install in there are no workers. Both cash I have caching enabled. This is with 4.3.9.

To understand your deployment options can you scrub anything private and share your inventory file?

Comment 15 Marko Karg 2020-04-13 07:18:39 UTC

Created attachment 1678381 [details]
hosts file for the ansible playbook

Comment 16 Marko Karg 2020-04-13 07:19:14 UTC

Steve - hosts file with private bits redacted has just been uploaded.

Comment 18 Marko Karg 2020-04-14 12:17:21 UTC

Just confirmed - the problem also shows on 4.3.10

Comment 19 Marko Karg 2020-04-14 14:48:45 UTC

Also checked with a no-worker deployment, issue is the same:

[kni@e16-h12-b01-fc640 ~]$ oc get nodes
NAME       STATUS   ROLES           AGE   VERSION
master-0   Ready    master,worker   36m   v1.16.2
master-1   Ready    master,worker   36m   v1.16.2
master-2   Ready    master,worker   36m   v1.16.2
[kni@e16-h12-b01-fc640 ~]$ oc get co
NAME                                       VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE
authentication                                       Unknown     Unknown       True       13m
cloud-credential                           4.3.10    True        False         False      36m
cluster-autoscaler                         4.3.10    True        False         False      12m
console                                    4.3.10    False       True          False      12m
dns                                        4.3.10    True        False         False      32m
image-registry                             4.3.10    True        False         False      13m
ingress                                    4.3.10    True        False         False      12m
insights                                   4.3.10    True        False         False      13m
kube-apiserver                             4.3.10    True        False         False      32m
kube-controller-manager                    4.3.10    True        False         False      15m
kube-scheduler                             4.3.10    True        False         False      15m
machine-api                                4.3.10    True        False         False      32m
machine-config                             4.3.10    True        False         False      32m
marketplace                                4.3.10    True        False         False      12m
monitoring                                 4.3.10    True        False         False      6m25s
network                                    4.3.10    True        False         False      32m
node-tuning                                4.3.10    True        False         False      13m
openshift-apiserver                        4.3.10    True        False         False      13m
openshift-controller-manager               4.3.10    True        False         False      29m
openshift-samples                          4.3.10    True        False         False      12m
operator-lifecycle-manager                 4.3.10    True        False         False      13m
operator-lifecycle-manager-catalog         4.3.10    True        False         False      13m
operator-lifecycle-manager-packageserver   4.3.10    True        False         False      13m
service-ca                                 4.3.10    True        False         False      33m
service-catalog-apiserver                  4.3.10    True        False         False      13m
service-catalog-controller-manager         4.3.10    True        False         False      13m
storage                                    4.3.10    True        False         False      13m

Comment 20 Marko Karg 2020-04-15 16:02:40 UTC

I deployed 4.3.12 with a disconnected registry and that worked. It's a workaround for now, but I still think that we need to have it working with a remote registry as well.

Comment 26 Dustin Black 2020-04-24 23:59:46 UTC

Similar problem experienced today with a 4.4 nightly, except this time instead of the EOF error I just get 'connection refused', but the symptom is otherwise the same.

$ oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version             False       True          173m    Unable to apply 4.4.0-0.nightly-2020-04-23-014745: some cluster operators have not yet rolled out

$ oc get co | egrep 'NAME|auth|cons'
NAME                                       VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE
authentication                                                                 Unknown     Unknown       True       151m
console                                    4.4.0-0.nightly-2020-04-23-014745   False       True          False      144m

$ oc describe co authentication | grep -n3 RouteHealth
14-Status:
15-  Conditions:
16-    Last Transition Time:  2020-04-24T21:30:58Z
17:    Message:               RouteHealthDegraded: failed to GET route: dial tcp 192.168.222.4:443: connect: connection refused
18:    Reason:                RouteHealth_FailedGet
19-    Status:                True
20-    Type:                  Degraded
21-    Last Transition Time:  2020-04-24T21:22:47Z

$ oc logs -n openshift-authentication oauth-openshift-567c864d66-fckfn | tail -5
I0424 21:29:36.788091       1 tlsconfig.go:179] loaded serving cert ["serving-cert::/var/config/system/secrets/v4-0-config-system-serving-cert/tls.crt::/var/config/system/secrets/v4-0-config-system-serving-cert/tls.key"]: "oauth-openshift.openshift-authentication.svc" [serving] validServingFor=[oauth-openshift.openshift-authentication.svc,oauth-openshift.openshift-authentication.svc.cluster.local] issuer="openshift-service-serving-signer@1587763371" (2020-04-24 21:28:51 +0000 UTC to 2022-04-24 21:28:52 +0000 UTC (now=2020-04-24 21:29:36.788083926 +0000 UTC))
I0424 21:29:36.788264       1 named_certificates.go:52] loaded SNI cert [1/"sni-serving-cert::/var/config/system/secrets/v4-0-config-system-router-certs/apps.test.myocp4.com::/var/config/system/secrets/v4-0-config-system-router-certs/apps.test.myocp4.com"]: "*.apps.test.myocp4.com" [serving] validServingFor=[*.apps.test.myocp4.com] issuer="ingress-operator@1587763725" (2020-04-24 21:28:45 +0000 UTC to 2022-04-24 21:28:46 +0000 UTC (now=2020-04-24 21:29:36.788256366 +0000 UTC))
I0424 21:29:36.788423       1 named_certificates.go:52] loaded SNI cert [0/"self-signed loopback"]: "apiserver-loopback-client@1587763776" [serving] validServingFor=[apiserver-loopback-client] issuer="apiserver-loopback-client-ca@1587763776" (2020-04-24 20:29:36 +0000 UTC to 2021-04-24 20:29:36 +0000 UTC (now=2020-04-24 21:29:36.788416404 +0000 UTC))
I0424 21:35:05.927489       1 streamwatcher.go:114] Unexpected EOF during watch stream event decoding: unexpected EOF
I0424 21:35:05.927567       1 streamwatcher.go:114] Unexpected EOF during watch stream event decoding: unexpected EOF

$ oc logs -n openshift-authentication-operator authentication-operator-5679bf68ff-7sx9v | tail -3
E0424 23:54:49.416222       1 controller.go:129] {AuthenticationOperator2 AuthenticationOperator2} failed with: error checking current version: unable to check route health: failed to GET route: dial tcp 192.168.222.4:443: connect: connection refused
E0424 23:55:19.416255       1 controller.go:129] {AuthenticationOperator2 AuthenticationOperator2} failed with: error checking current version: unable to check route health: failed to GET route: dial tcp 192.168.222.4:443: connect: connection refused
E0424 23:55:49.416178       1 controller.go:129] {AuthenticationOperator2 AuthenticationOperator2} failed with: error checking current version: unable to check route health: failed to GET route: dial tcp 192.168.222.4:443: connect: connection refused

$ oc get pods -n openshift-console
NAME                         READY   STATUS             RESTARTS   AGE
console-5458db6d6d-pv9z2     0/1     CrashLoopBackOff   27         148m
console-568f78cc5b-djtl9     0/1     CrashLoopBackOff   27         148m
console-6df8f6d9c6-65fpq     0/1     Running            27         142m
downloads-7cf67c7b5d-9fqmj   1/1     Running            0          149m
downloads-7cf67c7b5d-t2rqs   1/1     Running            0          149m

$ oc logs -n openshift-console console-5458db6d6d-pv9z2 | tail -2
2020-04-24T23:53:40Z auth: error contacting auth provider (retrying in 10s): request to OAuth issuer endpoint https://oauth-openshift.apps.test.myocp4.com/oauth/token failed: Head https://oauth-openshift.apps.test.myocp4.com: dial tcp 192.168.222.4:443: connect: connection refused
2020-04-24T23:53:50Z auth: error contacting auth provider (retrying in 10s): request to OAuth issuer endpoint https://oauth-openshift.apps.test.myocp4.com/oauth/token failed: Head https://oauth-openshift.apps.test.myocp4.com: dial tcp 192.168.222.4:443: connect: connection refused

Comment 30 Dustin Black 2020-04-28 19:36:38 UTC

It looks like we understand the authentication operator problem now. This stems from either an IP address conflict for the ingressVIP or otherwise a conflicting VRID for VRRP (our particular case in the scale lab).

The VRID is auto-generated from the cluster name, and therefore when we re-deploy in the lab with the same cluster name but there are nodes hanging out that are not part of the current deployment but were part of a previous deployment, there will be a VRID conflict and you will see VRRP errors in the keepalived logs. In my test lab, as soon as I shut down the one node that was not part of the cluster, the auth operator came up.

It sounds like it is not viable to make an installer change to do something like randomize the VRID and ensure it is effective. This is partially due to the fact that the VRID is limited to 0-255.

We are currently adjusting our automation to include some randomization in our own cluster naming scheme. This might not solve the problem 100%, but it should reduce the likelihood of hitting it.

Comment 32 Miciah Dashiel Butler Masters 2020-04-30 19:18:50 UTC

Given that the cause appears to be a conflicting VRID, I'm re-assigning this Bugzilla report to the Installer component.  Installer team, would it be feasible to detect VRID conflicts at installation time?

Comment 33 Stephen Benjamin 2020-05-08 22:09:06 UTC

> Installer team, would it be feasible to detect VRID conflicts at installation time?


I doubt it, do you have to be on the same L2 network? We don't make that guarantee the installer sits on the same L2 network, only that we can reach the API. 


Toni would know.

Comment 34 Antoni Segura Puimedon 2020-05-12 10:40:19 UTC

The maximum auto-detection that comes to mind that we could do is:

 at bootstrap:
    1. snoop VRRP traffic for a minute and rule out all the VRRP IDs that we see
    2. Pick Three that were not seen
    3. Configure the local keepalived with them.
    4. Find a way to propagate the decision (maybe via API) so the other nodes can find it and set it.

Comment 39 Antoni Segura Puimedon 2020-07-02 08:23:55 UTC

We addressed it with documentation about how to see which virtual router ids your cluster will end up with in: https://github.com/openshift/installer/pull/3463

Comment 40 Aleksandra Malykhin 2020-07-12 05:05:52 UTC

Verified on
Client Version: 4.6.0-0.nightly-2020-07-07-233934

[root@titan54 ~]# podman run quay.io/openshift/origin-baremetal-runtimecfg:4.6 vr-ids cnf10
APIVirtualRouterID: 147
DNSVirtualRouterID: 158
IngressVirtualRouterID: 2

[root@titan54 ~]# podman run quay.io/openshift/origin-baremetal-runtimecfg:4.6 vr-ids cnf11
APIVirtualRouterID: 228
DNSVirtualRouterID: 239
IngressVirtualRouterID: 147

Comment 43 errata-xmlrpc 2020-10-27 15:57:43 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196

Note You need to log in before you can comment on or make changes to this bug.