Bug 1821667

Summary: keepalived virtual routerids can easily clash when running several clusters
Product: OpenShift Container Platform Reporter: Karim Boumedhel <kboumedh>
Component: InstallerAssignee: Antoni Segura Puimedon <asegurap>
Installer sub component: OpenShift on Bare Metal IPI QA Contact: Victor Voronkov <vvoronko>
Status: CLOSED ERRATA Docs Contact:
Severity: medium    
Priority: medium CC: asegurap, augol, bperkins, kgarriso, smilner, vvoronko, yboaron
Version: 4.4Keywords: Triaged
Target Milestone: ---   
Target Release: 4.5.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Cause: Using VRRP to manager the Virtual IPs for OCP IPI clusters means that there are only 8 bits available for a virtual router ID on a given broadcast domain. There may be be virtual router IDs already in use in the broadcast domain we deploy to Consequence: Collisions end up preventing nodes from taking on their Virtual IPs. Fix: Add a tool (and document its usage) that allows the user to check which virtual router IDs will be used for the chosen cluster name. Result: Users now have a way to know about Virtual Router IDs before deploying.
Story Points: ---
Clone Of:
: 1823465 (view as bug list) Environment:
Last Closed: 2020-07-13 17:25:52 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1823465    

Description Karim Boumedhel 2020-04-07 11:31:45 UTC
Description of problem:
keepalived virtual routerids clash when running several clusters, causing some vips not to go up

Version-Release number of selected component (if applicable):
>=4.4

How reproducible:
always, given one uses specific cluster names

Steps to Reproduce:
1 .deploy a cluster with name cnf10 and a second one with cnf11
2. api virtual id on first cluster conflicts with ingress virtual id on the second one as both are evaluated with this function https://github.com/openshift/baremetal-runtimecfg/pull/54/files#diff-3b5c896aef01987443b23dc503e418eaR147

Actual results:
conflicts, resulting in ingress vip not going up on workers

Expected results:
no conflicts

Additional info:
a tool should at least anticipates the generated ids to warn end user that he should not use those two cluster names together
something like this for instance

```
package main

import "fmt"

func FletcherChecksum8(inp string) uint8 {
	var ckA, ckB uint8
	for i := 0; i < len(inp); i++ {
		ckA = (ckA + inp[i]) % 0xf
		ckB = (ckB + ckA) % 0xf
	}
	return (ckB << 4) | ckA
}

func main() {
	cluster1 := "cnf10"
	cluster2 := "cnf11"
	api_id1 := FletcherChecksum8(cluster1+"-api") + 1
	dns_id1 := FletcherChecksum8(cluster1+"-dns") + 1
	ingress_id1 := FletcherChecksum8(cluster1+"-ingress") + 1
	api_id2 := FletcherChecksum8(cluster2+"-api") + 1
	dns_id2 := FletcherChecksum8(cluster2+"-dns") + 1
	ingress_id2 := FletcherChecksum8(cluster2+"-ingress") + 1
	fmt.Printf("cluster: %s api: %d dns: %d ingress: %d\n", cluster1, api_id1, dns_id1, ingress_id1)
	fmt.Printf("cluster: %s api: %d dns: %d ingress: %d\n", cluster2, api_id2, dns_id2, ingress_id2)
}
```

Comment 1 Yossi Boaron 2020-04-13 08:56:53 UTC
Just to clarify, keepalived virtual router ids clashes only if the clusters deployed on the same L2 domain.

Comment 7 Victor Voronkov 2020-04-20 08:39:55 UTC
Verified on 4.5.0-0.nightly-2020-04-14-031010

checked from master node:

[master-0-0 ~]$ sudo crictl exec $(sudo crictl ps --name keepalived-monitor | awk 'FNR==2{ print $1}') runtimecfg vr-ids cnf10
APIVirtualRouterID: 147
DNSVirtualRouterID: 158
IngressVirtualRouterID: 2
[core@master-0-0 ~]$ sudo crictl exec $(sudo crictl ps --name keepalived-monitor | awk 'FNR==2{ print $1}') runtimecfg vr-ids cnf11
APIVirtualRouterID: 228
DNSVirtualRouterID: 239
IngressVirtualRouterID: 147

Checked on external host by documentation provided here https://github.com/openshift/installer/blob/master/docs/user/metal/install_ipi.md
[~]# podman run quay.io/openshift/origin-baremetal-runtimecfg:4.5 vr-ids cnf11
APIVirtualRouterID: 228
DNSVirtualRouterID: 239
IngressVirtualRouterID: 147

Comment 8 errata-xmlrpc 2020-07-13 17:25:52 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2409