Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 2053309

Summary:	Unicast mode change upgrade check not working
Product:	OpenShift Container Platform	Reporter:	Ben Nemec <bnemec>
Component:	Networking	Assignee:	Douglas Schilling Landgraf <dougsland>
Networking sub component:	runtime-cfg	QA Contact:	Victor Voronkov <vvoronko>
Status:	CLOSED WONTFIX	Docs Contact:
Severity:	medium
Priority:	medium	CC:	adpawar, dougsland, jima, vvoronko
Version:	4.10
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2024-04-30 18:04:53 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Ben Nemec 2022-02-10 22:25:45 UTC

Description of problem: When we switched the default keepalived mode to unicast, we added some logic to ensure the mode change happens at approximately the same time on all nodes so we don't end up with some nodes using unicast and some not, which would result in duplicate VIPs. However, part of that logic was to ensure the cluster was fully upgraded before we did the switch. This check appears to be flawed and isn't actually allowing the mode switch to happen.

The problem is that we make two comparisons[0]: First, we verify that each node's desiredConfig matches its currentConfig. Then we check that its desiredConfig matches the desiredConfig of the first node. The problem is master and worker nodes have different configs, which means worker nodes will never match the first master.

I'm not sure how serious this problem is since keepalived still functions fine in multicast mode and unicast is more important for new deployments where multicast is not allowed, but it's still something we should fix.

0: https://github.com/openshift/baremetal-runtimecfg/blob/edc9617a13839571f596109529573937fe199c2d/pkg/config/node.go#L194

Version-Release number of selected component (if applicable): I believe we switched to unicast in 4.6, and this affects every version back to that.

How reproducible: Always.

Steps to Reproduce:
1. Deploy a cluster in multicast mode (this requires modifying machine-config-operator manifests on baremetal since the default is now unicast).
2. Override the keepalived manifest with one that sets ENABLE_UNICAST to "yes" and create the /etc/keepalived/monitor.conf file with the contents "mode: unicast"
3. Watch the keepalived-monitor logs.

Actual results: After ~10 minutes you will see a log message that says "Failed to retrieve upgrade status or Upgrade still running"

Expected results: Keepalived mode switched to unicast

Additional info:

Comment 1 Douglas Schilling Landgraf 2022-04-22 15:25:08 UTC

For testing, make sure the test machine contain RHEL8.5 or higher and go version. Here the steps how to create the environment.

1) Check system
$ cat /etc/redhat-release 
Red Hat Enterprise Linux release 8.5 (Ootpa)

2) Check Go Version
$ go version
go version go1.17.1 linux/amd64


3) Download the code for creating the scenario for testing
$ mkdir -p ~/go/src/github.com/openshift
$ cd ~/go/src/github.com/openshift
$ git clone https://github.com/openshift/baremetal-runtimecfg.git

Create a symbol link in the home
$ cd ~
$ ln -s ~/go/src/github.com/openshift/baremetal-runtimecfg baremetal-runtimecfg


4) Download and do small changes into machine-config-operator
$ git clone https://github.com/openshift/machine-config-operator
.
$ pushd .
$ cd machine-config-operator

Do the following changes to disable unicast by default:
$ git diff
diff --git a/manifests/on-prem/keepalived.yaml b/manifests/on-prem/keepalived.yaml
index 4d9dab8a..6c3871bc 100644
--- a/manifests/on-prem/keepalived.yaml
+++ b/manifests/on-prem/keepalived.yaml
@@ -107,7 +107,7 @@ spec:
     image: {{ .Images.BaremetalRuntimeCfgBootstrap }}
     env:
       - name: ENABLE_UNICAST
-        value: "yes"
+        value: "no"
       - name: IS_BOOTSTRAP
         value: "yes"
     command:
diff --git a/templates/common/on-prem/files/keepalived.yaml b/templates/common/on-prem/files/keepalived.yaml
index c7166388..61ff0b75 100644
--- a/templates/common/on-prem/files/keepalived.yaml
+++ b/templates/common/on-prem/files/keepalived.yaml
@@ -153,7 +153,7 @@ contents:
         image: {{ .Images.baremetalRuntimeCfgImage }}
         env:
           - name: ENABLE_UNICAST
-            value: "yes"
+            value: "no"
           - name: IS_BOOTSTRAP
             value: "no"

$ popd

Create a symbol link in the home
$ cd ~
$ ln -s ~/go/src/github.com/openshift/machine-config-operator machine-config-operator


5) Download devscript to create OpenShift Baremetal Environment

NOTE: In the machine I am working, I had to move devscript to /home/ as I needed more space:

$ cd /home
$ ls git/
drwxr-xr-x.  5 douglas douglas   75 Apr 14 15:07 git

$ cd git/
$ git clone https://github.com/openshift-metal3/dev-scripts.git

Create a symbol link in the home
$ cd ~
$ ln -s /home/git/dev-scripts/ devscript
$ cd devscript

Set environment:
=======================
export WORKING_DIR=/home/git/wrk-dir-devscripts/
export IP_STACK=v4
export KUBECONFIG=/home/git/dev-scripts/ocp/ostest/auth/kubeconfig

export EXTRA_NETWORK_NAMES="nmstate1 nmstate2"
export NMSTATE1_NETWORK_SUBNET_V4='192.168.221.0/24'
export NMSTATE1_NETWORK_SUBNET_V6='fd2e:6f44:5dd8:ca56::/120'
export NMSTATE2_NETWORK_SUBNET_V4='192.168.222.0/24'
export NMSTATE2_NETWORK_SUBNET_V6='fd2e:6f44:5dd8:cc56::/120'

# Use the symbol link create in HOME dir
export MACHINE_CONFIG_OPERATOR_LOCAL_IMAGE=://machine-config-operator
export BAREMETAL_RUNTIMECFG_LOCAL_IMAGE=://baremetal-runtimecfg


Build (takes approx 1h or more)
==================================
devscript> make all

Comment 2 Douglas Schilling Landgraf 2022-04-22 15:58:58 UTC

After the installation is done, create a manifest.yaml for applying the change:


manifest.yaml
===================================================
apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
metadata:
  labels:
    machineconfiguration.openshift.io/role: master
  name: 10-keepalived-override
spec:
  config:
    ignition:
      version: 3.2.0
    storage:
      files:
      - contents:
          source: data:text/plain;charset=utf-8;base64,[base64 version of keepalived.yaml modified to set ENABLE_UNICAST to "yes"]
        mode: 0644
        overwrite: true
        path: /etc/kubernetes/manifests/keepalived.yaml
      - contents:
          source: data:text/plain;charset=utf-8;base64,bW9kZTogdW5pY2FzdA==
        mode: 0644
        overwrite: true
        path: /etc/keepalived/monitor.conf


                          



1) Apply the manifest to enable Unicast.

$ oc apply -f manifest.yaml


2) Keep watching the logs in the worker and master
$ oc logs -f -n openshift-kni-infra keepalived-worker-1 keepalived-monitor
$ oc logs -f -n openshift-kni-infra keepalived-master-1 keepalived-monitor


Look for "Update Mode" message, example:

time="2022-03-30T18:54:37Z" level=info msg="Update Mode from newConfig.EnableUnicast to desiredModeInfo.Mode" desiredModeInfo.Mode=unicast desiredModeInfo.Time="2022-03-30 18:55:00 +0000 UTC" newConfig.EnableUnicast=false
time="2022-03-30T18:54:37Z" level=info msg="Mode Update config change" curConfig="{{ostest test.metalkube.org 192.168.111.5 14 A AAAA 192.168.111.4 93 A AAAA 32 0 []} {123 123 123 [{master-0 192.168.111.20 123} {master-1 192.168.111.21 123} {master-2 192.168.111.22 123}] } 192.168.111.20 master-0 enp2s0 [192.168.111.1 fe80::5054:ff:fe70:45c9%enp3s0] {[192.168.111.20 192.168.111.21 192.168.111.22]} true}"
time="2022-03-30T18:54:37Z" level=info msg="global_defs {"


If there is no "Failing", "Error" message we are all set, bug verified.
Feel free to reach out if any questions.

Comment 3 Douglas Schilling Landgraf 2022-04-22 16:00:40 UTC

Hi Victor,

To verify the bug, please use the steps above as soon the https://github.com/openshift/baremetal-runtimecfg/pull/173 land/merge upstream.
Feel free to reach me.

Thanks
Douglas

Comment 6 adpawar 2023-02-17 05:19:08 UTC

Hello Team,
I have a customer who faced an issue 4.10 to 4.11 upgrade. The ingress VIP was active on more than one node at the same time causing the upgrade failure. After some digging we found out that, some nodes were configured to use Unicast for Keepalived and some were not, resulting in effectively a split-brain situation where there were 2 keepalived masters for the ingress VIP.
keepalived shouldn’t switch to unicast until after the cluster upgrade is complete, but what we found was the was a period of around 2 hours during the upgrade that keepalived was in a split brain scenario. Not all nodes were upgraded to 4.11.25 before the switch to unicast was made. 
I was wondering if this is related to an existing bug or if I need to file a new bug for this issue. 

Aditya Pawar

Comment 9 Rory Thrasher 2024-04-30 18:04:53 UTC

OCP is no longer using Bugzilla and this bug appears to have been left in an orphaned state. If the bug is still relevant, please open a new issue in the OCPBUGS Jira project: https://issues.redhat.com/projects/OCPBUGS/summary