Bug 1924171

Summary: ovn-kube must handle single-stack to dual-stack migration
Product: OpenShift Container Platform Reporter: Dan Winship <danw>
Component: NetworkingAssignee: Dan Winship <danw>
Networking sub component: ovn-kubernetes QA Contact: Anurag saxena <anusaxen>
Status: CLOSED ERRATA Docs Contact:
Severity: low    
Priority: high CC: aojeagar, bbennett, dcbw, trozet
Version: 4.7   
Target Milestone: ---   
Target Release: 4.8.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of:
: 1937829 (view as bug list) Environment:
Last Closed: 2021-07-27 22:37:56 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1937829, 1956352    

Description Dan Winship 2021-02-02 18:04:39 UTC
When migrating a cluster from single-stack to dual-stack, CNO will eventually update the ovnkube config and restart the ovnkube-masters. This will currently fail with:

panic: failed to set gateway chassis 35d1489e-e7f2-494a-99fb-0c4ae0419690 for distributed gateway port rtos-node_local_switch: stdout: "", stderr: "ovn-nbctl: rtos-node_local_switch: port already exists with different network\n", error: OVN command '/usr/bin/ovn-nbctl --timeout=15 --may-exist lrp-add ovn_cluster_router rtos-node_local_switch 0a:58:a9:fe:00:02 169.254.0.2/20 fd99::2/64 -- --id=@gw create gateway_chassis chassis_name=35d1489e-e7f2-494a-99fb-0c4ae0419690 external_ids:dgp_name=rtos-node_local_switch name=rtos-node_local_switch_35d1489e-e7f2-494a-99fb-0c4ae0419690 priority=100 -- set logical_router_port rtos-node_local_switch gateway_chassis=@gw' failed: exit status 1

The code currently assumes that the switch will either not exist, or it will have the correct config; it doesn't deal with it having a "half-correct" config.


We probably need to set up a CI job upstream to test single-to-dual migration there, and then once that's working, pull the changes downstream. (This will be needed in 4.7.z.)


We are only supporting going from a single-stack config to a dual-stack config which is a superset of it. eg, from

    --cluster-subnets="10.128.0.0/14/23"
    --service-cidrs="172.30.0.0/16"

to

    --cluster-subnets="10.128.0.0/14/23,fd01::/48/64"
    --service-cidrs="172.30.0.0/16,fd02::/112"

Comment 1 Dan Williams 2021-02-02 20:05:48 UTC
-> danw since he's actually working on it right now

Comment 2 Antonio Ojea 2021-02-03 09:52:47 UTC
> We probably need to set up a CI job upstream to test single-to-dual migration there, and then once that's working, pull the changes downstream. (This will be needed in 4.7.z.)

appreciate if you can be more descriptive with the CNO steps so we can "mimic" the same scenario in the test

1. Create single stack cluster
2. enable dualstack feature gate and restart apiservers?
3. modify ovn-kube with dual stack parameters?
...

Comment 3 Dan Winship 2021-02-03 13:30:24 UTC
(In reply to Antonio Ojea from comment #2)
> 1. Create single stack cluster

More specifically: create docker hosts that already have both IPv4 and IPv6 addresses, but then install a cluster onto them that only uses IPv4 in its config.

> 2. enable dualstack feature gate and restart apiservers?

(restarting the apiservers both to enable the feature gate and to set the dual-stack service cidr)

And you need to restart the kubelets to enable the feature gate there too.

> 3. modify ovn-kube with dual stack parameters?

Yup. And both masters and nodes will need to be restarted I think.

Comment 5 Antonio Ojea 2021-03-11 16:01:08 UTC
Merged in openshift https://github.com/openshift/ovn-kubernetes/pull/440
Needs backports to 4.7

Comment 6 Dan Winship 2021-03-15 16:17:22 UTC
This is only partially implemented but this is all that we're backporting to 4.7 for now (the customer will need to use a very manual process) so I'm marking it VERIFIED so we can get the 4.7 backport in.

Comment 9 errata-xmlrpc 2021-07-27 22:37:56 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438