1866602 – [OVN] Fail to setup the cluster with UPI on vsphere

Bug 1866602 - [OVN] Fail to setup the cluster with UPI on vsphere

Summary: [OVN] Fail to setup the cluster with UPI on vsphere

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	4.6
Hardware:	All
OS:	All
Priority:	urgent
Severity:	urgent
Target Milestone:	---
Target Release:	4.6.0
Assignee:	Tim Rozet
QA Contact:	Anurag saxena
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-08-06 03:12 UTC by zhaozhanqi
Modified:	2020-10-27 16:25 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-10-27 16:25:22 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift machine-config-operator pull 2018	0	None	closed	Bug 1866602: OVS Configuration: handle static IP ifaces	2021-01-18 02:45:13 UTC
Red Hat Product Errata	RHBA-2020:4196	0	None	None	None	2020-10-27 16:25:45 UTC

Description zhaozhanqi 2020-08-06 03:12:25 UTC

Description of problem:
UPI on vsphere with OVN install
the br-ex ip get a new ip not same with the ens192(default interface).

 [core@control-plane-1 ~]$ cat /etc/sysconfig/network-scripts/ifcfg-ens192
TYPE=Ethernet
BOOTPROTO=none
NAME=ens192
DEVICE=ens192
ONBOOT=yes
IPADDR=139.178.76.12
PREFIX=26
GATEWAY=139.178.76.1

[core@control-plane-1 ~]$ nmcli d show br-ex
GENERAL.DEVICE:                         br-ex
GENERAL.TYPE:                           ovs-interface
GENERAL.HWADDR:                         (unknown)
GENERAL.MTU:                            1500
GENERAL.STATE:                          100 (connected)
GENERAL.CONNECTION:                     ovs-if-br-ex
GENERAL.CON-PATH:                       /org/freedesktop/NetworkManager/ActiveConnection/5
IP4.ADDRESS[1]:                         139.178.76.43/26
IP4.GATEWAY:                            139.178.76.1
IP4.ROUTE[1]:                           dst = 0.0.0.0/0, nh = 139.178.76.1, mt = 800
IP4.ROUTE[2]:                           dst = 139.178.76.0/26, nh = 0.0.0.0, mt = 800
IP4.DNS[1]:                             139.178.76.62
IP4.DOMAIN[1]:                          internal.example.com
IP6.ADDRESS[1]:                         fe80::f964:d219:d681:d154/64
IP6.GATEWAY:                            --
IP6.ROUTE[1]:                           dst = fe80::/64, nh = ::, mt = 800
IP6.ROUTE[2]:                           dst = ff00::/8, nh = ::, mt = 256, table=255


Version-Release number of selected component (if applicable):
4.6

How reproducible:
always

Steps to Reproduce:
1. setup the cluster with upi on vsphere with OVN
2. 
3.

Actual results:
cluster setup failed
Check the br-ex ip not same with ens192 the origin ip


Expected results:

br-ex if need to consider as unmanaged by NM

Additional info:

Comment 1 Dan Winship 2020-08-06 12:38:43 UTC

> br-ex if need to consider as unmanaged by NM

It's actually created by NM.

When "eth0" is using DHCP, we need br-ex to run DHCP as well, to keep the lease active.

I guess the fix is that if "eth0" has ipv4.method=manual / ipv6.method=manual, we need to copy that to br-ex too, so it _doesn't_ run DHCP.

Comment 2 Tim Rozet 2020-08-06 13:28:14 UTC

(In reply to Dan Winship from comment #1)
> > br-ex if need to consider as unmanaged by NM
> 
> It's actually created by NM.
> 
> When "eth0" is using DHCP, we need br-ex to run DHCP as well, to keep the
> lease active.
> 
> I guess the fix is that if "eth0" has ipv4.method=manual /
> ipv6.method=manual, we need to copy that to br-ex too, so it _doesn't_ run
> DHCP.

I thought our agreement was if you want to do manual interface config on the box, then you need to provide ignition NetworkManager key files to specify your manual config. Here is an example you an use:
https://gist.github.com/trozet/8cd5da0ce872b2f8d3952bf83279af71

Comment 3 Dan Winship 2020-08-06 14:01:53 UTC

OK, apparently there was some miscommunication.

I thought we had agreed that the user was responsible for ensuring the *underlying host network* was set up the way they wanted (eg, "default behavior" / "use a static IP" / "bond eth0 and eth1 and then make a vlan interface off the bond and use that as the default interface" / etc), and then the ovn-kubernetes setup script would take whatever interface had the default route, and move it to br-ex.

Tim thought we had agreed that if the user wanted anything other than 100% default networking (DHCP on primary ethernet interface) then they would have to provide NM config files for the entire network, *up to and including br-ex*.


It turns out that my interpretation is gibberish, because if you want an OVS bridge with a bonded interface inside it, you can't configure that by taking a kernel-level bonded interface and moving it into the bridge, you have to configure the bonding at the OVS level. Which Tim assumed I knew, and hence he assumed I was saying his version, since that's the only thing that makes sense in that case.


I don't like the current situation because:

  - While I think it's reasonable to make life difficult for people who
    want "really complicated" network configurations, I don't think "I
    need static IP instead of DHCP" should count as "really complicated".

  - This would mean that the sdn-to-ovn migration tool would not work for
    users with static IPs. (Given that we don't currently have a declarative
    spec for host networking config, it's not reasonable to expect the
    migration tool to be able to fix up their cluster configuration
    automatically.)

  - Requiring the user to specify a configuration for br-ex exposes too
    much of the low-level details of ovn-kubernetes shared gateway mode.
    What if we realize later that we need to configure the shared gateway
    bridge somewhat differently? What do we do with people's explicit
    br-ex configurations then?

So I feel like, if you aren't using bonds and vlans (as in the case here in this bug) then the setup script needs to just cope. (Or maybe the case in this bug is the _only_ other case that gets the "simple" treatment; you get automatic handling for single-ethernet-with-DHCP, and automatic handling for single-ethernet-with-static-IP, but everything else needs the more complicated config?)

For the users who are doing "complicated" stuff, I'm not sure. The nmstate-like approach is definitely better than writing out NetworkManager configs (other than the parsing issues), but I still don't like the fact that it requires specifying the _exact_ topology including the ovn-kubernetes-specific parts, rather than only specifying the "raw" host network config. It may be that there's nothing we can do about that. I guess at least with the nmstate-like file, if we do need to change the topology in the future, it would be much easier to tweak the provided config to match what we need than it would be in the NetworkManager config case.

Comment 4 zhaozhanqi 2020-08-11 06:21:38 UTC

@Dan
do you mean we can set the br-ex ignition file to set br-ex with static ip? if so , how to set this? could you help provide the details steps, that's will very helpful.  thanks.

Comment 5 Dan Winship 2020-08-11 11:50:44 UTC

I didn't mean that, although I guess you could, at the moment. The eventual fix is that you should not have to do anything different than you were doing before.

I think if you just edit the existing br-ex definition on the node using nmcli or nmtui to set a static IP, then it should work.

Comment 6 Tim Rozet 2020-08-18 13:43:35 UTC

After some discussion, we have agreed to allowing static IP to be detected and moved over to the OVS interface. However, all other manual configuration, including bonds or vlans will require configuring the host manually.

Comment 7 Tim Rozet 2020-08-20 18:40:58 UTC

Is there anyway you can test out https://github.com/openshift/machine-config-operator/pull/2018 and see if it fixes the problem? You can try changing the script on the host and rebooting the node if you don't want to build a custom MCO image. Or if you can give me access to a node that fails I can modify it and test it out.

Comment 8 Dan Winship 2020-08-20 20:31:42 UTC

You can tell cluster-bot "build openshift/machine-config-operator#2018" and it will build a release image containing that PR for you, and then follow the link it gives you and look in the build-log.txt and near the end you'll see:

    2020/07/18 14:50:09 Create release image registry.svc.ci.openshift.org/ci-ln-zn97t2k/release:latest

and then run "oc adm release extract --command=openshift-install THAT-IMAGE-NAME" to get an appropriate openshift-install binary that will use that image.

Comment 9 Anurag saxena 2020-08-20 20:36:33 UTC

(In reply to Dan Winship from comment #8)
> You can tell cluster-bot "build openshift/machine-config-operator#2018" and
> it will build a release image containing that PR for you, and then follow
> the link it gives you and look in the build-log.txt and near the end you'll
> see:
> 
>     2020/07/18 14:50:09 Create release image
> registry.svc.ci.openshift.org/ci-ln-zn97t2k/release:latest
> 
> and then run "oc adm release extract --command=openshift-install
> THAT-IMAGE-NAME" to get an appropriate openshift-install binary that will
> use that image.

@Dan, I guess the cluster-bot only supports IPI? This is UPI where nodes will have static IP

Comment 10 Anurag saxena 2020-08-20 20:38:18 UTC

ah sorry misunderstood btw launch and build..heh

Comment 11 Anurag saxena 2020-08-21 20:15:18 UTC

Update: Shared cluster with Tim today with web console access

Comment 12 Tim Rozet 2020-08-21 20:48:22 UTC

Thanks Anurag! Verified with some tweaks the fix will work for vsphere. Under review now.

Comment 17 errata-xmlrpc 2020-10-27 16:25:22 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196

Note You need to log in before you can comment on or make changes to this bug.