Bug 2036677 - Day-1 Networking - Static ip is Overwritten Upon Node Restart
Summary: Day-1 Networking - Static ip is Overwritten Upon Node Restart
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.10
Hardware: All
OS: Linux
urgent
medium
Target Milestone: ---
: ---
Assignee: Ben Nemec
QA Contact: Anurag saxena
URL:
Whiteboard:
: 2082962 (view as bug list)
Depends On: 2100181
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-01-03 14:58 UTC by Adina Wolff
Modified: 2022-12-14 19:32 UTC (History)
19 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-12-14 19:32:34 UTC
Target Upstream Version:
Embargoed:
awolff: needinfo-
vpickard: needinfo-
vpickard: needinfo-


Attachments (Terms of Use)
config and logs from reboot (10.46 KB, application/gzip)
2022-01-21 18:00 UTC, Ben Nemec
no flags Details

Comment 1 Steven Hardy 2022-01-04 15:59:48 UTC
> After deployment, the machine configured with nmstate in install-config didn’t get a proper name (presumably because the naming is done by the dhcp server)

Note that by default NetworkManager will try to derive the hostname from DHCP, or a reverse DNS lookup.

So in the case where it's not provided via DHCP, you'll need to ensure there's a PTR record to map the static IP to the DNS name (which should then be used by NM to set the hostname AFAIK)

Comment 4 Adina Wolff 2022-01-10 06:40:24 UTC
I was just reminded that last week we were able to login to the host even after reboot. It seems like the newer version has different behavior. The host doesn't receive an IP address on enp0s4 after reboot:

[root@sealusa3 ~]# virsh console master-0-0
Connected to domain master-0-0
Escape character is ^] (Ctrl + ])
[59351.953782] overlayfs: unrecognized mount option "volatile" or missing value

Password: 
Login incorrect

localhost login: [59407.951065] overlayfs: unrecognized mount option "volatile" or missing value
login: timed out after 60 seconds
Red Hat Enterprise Linux CoreOS 410.84.202201081937-0 (Ootpa) 4.10
Ignition: ran on 2022/01/09 10:34:58 UTC (at least 2 boots ago)
Ignition: user-provided config was applied
SSH host key: SHA256:KxzDDMYTb4c6e24AK0Jh7zwhQcV8Szs1w0JS8PQbfsQ (ECDSA)
SSH host key: SHA256:f/aphkZM8I5wva45/fLXeYAzbWoPfAtj1DAdadlnAAk (ED25519)
SSH host key: SHA256:HEPcJSxfynjZkP+aEPS8WUo0PFNgk/OpPVHMeRwNZJI (RSA)
enp0s3: 172.22.0.59 fe80::5054:ff:fe26:d659
enp0s4:  
localhost login: [59463.984038] overlayfs: unrecognized mount option "volatile" or missing value

Comment 7 Adina Wolff 2022-01-12 18:49:18 UTC
Thanks Zane.
Back to the hostname. 
Following the suggestion by @shardy, we ran a deployment with dns configuration for the host that gets static ip. This doesn't seem to have changed the behavior. Suggestions or comments on wrong configuration are welcome.

[root@sealusa3 ~]# virsh net-edit baremetal-0
.....
  <domain name='ocp-edge-cluster-0.qe.lab.redhat.com' localOnly='yes'/>
  <dns enable='yes'>
    <forwarder domain='apps.ocp-edge-cluster-0.qe.lab.redhat.com' addr='127.0.0.1'/>
    <forwarder domain='api.ocp-edge-cluster-0.qe.lab.redhat.com' addr='127.0.0.1'/>
    <host ip='192.168.123.1'>
      <hostname>registry</hostname>
      <hostname>hypervisor</hostname>
    </host>
    <host ip='192.168.123.11'>
      <hostname>master-0-0</hostname>
      <hostname>master-0-0.ocp-edge-cluster-0.qe.lab.redhat.com</hostname>
    </host>
....

install-config.yaml:
.....
        networkConfig: |
          routes:
            config:
            - destination: 0.0.0.0/0
              next-hop-address: 192.168.123.1
              next-hop-interface: enp0s4
          dns-resolver:
            config:
              server:
              - 192.168.123.1
          interfaces:
          - name: enp0s4
            type: ethernet
            state: up
            ipv4:
              address:
              - ip: "192.168.123.11"
                prefix-length: 24
              enabled: true
 .....


[kni@provisionhost-0-0 ~]$ oc get nodes
NAME                    STATUS   ROLES    AGE   VERSION
localhost.localdomain   Ready    master   54m   v1.22.1+6859754
master-0-1              Ready    master   61m   v1.22.1+6859754
master-0-2              Ready    master   61m   v1.22.1+6859754
worker-0-0              Ready    worker   32m   v1.22.1+6859754
worker-0-1              Ready    worker   30m   v1.22.1+6859754
[kni@provisionhost-0-0 ~]$ 


sh-4.4# nmcli dev show br-ex
GENERAL.DEVICE:                         br-ex
GENERAL.TYPE:                           ovs-interface
GENERAL.HWADDR:                         52:54:00:EE:99:D8
GENERAL.MTU:                            1500
GENERAL.STATE:                          100 (connected)
GENERAL.CONNECTION:                     ovs-if-br-ex
GENERAL.CON-PATH:                       /org/freedesktop/NetworkManager/ActiveConnection/8
IP4.ADDRESS[1]:                         192.168.123.11/24
IP4.GATEWAY:                            192.168.123.1
IP4.ROUTE[1]:                           dst = 192.168.123.0/24, nh = 0.0.0.0, mt = 800
IP4.ROUTE[2]:                           dst = 169.254.169.0/30, nh = 192.168.123.1, mt = 0
IP4.ROUTE[3]:                           dst = 172.30.0.0/16, nh = 192.168.123.1, mt = 0
IP4.ROUTE[4]:                           dst = 0.0.0.0/0, nh = 192.168.123.1, mt = 800
IP4.DNS[1]:                             192.168.123.1
IP6.GATEWAY:                            --
sh-4.4#

Comment 8 Steven Hardy 2022-01-13 10:35:10 UTC
On the host that's coming up as localhost, please can you check the reverse DNS lookup, e.g dig -x 192.168.123.11 in the example above

We need to confirm there is a PTR record for NM to derive the hostname from the IP

Also please can you save the journal (either for all services, or at least for NetworkManager) somewhere?  Thanks!

Comment 13 Ben Nemec 2022-01-18 22:55:26 UTC
Hmm, I tried reproducing this in my dev env and was not able to. The node I assigned a static IP to came back up with the same IP and hostname as before. I'll have to take a closer look at your setup to see if I can find any differences.

Comment 14 Ben Nemec 2022-01-19 21:06:47 UTC
I was able to reproduce this today (I'm not positive what changed from yesterday, but looking into that too) and it looks like it's something to do with configure-ovs. In the logs after the reboot, I first see:

Jan 19 17:40:43 master-0.ostest.test.metalkube.org configure-ovs.sh[2516]: + nmcli -g all c show
Jan 19 17:40:43 master-0.ostest.test.metalkube.org configure-ovs.sh[2516]: Wired Connection:fb66f30a-840d-4bc4-aad8-8e3642322600:802-3-ethernet:1642614036:Wed Jan 19 17\:40\:36 2022:yes:0:no:/org/freedesktop/NetworkManager/Settings/1:ye>
[snip]
Jan 19 17:40:43 master-0.ostest.test.metalkube.org configure-ovs.sh[2516]: enp2s0:60ed6e8e-7990-4691-b563-dad469da1faf:802-3-ethernet:1642546018:Tue Jan 18 22\:46\:58 2022:yes:0:no:/org/freedesktop/NetworkManager/Settings/2:no:::::/etc/>
Jan 19 17:40:43 master-0.ostest.test.metalkube.org configure-ovs.sh[2516]: + ip -d address show
[snip]
Jan 19 17:40:43 master-0.ostest.test.metalkube.org configure-ovs.sh[2516]: 3: enp2s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel master ovs-system state UP group default qlen 1000
Jan 19 17:40:43 master-0.ostest.test.metalkube.org configure-ovs.sh[2516]:     link/ether 00:63:9d:e0:32:a8 brd ff:ff:ff:ff:ff:ff promiscuity 1 minmtu 68 maxmtu 65535
Jan 19 17:40:43 master-0.ostest.test.metalkube.org configure-ovs.sh[2516]:     openvswitch_slave numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535
[snip]
Jan 19 17:40:43 master-0.ostest.test.metalkube.org configure-ovs.sh[2516]: 8: br-ex: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default qlen 1000
Jan 19 17:40:43 master-0.ostest.test.metalkube.org configure-ovs.sh[2516]:     link/ether 00:63:9d:e0:32:a8 brd ff:ff:ff:ff:ff:ff promiscuity 1 minmtu 68 maxmtu 65535
Jan 19 17:40:43 master-0.ostest.test.metalkube.org configure-ovs.sh[2516]:     openvswitch numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535
Jan 19 17:40:43 master-0.ostest.test.metalkube.org configure-ovs.sh[2516]:     inet 192.168.111.30/24 brd 192.168.111.255 scope global noprefixroute br-ex
Jan 19 17:40:43 master-0.ostest.test.metalkube.org configure-ovs.sh[2516]:        valid_lft forever preferred_lft forever

This is pretty much what I would expect to see. Note the .30 address on br-ex, which is the static IP I configured on the node. However, after configure-ovs tears down the existing bridge so it can re-create it, I see:

Jan 19 17:40:43 master-0 configure-ovs.sh[2516]: + nmcli -g all c show
Jan 19 17:40:43 master-0 configure-ovs.sh[2516]: Wired Connection:fb66f30a-840d-4bc4-aad8-8e3642322600:802-3-ethernet:1642614043:Wed Jan 19 17\:40\:43 2022:yes:0:no:/org/freedesktop/NetworkManager/Settings/1:yes:enp1s0:activated:/org/fr>
Jan 19 17:40:43 master-0 configure-ovs.sh[2516]: Wired Connection:fb66f30a-840d-4bc4-aad8-8e3642322600:802-3-ethernet:1642614043:Wed Jan 19 17\:40\:43 2022:yes:0:no:/org/freedesktop/NetworkManager/Settings/1:yes:enp2s0:activated:/org/fr>
Jan 19 17:40:43 master-0 configure-ovs.sh[2516]: enp2s0:60ed6e8e-7990-4691-b563-dad469da1faf:802-3-ethernet:1642546018:Tue Jan 18 22\:46\:58 2022:yes:0:no:/org/freedesktop/NetworkManager/Settings/2:no:::::/etc/NetworkManager/system-conn>
Jan 19 17:40:43 master-0 configure-ovs.sh[2516]: + ip -d address show
[snip]
Jan 19 17:40:43 master-0 configure-ovs.sh[2516]: 3: enp2s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
Jan 19 17:40:43 master-0 configure-ovs.sh[2516]:     link/ether 00:63:9d:e0:32:a8 brd ff:ff:ff:ff:ff:ff promiscuity 0 minmtu 68 maxmtu 65535 numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535
Jan 19 17:40:43 master-0 configure-ovs.sh[2516]:     inet 192.168.111.20/24 brd 192.168.111.255 scope global dynamic noprefixroute enp2s0
Jan 19 17:40:43 master-0 configure-ovs.sh[2516]:        valid_lft 3600sec preferred_lft 3600sec
Jan 19 17:40:43 master-0 configure-ovs.sh[2516]:     inet6 fe80::263:9dff:fee0:32a8/64 scope link tentative noprefixroute
Jan 19 17:40:43 master-0 configure-ovs.sh[2516]:        valid_lft forever preferred_lft forever

Now enp2s0 has reverted to the DHCP address. Also note that in the nmcli output "Wired Connection" has changed and there is now a separate one for enp1s0 and enp2s0. I think that must be overriding the enp2s0 connection that we created on day 1. Note that this did _not_ happen the first time I rebooted this node, which is why it came up correctly. It's not clear to me what is triggering this behavior yet though.

Comment 16 Ben Nemec 2022-01-21 18:00:51 UTC
Created attachment 1852594 [details]
config and logs from reboot

Okay, I'm still not sure how to fix this, but it's definitely a problem with the process where configure-ovs tears down the old bridge and replaces it with a new one. The node initiallly comes up with the correct address, but after NetworkManager is restarted it seems to revert the interface to DHCP. I'm going to have to ask for help from the NetworkManager and/or SDN teams to figure this out.

Comment 17 Adina Wolff 2022-01-23 13:41:13 UTC
This has been reproduced today on real baremetal

Comment 18 Bob Fournier 2022-01-25 17:16:01 UTC
Changing component to SDN.

Comment 19 Jaime Caamaño Ruiz 2022-01-27 16:01:38 UTC
@bnemec

There is two interfaces on the node: enp1s0 and enp2s0

There is a NM connection profile 60ed6e8e-7990-4691-b563-dad469da1faf for enp2s0 to configure static ip.

The node is booted with karg ip=dhcp, which will cause generation of profile fb66f30a-840d-4bc4-aad8-8e3642322600 to configure *ANY* interface with dhcp.

This means that enp2s0 can be indistinctly activated with profile 60ed6e8e-7990-4691-b563-dad469da1faf or fb66f30a-840d-4bc4-aad8-8e3642322600.

When you look at this configuration statically, it makes no sense. You should either configure karg ip=enp1s0:dhcp or configure profile 60ed6e8e-7990-4691-b563-dad469da1faf with more priority than the default.

Looking a things dynamically, there is something that makes the node boot always with enp2s0 activated the expected profile. And then when configure-ovs reloads NM it triggers some kind of round robin which makes the profile for enp2s0 to switch from 60ed6e8e-7990-4691-b563-dad469da1faf to fb66f30a-840d-4bc4-aad8-8e3642322600.

Comment 20 Ben Nemec 2022-01-27 16:54:13 UTC
I was able to make this work with this machine-config:

apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
metadata:
  labels:
    machineconfiguration.openshift.io/role: master
  name: 10-static-workaround-master
spec:
  config:
    ignition:
      version: 3.2.0
    systemd:
      units:
      - contents: |
          [Unit]
          Description=Static IP Workaround
          Wants=NetworkManager-wait-online.service
          After=NetworkManager-wait-online.service
          Before=ovs-configuration.service
          [Service]
          Type=oneshot
          ExecStart=/bin/bash -c "for i in $(nmcli --fields NAME,UUID -t con show | grep 'Wired Connection' | awk -F : '{print $2}'); do nmcli con modify $i match.interface-name '!enp2s0'; done"
          [Install]
          WantedBy=multi-user.target
        enabled: true
        name: static-workaround.service

I _think_ a more proper fix would be to set connection.autoconnect-priority to something >0 on the static connection profile, but I'm not sure nmstate exposes that option. I'm checking with that team to see if there's some way to do it.

Note that we can't change boot parameters because the provisioning layer doesn't understand the network config, it just passes it through to the host. There's no automated way for us to tell which interfaces should be excluded from DHCP.

Comment 21 Zane Bitter 2022-01-27 17:04:14 UTC
One thing we could do is not pass any ip= arguments if the user has specified some network config.
A corollary of this is that if the user specifies *any* network config, they'd be responsible for providing *all* of the network config necessary for the Node to come up. (So e.g. if they have an IPv6 cluster they'll need to explicitly specify that the interface must wait for IPv6, something that we do for them if they don't provide any network config.) Maybe that is OK?

If the alternative is that the user has to manually set the priority (if that is even possible) on each interface then that is not much better.

I guess we could manually adjust the generated NetworkManager keyfiles to set the priorities for all interfaces higher than for the default, but that's a last resort.

Comment 22 Jaime Caamaño Ruiz 2022-01-27 17:38:28 UTC
- How upgrade-able would be forcing users to create profiles in a given manner when they might have already created them? Are all profiles re-generated on an upgrade procedure? 
- Maybe we can handle increasing the priority in configure-ovs automatically?
- Should we handle this for SDN as well? While configure-ovs does not reload NM for SDN, some other workflow could do it and trigger the problem there as well.

Comment 23 Jaime Caamaño Ruiz 2022-01-27 20:13:13 UTC
AFAIK, NM already generates default dhcp profiles for wired connections without the need for a ip=dhcp kernel argument. It also generates them with -999 priority and generates a separate one for each device. Would this be sufficient?

Comment 25 Yoav Porag 2022-02-08 16:45:44 UTC
(In reply to Ben Nemec from comment #20)
> I was able to make this work with this machine-config:
> 
> apiVersion: machineconfiguration.openshift.io/v1
> kind: MachineConfig
> metadata:
>   labels:
>     machineconfiguration.openshift.io/role: master
>   name: 10-static-workaround-master
> spec:
>   config:
>     ignition:
>       version: 3.2.0
>     systemd:
>       units:
>       - contents: |
>           [Unit]
>           Description=Static IP Workaround
>           Wants=NetworkManager-wait-online.service
>           After=NetworkManager-wait-online.service
>           Before=ovs-configuration.service
>           [Service]
>           Type=oneshot
>           ExecStart=/bin/bash -c "for i in $(nmcli --fields NAME,UUID -t con
> show | grep 'Wired Connection' | awk -F : '{print $2}'); do nmcli con modify
> $i match.interface-name '!enp2s0'; done"
>           [Install]
>           WantedBy=multi-user.target
>         enabled: true
>         name: static-workaround.service
> 
> I _think_ a more proper fix would be to set connection.autoconnect-priority
> to something >0 on the static connection profile, but I'm not sure nmstate
> exposes that option. I'm checking with that team to see if there's some way
> to do it.
> 
> Note that we can't change boot parameters because the provisioning layer
> doesn't understand the network config, it just passes it through to the
> host. There's no automated way for us to tell which interfaces should be
> excluded from DHCP.

I have successfully reproduced this on my environment.

Comment 28 Yoav Porag 2022-02-08 21:41:38 UTC
Calrification:

[core@master-0-0 ~]$ ip route |grep static
default via 192.168.123.1 dev br-ex proto static metric 800 

after applying the machine config, the route is no longer "auto", but static, as it should be.
rebooting node after applying the machineconfig the node preserves its IP after reboot.

@vpickard 
there is still a discussion to be made on if this can be integrated into the installation process, but if not it needs to be documented for the customers.

Comment 30 Adina Wolff 2022-02-16 17:17:58 UTC
The issue is reproduced on a node with day1 networking that is added to a cluster:

networking secret:

before node reboot:

[core@openshift-worker-3 ~]$ nmcli con show ovs-if-br-ex
..............
ipv4.method:                            manual
ipv4.dns:                               10.46.0.31
ipv4.dns-search:                        --
ipv4.dns-options:                       --
ipv4.dns-priority:                      40
ipv4.addresses:                         10.46.29.136/25
ipv4.gateway:                           --
ipv4.routes:                            { ip = 0.0.0.0/0, nh = 10.46.29.254 table=254 }
ipv4.route-metric:                      -1
ipv4.route-table:                       0 (unspec)
ipv4.routing-rules:                     --


after node reboot:

[core@openshift-worker-3 ~]$ nmcli con show ovs-if-br-ex
..........
ipv4.method:                            auto
ipv4.dns:                               --
ipv4.dns-search:                        --
ipv4.dns-options:                       --
ipv4.dns-priority:                      0
ipv4.addresses:                         --
ipv4.gateway:                           --
ipv4.routes:                            --
ipv4.route-metric:                      49
ipv4.route-table:                       0 (unspec)
ipv4.routing-rules:                     --

Comment 31 Adina Wolff 2022-02-16 19:15:19 UTC
It looks like the issue does not reproduce on an environment with no dhcp server:

[core@master-0-0 ~]$ nmcli con show ovs-if-br-ex
connection.id:                          ovs-if-br-ex
connection.uuid:                        34872a8f-3b68-4e30-b34b-66dee095395d
connection.stable-id:                   --
connection.type:                        ovs-interface
connection.interface-name:              br-ex
connection.autoconnect:                 yes
connection.autoconnect-priority:        0
........
ipv4.method:                            manual
ipv4.dns:                               192.168.123.1
ipv4.dns-search:                        --
ipv4.dns-options:                       --
ipv4.dns-priority:                      40
ipv4.addresses:                         192.168.123.11/24
ipv4.gateway:                           --
ipv4.routes:                            { ip = 0.0.0.0/0, nh = 192.168.123.1 table=254 }
ipv4.route-metric:                      -1
ipv4.route-table:                       0 (unspec)
ipv4.routing-rules:                     --
ipv4.ignore-auto-routes:                no
ipv4.ignore-auto-dns:                   no
ipv4.dhcp-client-id:                    mac
ipv4.dhcp-iaid:                         --
[core@master-0-0 ~]$ sudo reboot
Connection to master-0-0.ocp-edge-cluster-0.qe.lab.redhat.com closed by remote host.
Connection to master-0-0.ocp-edge-cluster-0.qe.lab.redhat.com closed.
[kni@provisionhost-0-0 ~]$ ssh core.qe.lab.redhat.com
Red Hat Enterprise Linux CoreOS 410.84.202202142040-0
  Part of OpenShift 4.10, RHCOS is a Kubernetes native operating system
  managed by the Machine Config Operator (`clusteroperator/machine-config`).

WARNING: Direct SSH access to machines is not recommended; instead,
make configuration changes via `machineconfig` objects:
  https://docs.openshift.com/container-platform/4.10/architecture/architecture-rhcos.html

---
Last login: Wed Feb 16 17:04:49 2022 from 192.168.123.56
[systemd]
Failed Units: 1
  NetworkManager-wait-online.service

[core@master-0-0 ~]$ nmcli con show ovs-if-br-ex
connection.id:                          ovs-if-br-ex
connection.uuid:                        be29452f-5b54-40d4-b75f-20fe043e0166
connection.stable-id:                   --
connection.type:                        ovs-interface
connection.interface-name:              br-ex
connection.autoconnect:                 yes
connection.autoconnect-priority:        0
.....................
ipv4.method:                            manual
ipv4.dns:                               192.168.123.1
ipv4.dns-search:                        --
ipv4.dns-options:                       --
ipv4.dns-priority:                      40
ipv4.addresses:                         192.168.123.11/24
ipv4.gateway:                           --
ipv4.routes:                            { ip = 0.0.0.0/0, nh = 192.168.123.1 table=254 }
ipv4.route-metric:                      -1
ipv4.route-table:                       0 (unspec)
ipv4.routing-rules:                     --
[core@master-0-0 ~]$

Comment 34 Adina Wolff 2022-03-06 18:19:05 UTC
@bnemec I just tried to apply the networkConfig workaround to a cluster that was deployed with 4.8, then upgraded to 4.10 and scaled up with the new node having static ip. 
It doesn't seem to have worked.

must-gather:
 http://rhos-compute-node-10.lab.eng.rdu2.redhat.com/logs/must-gather.local.2689457161302658140.tar.gz


ClusterID: da2388a3-ca07-43f5-a002-4a47081c6276
ClusterVersion: Stable at "4.10.0-0.nightly-2022-03-01-224543"
ClusterOperators:
	clusteroperator/dns is progressing: DNS "default" reports Progressing=True: "Have 5 available node-resolver pods, want 6."
	clusteroperator/machine-config is not available (Cluster not available for [{operator 4.10.0-0.nightly-2022-03-01-224543}]) because Failed to resync 4.10.0-0.nightly-2022-03-01-224543 because: failed to apply machine config daemon manifests: timed out waiting for the condition during waitForDaemonsetRollout: Daemonset machine-config-daemon is not ready. status: (desired: 6, updated: 6, ready: 5, unavailable: 1)
	clusteroperator/monitoring is not available (Rollout of the monitoring stack failed and is degraded. Please investigate the degraded status error.) because Failed to rollout the stack. Error: updating alertmanager: waiting for Alertmanager object changes failed: waiting for Alertmanager openshift-monitoring/main: expected 2 replicas, got 1 updated replicas
updating prometheus-k8s: waiting for Prometheus object changes failed: waiting for Prometheus openshift-monitoring/k8s: expected 2 replicas, got 1 updated replicas
	clusteroperator/network is degraded because DaemonSet "openshift-multus/multus" rollout is not making progress - last change 2022-03-06T08:05:38Z
DaemonSet "openshift-multus/multus-additional-cni-plugins" rollout is not making progress - last change 2022-03-06T08:05:38Z
DaemonSet "openshift-ovn-kubernetes/ovnkube-node" rollout is not making progress - last change 2022-03-06T08:05:38Z

Comment 35 Yoav Porag 2022-03-07 07:06:13 UTC
Regarding adina's previous comment, note that testing the workaround on a normally provisioned cluster (one not upgraded from 4.8) seemed to have succeeded.
tests were done on IPI on virtual baremetal

Comment 36 Ben Nemec 2022-03-10 16:58:40 UTC
Do you mean rebooting the scaled up node fails, or that the scaleup itself fails?

Comment 37 Adina Wolff 2022-03-14 18:17:17 UTC
(In reply to Ben Nemec from comment #36)
> Do you mean rebooting the scaled up node fails, or that the scaleup itself
> fails?

I mean that applying the workaround failed. The above error is seen after trying to create the machine-config for workers. 
Again, please note that this is a cluster that was deployed with 4.8, then upgraded to 4.10 and then scaled up with day1 network configuration.

Comment 38 Hao Zhou 2022-03-16 02:25:09 UTC
I used real baremetal to reproduce this issue. I noticed two points:

1. After the machine is restarted, the static IP address is overwritten by DHCP only on the worker, not on the maser.

2. After the worker's static IP address is overwritten by DHCP, its corresponding Machine CR still displayed the manual configuration IP in Status.Addresses.Address field.

In addition, the network set by day1 networking will eventually be saved to [BMH.Spec.Provisioningnetworkdataname](https://github.com/metal3-io/baremetal-operator/blob/main/apis/metal3.io/v1alpha1/baremetalhost_types.go#L376-L380), and I'm not sure if it should continue to work after the BMH provisioned, as there is no longer this configuration in BMH.Spec.

Comment 39 Zane Bitter 2022-03-16 02:40:06 UTC
(In reply to Adina Wolff from comment #34)
> @bnemec I just tried to apply the networkConfig workaround to a
> cluster that was deployed with 4.8, then upgraded to 4.10 and scaled up with
> the new node having static ip. 
> It doesn't seem to have worked.

It's not possible to use network config when scaling up clusters installed pre-4.10. These clusters will still be installing the QCOW (not from the live ISO) so they won't get any of the network config applied to the nodes. Manual intervention is needed to change the image type in the MachineSet before scaling up will be able to use network config.

Given that it's not possible to have installed 4.8 (or 4.9) in an environment where static IPs are required, I'm not sure that this is a test case we need to worry about.

Comment 40 Adina Wolff 2022-03-16 07:33:44 UTC
@zbitter Indeed, I updated the image in the machineSet before I was able to properly scale-up. (I got instructions from @shardy)

If you are saying that this scenario shouldn't be supported at all, we can stop testing it.

Comment 42 Zane Bitter 2022-03-16 13:42:00 UTC
(In reply to Adina Wolff from comment #40)
> @zbitter Indeed, I updated the image in the machineSet before I
> was able to properly scale-up. (I got instructions from @shardy)

Ah, OK. If the MachineSet is updated then it should work the same as a fresh 4.10 cluster, so if not then it is indeed a bug.

Comment 43 Adina Wolff 2022-03-16 14:03:22 UTC
Thanks @zbitter so would you suggest opening a seperate bug to track this, or is the comment here sufficient?

Comment 44 Zane Bitter 2022-03-21 14:16:54 UTC
Here is probably sufficient if there's no evidence of a separate cause.

Comment 54 Adina Wolff 2022-06-14 14:07:01 UTC
*** Bug 2082962 has been marked as a duplicate of this bug. ***

Comment 57 Ben Nemec 2022-11-22 16:15:16 UTC
A better workaround for more recent NM versions can be found in https://bugzilla.redhat.com/show_bug.cgi?id=1934122#c24

Comment 58 Dave Gordon 2022-12-14 19:32:34 UTC
Closing -- there is a workaround documented.


Note You need to log in before you can comment on or make changes to this bug.