Bug 1913248 - Creating vlan interface on top of a bond device via NodeNetworkConfigurationPolicy fails
Summary: Creating vlan interface on top of a bond device via NodeNetworkConfigurationP...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Container Native Virtualization (CNV)
Classification: Red Hat
Component: Networking
Version: 2.6.0
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: ---
: 2.6.0
Assignee: Quique Llorente
QA Contact: Meni Yakove
URL:
Whiteboard:
Depends On:
Blocks: 1916413
TreeView+ depends on / blocked
 
Reported: 2021-01-06 11:19 UTC by Marius Cornea
Modified: 2021-03-10 11:23 UTC (History)
5 users (show)

Fixed In Version: kubernetes-nmstate-handler-container-v2.6.0-17
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1916413 (view as bug list)
Environment:
Last Closed: 2021-03-10 11:22:46 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
nmstate_pods.log (16.57 KB, text/plain)
2021-01-06 11:19 UTC, Marius Cornea
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2021:0799 0 None None None 2021-03-10 11:23:17 UTC

Internal Links: 1917813

Description Marius Cornea 2021-01-06 11:19:21 UTC
Created attachment 1744864 [details]
nmstate_pods.log

Description of problem:

I am trying to create the following NodeNetworkConfigurationPolicy:

---
apiVersion: nmstate.io/v1alpha1
kind: NodeNetworkConfigurationPolicy
metadata:
  name: bigip-bridges
spec:
  nodeSelector:
    node-role.kubernetes.io/worker: ""
  desiredState:
    interfaces:
    - name: bond0.375
      type: vlan
      state: up
      vlan:
        base-iface: bond0
        id: 375

    - name: bigip-mgmt
      description: Linux bridge with bond0 vlan375 as a port!
      type: linux-bridge
      state: up
      bridge:
        options:
          stp:
            enabled: false
        port:
        - name: bond0.375

    - name: bond0.376
      type: vlan
      state: up
      vlan:
        base-iface: bond0
        id: 376

    - name: bigip-ha
      description: Linux bridge with bond0 vlan376 as a port!
      type: linux-bridge
      state: up
      bridge:
        options:
          stp:
            enabled: false
        port:
        - name: bond0.376


which results in the following error message:

{"level":"error","ts":1609931516.2395382,"logger":"controllers.NodeNetworkConfigurationPolicy","msg":"Rolling back network configuration, manual intervention needed: ","nodenetworkconfigurationpolicy":"/bigip-bridges","error":"error reconciling NodeNetworkConfigurationPolicy at desired state apply: , failed to execute nmstatectl set --no-commit --timeout 480: 'exit status 1' '' 'Traceback (most recent call last):\n  File \"/usr/bin/nmstatectl\", line 11, in <module>\n    load_entry_point('nmstate==0.3.4', 'console_scripts', 'nmstatectl')()\n  File \"/usr/lib/python3.6/site-packages/nmstatectl/nmstatectl.py\", line 67, in main\n    return args.func(args)\n  File \"/usr/lib/python3.6/site-packages/nmstatectl/nmstatectl.py\", line 267, in apply\n    args.save_to_disk,\n  File \"/usr/lib/python3.6/site-packages/nmstatectl/nmstatectl.py\", line 289, in apply_state\n    save_to_disk=save_to_disk,\n  File \"/usr/lib/python3.6/site-packages/libnmstate/netapplier.py\", line 69, in apply\n    net_state = NetState(desired_state, current_state, save_to_disk)\n  File \"/usr/lib/python3.6/site-packages/libnmstate/net_state.py\", line 40, in __init__\n    save_to_disk,\n  File \"/usr/lib/python3.6/site-packages/libnmstate/ifaces/ifaces.py\", line 106, in __init__\n    self._pre_edit_validation_and_cleanup()\n  File \"/usr/lib/python3.6/site-packages/libnmstate/ifaces/ifaces.py\", line 128, in _pre_edit_validation_and_cleanup\n    self._validate_over_booked_slaves()\n  File \"/usr/lib/python3.6/site-packages/libnmstate/ifaces/ifaces.py\", line 423, in _validate_over_booked_slaves\n    f\"Interface {iface.name} slave {slave_name} is \"\nlibnmstate.error.NmstateValueError: Interface br-ex slave enp5s0 is already enslaved by interface bond0\n'","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\t/remote-source/app/vendor/github.com/go-logr/zapr/zapr.go:128\ngithub.com/nmstate/kubernetes-nmstate/controllers.(*NodeNetworkConfigurationPolicyReconciler).Reconcile\n\t/remote-source/app/controllers/nodenetworkconfigurationpolicy_controller.go:277\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/remote-source/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:244\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/remote-source/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:218\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker\n\t/remote-source/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:197\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1\n\t/remote-source/app/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:155\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil\n\t/remote-source/app/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:156\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/remote-source/app/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133\nk8s.io/apimachinery/pkg/util/wait.Until\n\t/remote-source/app/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:90"}

This is the networking layout on the nodes:

enp4s0
bond0: enp5s0,enp6s0
br-ex bridge created during deployment includes bond0 interface

nmcli con
NAME                UUID                                  TYPE           DEVICE 
ovs-if-br-ex        e0b6fe95-b4c6-4bbb-8f15-50ad1bd6b718  ovs-interface  br-ex  
Wired connection 1  9ca44b7c-265d-3fe3-bc51-7e52d84ab74c  ethernet       enp4s0 
br-ex               49a80196-d3df-42d9-ac1b-33282d94ae8d  ovs-bridge     br-ex  
ovs-if-phys0        22fb5643-4768-4fff-839a-122a0868a6c5  bond           bond0  
ovs-port-br-ex      3c390181-35c4-4f6b-9fe6-464f10210121  ovs-port       br-ex  
ovs-port-phys0      69aaa7d4-450d-40f6-80f9-696ebbc6bc72  ovs-port       bond0  
System enp5s0       9310e179-14b6-430a-6843-6491c047d532  ethernet       enp5s0 
System enp6s0       b43fa2aa-5a85-7b0a-9a20-469067dba6d6  ethernet       enp6s0 
bond0               ad33d8b0-1f7b-cab9-9447-ba07f855b143  bond           --     


[core@worker-0-0 ~]$ ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: enp4s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
    link/ether 52:54:00:35:d4:fc brd ff:ff:ff:ff:ff:ff
    inet6 fd00:1101::552a:b19:e27a:4e9/128 scope global dynamic noprefixroute 
       valid_lft 2045sec preferred_lft 2045sec
    inet6 fe80::83db:b124:3a5d:20fd/64 scope link noprefixroute 
       valid_lft forever preferred_lft forever
3: enp5s0: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc fq_codel master bond0 state UP group default qlen 1000
    link/ether 52:54:00:a5:49:43 brd ff:ff:ff:ff:ff:ff
4: enp6s0: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc fq_codel master bond0 state UP group default qlen 1000
    link/ether 52:54:00:a5:49:43 brd ff:ff:ff:ff:ff:ff
6: bond0: <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 1500 qdisc noqueue master ovs-system state UP group default qlen 1000
    link/ether 52:54:00:a5:49:43 brd ff:ff:ff:ff:ff:ff
7: ovs-system: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000
    link/ether 42:0e:c2:af:fe:9d brd ff:ff:ff:ff:ff:ff
8: br-ex: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default qlen 1000
    link/ether 52:54:00:a5:49:43 brd ff:ff:ff:ff:ff:ff
    inet 192.168.123.123/24 brd 192.168.123.255 scope global dynamic noprefixroute br-ex
       valid_lft 2229sec preferred_lft 2229sec
    inet 192.168.123.10/32 scope global br-ex
       valid_lft forever preferred_lft forever
    inet6 fe80::acb7:e581:b640:6c69/64 scope link noprefixroute 
       valid_lft forever preferred_lft forever
[...]


Version-Release number of selected component (if applicable):
registry-proxy.engineering.redhat.com/rh-osbs/iib:35589
python3-libnmstate-0.3.4-17.el8_3.noarch
nmstate-0.3.4-17.el8_3.noarc

How reproducible:


Steps to Reproduce:

1. Deploy OCP 4.7 via baremetal IPI flow. Nodes have the following network layout: one nic used for provisioning network and 2 nics grouped in a bond used for the control plane network

2. Deploy CNV 2.6

3. Create the following NNCP

---
apiVersion: nmstate.io/v1alpha1
kind: NodeNetworkConfigurationPolicy
metadata:
  name: bigip-bridges
spec:
  nodeSelector:
    node-role.kubernetes.io/worker: ""
  desiredState:
    interfaces:
    - name: bond0.375
      type: vlan
      state: up
      vlan:
        base-iface: bond0
        id: 375

    - name: bigip-mgmt
      description: Linux bridge with bond0 vlan375 as a port!
      type: linux-bridge
      state: up
      bridge:
        options:
          stp:
            enabled: false
        port:
        - name: bond0.375

    - name: bond0.376
      type: vlan
      state: up
      vlan:
        base-iface: bond0
        id: 376

    - name: bigip-ha
      description: Linux bridge with bond0 vlan376 as a port!
      type: linux-bridge
      state: up
      bridge:
        options:
          stp:
            enabled: false
        port:
        - name: bond0.376


Actual results:

NNCP fails to get configured

Expected results:

NNCP is configured correctly

Additional info:

Attaching nmstate pods logs.

Comment 1 Petr Horáček 2021-01-06 11:35:20 UTC
The traceback split to multiple lines:

  File \"/usr/bin/nmstatectl\", line 11, in <module>
    load_entry_point('nmstate==0.3.4', 'console_scripts', 'nmstatectl')()
  File \"/usr/lib/python3.6/site-packages/nmstatectl/nmstatectl.py\", line 67, in main
    return args.func(args)
  File \"/usr/lib/python3.6/site-packages/nmstatectl/nmstatectl.py\", line 267, in apply
    args.save_to_disk,
  File \"/usr/lib/python3.6/site-packages/nmstatectl/nmstatectl.py\", line 289, in apply_state
    save_to_disk=save_to_disk,
  File \"/usr/lib/python3.6/site-packages/libnmstate/netapplier.py\", line 69, in apply
    net_state = NetState(desired_state, current_state, save_to_disk)
  File \"/usr/lib/python3.6/site-packages/libnmstate/net_state.py\", line 40, in __init__
    save_to_disk,
  File \"/usr/lib/python3.6/site-packages/libnmstate/ifaces/ifaces.py\", line 106, in __init__
    self._pre_edit_validation_and_cleanup()
  File \"/usr/lib/python3.6/site-packages/libnmstate/ifaces/ifaces.py\", line 128, in _pre_edit_validation_and_cleanup
    self._validate_over_booked_slaves()
  File \"/usr/lib/python3.6/site-packages/libnmstate/ifaces/ifaces.py\", line 423, in _validate_over_booked_slaves
    f\"Interface {iface.name} slave {slave_name} is \"
libnmstate.error.NmstateValueError: Interface br-ex slave enp5s0 is already enslaved by interface bond0

@fge I don't see anything wrong with the configuration, do you? Could it be a bug in nmstate? Note that br-ex (to which bond0 is connected) is an OVS bridge.

Comment 2 Gris Ge 2021-01-08 05:21:29 UTC
nmstate-0.3 does not support applying state with OVS interface holding the same name with OVS bridge.

In my test, when br_ex does not hold DNS or routes, the state apply goes well.

Please confirm my guess on the network setup:

## Before nmstatectl run:

 * enp4s0 no idea.
 * enp5s0 and enp6s0 be bond0 slave/port.
 * bond0 be system interface of OVS bridge br_ex which also the OVS internal interface named as br_ex.
 * br0_ex OVS internal interface holding the DNS and default gateway.

## Nmstatectl run:
 * Create two vlan 375 and 376 out of bond0. And assign each vlan to a new bridge.

Comment 3 Gris Ge 2021-01-08 07:40:44 UTC
Found the root cause.

Base on the output of above `nmcli con`, you are using `br-ex` and `bond0` for both OVS bridge, OVS port and OVS interface.
Which cause nmstate-0.3 think `bond0, eth1 and eth2` are port/slave of OVS bridge br_ex which the triggered above exception.

I will continue my investigation on this.

Comment 4 Gris Ge 2021-01-08 08:26:56 UTC
Even nmstate somehow supported OVS bridge/port/interface using the same name, you will still stopped by NetworkManager bug:
https://bugzilla.redhat.com/show_bug.cgi?id=1914139

So my suggestion is change this `baremetal IPI flow` to use different name for OVS bridge/port/interface.

Comment 5 Petr Horáček 2021-01-15 08:19:48 UTC
Thanks Gris.

To recapitulate, this is caused by two underlying bugs.

First is in NetworkManager, preventing us from creating a VLAN on top of a bonding attached to OVS bridge [1]. This only happens when OVS port and the bonding share the same name and we hope to work around it in OVN Kubernetes setup script [2].

The second issue is in nmstate, due to special handling of the interface holding the DNS entry. We are working with the nmstate team to get a hotfix for that.

[1] https://bugzilla.redhat.com/show_bug.cgi?id=1914139
[2] https://bugzilla.redhat.com/show_bug.cgi?id=1916413

Comment 6 Gris Ge 2021-01-18 12:59:44 UTC
Hi Petr,

Could you try `dnf copr enable packit/nmstate-nmstate-1487` ?

It should fixed both of above problems.

Comment 7 Petr Horáček 2021-01-19 08:12:38 UTC
Thanks Gris, Quique is testing it.

Comment 8 Quique Llorente 2021-01-19 12:17:09 UTC
After applying manually at nmstate the fixes from https://github.com/nmstate/nmstate/commit/a85b3dddf82f9e71774229740fbae6ea843d86d6 and applying the NNCP with a nodeSelector pointing to that container
looks like it's working fine, so we can try to deliver a bugfix for it.(In reply to Gris Ge from comment #6)
> Hi Petr,
> 
> Could you try `dnf copr enable packit/nmstate-nmstate-1487` ?
> 
> It should fixed both of above problems.

After applying manually at nmstate the fixes from https://github.com/nmstate/nmstate/commit/a85b3dddf82f9e71774229740fbae6ea843d86d6 and applying the NNCP with a nodeSelector pointing to that container
looks like it's working fine, so we can try to deliver a bugfix for it.

Comment 9 Gris Ge 2021-01-22 13:42:30 UTC
Nmstate has fixed this issue by:

 * RHEL 8.4 with nmstate-1.0.1-1.el8
 * RHEL 8.3.0.z with nmstate-0.3.4-22.el8_3

Both are pending release.

Comment 10 Petr Horáček 2021-01-22 14:27:14 UTC
Thanks Gris, we appreciate it a lot!

Comment 12 Petr Horáček 2021-02-01 09:38:58 UTC
Marius, could you please verify whether the problem got solved using a CNV nightly build?

Comment 13 Marius Cornea 2021-02-03 10:38:56 UTC
Verified on registry-proxy.engineering.redhat.com/rh-osbs/iib:42945

oc -n openshift-cnv exec -it nmstate-handler-mp4np -- rpm -q nmstate
nmstate-0.3.4-22.el8_3.noarch


oc get NodeNetworkConfigurationPolicy/bigip-bridges -o yaml
apiVersion: nmstate.io/v1beta1
kind: NodeNetworkConfigurationPolicy
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"nmstate.io/v1alpha1","kind":"NodeNetworkConfigurationPolicy","metadata":{"annotations":{},"name":"bigip-bridges"},"spec":{"desiredState":{"interfaces":[{"name":"bond0.375","state":"up","type":"vlan","vlan":{"base-iface":"bond0","id":375}},{"bridge":{"options":{"stp":{"enabled":false}},"port":[{"name":"bond0.375"}]},"description":"Linux bridge with bond0 vlan375 as a port!","name":"bigip-mgmt","state":"up","type":"linux-bridge"},{"name":"bond0.376","state":"up","type":"vlan","vlan":{"base-iface":"bond0","id":376}},{"bridge":{"options":{"stp":{"enabled":false}},"port":[{"name":"bond0.376"}]},"description":"Linux bridge with bond0 vlan376 as a port!","name":"bigip-ha","state":"up","type":"linux-bridge"}]},"nodeSelector":{"node-role.kubernetes.io/worker":""}}}
    nmstate.io/webhook-mutating-timestamp: "1612348466896468356"
  creationTimestamp: "2021-02-03T10:34:26Z"
  generation: 1
  managedFields:
  - apiVersion: nmstate.io/v1alpha1
    fieldsType: FieldsV1
    fieldsV1:
      f:metadata:
        f:annotations:
          .: {}
          f:kubectl.kubernetes.io/last-applied-configuration: {}
      f:spec:
        .: {}
        f:desiredState:
          .: {}
          f:interfaces: {}
        f:nodeSelector:
          .: {}
          f:node-role.kubernetes.io/worker: {}
    manager: kubectl-client-side-apply
    operation: Update
    time: "2021-02-03T10:34:26Z"
  - apiVersion: nmstate.io/v1beta1
    fieldsType: FieldsV1
    fieldsV1:
      f:status:
        .: {}
        f:conditions: {}
    manager: manager
    operation: Update
    time: "2021-02-03T10:34:40Z"
  name: bigip-bridges
  resourceVersion: "63366"
  selfLink: /apis/nmstate.io/v1beta1/nodenetworkconfigurationpolicies/bigip-bridges
  uid: 591503a1-295f-46d2-a223-c2f2994da03d
spec:
  desiredState:
    interfaces:
    - name: bond0.375
      state: up
      type: vlan
      vlan:
        base-iface: bond0
        id: 375
    - bridge:
        options:
          stp:
            enabled: false
        port:
        - name: bond0.375
      description: Linux bridge with bond0 vlan375 as a port!
      name: bigip-mgmt
      state: up
      type: linux-bridge
    - name: bond0.376
      state: up
      type: vlan
      vlan:
        base-iface: bond0
        id: 376
    - bridge:
        options:
          stp:
            enabled: false
        port:
        - name: bond0.376
      description: Linux bridge with bond0 vlan376 as a port!
      name: bigip-ha
      state: up
      type: linux-bridge
  nodeSelector:
    node-role.kubernetes.io/worker: ""
status:
  conditions:
  - lastHearbeatTime: "2021-02-03T10:34:53Z"
    lastTransitionTime: "2021-02-03T10:34:53Z"
    message: 2/2 nodes successfully configured
    reason: SuccessfullyConfigured
    status: "True"
    type: Available
  - lastHearbeatTime: "2021-02-03T10:34:53Z"
    lastTransitionTime: "2021-02-03T10:34:53Z"
    reason: SuccessfullyConfigured
    status: "False"
    type: Degraded

Comment 16 errata-xmlrpc 2021-03-10 11:22:46 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Virtualization 2.6.0 security and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:0799


Note You need to log in before you can comment on or make changes to this bug.