Bug 2057613 - nmpolicy capture - race condition when appying teardown nncp; nnce fails
Summary: nmpolicy capture - race condition when appying teardown nncp; nnce fails
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Container Native Virtualization (CNV)
Classification: Red Hat
Component: Networking
Version: 4.10.0
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: 4.10.1
Assignee: Quique Llorente
QA Contact: Adi Zavalkovsky
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-02-23 17:46 UTC by Ruth Netser
Modified: 2022-05-18 20:27 UTC (History)
4 users (show)

Fixed In Version: kubernetes-nmstate-handler v4.10.1-2
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-05-18 20:27:03 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2022:4668 0 None None None 2022-05-18 20:27:23 UTC

Description Ruth Netser 2022-02-23 17:46:10 UTC
Description of problem:
Applying nmpolicy capture teardown nncp may lead to a race condition in which the first node's nnce fails to apply


Version-Release number of selected component (if applicable):
CNV 4.10.0 
nmstate-handler v4.10.0-48

How reproducible:
Depends on how fast nnce is applied

Steps to Reproduce:
1. Create capture nncp (with node selector; tested with 2 nodes)
2. Wait for it to be Available
3. Delete nncp (wait for it to be deleted)
4. Create teardown nncp

Actual results:
Sometimes the first node's nnce fails on:
    message: |
      failure generating desiredState and capturedStates
        failed to generate state, err
          failed to resolve capture expression, err
            resolve error
              resolve error
                step 'interfaces' from path '[interfaces]' not found at map state 'map[]'
      | capture.capture-br1-routes | routes.running.next-hop-interface := capture.capture-br1.interfaces.0.bridge.port.0.name


If we wait before the creation of the teardown nncp, everything works as expected.

Expected results:
nncp should be applied successfully.

Additional info:
$ oc get nnce
NAME                                                      STATUS
c01-rn-410-21-sktq8-worker-0-dkddz.capture-br1-teardown   Failing
c01-rn-410-21-sktq8-worker-0-llxx8.capture-br1-teardown   Available

$ oc get nncp
NAME                   STATUS
capture-br1-teardown   Degraded


================ Capture nncp ====================
apiVersion: nmstate.io/v1
kind: NodeNetworkConfigurationPolicy
metadata:
  name: capture-br1-deployment
spec:
  capture:
    default-gw: routes.running.destination=="0.0.0.0/0"
    default-gw-routes-takeover: capture.primary-nic-routes | routes.running.next-hop-interface
      := "capture-br1"
    primary-nic: interfaces.name==capture.default-gw.routes.running.0.next-hop-interface
    primary-nic-routes: routes.running.next-hop-interface==capture.primary-nic.interfaces.0.name
  desiredState:
    interfaces:
    - bridge:
        options:
          stp:
            enabled: false
        port:
        - name: '{{ capture.primary-nic.interfaces.0.name }}'
      ipv4: '{{ capture.primary-nic.interfaces.0.ipv4 }}'
      ipv6: '{{ capture.primary-nic.interfaces.0.ipv6 }}'
      name: capture-br1
      state: up
      type: linux-bridge
    routes:
      config: '{{ capture.default-gw-routes-takeover.routes.running }}'
  nodeSelector:
    capture: allow


=============== Teardown nncp =====================
apiVersion: nmstate.io/v1
kind: NodeNetworkConfigurationPolicy
metadata:
  name: capture-br1-teardown
spec:
  capture:
    capture-br1: interfaces.name == "capture-br1"
    capture-br1-routes: routes.running.next-hop-interface == "capture-br1"
    capture-br1-routes-takeover: capture.capture-br1-routes | routes.running.next-hop-interface
      := capture.capture-br1.interfaces.0.bridge.port.0.name
  desiredState:
    interfaces:
    - bridge:
        options:
          stp:
            enabled: false
        port: []
      ipv4:
        auto-dns: true
        dhcp: false
        enabled: false
      ipv6:
        auto-dns: true
        autoconf: false
        dhcp: false
        enabled: false
      name: capture-br1
      state: absent
      type: linux-bridge
    - ipv4: '{{ capture.capture-br1.interfaces.0.ipv4 }}'
      ipv6: '{{ capture.capture-br1.interfaces.0.ipv6 }}'
      name: '{{ capture.capture-br1.interfaces.0.bridge.port.0.name }}'
      state: up
      type: ethernet
    routes:
      config: '{{ capture.capture-br1-routes-takeover.routes.running }}'
  nodeSelector:
    capture: allow


=============== failed nnce =====================
$ oc get nnce c01-rn-410-21-sktq8-worker-0-dkddz.capture-br1-teardown -oyaml
apiVersion: nmstate.io/v1beta1
kind: NodeNetworkConfigurationEnactment
metadata:
  creationTimestamp: "2022-02-23T17:39:06Z"
  generation: 1
  labels:
    app.kubernetes.io/component: network
    app.kubernetes.io/managed-by: cnao-operator
    app.kubernetes.io/part-of: hyperconverged-cluster
    app.kubernetes.io/version: 4.10.0
    nmstate.io/node: c01-rn-410-21-sktq8-worker-0-dkddz
    nmstate.io/policy: capture-br1-teardown
  name: c01-rn-410-21-sktq8-worker-0-dkddz.capture-br1-teardown
  ownerReferences:
  - apiVersion: v1
    kind: Node
    name: c01-rn-410-21-sktq8-worker-0-dkddz
    uid: f5a9ead1-7732-491a-a552-f96f1851127b
  resourceVersion: "6124938"
  uid: 7f7ce686-03b1-4b8c-8a66-d2d430571821
status:
  conditions:
  - lastHearbeatTime: "2022-02-23T17:39:06Z"
    lastTransitionTime: "2022-02-23T17:39:06Z"
    message: |
      failure generating desiredState and capturedStates
        failed to generate state, err
          failed to resolve capture expression, err
            resolve error
              resolve error
                step 'interfaces' from path '[interfaces]' not found at map state 'map[]'
      | capture.capture-br1-routes | routes.running.next-hop-interface := capture.capture-br1.interfaces.0.bridge.port.0.name
      | ...............................................................^
    messageEncoded: H4sIAAAAAAAA/6ROu24CMRDs+Yrp3OQs0iLlK1IiIi147rDEra31XkRxHx8hcIiUdHHj3XntjJIvixETlSaedUJiy8b07uKEaMJJqi8daTvcPEzw0l1Eu1EvoNlP2tjK5ZM9ALxWY2u56EPaBTQrv9fmrAhZnTbKiS1gtDKjip8R9k/8EKDFMZZFE8QxS70XQpil7g9hs/YK8fEPR3sdrCzOhhX3IdqimnWKyqsP51KH7xPYvf2VEJ8d4jYeLaeJsRbzuI0qMzcr4v/ex1cAAAD//4X6e16gAQAA
    reason: FailedToConfigure
    status: "True"
    type: Failing
  - lastHearbeatTime: "2022-02-23T17:39:06Z"
    lastTransitionTime: "2022-02-23T17:39:06Z"
    reason: FailedToConfigure
    status: "False"
    type: Available
  - lastHearbeatTime: "2022-02-23T17:39:06Z"
    lastTransitionTime: "2022-02-23T17:39:06Z"
    reason: FailedToConfigure
    status: "False"
    type: Progressing
  - lastHearbeatTime: "2022-02-23T17:39:06Z"
    lastTransitionTime: "2022-02-23T17:39:06Z"
    reason: FailedToConfigure
    status: "False"
    type: Pending
  - lastHearbeatTime: "2022-02-23T17:39:06Z"
    lastTransitionTime: "2022-02-23T17:39:06Z"
    reason: SuccessfullyConfigured
    status: "False"
    type: Aborted
  desiredStateMetaInfo: {}
  policyGeneration: 1


=============== nncp =====================
$ oc get nncp -oyaml
apiVersion: v1
items:
- apiVersion: nmstate.io/v1
  kind: NodeNetworkConfigurationPolicy
  metadata:
    annotations:
      nmstate.io/webhook-mutating-timestamp: "1645637946522944281"
    creationTimestamp: "2022-02-23T17:39:06Z"
    generation: 1
    name: capture-br1-teardown
    resourceVersion: "6125381"
    uid: 11628dd4-68fe-4df3-aebd-da833df76650
  spec:
    capture:
      capture-br1: interfaces.name == "capture-br1"
      capture-br1-routes: routes.running.next-hop-interface == "capture-br1"
      capture-br1-routes-takeover: capture.capture-br1-routes | routes.running.next-hop-interface
        := capture.capture-br1.interfaces.0.bridge.port.0.name
    desiredState:
      interfaces:
      - bridge:
          options:
            stp:
              enabled: false
          port: []
        ipv4:
          auto-dns: true
          dhcp: false
          enabled: false
        ipv6:
          auto-dns: true
          autoconf: false
          dhcp: false
          enabled: false
        name: capture-br1
        state: absent
        type: linux-bridge
      - ipv4: '{{ capture.capture-br1.interfaces.0.ipv4 }}'
        ipv6: '{{ capture.capture-br1.interfaces.0.ipv6 }}'
        name: '{{ capture.capture-br1.interfaces.0.bridge.port.0.name }}'
        state: up
        type: ethernet
      routes:
        config: '{{ capture.capture-br1-routes-takeover.routes.running }}'
    nodeSelector:
      capture: allow
  status:
    conditions:
    - lastHearbeatTime: "2022-02-23T17:39:18Z"
      lastTransitionTime: "2022-02-23T17:39:06Z"
      reason: FailedToConfigure
      status: "False"
      type: Available
    - lastHearbeatTime: "2022-02-23T17:39:18Z"
      lastTransitionTime: "2022-02-23T17:39:06Z"
      message: 1/2 nodes failed to configure
      reason: FailedToConfigure
      status: "True"
      type: Degraded
    lastUnavailableNodeCountUpdate: "2022-02-23T17:39:18Z"
kind: List
metadata:
  resourceVersion: ""
  selfLink: ""

Comment 1 Quique Llorente 2022-02-24 12:43:37 UTC
We have calculate capture bypassing NNS and calling directly nmstatectl show so we get accurate date

https://github.com/nmstate/kubernetes-nmstate/blob/09067b48a7814f384b2c20fa7a62ada3f5cd3ccf/controllers/handler/nodenetworkconfigurationpolicy_controller.go#L170

I will prepare a fix ASAP

Comment 2 Quique Llorente 2022-02-24 13:07:12 UTC
The u/s fix https://github.com/nmstate/kubernetes-nmstate/pull/998

Comment 3 Adi Zavalkovsky 2022-03-28 13:02:58 UTC
Verified.

kubernetes-nmstate-handler v4.10.1-2
OCP Version 4.10.6.

Deployed described attached capture+teardown nncps, and condition didn't occur.
Nmpolciy gets config using nmstatectl

Comment 9 errata-xmlrpc 2022-05-18 20:27:03 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Virtualization 4.10.1 Images security and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:4668


Note You need to log in before you can comment on or make changes to this bug.