Bug 1833358

Summary: NodeNetworkConfigurationPolicy failed to retrieve default gw
Product: Container Native Virtualization (CNV) Reporter: Robert Bohne <rbohne>
Component: NetworkingAssignee: Quique Llorente <ellorent>
Status: CLOSED DUPLICATE QA Contact: Meni Yakove <myakove>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 2.3.0CC: cnv-qe-bugs, dholler, mhooper, nschuetz, phoracek
Target Milestone: ---Keywords: Reopened
Target Release: 2.4.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-01-06 11:05:56 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1748389    
Bug Blocks:    

Comment 1 Robert Bohne 2020-05-13 05:19:19 UTC
I reinstalled my lab and switch from CNV Beta 2.3 to GA 2.3 and I got the same problem! With exactly the same configuration worked before.

Comment 2 Robert Bohne 2020-05-13 06:00:03 UTC
The error message:

  - lastHearbeatTime: "2020-05-13T05:12:24Z"
    lastTransitionTime: "2020-05-13T05:12:24Z"
    message: 'error reconciling NodeNetworkConfigurationPolicy at desired state apply:
      , rolling back desired state configuration: failed runnig probes after network
      changes: failed to retrieve default gw at runProbes: timed out waiting for the
      condition'
    reason: FailedToConfigure
    status: "True"
    type: Failing

It looks like the following probe runs in a timeout: https://github.com/nmstate/kubernetes-nmstate/blob/master/pkg/probe/probes.go#L98

Comment 3 Robert Bohne 2020-05-13 06:07:55 UTC
If I run "nmstatectl show" everything looks fine.

oc get pods -l app=kubernetes-nmstate -o wide
NAME                           READY   STATUS    RESTARTS   AGE   IP              NODE        NOMINATED NODE   READINESS GATES
nmstate-handler-f4zqq          1/1     Running   0          27m   192.168.52.12   master-2    <none>           <none>
nmstate-handler-q2hqf          1/1     Running   0          27m   192.168.52.10   master-0    <none>           <none>
nmstate-handler-vdb7d          1/1     Running   0          27m   192.168.52.11   master-1    <none>           <none>
nmstate-handler-worker-77nvt   1/1     Running   0          15m   192.168.52.14   compute-1   <none>           <none>
nmstate-handler-worker-kr6kg   1/1     Running   0          15m   192.168.52.13   compute-0   <none>           <none>

 oc describe pod nmstate-handler-worker-kr6kg | grep 'Image ID'
    Image ID:      registry.redhat.io/container-native-virtualization/kubernetes-nmstate-handler-rhel8@sha256:a7946b4d171184c1c0f6cee1f4e63fb18a66121a7024da0132723387d945d459

oc rsh nmstate-handler-worker-77nvt nmstatectl show --json > nmstate-handler-worker-77nvt.nmstatectl.show.json

cat nmstate-handler-worker-77nvt.nmstatectl.show.json | jq '.routes.running'
[
  {
    "table-id": 254,
    "destination": "0.0.0.0/0",
    "next-hop-interface": "ens3",
    "next-hop-address": "192.168.52.1",
    "metric": 100
  },
  {
    "table-id": 254,
    "destination": "192.168.52.0/24",
    "next-hop-interface": "ens3",
    "next-hop-address": "",
    "metric": 100
  },
  {
    "table-id": 254,
    "destination": "fe80::/64",
    "next-hop-interface": "ens3",
    "next-hop-address": "",
    "metric": 100
  },
  {
    "table-id": 255,
    "destination": "ff00::/8",
    "next-hop-interface": "ens3",
    "next-hop-address": "",
    "metric": 256
  }
]

Comment 4 Petr Horáček 2020-05-13 09:17:12 UTC
Thanks Robert for the detailed info.

Quique, would you please look into it. I think we have seen a similar issue before.

Comment 5 Quique Llorente 2020-05-13 09:33:00 UTC
Hi Robert, 

   Can you attach the NetworkNodeState to see if the default gw is there too ? 

   Also I see that ipv4 is deactivated at bridge and since primary nic is going to be part of the bridge ipv4 is deactivated there too, so 
the node has no ip address and communication with kubeapi is lost, you need to activate dhcp at the bridge so it takes over the primary nic address. In case
there is no dhcp and everything is static then you will have to put there the IP yourself.

   Let me know if it helps.

Comment 6 Robert Bohne 2020-05-13 11:01:47 UTC
(In reply to Quique Llorente from comment #5)
> Hi Robert, 
> 
>    Can you attach the NetworkNodeState to see if the default gw is there too
> ? 

https://gist.github.com/rbo/a6bc4628ea52b05c2babb194e95cb084 - I tried a new OCP4.3 installation with CNV 2.3 from OperatorHub. Same problem... 

In case you want access to my cluster: let me know, my lab is public available. 


Data from customer cluster I can collect later today.



> 
>    Also I see that ipv4 is deactivated at bridge and since primary nic is
> going to be part of the bridge ipv4 is deactivated there too, so 
> the node has no ip address and communication with kubeapi is lost, you need
> to activate dhcp at the bridge so it takes over the primary nic address. In
> case
> there is no dhcp and everything is static then you will have to put there
> the IP yourself.
> 
>    Let me know if it helps.

Mh not realy, I tried to configure via nmcli on the node:

nmcli con add type bridge ifname br1 con-name br1
nmcli con add type bridge-slave ifname ens3 master br1
nmcli con modify br1 bridge.stp no
nmcli con down 'Wired connection 1'
nmcli con up br1
nmcli con mod br1 connection.autoconnect yes
nmcli con mod 'Wired connection 1' connection.autoconnect no

Unforentatly  I'm not a Linux network expert at all. Could the manually nmcli a work-a-around for my PoC at the customer?

Comment 7 Quique Llorente 2020-05-13 12:01:46 UTC
Can you try activating dhcp at the bridge, the bridge is going to take over the primary nic MAC so DHCP server will assign the address from the nic to the bridge.

desiredState:
    interfaces:
    - bridge:
        options:
          stp:
            enabled: false
        port:
        - name: ens10f0
      description: Linux bridge with ens10f0 as a port
      ipv4:
        enabled: true
        dhcp: true
      name: br1
      state: up
      type: linux-bridge

Comment 8 Robert Bohne 2020-05-13 12:25:44 UTC
      ipv4:
        enabled: true
        dhcp: true

That solves the problem at customer env. Awesome, thank you very much!

Comment 9 Petr Horáček 2020-05-13 13:00:26 UTC
Thank you both :) Closing this.

Comment 10 Mark Hooper 2020-07-10 23:36:29 UTC
Im getting this same error from creating a vlan sub interface and bridge both without dhcp enabled.  My bare metal cluster is configured with static IPs.  This is on a 4.5.0 cluster with CNV 2.3 ga.


status:
  conditions:
  - lastHearbeatTime: "2020-07-10T23:32:03Z"
    lastTransitionTime: "2020-07-10T23:32:03Z"
    message: 'error reconciling NodeNetworkConfigurationPolicy at desired state apply:
      , rolling back desired state configuration: failed runnig probes after network
      changes: failed to retrieve default gw at runProbes: timed out waiting for the
      condition'
    reason: FailedToConfigure
    status: "True"
    type: Failing
  - lastHearbeatTime: "2020-07-10T23:32:03Z"
    lastTransitionTime: "2020-07-10T23:32:03Z"
    reason: FailedToConfigure
    status: "False"
    type: Available
  - lastHearbeatTime: "2020-07-10T23:32:03Z"
    lastTransitionTime: "2020-07-10T23:32:03Z"
    reason: FailedToConfigure
    status: "False"
    type: Progressing
  - lastHearbeatTime: "2020-07-10T23:29:49Z"
    lastTransitionTime: "2020-07-10T23:29:49Z"
    message: All policy selectors are matching the node
    reason: AllSelectorsMatching
    status: "True"
    type: Matching
  desiredState:
    interfaces:
    - description: VLAN 24 using eno1
      ipv4:
        dhcp: false
        enabled: false
      name: eno1.24
      state: up
      type: vlan
      vlan:
        base-iface: eno1
        id: 24
    - description: Linux bridge with eno1 as a port
      ipv4:
        bridge:
          options:
            stp:
              enabled: false
          port:
          - name: eno1.24
        dhcp: false
        enabled: false
      name: br-v24
      state: up
      type: linux-bridge
  policyGeneration: 1

Comment 11 Petr Horáček 2020-07-13 07:48:46 UTC
Hello Mark. Would you please share your routes from the host? `oc get nns <name_of_the_affected_node> -o yaml`

For 2.5, we will be moving from default route based connectivity check to DNS based one.

Comment 12 Mark Hooper 2020-07-13 15:23:12 UTC
Below is the output requested;  I have noticed that the default route (which should be 172.30.22.1) gets removed from the host upon failure of the nncp.

apiVersion: nmstate.io/v1alpha1
kind: NodeNetworkState
metadata:
  creationTimestamp: "2020-07-10T22:22:57Z"
  generation: 1
  name: fury.h00pz.co
  ownerReferences:
  - apiVersion: v1
    kind: Node
    name: fury.h00pz.co
    uid: 10e03c93-8e23-419d-a9eb-52790a9d0f1c
  resourceVersion: "4458015"
  selfLink: /apis/nmstate.io/v1alpha1/nodenetworkstates/fury.h00pz.co
  uid: 00f15314-73fc-42dd-bdf4-c492286a3498
status:
  currentState:
    dns-resolver:
      config:
        search: []
        server:
        - 172.30.23.100
      running:
        search: []
        server:
        - 172.30.23.100
    interfaces:
    - ipv4:
        enabled: false
      ipv6:
        enabled: false
      mtu: 1450
      name: br0
      state: down
      type: ovs-interface
    - ethernet:
        auto-negotiation: true
        duplex: full
        speed: 1000
      ipv4:
        enabled: false
      ipv6:
        enabled: false
      mac-address: F0:1F:AF:DC:78:C4
      mtu: 1500
      name: eno1
      state: down
      type: ethernet
    - ipv4:
        enabled: false
      ipv6:
        enabled: false
      mac-address: F0:1F:AF:DC:78:C5
      mtu: 1500
      name: eno2
      state: down
      type: ethernet
    - ethernet:
        auto-negotiation: false
        duplex: full
        speed: 10000
      ipv4:
        address:
        - ip: 172.30.22.100
          prefix-length: 24
        dhcp: false
        enabled: true
      ipv6:
        address:
        - ip: fe80::92e2:baff:fe52:7630
          prefix-length: 64
        autoconf: false
        dhcp: false
        enabled: true
      mac-address: 90:E2:BA:52:76:30
      mtu: 1500
      name: enp8s0
      state: up
      type: ethernet
    - ipv4:
        enabled: false
      ipv6:
        enabled: false
      mtu: 65536
      name: lo
      state: down
      type: unknown
    - ipv4:
        enabled: false
      ipv6:
        enabled: false
      mtu: 1450
      name: tun0
      state: down
      type: ovs-interface
    - ipv4:
        enabled: false
      ipv6:
        enabled: false
      mac-address: A2:6F:AD:44:5B:3B
      mtu: 65000
      name: vxlan_sys_4789
      state: down
      type: vxlan
      vxlan:
        base-iface: ""
        destination-port: 0
        id: 0
        remote: ""
    route-rules:
      config: []
    routes:
      config: []
      running:
      - destination: 172.30.22.0/24
        metric: 100
        next-hop-address: ""
        next-hop-interface: enp8s0
        table-id: 254
      - destination: fe80::/64
        metric: 100
        next-hop-address: ""
        next-hop-interface: enp8s0
        table-id: 254
      - destination: ff00::/8
        metric: 256
        next-hop-address: ""
        next-hop-interface: enp8s0
        table-id: 255
  lastSuccessfulUpdateTime: "2020-07-13T15:19:37Z"

Comment 13 Dominik Holler 2021-01-05 16:01:17 UTC
I also run into this or a similar issue on OpenShift Virtualization 2.5.2. I was able to work around by including all the 'routes:' into the 'desiredState'.

Comment 14 Mark Hooper 2021-01-05 18:12:30 UTC
(In reply to Dominik Holler from comment #13)
> I also run into this or a similar issue on OpenShift Virtualization 2.5.2. I
> was able to work around by including all the 'routes:' into the
> 'desiredState'.

Dominik,
I'm surprised this is still a problem since I worked with the dev team back in July on the workaround I have in my CNV yaml here, https://github.com/h00pz/ocp-build/blob/master/cnv/4_nncp-bridge.yaml.  You need to include the routes section to ensure your DG doesn't go missing.

Comment 15 Dominik Holler 2021-01-06 07:16:34 UTC
Looks like OpenShift Virtualization is using nnmstate 0.2 , which contains bug 1748389 .

[dominik@t460p yml]$ oc exec --namespace openshift-cnv --stdin --tty nmstate-handler-kk5wb -- rpm -qa nmstate
nmstate-0.2.6-14.el8_2.noarch

Comment 16 Petr Horáček 2021-01-06 11:05:56 UTC
Thanks for re-opening, Dominik. I indeed closed this with a wrong resolution. It should be a duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1748389.

OpenShift Virtualization 2.6 will be based on nmstate 0.3 and hopefully won't have this issue. Alas, we were unable to reproduce the problem to verify the fix.

*** This bug has been marked as a duplicate of bug 1879458 ***