Bug 1833358
Summary: | NodeNetworkConfigurationPolicy failed to retrieve default gw | ||
---|---|---|---|
Product: | Container Native Virtualization (CNV) | Reporter: | Robert Bohne <rbohne> |
Component: | Networking | Assignee: | Quique Llorente <ellorent> |
Status: | CLOSED DUPLICATE | QA Contact: | Meni Yakove <myakove> |
Severity: | unspecified | Docs Contact: | |
Priority: | unspecified | ||
Version: | 2.3.0 | CC: | cnv-qe-bugs, dholler, mhooper, nschuetz, phoracek |
Target Milestone: | --- | Keywords: | Reopened |
Target Release: | 2.4.0 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2021-01-06 11:05:56 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | 1748389 | ||
Bug Blocks: |
Comment 1
Robert Bohne
2020-05-13 05:19:19 UTC
The error message: - lastHearbeatTime: "2020-05-13T05:12:24Z" lastTransitionTime: "2020-05-13T05:12:24Z" message: 'error reconciling NodeNetworkConfigurationPolicy at desired state apply: , rolling back desired state configuration: failed runnig probes after network changes: failed to retrieve default gw at runProbes: timed out waiting for the condition' reason: FailedToConfigure status: "True" type: Failing It looks like the following probe runs in a timeout: https://github.com/nmstate/kubernetes-nmstate/blob/master/pkg/probe/probes.go#L98 If I run "nmstatectl show" everything looks fine. oc get pods -l app=kubernetes-nmstate -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES nmstate-handler-f4zqq 1/1 Running 0 27m 192.168.52.12 master-2 <none> <none> nmstate-handler-q2hqf 1/1 Running 0 27m 192.168.52.10 master-0 <none> <none> nmstate-handler-vdb7d 1/1 Running 0 27m 192.168.52.11 master-1 <none> <none> nmstate-handler-worker-77nvt 1/1 Running 0 15m 192.168.52.14 compute-1 <none> <none> nmstate-handler-worker-kr6kg 1/1 Running 0 15m 192.168.52.13 compute-0 <none> <none> oc describe pod nmstate-handler-worker-kr6kg | grep 'Image ID' Image ID: registry.redhat.io/container-native-virtualization/kubernetes-nmstate-handler-rhel8@sha256:a7946b4d171184c1c0f6cee1f4e63fb18a66121a7024da0132723387d945d459 oc rsh nmstate-handler-worker-77nvt nmstatectl show --json > nmstate-handler-worker-77nvt.nmstatectl.show.json cat nmstate-handler-worker-77nvt.nmstatectl.show.json | jq '.routes.running' [ { "table-id": 254, "destination": "0.0.0.0/0", "next-hop-interface": "ens3", "next-hop-address": "192.168.52.1", "metric": 100 }, { "table-id": 254, "destination": "192.168.52.0/24", "next-hop-interface": "ens3", "next-hop-address": "", "metric": 100 }, { "table-id": 254, "destination": "fe80::/64", "next-hop-interface": "ens3", "next-hop-address": "", "metric": 100 }, { "table-id": 255, "destination": "ff00::/8", "next-hop-interface": "ens3", "next-hop-address": "", "metric": 256 } ] Thanks Robert for the detailed info. Quique, would you please look into it. I think we have seen a similar issue before. Hi Robert, Can you attach the NetworkNodeState to see if the default gw is there too ? Also I see that ipv4 is deactivated at bridge and since primary nic is going to be part of the bridge ipv4 is deactivated there too, so the node has no ip address and communication with kubeapi is lost, you need to activate dhcp at the bridge so it takes over the primary nic address. In case there is no dhcp and everything is static then you will have to put there the IP yourself. Let me know if it helps. (In reply to Quique Llorente from comment #5) > Hi Robert, > > Can you attach the NetworkNodeState to see if the default gw is there too > ? https://gist.github.com/rbo/a6bc4628ea52b05c2babb194e95cb084 - I tried a new OCP4.3 installation with CNV 2.3 from OperatorHub. Same problem... In case you want access to my cluster: let me know, my lab is public available. Data from customer cluster I can collect later today. > > Also I see that ipv4 is deactivated at bridge and since primary nic is > going to be part of the bridge ipv4 is deactivated there too, so > the node has no ip address and communication with kubeapi is lost, you need > to activate dhcp at the bridge so it takes over the primary nic address. In > case > there is no dhcp and everything is static then you will have to put there > the IP yourself. > > Let me know if it helps. Mh not realy, I tried to configure via nmcli on the node: nmcli con add type bridge ifname br1 con-name br1 nmcli con add type bridge-slave ifname ens3 master br1 nmcli con modify br1 bridge.stp no nmcli con down 'Wired connection 1' nmcli con up br1 nmcli con mod br1 connection.autoconnect yes nmcli con mod 'Wired connection 1' connection.autoconnect no Unforentatly I'm not a Linux network expert at all. Could the manually nmcli a work-a-around for my PoC at the customer? Can you try activating dhcp at the bridge, the bridge is going to take over the primary nic MAC so DHCP server will assign the address from the nic to the bridge. desiredState: interfaces: - bridge: options: stp: enabled: false port: - name: ens10f0 description: Linux bridge with ens10f0 as a port ipv4: enabled: true dhcp: true name: br1 state: up type: linux-bridge ipv4: enabled: true dhcp: true That solves the problem at customer env. Awesome, thank you very much! Thank you both :) Closing this. Im getting this same error from creating a vlan sub interface and bridge both without dhcp enabled. My bare metal cluster is configured with static IPs. This is on a 4.5.0 cluster with CNV 2.3 ga. status: conditions: - lastHearbeatTime: "2020-07-10T23:32:03Z" lastTransitionTime: "2020-07-10T23:32:03Z" message: 'error reconciling NodeNetworkConfigurationPolicy at desired state apply: , rolling back desired state configuration: failed runnig probes after network changes: failed to retrieve default gw at runProbes: timed out waiting for the condition' reason: FailedToConfigure status: "True" type: Failing - lastHearbeatTime: "2020-07-10T23:32:03Z" lastTransitionTime: "2020-07-10T23:32:03Z" reason: FailedToConfigure status: "False" type: Available - lastHearbeatTime: "2020-07-10T23:32:03Z" lastTransitionTime: "2020-07-10T23:32:03Z" reason: FailedToConfigure status: "False" type: Progressing - lastHearbeatTime: "2020-07-10T23:29:49Z" lastTransitionTime: "2020-07-10T23:29:49Z" message: All policy selectors are matching the node reason: AllSelectorsMatching status: "True" type: Matching desiredState: interfaces: - description: VLAN 24 using eno1 ipv4: dhcp: false enabled: false name: eno1.24 state: up type: vlan vlan: base-iface: eno1 id: 24 - description: Linux bridge with eno1 as a port ipv4: bridge: options: stp: enabled: false port: - name: eno1.24 dhcp: false enabled: false name: br-v24 state: up type: linux-bridge policyGeneration: 1 Hello Mark. Would you please share your routes from the host? `oc get nns <name_of_the_affected_node> -o yaml` For 2.5, we will be moving from default route based connectivity check to DNS based one. Below is the output requested; I have noticed that the default route (which should be 172.30.22.1) gets removed from the host upon failure of the nncp. apiVersion: nmstate.io/v1alpha1 kind: NodeNetworkState metadata: creationTimestamp: "2020-07-10T22:22:57Z" generation: 1 name: fury.h00pz.co ownerReferences: - apiVersion: v1 kind: Node name: fury.h00pz.co uid: 10e03c93-8e23-419d-a9eb-52790a9d0f1c resourceVersion: "4458015" selfLink: /apis/nmstate.io/v1alpha1/nodenetworkstates/fury.h00pz.co uid: 00f15314-73fc-42dd-bdf4-c492286a3498 status: currentState: dns-resolver: config: search: [] server: - 172.30.23.100 running: search: [] server: - 172.30.23.100 interfaces: - ipv4: enabled: false ipv6: enabled: false mtu: 1450 name: br0 state: down type: ovs-interface - ethernet: auto-negotiation: true duplex: full speed: 1000 ipv4: enabled: false ipv6: enabled: false mac-address: F0:1F:AF:DC:78:C4 mtu: 1500 name: eno1 state: down type: ethernet - ipv4: enabled: false ipv6: enabled: false mac-address: F0:1F:AF:DC:78:C5 mtu: 1500 name: eno2 state: down type: ethernet - ethernet: auto-negotiation: false duplex: full speed: 10000 ipv4: address: - ip: 172.30.22.100 prefix-length: 24 dhcp: false enabled: true ipv6: address: - ip: fe80::92e2:baff:fe52:7630 prefix-length: 64 autoconf: false dhcp: false enabled: true mac-address: 90:E2:BA:52:76:30 mtu: 1500 name: enp8s0 state: up type: ethernet - ipv4: enabled: false ipv6: enabled: false mtu: 65536 name: lo state: down type: unknown - ipv4: enabled: false ipv6: enabled: false mtu: 1450 name: tun0 state: down type: ovs-interface - ipv4: enabled: false ipv6: enabled: false mac-address: A2:6F:AD:44:5B:3B mtu: 65000 name: vxlan_sys_4789 state: down type: vxlan vxlan: base-iface: "" destination-port: 0 id: 0 remote: "" route-rules: config: [] routes: config: [] running: - destination: 172.30.22.0/24 metric: 100 next-hop-address: "" next-hop-interface: enp8s0 table-id: 254 - destination: fe80::/64 metric: 100 next-hop-address: "" next-hop-interface: enp8s0 table-id: 254 - destination: ff00::/8 metric: 256 next-hop-address: "" next-hop-interface: enp8s0 table-id: 255 lastSuccessfulUpdateTime: "2020-07-13T15:19:37Z" I also run into this or a similar issue on OpenShift Virtualization 2.5.2. I was able to work around by including all the 'routes:' into the 'desiredState'. (In reply to Dominik Holler from comment #13) > I also run into this or a similar issue on OpenShift Virtualization 2.5.2. I > was able to work around by including all the 'routes:' into the > 'desiredState'. Dominik, I'm surprised this is still a problem since I worked with the dev team back in July on the workaround I have in my CNV yaml here, https://github.com/h00pz/ocp-build/blob/master/cnv/4_nncp-bridge.yaml. You need to include the routes section to ensure your DG doesn't go missing. Looks like OpenShift Virtualization is using nnmstate 0.2 , which contains bug 1748389 . [dominik@t460p yml]$ oc exec --namespace openshift-cnv --stdin --tty nmstate-handler-kk5wb -- rpm -qa nmstate nmstate-0.2.6-14.el8_2.noarch Thanks for re-opening, Dominik. I indeed closed this with a wrong resolution. It should be a duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1748389. OpenShift Virtualization 2.6 will be based on nmstate 0.3 and hopefully won't have this issue. Alas, we were unable to reproduce the problem to verify the fix. *** This bug has been marked as a duplicate of bug 1879458 *** |