Bug 1833358
| Summary: | NodeNetworkConfigurationPolicy failed to retrieve default gw | ||
|---|---|---|---|
| Product: | Container Native Virtualization (CNV) | Reporter: | Robert Bohne <rbohne> |
| Component: | Networking | Assignee: | Quique Llorente <ellorent> |
| Status: | CLOSED DUPLICATE | QA Contact: | Meni Yakove <myakove> |
| Severity: | unspecified | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 2.3.0 | CC: | cnv-qe-bugs, dholler, mhooper, nschuetz, phoracek |
| Target Milestone: | --- | Keywords: | Reopened |
| Target Release: | 2.4.0 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2021-01-06 11:05:56 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
| Bug Depends On: | 1748389 | ||
| Bug Blocks: | |||
|
Comment 1
Robert Bohne
2020-05-13 05:19:19 UTC
The error message:
- lastHearbeatTime: "2020-05-13T05:12:24Z"
lastTransitionTime: "2020-05-13T05:12:24Z"
message: 'error reconciling NodeNetworkConfigurationPolicy at desired state apply:
, rolling back desired state configuration: failed runnig probes after network
changes: failed to retrieve default gw at runProbes: timed out waiting for the
condition'
reason: FailedToConfigure
status: "True"
type: Failing
It looks like the following probe runs in a timeout: https://github.com/nmstate/kubernetes-nmstate/blob/master/pkg/probe/probes.go#L98
If I run "nmstatectl show" everything looks fine.
oc get pods -l app=kubernetes-nmstate -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
nmstate-handler-f4zqq 1/1 Running 0 27m 192.168.52.12 master-2 <none> <none>
nmstate-handler-q2hqf 1/1 Running 0 27m 192.168.52.10 master-0 <none> <none>
nmstate-handler-vdb7d 1/1 Running 0 27m 192.168.52.11 master-1 <none> <none>
nmstate-handler-worker-77nvt 1/1 Running 0 15m 192.168.52.14 compute-1 <none> <none>
nmstate-handler-worker-kr6kg 1/1 Running 0 15m 192.168.52.13 compute-0 <none> <none>
oc describe pod nmstate-handler-worker-kr6kg | grep 'Image ID'
Image ID: registry.redhat.io/container-native-virtualization/kubernetes-nmstate-handler-rhel8@sha256:a7946b4d171184c1c0f6cee1f4e63fb18a66121a7024da0132723387d945d459
oc rsh nmstate-handler-worker-77nvt nmstatectl show --json > nmstate-handler-worker-77nvt.nmstatectl.show.json
cat nmstate-handler-worker-77nvt.nmstatectl.show.json | jq '.routes.running'
[
{
"table-id": 254,
"destination": "0.0.0.0/0",
"next-hop-interface": "ens3",
"next-hop-address": "192.168.52.1",
"metric": 100
},
{
"table-id": 254,
"destination": "192.168.52.0/24",
"next-hop-interface": "ens3",
"next-hop-address": "",
"metric": 100
},
{
"table-id": 254,
"destination": "fe80::/64",
"next-hop-interface": "ens3",
"next-hop-address": "",
"metric": 100
},
{
"table-id": 255,
"destination": "ff00::/8",
"next-hop-interface": "ens3",
"next-hop-address": "",
"metric": 256
}
]
Thanks Robert for the detailed info. Quique, would you please look into it. I think we have seen a similar issue before. Hi Robert, Can you attach the NetworkNodeState to see if the default gw is there too ? Also I see that ipv4 is deactivated at bridge and since primary nic is going to be part of the bridge ipv4 is deactivated there too, so the node has no ip address and communication with kubeapi is lost, you need to activate dhcp at the bridge so it takes over the primary nic address. In case there is no dhcp and everything is static then you will have to put there the IP yourself. Let me know if it helps. (In reply to Quique Llorente from comment #5) > Hi Robert, > > Can you attach the NetworkNodeState to see if the default gw is there too > ? https://gist.github.com/rbo/a6bc4628ea52b05c2babb194e95cb084 - I tried a new OCP4.3 installation with CNV 2.3 from OperatorHub. Same problem... In case you want access to my cluster: let me know, my lab is public available. Data from customer cluster I can collect later today. > > Also I see that ipv4 is deactivated at bridge and since primary nic is > going to be part of the bridge ipv4 is deactivated there too, so > the node has no ip address and communication with kubeapi is lost, you need > to activate dhcp at the bridge so it takes over the primary nic address. In > case > there is no dhcp and everything is static then you will have to put there > the IP yourself. > > Let me know if it helps. Mh not realy, I tried to configure via nmcli on the node: nmcli con add type bridge ifname br1 con-name br1 nmcli con add type bridge-slave ifname ens3 master br1 nmcli con modify br1 bridge.stp no nmcli con down 'Wired connection 1' nmcli con up br1 nmcli con mod br1 connection.autoconnect yes nmcli con mod 'Wired connection 1' connection.autoconnect no Unforentatly I'm not a Linux network expert at all. Could the manually nmcli a work-a-around for my PoC at the customer? Can you try activating dhcp at the bridge, the bridge is going to take over the primary nic MAC so DHCP server will assign the address from the nic to the bridge.
desiredState:
interfaces:
- bridge:
options:
stp:
enabled: false
port:
- name: ens10f0
description: Linux bridge with ens10f0 as a port
ipv4:
enabled: true
dhcp: true
name: br1
state: up
type: linux-bridge
ipv4:
enabled: true
dhcp: true
That solves the problem at customer env. Awesome, thank you very much!
Thank you both :) Closing this. Im getting this same error from creating a vlan sub interface and bridge both without dhcp enabled. My bare metal cluster is configured with static IPs. This is on a 4.5.0 cluster with CNV 2.3 ga.
status:
conditions:
- lastHearbeatTime: "2020-07-10T23:32:03Z"
lastTransitionTime: "2020-07-10T23:32:03Z"
message: 'error reconciling NodeNetworkConfigurationPolicy at desired state apply:
, rolling back desired state configuration: failed runnig probes after network
changes: failed to retrieve default gw at runProbes: timed out waiting for the
condition'
reason: FailedToConfigure
status: "True"
type: Failing
- lastHearbeatTime: "2020-07-10T23:32:03Z"
lastTransitionTime: "2020-07-10T23:32:03Z"
reason: FailedToConfigure
status: "False"
type: Available
- lastHearbeatTime: "2020-07-10T23:32:03Z"
lastTransitionTime: "2020-07-10T23:32:03Z"
reason: FailedToConfigure
status: "False"
type: Progressing
- lastHearbeatTime: "2020-07-10T23:29:49Z"
lastTransitionTime: "2020-07-10T23:29:49Z"
message: All policy selectors are matching the node
reason: AllSelectorsMatching
status: "True"
type: Matching
desiredState:
interfaces:
- description: VLAN 24 using eno1
ipv4:
dhcp: false
enabled: false
name: eno1.24
state: up
type: vlan
vlan:
base-iface: eno1
id: 24
- description: Linux bridge with eno1 as a port
ipv4:
bridge:
options:
stp:
enabled: false
port:
- name: eno1.24
dhcp: false
enabled: false
name: br-v24
state: up
type: linux-bridge
policyGeneration: 1
Hello Mark. Would you please share your routes from the host? `oc get nns <name_of_the_affected_node> -o yaml` For 2.5, we will be moving from default route based connectivity check to DNS based one. Below is the output requested; I have noticed that the default route (which should be 172.30.22.1) gets removed from the host upon failure of the nncp.
apiVersion: nmstate.io/v1alpha1
kind: NodeNetworkState
metadata:
creationTimestamp: "2020-07-10T22:22:57Z"
generation: 1
name: fury.h00pz.co
ownerReferences:
- apiVersion: v1
kind: Node
name: fury.h00pz.co
uid: 10e03c93-8e23-419d-a9eb-52790a9d0f1c
resourceVersion: "4458015"
selfLink: /apis/nmstate.io/v1alpha1/nodenetworkstates/fury.h00pz.co
uid: 00f15314-73fc-42dd-bdf4-c492286a3498
status:
currentState:
dns-resolver:
config:
search: []
server:
- 172.30.23.100
running:
search: []
server:
- 172.30.23.100
interfaces:
- ipv4:
enabled: false
ipv6:
enabled: false
mtu: 1450
name: br0
state: down
type: ovs-interface
- ethernet:
auto-negotiation: true
duplex: full
speed: 1000
ipv4:
enabled: false
ipv6:
enabled: false
mac-address: F0:1F:AF:DC:78:C4
mtu: 1500
name: eno1
state: down
type: ethernet
- ipv4:
enabled: false
ipv6:
enabled: false
mac-address: F0:1F:AF:DC:78:C5
mtu: 1500
name: eno2
state: down
type: ethernet
- ethernet:
auto-negotiation: false
duplex: full
speed: 10000
ipv4:
address:
- ip: 172.30.22.100
prefix-length: 24
dhcp: false
enabled: true
ipv6:
address:
- ip: fe80::92e2:baff:fe52:7630
prefix-length: 64
autoconf: false
dhcp: false
enabled: true
mac-address: 90:E2:BA:52:76:30
mtu: 1500
name: enp8s0
state: up
type: ethernet
- ipv4:
enabled: false
ipv6:
enabled: false
mtu: 65536
name: lo
state: down
type: unknown
- ipv4:
enabled: false
ipv6:
enabled: false
mtu: 1450
name: tun0
state: down
type: ovs-interface
- ipv4:
enabled: false
ipv6:
enabled: false
mac-address: A2:6F:AD:44:5B:3B
mtu: 65000
name: vxlan_sys_4789
state: down
type: vxlan
vxlan:
base-iface: ""
destination-port: 0
id: 0
remote: ""
route-rules:
config: []
routes:
config: []
running:
- destination: 172.30.22.0/24
metric: 100
next-hop-address: ""
next-hop-interface: enp8s0
table-id: 254
- destination: fe80::/64
metric: 100
next-hop-address: ""
next-hop-interface: enp8s0
table-id: 254
- destination: ff00::/8
metric: 256
next-hop-address: ""
next-hop-interface: enp8s0
table-id: 255
lastSuccessfulUpdateTime: "2020-07-13T15:19:37Z"
I also run into this or a similar issue on OpenShift Virtualization 2.5.2. I was able to work around by including all the 'routes:' into the 'desiredState'. (In reply to Dominik Holler from comment #13) > I also run into this or a similar issue on OpenShift Virtualization 2.5.2. I > was able to work around by including all the 'routes:' into the > 'desiredState'. Dominik, I'm surprised this is still a problem since I worked with the dev team back in July on the workaround I have in my CNV yaml here, https://github.com/h00pz/ocp-build/blob/master/cnv/4_nncp-bridge.yaml. You need to include the routes section to ensure your DG doesn't go missing. Looks like OpenShift Virtualization is using nnmstate 0.2 , which contains bug 1748389 . [dominik@t460p yml]$ oc exec --namespace openshift-cnv --stdin --tty nmstate-handler-kk5wb -- rpm -qa nmstate nmstate-0.2.6-14.el8_2.noarch Thanks for re-opening, Dominik. I indeed closed this with a wrong resolution. It should be a duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1748389. OpenShift Virtualization 2.6 will be based on nmstate 0.3 and hopefully won't have this issue. Alas, we were unable to reproduce the problem to verify the fix. *** This bug has been marked as a duplicate of bug 1879458 *** |