Created attachment 1669605 [details] journalctl output Description of problem: On a destructive configuration policy, which involves all physical NICs of a node, and supposed to disable the connectivity of the node, the NNCE reports one state, while NNCP reports another. Version-Release number of selected component (if applicable): kubernetes-nmstate-handler-rhel8@sha256:4a1379bf1223cf064e54419721045ca1275ae57a04433db78d4a54e1269acee1 CNAO: sha256_379cfaaba59bae6089af24bb25c104e399e867b6732e5c8a33caf235 How reproducible: Most of the times (the bug doesn't always occur). Steps to Reproduce: 1. Apply a valid NNCP that affects all physical NICs of a node. In the example given here I set all the NICs, which originally had dynamic IPs, to have static IPs. For each NIC I used the same dynamic IP that the DHCP server provide to it (to make sure I avoid IP conflicts). apiVersion: nmstate.io/v1alpha1 kind: NodeNetworkConfigurationPolicy metadata: name: static-nics spec: desiredState: interfaces: - name: ens3 type: ethernet state: up ipv4: address: - ip: 172.16.0.33 prefix-length: 24 dhcp: false enabled: true - name: ens6 type: ethernet state: up ipv4: address: - ip: 172.16.0.19 prefix-length: 24 dhcp: false enabled: true - name: ens7 type: ethernet state: up ipv4: address: - ip: 172.16.0.49 prefix-length: 24 dhcp: false enabled: true - name: ens8 type: ethernet state: up ipv4: address: - ip: 172.16.0.14 prefix-length: 24 dhcp: false enabled: true nodeSelector: kubernetes.io/hostname: "host-172-16-0-33" 2. After a long-enough timeout (~5 minutes) check the IP addresses of all the NIC that were set in this NNCP: [cnv-qe-jenkins@cnv-executor-ysegev-4-3 yossi]$ ssh core.0.33 ip addr show dev ens3 2: ens3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc fq_codel state UP group default qlen 1000 link/ether fa:16:3e:dc:3f:f6 brd ff:ff:ff:ff:ff:ff inet 172.16.0.33/24 brd 172.16.0.255 scope global dynamic noprefixroute ens3 valid_lft 86195sec preferred_lft 86195sec inet6 fe80::f816:3eff:fedc:3ff6/64 scope link noprefixroute valid_lft forever preferred_lft forever [cnv-qe-jenkins@cnv-executor-ysegev-4-3 yossi]$ ssh core.0.33 ip addr show dev ens6 3: ens6: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc fq_codel state UP group default qlen 1000 link/ether fa:16:3e:2e:76:aa brd ff:ff:ff:ff:ff:ff inet 172.16.0.19/24 brd 172.16.0.255 scope global dynamic noprefixroute ens6 valid_lft 86192sec preferred_lft 86192sec inet6 fe80::f816:3eff:fe2e:76aa/64 scope link noprefixroute valid_lft forever preferred_lft forever [cnv-qe-jenkins@cnv-executor-ysegev-4-3 yossi]$ ssh core.0.33 ip addr show dev ens7 4: ens7: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc fq_codel state UP group default qlen 1000 link/ether fa:16:3e:9d:e8:a3 brd ff:ff:ff:ff:ff:ff inet 172.16.0.49/24 brd 172.16.0.255 scope global dynamic noprefixroute ens7 valid_lft 86189sec preferred_lft 86189sec [cnv-qe-jenkins@cnv-executor-ysegev-4-3 yossi]$ ssh core.0.33 ip addr show dev ens8 5: ens8: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc fq_codel state UP group default qlen 1000 link/ether fa:16:3e:2d:ef:00 brd ff:ff:ff:ff:ff:ff inet 172.16.0.14/24 brd 172.16.0.255 scope global dynamic noprefixroute ens8 valid_lft 86186sec preferred_lft 86186sec In all the cases, you can see that the address line contains the word "dynamic", which implies that the intended policy configuration was considered to be destructive, and therefore it was roll-backed. 3. Check the status of both NNCP and NNCE: [cnv-qe-jenkins@cnv-executor-ysegev-4-3 yossi]$ oc get nncp static-nics NAME STATUS static-nics SuccessfullyConfigured [cnv-qe-jenkins@cnv-executor-ysegev-4-3 yossi]$ [cnv-qe-jenkins@cnv-executor-ysegev-4-3 yossi]$ oc get nnce host-172-16-0-33.static-nics NAME STATUS host-172-16-0-33.static-nics ConfigurationProgressing Actual results: <BUG> Each shows a different status ("SuccessfullyConfigured" and "ConfigurationProgressing"), which is wrong in both cases. In addition - the NNCE description ("oc get nnce host-172-16-0-33.static-nics -o yaml") doesn't include a rollback message. Expected results: 1. The status of both NNCP and NNCE should be "ConfigurationFailed". 2. The current status condition in the NNCE should include a rollback message (search for the string "rollback" to verify). Additional info: This bug also happened on other scenarios, e.g. when the static IP's in the policy were different than those that were already dynamically given by the DHCP server. However, in this scenario the occurrence of the bug was not consistent, and in some of the cases the behavior was as-expected (i.e. both NNCE and NNCP showed status "ConfigurationFailed", and the NNCE description included a rollback message). The node's journalctl output is attached, with nmstate in TRACE log-level. It includes the timeline since just before applying the policy.
Yossi, could you please share logs of all the nmstate pods too? And see if they were ever restarted. If they were, then share the previous log (-p) too.
Attaching all relevant logs. Please note that for one of the masters (host-172-16-0-38), I couldn't get the nmstate-handler log due to this error: Error from server: Get https://172.16.0.38:10250/containerLogs/openshift-cnv/nmstate-handler-fzccc/nmstate-handler?previous=true: remote error: tls: internal error However, this is one of the pods that wasn't restarted. Also note that the node on which i reproduced this bug is host-172-16-0-33 (worker).
Created attachment 1670336 [details] nmstate-handler-host-172-16-0-22-master
Created attachment 1670337 [details] nmstate-handler-host-172-16-0-22-master-previous
Created attachment 1670338 [details] nmstate-handler-host-172-16-0-24-master
Created attachment 1670339 [details] nmstate-handler-host-172-16-0-33-worker
Created attachment 1670340 [details] nmstate-handler-host-172-16-0-33-worker-previous
Created attachment 1670341 [details] nmstate-handler-host-172-16-0-47-worker
Created attachment 1670342 [details] nmstate-handler-host-172-16-0-47-worker-previous
I think I found this issue in the past, and put in place a fix [1] but we didn't merge, issue is (or this is my theory) after nmstatectl rollback there is a time window where k8s api server connectivity is lost so we cannot propertly update nnce/nncp conditions, the fix is just to exercise the same probes we exercise to check that everything is ok after applying desiredState but for rollback, so we wait a little for system to settle after we put back good config. Yossi, let me put in place a PR and we create a d/s build with it and test. [1] https://github.com/nmstate/kubernetes-nmstate/pull/429/commits/9059be918b8d5b2b562e2f13aa9e51a1fbbf6058
The PR https://github.com/nmstate/kubernetes-nmstate/pull/454
So the risk is only when somebody reconfigures the default interface (kills connectivity to the API server)?
As discussed with Quique, this bug affects only reporting of the state and only when the default network is reconfigured. I'm moving this therefore to 2.4. If somebody faces it, the easiest way to get proper state would be to do some non-actionable change to the NNCP. e.g. setting a "description" on one of the interfaces.
I give a proper look to Quique's PR that is supposed to fix this and it seems harmless. It may help QE with annoying issues during automation. The cherry-pick to 2.3 seems to be without any collision. I'm moving this back to 2.3 and asking for blocker ack. No harm, just avoiding noise on test automation. Sorry for the noise I caused before.
We have re-create this scenario u/s using multiworker cluster and running test only there (not at master) and several issues were found related to nnce/nncp conditions, so we have open several issues u/s that we are working one at master and they will be backported to release-0.15 branch and later propagate to CNV-2.3 - Race condition between rollback checks and Node state: https://github.com/nmstate/kubernetes-nmstate/issues/474 (already merged at release-0.15) - Calculate policy conditions counting nodes running nmstate pod: https://github.com/nmstate/kubernetes-nmstate/issues/473 - Post rollback probes are not run: https://github.com/nmstate/kubernetes-nmstate/issues/482 Also some minor bugs related to this: - app label missing at daemonset: https://github.com/nmstate/kubernetes-nmstate/issues/476 - Bad log at policy condition success: https://github.com/nmstate/kubernetes-nmstate/issues/475
I'm sorry for the noise, but I think we should bring this back to ASSIGNED. Even though we fixed the original issue, after running tier1 against D/S builds we found a couple more. Since those issues would be found by QE sooner or later anyway, I'm moving it back to devs, so we deliver *flawless* build.
Verified by repeating the scenario from the original bug description 5 times. Results: 1. NNCE and NNCP both report the same status condition ("FailedToConfigure") 2. Rollback is reported in the NNCE. Verified on version: k8s-nmstate: kubernetes-nmstate-handler-rhel8@sha256:a7946b4d171184c1c0f6cee1f4e63fb18a66121a7024da0132723387d945d459 CNAO: cluster-network-addons-operator@sha256:2ec373edc6a1307248e9d4e4276f26052aafcd6058e5ddc182de847f9e794ff8
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2020:2011
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days