1812856 – knmstate: Inconsistency between NNCE and NNCP status report

Bug 1812856 - knmstate: Inconsistency between NNCE and NNCP status report

Summary: knmstate: Inconsistency between NNCE and NNCP status report

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Container Native Virtualization (CNV)
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	2.3.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	2.3.0
Assignee:	Petr Horáček
QA Contact:	Yossi Segev
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-03-12 10:56 UTC by Yossi Segev
Modified:	2023-09-14 05:57 UTC (History)
CC List:	5 users (show)
Fixed In Version:	kubernetes-nmstate-handler-container-v2.3.0-29
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-05-04 19:10:58 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
journalctl output (8.10 MB, text/plain) 2020-03-12 10:56 UTC, Yossi Segev	no flags	Details
nmstate-handler-host-172-16-0-22-master (16.83 KB, text/plain) 2020-03-15 15:44 UTC, Yossi Segev	no flags	Details
nmstate-handler-host-172-16-0-22-master-previous (1.60 KB, text/plain) 2020-03-15 15:44 UTC, Yossi Segev	no flags	Details
nmstate-handler-host-172-16-0-24-master (1.17 MB, text/plain) 2020-03-15 15:45 UTC, Yossi Segev	no flags	Details
nmstate-handler-host-172-16-0-33-worker (172.26 KB, text/plain) 2020-03-15 15:46 UTC, Yossi Segev	no flags	Details
nmstate-handler-host-172-16-0-33-worker-previous (215.86 KB, text/plain) 2020-03-15 15:53 UTC, Yossi Segev	no flags	Details
nmstate-handler-host-172-16-0-47-worker (189.36 KB, text/plain) 2020-03-15 15:54 UTC, Yossi Segev	no flags	Details
nmstate-handler-host-172-16-0-47-worker-previous (1.60 KB, text/plain) 2020-03-15 15:55 UTC, Yossi Segev	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	CNV-32947	0	None	None	None	2023-09-14 05:57:27 UTC
Red Hat Product Errata	RHEA-2020:2011	0	None	None	None	2020-05-04 19:11:10 UTC

Description Yossi Segev 2020-03-12 10:56:25 UTC

Created attachment 1669605 [details]
journalctl output

Description of problem:
On a destructive configuration policy, which involves all physical NICs of a node, and supposed to disable the connectivity of the node, the NNCE reports one state, while NNCP reports another.


Version-Release number of selected component (if applicable):
kubernetes-nmstate-handler-rhel8@sha256:4a1379bf1223cf064e54419721045ca1275ae57a04433db78d4a54e1269acee1
CNAO: sha256_379cfaaba59bae6089af24bb25c104e399e867b6732e5c8a33caf235


How reproducible:
Most of the times (the bug doesn't always occur).


Steps to Reproduce:
1. Apply a valid NNCP that affects all physical NICs of a node.
In the example given here I set all the NICs, which originally had dynamic IPs, to have static IPs. For each NIC I used the same dynamic IP that the DHCP server provide to it (to make sure I avoid IP conflicts).
apiVersion: nmstate.io/v1alpha1
kind: NodeNetworkConfigurationPolicy
metadata:
  name: static-nics
spec:
  desiredState:
    interfaces:
      - name: ens3
        type: ethernet
        state: up 
        ipv4:
          address:
          - ip: 172.16.0.33
            prefix-length: 24
          dhcp: false
          enabled: true
      - name: ens6
        type: ethernet
        state: up 
        ipv4:
          address:
          - ip: 172.16.0.19
            prefix-length: 24
          dhcp: false
          enabled: true
      - name: ens7
        type: ethernet
        state: up 
        ipv4:
          address:
          - ip: 172.16.0.49
            prefix-length: 24
          dhcp: false
          enabled: true
      - name: ens8
        type: ethernet
        state: up 
        ipv4:
          address:
          - ip: 172.16.0.14
            prefix-length: 24
          dhcp: false
          enabled: true
  nodeSelector: 
    kubernetes.io/hostname: "host-172-16-0-33"

2. After a long-enough timeout (~5 minutes) check the IP addresses of all the NIC that were set in this NNCP:
[cnv-qe-jenkins@cnv-executor-ysegev-4-3 yossi]$ ssh core.0.33 ip addr show dev ens3
2: ens3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc fq_codel state UP group default qlen 1000
    link/ether fa:16:3e:dc:3f:f6 brd ff:ff:ff:ff:ff:ff
    inet 172.16.0.33/24 brd 172.16.0.255 scope global dynamic noprefixroute ens3
       valid_lft 86195sec preferred_lft 86195sec
    inet6 fe80::f816:3eff:fedc:3ff6/64 scope link noprefixroute 
       valid_lft forever preferred_lft forever

[cnv-qe-jenkins@cnv-executor-ysegev-4-3 yossi]$ ssh core.0.33 ip addr show dev ens6
3: ens6: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc fq_codel state UP group default qlen 1000
    link/ether fa:16:3e:2e:76:aa brd ff:ff:ff:ff:ff:ff
    inet 172.16.0.19/24 brd 172.16.0.255 scope global dynamic noprefixroute ens6
       valid_lft 86192sec preferred_lft 86192sec
    inet6 fe80::f816:3eff:fe2e:76aa/64 scope link noprefixroute 
       valid_lft forever preferred_lft forever

[cnv-qe-jenkins@cnv-executor-ysegev-4-3 yossi]$ ssh core.0.33 ip addr show dev ens7
4: ens7: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc fq_codel state UP group default qlen 1000
    link/ether fa:16:3e:9d:e8:a3 brd ff:ff:ff:ff:ff:ff
    inet 172.16.0.49/24 brd 172.16.0.255 scope global dynamic noprefixroute ens7
       valid_lft 86189sec preferred_lft 86189sec

[cnv-qe-jenkins@cnv-executor-ysegev-4-3 yossi]$ ssh core.0.33 ip addr show dev ens8
5: ens8: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc fq_codel state UP group default qlen 1000
    link/ether fa:16:3e:2d:ef:00 brd ff:ff:ff:ff:ff:ff
    inet 172.16.0.14/24 brd 172.16.0.255 scope global dynamic noprefixroute ens8
       valid_lft 86186sec preferred_lft 86186sec

In all the cases, you can see that the address line contains the word "dynamic", which implies that the intended policy configuration was considered to be destructive, and therefore it was roll-backed.

3. Check the status of both NNCP and NNCE:
[cnv-qe-jenkins@cnv-executor-ysegev-4-3 yossi]$ oc get nncp static-nics 
NAME          STATUS
static-nics   SuccessfullyConfigured
[cnv-qe-jenkins@cnv-executor-ysegev-4-3 yossi]$ 
[cnv-qe-jenkins@cnv-executor-ysegev-4-3 yossi]$ oc get nnce host-172-16-0-33.static-nics 
NAME                           STATUS
host-172-16-0-33.static-nics   ConfigurationProgressing


Actual results:
<BUG> Each shows a different status ("SuccessfullyConfigured" and "ConfigurationProgressing"), which is wrong in both cases.
In addition - the NNCE description ("oc get nnce host-172-16-0-33.static-nics -o yaml") doesn't include a rollback message.


Expected results:
1. The status of both NNCP and NNCE should be "ConfigurationFailed".
2. The current status condition in the NNCE should include a rollback message (search for the string "rollback" to verify).


Additional info:
This bug also happened on other scenarios, e.g. when the static IP's in the policy were different than those that were already dynamically given by the DHCP server.
However, in this scenario the occurrence of the bug was not consistent, and in some of the cases the behavior was as-expected (i.e. both NNCE and NNCP showed status "ConfigurationFailed", and the NNCE description included a rollback message).

The node's journalctl output is attached, with nmstate in TRACE log-level. It includes the timeline since just before applying the policy.

Comment 1 Petr Horáček 2020-03-15 11:09:59 UTC

Yossi, could you please share logs of all the nmstate pods too? And see if they were ever restarted. If they were, then share the previous log (-p) too.

Comment 2 Yossi Segev 2020-03-15 15:32:01 UTC

Attaching all relevant logs.
Please note that for one of the masters (host-172-16-0-38), I couldn't get the nmstate-handler log due to this error:
Error from server: Get https://172.16.0.38:10250/containerLogs/openshift-cnv/nmstate-handler-fzccc/nmstate-handler?previous=true: remote error: tls: internal error
However, this is one of the pods that wasn't restarted.
Also note that the node on which i reproduced this bug is host-172-16-0-33 (worker).

Comment 3 Yossi Segev 2020-03-15 15:44:10 UTC

Created attachment 1670336 [details]
nmstate-handler-host-172-16-0-22-master

Comment 4 Yossi Segev 2020-03-15 15:44:40 UTC

Created attachment 1670337 [details]
nmstate-handler-host-172-16-0-22-master-previous

Comment 5 Yossi Segev 2020-03-15 15:45:21 UTC

Created attachment 1670338 [details]
nmstate-handler-host-172-16-0-24-master

Comment 6 Yossi Segev 2020-03-15 15:46:15 UTC

Created attachment 1670339 [details]
nmstate-handler-host-172-16-0-33-worker

Comment 7 Yossi Segev 2020-03-15 15:53:28 UTC

Created attachment 1670340 [details]
nmstate-handler-host-172-16-0-33-worker-previous

Comment 8 Yossi Segev 2020-03-15 15:54:31 UTC

Created attachment 1670341 [details]
nmstate-handler-host-172-16-0-47-worker

Comment 9 Yossi Segev 2020-03-15 15:55:07 UTC

Created attachment 1670342 [details]
nmstate-handler-host-172-16-0-47-worker-previous

Comment 10 Quique Llorente 2020-03-18 07:05:39 UTC

I think I found this issue in the past, and put in place a fix [1] but we didn't merge, issue is (or this is my theory) after nmstatectl rollback there is a time window where
k8s api server connectivity is lost so we cannot propertly update nnce/nncp conditions, the fix is just to exercise the same probes we exercise to check that everything is 
ok after applying desiredState but for rollback, so we wait a little for system to settle after we put back good config.

Yossi, let me put in place a PR and we create a d/s build with it and test.

[1] https://github.com/nmstate/kubernetes-nmstate/pull/429/commits/9059be918b8d5b2b562e2f13aa9e51a1fbbf6058

Comment 11 Quique Llorente 2020-03-18 07:11:25 UTC

The PR

https://github.com/nmstate/kubernetes-nmstate/pull/454

Comment 12 Petr Horáček 2020-03-18 13:24:39 UTC

So the risk is only when somebody reconfigures the default interface (kills connectivity to the API server)?

Comment 13 Petr Horáček 2020-03-19 11:11:44 UTC

As discussed with Quique, this bug affects only reporting of the state and only when the default network is reconfigured. I'm moving this therefore to 2.4.

If somebody faces it, the easiest way to get proper state would be to do some non-actionable change to the NNCP. e.g. setting a "description" on one of the interfaces.

Comment 14 Petr Horáček 2020-03-19 12:59:15 UTC

I give a proper look to Quique's PR that is supposed to fix this and it seems harmless. It may help QE with annoying issues during automation. The cherry-pick to 2.3 seems to be without any collision. I'm moving this back to 2.3 and asking for blocker ack. No harm, just avoiding noise on test automation. Sorry for the noise I caused before.

Comment 15 Quique Llorente 2020-04-01 08:17:23 UTC

We have re-create this scenario u/s using multiworker cluster and running test only there (not at master) and several issues were found related to nnce/nncp conditions, so we have open several issues u/s that we are working one at master and they will be backported to release-0.15 branch and later propagate to CNV-2.3

- Race condition between rollback checks and Node state: https://github.com/nmstate/kubernetes-nmstate/issues/474 (already merged at release-0.15)
- Calculate policy conditions counting nodes running nmstate pod: https://github.com/nmstate/kubernetes-nmstate/issues/473
- Post rollback probes are not run: https://github.com/nmstate/kubernetes-nmstate/issues/482

Also some minor bugs related to this:
- app label missing at daemonset: https://github.com/nmstate/kubernetes-nmstate/issues/476
- Bad log at policy condition success: https://github.com/nmstate/kubernetes-nmstate/issues/475

Comment 16 Petr Horáček 2020-04-01 09:09:48 UTC

I'm sorry for the noise, but I think we should bring this back to ASSIGNED. Even though we fixed the original issue, after running tier1 against D/S builds we found a couple more. Since those issues would be found by QE sooner or later anyway, I'm moving it back to devs, so we deliver *flawless* build.

Comment 17 Yossi Segev 2020-04-19 11:56:12 UTC

Verified by repeating the scenario from the original bug description 5 times.
Results:
1. NNCE and NNCP both report the same status condition ("FailedToConfigure")
2. Rollback is reported in the NNCE.

Verified on version:
k8s-nmstate: kubernetes-nmstate-handler-rhel8@sha256:a7946b4d171184c1c0f6cee1f4e63fb18a66121a7024da0132723387d945d459
CNAO: cluster-network-addons-operator@sha256:2ec373edc6a1307248e9d4e4276f26052aafcd6058e5ddc182de847f9e794ff8

Comment 20 errata-xmlrpc 2020-05-04 19:10:58 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2020:2011

Comment 21 Red Hat Bugzilla 2023-09-14 05:54:18 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days

Note You need to log in before you can comment on or make changes to this bug.