Bug 1896533 - network operator degraded due to additionalNetwork in non-existent namespace
Summary: network operator degraded due to additionalNetwork in non-existent namespace
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.6
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: 4.12.0
Assignee: Nikhil Simha
QA Contact: Weibin Liang
URL:
Whiteboard:
: 1945566 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-11-10 19:57 UTC by Seth Jennings
Modified: 2023-01-17 19:46 UTC (History)
12 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2023-01-17 19:46:21 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-network-operator pull 1128 0 None open Bug 1896533: Nonexistent Namespaces Degradation logging message 2021-06-15 19:00:43 UTC
Github openshift cluster-network-operator pull 1600 0 None open Bug 1896533: moved SetDegraded call out of object loop to process all items first 2022-10-27 18:11:48 UTC
Red Hat Product Errata RHSA-2022:7399 0 None None None 2023-01-17 19:46:40 UTC

Description Seth Jennings 2020-11-10 19:57:42 UTC
Description of problem:

I ran an e2e that SIGTERMed while this test was running
https://github.com/openshift/origin/pull/24998/files

That test creates a additionalNetworks in the cluster network.operator.openshift.io.

spec:
  additionalNetworks:
  - name: secondary
    namespace: e2e-test-prometheus-qtdwj
    simpleMacvlanConfig:
      ipamConfig:
        staticIPAMConfig:
          addresses:
          - address: 10.1.1.0/24
        type: static
    type: SimpleMacvlan
  clusterNetwork:
  - cidr: 10.128.0.0/14
    hostPrefix: 23
  defaultNetwork:
    type: OpenShiftSDN
  logLevel: ""
  serviceNetwork:
  - 172.30.0.0/16

However, when the e2e namespace is deleted the network operator goes degraded with the following error

Error while updating operator configuration: could not apply (k8s.cni.cncf.io/v1, Kind=NetworkAttachmentDefinition) e2e-test-prometheus-qtdwj/secondary: could not create (k8s.cni.cncf.io/v1, Kind=NetworkAttachmentDefinition) e2e-test-prometheus-qtdwj/secondary: namespaces "e2e-test-prometheus-qtdwj" not found

The cluster network.operator.openshift.io resource must have the additionalNetworks manually removed and then the network operator will go Degraded=False.


Version-Release number of selected component (if applicable):
4.6.3

How reproducible:
Always

Steps to Reproduce:
1. Create and additionalNetwork for a particular namespace and then delete that namespace
2.
3.

Actual results:
network operator goes Degraded

Expected results:
network operator should not go Degraded and.. maybe additionalNetworks should be removed for nonexistent namespaces.

Additional info:

Comment 1 Dan Winship 2020-11-10 20:48:31 UTC
> I ran an e2e that SIGTERMed while this test was running
> https://github.com/openshift/origin/pull/24998/files

er... why did it SIGTERM? And why didn't that end the e2e run right then? (Or did it end the e2e run, and CNO didn't go Degraded until after the e2e run was over, but the CI logs failed to make that clear?)

And why did the namespace get deleted but the NetworkAttachmentDefinition didn't? Did the test's "defer" clause not run but its AfterEach() did?

> Expected results:
> network operator should not go Degraded and.. maybe additionalNetworks should be removed for nonexistent namespaces.

In general, if there is bad data in the network config, the expected behavior is that CNO goes Degraded with a message that clearly indicates that there is bad data in the network config which the admin needs to fix. It does not normally try to fix things itself. (What if the admin modified the config first and was planning to create the namespace second?)

But anyway, if we're going to go Degraded here, the message should be a lot clearer about what happened.


We need to revisit that e2e test to see if what it's doing is really the right approach for testing this feature, and if it's not, then rewrite it.

Comment 2 Seth Jennings 2020-11-10 21:00:35 UTC
> er... why did it SIGTERM?
I was manually running openshift-tests against a long-lived cluster.  Unfortunately, the KNI networks stuff fell over and I ctrl-C'ed it because it takes forever to finish when things are going wrong.

I generally don't expect tests that are not marked [Disruptive] to make changes to cluster level configs.

Comment 3 Douglas Smith 2021-01-15 17:15:45 UTC
changing subcomponent, this wasn't coming up on our radar from it.

Comment 4 Douglas Smith 2021-04-05 14:37:10 UTC
*** Bug 1945566 has been marked as a duplicate of this bug. ***

Comment 10 Lucas López Montero 2021-06-09 07:19:54 UTC
Is there news in relation to this issue?

Comment 11 Nikhil Simha 2021-06-09 19:39:30 UTC
Hello, I was able to reproduce the degraded state, but not the SIGTERM. I'm looking into what's causing it now.

Comment 12 Nikhil Simha 2021-06-10 19:03:16 UTC
Here are my current steps:

1. ran ./hack/run-locally.sh (observe "degraded: false" for network clusteroperator)

2. in new terminal tab, create additionalNetwork using docs (https://docs.openshift.com/container-platform/4.7/networking/multiple_networks/configuring-macvlan.html) with namespace named "test-pebos"

3. delete "test-pebos" and observe "degraded: true" for network clusteroperator

4. recreate "test pebos" and observe "degraded: false" for network clusteroperator

This seems to me 

Suggested by @

Comment 13 Nikhil Simha 2021-06-10 19:10:29 UTC
Accidentally posted before I finished typing - but following my fourth step, seems to be desired behavior on CNO side actually (re: Dan Winship's comment). 

Degraded state gets removed following either matching namespace/expected namespace or deletion of net-attach-def. 

Agree with Dan, maybe logging can be clearer or provide a suggested action ("consider either changing namespace name, creating namespace, or deleting net-attach-def").

Comment 14 David Juran 2021-06-11 10:51:17 UTC
Since https://bugzilla.redhat.com/show_bug.cgi?id=1945566 has been closed as a duplicate of this bug, please also consider the original use-case[1] there.

I.e. additionalNetworks must be added to CNO by cluster administrator but but Namespaces can be created through self-service. Therefore, the CNO should be able to handle the case when the additionalNetwork is added before the NS exists.

Is this use-case in the scope of this Bz or should https://bugzilla.redhat.com/show_bug.cgi?id=1945566 be re-opened?

[1]
https://bugzilla.redhat.com/show_bug.cgi?id=1945566#c0

Comment 15 Nikhil Simha 2021-06-11 18:58:04 UTC
I want to make sure I'm on the same page here...

David, I tried my earlier example and this time made a namespace manually after the additionalNetwork was "created" (didn't actually show up and network operator becomes degraded). 

The degraded state went away quickly and the additionalNetwork showed up when I ran `oc get network-attachment-definitions -n test-pebos`.

Which part of this is the undesired behavior? Or is the issue a design choice that should become more straightforward? Thanks.

Comment 16 David Juran 2021-06-24 10:09:23 UTC
Apologies for late reply.
The issue, as described in https://bugzilla.redhat.com/show_bug.cgi?id=1945566#c0 is if you have additionalNetworks both for namespaces which exist and for such that don't (yet) exist.
Depending on the order of the additionalNetworks, the net-attach-defs for those namespaces that _does_ exist will not be created before all namespaces are first created.

Please see the example in https://bugzilla.redhat.com/attachment.cgi?id=1768232

Comment 17 Douglas Smith 2021-06-24 20:46:18 UTC
Thanks for the input David, and the example, very helpful.

I'm inclined to think that in general this is acceptable behavior from the CNO, that if one item in the set isn't able to be reconciled, that entirety of changes from the set aren't reconciled. At a glance that seems OK. And I'm not sure about having the CNO manage the creation of namespaces. And, the degradation lets a user know that something is wrong, so they can look to see what's going on.

However, I believe the messaging can be updated. I think we need a two pronged approach:

1. Update the messaging from the CNO to make it actionable by a user, to let them know what went wrong, and what they need to fix, to convey "This namespace doesn't exist, please create $foo namespace before adding net-attach-defs"

2. I think the documentation might need an update, as an example: https://docs.openshift.com/container-platform/4.7/networking/multiple_networks/configuring-bridge.html#nw-multus-create-network_configuring-bridge -- procedure, step 2. Add an informational note for the "namespace:" line to denote that it's expected that users create the namespaces first.

Happy to get any other input here, too, thanks.

Comment 18 David Juran 2021-07-01 11:16:03 UTC
I think the main issue hers is that Namespaces (projects) can be self-service but additionalNetworks for the CNO aren't.

If the above assumption is correct, then we should not treat a non-existent namespace as something wrong but rather as a normal condition and CNO should basically defer creating the net-attach-defs related to that namespace until it at some point (maybe) turns up. This should of course not affect handling of other net-attach-defs.

If the above assumption is _not_ correct, then we must revisit how we communicate self-service of namespaces and projects.

Comment 21 Weibin Liang 2022-01-21 17:09:49 UTC
Tested and verified in 4.10.0-0.nightly-2022-01-21-074618

Comment 24 Weibin Liang 2022-02-23 21:02:51 UTC
Run automation script several times, will see this problem again in 4.10.0-rc.3

02-23 15:56:52.126  STEP: Delete the namespace
02-23 15:56:52.126  warning: deleting cluster-scoped resources, not scoped to the provided namespace
02-23 15:56:52.126  project.project.openshift.io "ocp-46387" deleted
02-23 15:56:52.126  STEP: Check NetworkOperatorStatus after deleting namespace
02-23 15:56:52.126  Feb 23 20:56:41.229: INFO: Network operator state is:NAME      VERSION       AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
02-23 15:56:52.126  network   4.10.0-rc.3   True        False         False      6h30m
02-23 15:56:52.126  Feb 23 20:56:51.232: INFO: Network operator state is:NAME      VERSION       AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
02-23 15:56:52.126  network   4.10.0-rc.3   True        False         True       6h30m   Error while updating operator configuration: could not apply (k8s.cni.cncf.io/v1, Kind=NetworkAttachmentDefinition) ocp-46387/secondary; Namespace error for networkattachment definition, consider possible solutions: (1) Edit config files to include existing namespace (2) Create non-existent namespace (3) Delete erroneous network-attachment-definition: could not apply (k8s.cni.cncf.io/v1, Kind=NetworkAttachmentDefinition) ocp-46387/secondary: ApplyObject of (k8s.cni.cncf.io/v1, Kind=NetworkAttachmentDefinition) ocp-46387/secondary was unsuccessful: namespaces "ocp-46387" not found
02-23 15:56:52.126  

Whole testing log: https://mastern-jenkins-csb-openshift-qe.apps.ocp-c1.prod.psi.redhat.com/job/ocp-common/job/ginkgo-test/36853/console

Comment 29 David Juran 2022-05-05 10:49:09 UTC
Reopening as it's not clear to me that the problem described in https://bugzilla.redhat.com/show_bug.cgi?id=1896533#c16 has been addressed.

I.e. if there are multiple additionalNetworks defined and the NS for one of these are missing, are the subsequent additionalNetworks then processed by the CNO?

Comment 33 Weibin Liang 2022-07-05 16:26:25 UTC
Still failed in 4.11.0-rc.0

network   4.11.0-rc.0   True        False         False      26m
Jul  5 12:24:20.049: INFO: Network operator state is:NAME      VERSION       AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
network   4.11.0-rc.0   True        False         True       26m     Error while updating operator configuration: could not apply (k8s.cni.cncf.io/v1, Kind=NetworkAttachmentDefinition) ocp-46387/secondary; Namespace error for networkattachment definition, consider possible solutions: (1) Edit config files to include existing namespace (2) Create non-existent namespace (3) Delete erroneous network-attachment-definition: could not apply (k8s.cni.cncf.io/v1, Kind=NetworkAttachmentDefinition) ocp-46387/secondary: failed to apply / update (k8s.cni.cncf.io/v1, Kind=NetworkAttachmentDefinition) ocp-46387/secondary: namespaces "ocp-46387" not found

Comment 36 Weibin Liang 2022-09-07 01:21:09 UTC
Feature keep failing in QE CI testing: https://issues.redhat.com/browse/OCPQE-11902

Comment 39 Nikhil Simha 2022-10-24 17:56:57 UTC
Hi folks, please take a look at this proposed fix:

https://github.com/openshift/cluster-network-operator/pull/1600

I moved the "SetDegraded" function call to after all objects are processed. Feedback is welcome and encouraged.

Comment 41 Weibin Liang 2022-11-07 19:11:32 UTC
Tested and verified in 4.12.0-0.nightly-2022-11-07-104414

Comment 44 errata-xmlrpc 2023-01-17 19:46:21 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.12.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:7399


Note You need to log in before you can comment on or make changes to this bug.