Description of problem: I ran an e2e that SIGTERMed while this test was running https://github.com/openshift/origin/pull/24998/files That test creates a additionalNetworks in the cluster network.operator.openshift.io. spec: additionalNetworks: - name: secondary namespace: e2e-test-prometheus-qtdwj simpleMacvlanConfig: ipamConfig: staticIPAMConfig: addresses: - address: 10.1.1.0/24 type: static type: SimpleMacvlan clusterNetwork: - cidr: 10.128.0.0/14 hostPrefix: 23 defaultNetwork: type: OpenShiftSDN logLevel: "" serviceNetwork: - 172.30.0.0/16 However, when the e2e namespace is deleted the network operator goes degraded with the following error Error while updating operator configuration: could not apply (k8s.cni.cncf.io/v1, Kind=NetworkAttachmentDefinition) e2e-test-prometheus-qtdwj/secondary: could not create (k8s.cni.cncf.io/v1, Kind=NetworkAttachmentDefinition) e2e-test-prometheus-qtdwj/secondary: namespaces "e2e-test-prometheus-qtdwj" not found The cluster network.operator.openshift.io resource must have the additionalNetworks manually removed and then the network operator will go Degraded=False. Version-Release number of selected component (if applicable): 4.6.3 How reproducible: Always Steps to Reproduce: 1. Create and additionalNetwork for a particular namespace and then delete that namespace 2. 3. Actual results: network operator goes Degraded Expected results: network operator should not go Degraded and.. maybe additionalNetworks should be removed for nonexistent namespaces. Additional info:
> I ran an e2e that SIGTERMed while this test was running > https://github.com/openshift/origin/pull/24998/files er... why did it SIGTERM? And why didn't that end the e2e run right then? (Or did it end the e2e run, and CNO didn't go Degraded until after the e2e run was over, but the CI logs failed to make that clear?) And why did the namespace get deleted but the NetworkAttachmentDefinition didn't? Did the test's "defer" clause not run but its AfterEach() did? > Expected results: > network operator should not go Degraded and.. maybe additionalNetworks should be removed for nonexistent namespaces. In general, if there is bad data in the network config, the expected behavior is that CNO goes Degraded with a message that clearly indicates that there is bad data in the network config which the admin needs to fix. It does not normally try to fix things itself. (What if the admin modified the config first and was planning to create the namespace second?) But anyway, if we're going to go Degraded here, the message should be a lot clearer about what happened. We need to revisit that e2e test to see if what it's doing is really the right approach for testing this feature, and if it's not, then rewrite it.
> er... why did it SIGTERM? I was manually running openshift-tests against a long-lived cluster. Unfortunately, the KNI networks stuff fell over and I ctrl-C'ed it because it takes forever to finish when things are going wrong. I generally don't expect tests that are not marked [Disruptive] to make changes to cluster level configs.
changing subcomponent, this wasn't coming up on our radar from it.
*** Bug 1945566 has been marked as a duplicate of this bug. ***
Is there news in relation to this issue?
Hello, I was able to reproduce the degraded state, but not the SIGTERM. I'm looking into what's causing it now.
Here are my current steps: 1. ran ./hack/run-locally.sh (observe "degraded: false" for network clusteroperator) 2. in new terminal tab, create additionalNetwork using docs (https://docs.openshift.com/container-platform/4.7/networking/multiple_networks/configuring-macvlan.html) with namespace named "test-pebos" 3. delete "test-pebos" and observe "degraded: true" for network clusteroperator 4. recreate "test pebos" and observe "degraded: false" for network clusteroperator This seems to me Suggested by @
Accidentally posted before I finished typing - but following my fourth step, seems to be desired behavior on CNO side actually (re: Dan Winship's comment). Degraded state gets removed following either matching namespace/expected namespace or deletion of net-attach-def. Agree with Dan, maybe logging can be clearer or provide a suggested action ("consider either changing namespace name, creating namespace, or deleting net-attach-def").
Since https://bugzilla.redhat.com/show_bug.cgi?id=1945566 has been closed as a duplicate of this bug, please also consider the original use-case[1] there. I.e. additionalNetworks must be added to CNO by cluster administrator but but Namespaces can be created through self-service. Therefore, the CNO should be able to handle the case when the additionalNetwork is added before the NS exists. Is this use-case in the scope of this Bz or should https://bugzilla.redhat.com/show_bug.cgi?id=1945566 be re-opened? [1] https://bugzilla.redhat.com/show_bug.cgi?id=1945566#c0
I want to make sure I'm on the same page here... David, I tried my earlier example and this time made a namespace manually after the additionalNetwork was "created" (didn't actually show up and network operator becomes degraded). The degraded state went away quickly and the additionalNetwork showed up when I ran `oc get network-attachment-definitions -n test-pebos`. Which part of this is the undesired behavior? Or is the issue a design choice that should become more straightforward? Thanks.
Apologies for late reply. The issue, as described in https://bugzilla.redhat.com/show_bug.cgi?id=1945566#c0 is if you have additionalNetworks both for namespaces which exist and for such that don't (yet) exist. Depending on the order of the additionalNetworks, the net-attach-defs for those namespaces that _does_ exist will not be created before all namespaces are first created. Please see the example in https://bugzilla.redhat.com/attachment.cgi?id=1768232
Thanks for the input David, and the example, very helpful. I'm inclined to think that in general this is acceptable behavior from the CNO, that if one item in the set isn't able to be reconciled, that entirety of changes from the set aren't reconciled. At a glance that seems OK. And I'm not sure about having the CNO manage the creation of namespaces. And, the degradation lets a user know that something is wrong, so they can look to see what's going on. However, I believe the messaging can be updated. I think we need a two pronged approach: 1. Update the messaging from the CNO to make it actionable by a user, to let them know what went wrong, and what they need to fix, to convey "This namespace doesn't exist, please create $foo namespace before adding net-attach-defs" 2. I think the documentation might need an update, as an example: https://docs.openshift.com/container-platform/4.7/networking/multiple_networks/configuring-bridge.html#nw-multus-create-network_configuring-bridge -- procedure, step 2. Add an informational note for the "namespace:" line to denote that it's expected that users create the namespaces first. Happy to get any other input here, too, thanks.
I think the main issue hers is that Namespaces (projects) can be self-service but additionalNetworks for the CNO aren't. If the above assumption is correct, then we should not treat a non-existent namespace as something wrong but rather as a normal condition and CNO should basically defer creating the net-attach-defs related to that namespace until it at some point (maybe) turns up. This should of course not affect handling of other net-attach-defs. If the above assumption is _not_ correct, then we must revisit how we communicate self-service of namespaces and projects.
Tested and verified in 4.10.0-0.nightly-2022-01-21-074618
Run automation script several times, will see this problem again in 4.10.0-rc.3 02-23 15:56:52.126 STEP: Delete the namespace 02-23 15:56:52.126 warning: deleting cluster-scoped resources, not scoped to the provided namespace 02-23 15:56:52.126 project.project.openshift.io "ocp-46387" deleted 02-23 15:56:52.126 STEP: Check NetworkOperatorStatus after deleting namespace 02-23 15:56:52.126 Feb 23 20:56:41.229: INFO: Network operator state is:NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE 02-23 15:56:52.126 network 4.10.0-rc.3 True False False 6h30m 02-23 15:56:52.126 Feb 23 20:56:51.232: INFO: Network operator state is:NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE 02-23 15:56:52.126 network 4.10.0-rc.3 True False True 6h30m Error while updating operator configuration: could not apply (k8s.cni.cncf.io/v1, Kind=NetworkAttachmentDefinition) ocp-46387/secondary; Namespace error for networkattachment definition, consider possible solutions: (1) Edit config files to include existing namespace (2) Create non-existent namespace (3) Delete erroneous network-attachment-definition: could not apply (k8s.cni.cncf.io/v1, Kind=NetworkAttachmentDefinition) ocp-46387/secondary: ApplyObject of (k8s.cni.cncf.io/v1, Kind=NetworkAttachmentDefinition) ocp-46387/secondary was unsuccessful: namespaces "ocp-46387" not found 02-23 15:56:52.126 Whole testing log: https://mastern-jenkins-csb-openshift-qe.apps.ocp-c1.prod.psi.redhat.com/job/ocp-common/job/ginkgo-test/36853/console
Reopening as it's not clear to me that the problem described in https://bugzilla.redhat.com/show_bug.cgi?id=1896533#c16 has been addressed. I.e. if there are multiple additionalNetworks defined and the NS for one of these are missing, are the subsequent additionalNetworks then processed by the CNO?
Still failed in 4.11.0-rc.0 network 4.11.0-rc.0 True False False 26m Jul 5 12:24:20.049: INFO: Network operator state is:NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE network 4.11.0-rc.0 True False True 26m Error while updating operator configuration: could not apply (k8s.cni.cncf.io/v1, Kind=NetworkAttachmentDefinition) ocp-46387/secondary; Namespace error for networkattachment definition, consider possible solutions: (1) Edit config files to include existing namespace (2) Create non-existent namespace (3) Delete erroneous network-attachment-definition: could not apply (k8s.cni.cncf.io/v1, Kind=NetworkAttachmentDefinition) ocp-46387/secondary: failed to apply / update (k8s.cni.cncf.io/v1, Kind=NetworkAttachmentDefinition) ocp-46387/secondary: namespaces "ocp-46387" not found
Feature keep failing in QE CI testing: https://issues.redhat.com/browse/OCPQE-11902
Hi folks, please take a look at this proposed fix: https://github.com/openshift/cluster-network-operator/pull/1600 I moved the "SetDegraded" function call to after all objects are processed. Feedback is welcome and encouraged.
Tested and verified in 4.12.0-0.nightly-2022-11-07-104414
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.12.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:7399