Hide Forgot
In bug 1905196 the customer ran into problems because they had manually created NetNamespace objects, resulting in conflicting NetIDs (in particular, multiple NetID 0 namespaces, because they hadn't bothered to manually assign a NetID), resulting in undefined behavior since conflicting NetIDs aren't allowed in NetworkPolicy mode. (They were manually creating NetNamespaces because they wanted to set an egress IP for a newly-created project. The fix here is to create the project, then wait for openshift-sdn to create the NetNamespace, then modify the NetNamespace, rather than creating the NetNamespace yourself. But there is nothing in the documentation that suggests that this should be necessary.) Note that if you manually create a NetNamespace and manually assign it an unused NetID, then things will appear to work, but sdn-controller won't actually observe the new NetID assignment, and so could in theory later assign that same NetID to another namespace. Also, you could manually assign a NetID in the "reserved" range (0-10) which could cause unexpected conflicts if we use another one of those IDs for internal purposes later. So... 1. If we're keeping the current semantics, then the Egress IP docs need to make it clear that you can only _modify_ NetNamespaces, not _create_ them. 2. sdn-node in NetworkPolicy mode needs to handle NetID conflicts in a well-defined way (ie, treating them as "the user tried to do something that won't work" rather than "a paradox occurred, assuming false==true") 2a. at a minimum, recognize that `NetID == 0 && Name != "default" && EgressIPs != nil` represents a user error 3. maybe we should fix sdn-controller to recognize when a user has created a NetNamespace with no NetID, and assign a NetID to it in that case. (This would also require sdn-node to ignore the NetNamespace until sdn-controller modified it.)
Have a test on 4.8.0-0.nightly-2021-05-11-025533 seems user still can modify the netID, 1. oc new-project z4 # oc get netnamespace z4 NAME NETID EGRESS IPS z4 15085403 2. using `oc edit netnamespace z4` to update the netid to 0 # oc edit netnamespace z4 error: netnamespaces.network.openshift.io "z4" could not be patched: the body of the request was in an unknown format - accepted media types include: application/json-patch+json, application/merge-patch+json, application/apply-patch+yaml You can run `oc replace -f /tmp/oc-edit-o92w3.yaml` to try this update again. # cat /tmp/oc-edit-o92w3.yaml # Please edit the object below. Lines beginning with a '#' will be ignored, # and an empty file will abort the edit. If an error occurs while saving this file will be # reopened with the relevant failures. # apiVersion: network.openshift.io/v1 kind: NetNamespace metadata: creationTimestamp: "2021-05-11T10:59:03Z" generation: 1 managedFields: - apiVersion: network.openshift.io/v1 fieldsType: FieldsV1 fieldsV1: f:netid: {} f:netname: {} manager: openshift-sdn-controller operation: Update time: "2021-05-11T10:59:03Z" name: z4 resourceVersion: "189337" uid: 664c9db1-36c2-446a-8017-63e0142d3c7d netid: 0 netname: z4 3. # oc replace -f /tmp/oc-edit-o92w3.yaml netnamespace.network.openshift.io/z4 replaced 4. # oc get netnamespace z4 NAME NETID EGRESS IPS z4 0 ### Find the netid already updated to 0 same with default # oc get netnamespace default NAME NETID EGRESS IPS default 0
@zzhao 1- how you sure the above build contains my fix ? 2- do u see any log like this "Netid 0 for namespace z4 already exists under different namespace default" 3- I confirmed the fix by adding UT testcases for duplicate and none duplicate cases
(In reply to Mohamed Mahmoud from comment #2) > @zzhao > 1- how you sure the above build contains my fix ? > 2- do u see any log like this "Netid 0 for namespace z4 already exists under > different namespace default" > 3- I confirmed the fix by adding UT testcases for duplicate and none > duplicate cases just we are clear we aren't preventing customers from changing netns manually what we are doing in this PR is to protect SDN when this happens so we won't attempt to add/delete policy for this modified netns. in your test pls look for the warning message I shared above to confirm the new behavior we also need to doc in this BZ that customers are discouraged from manually modifying namespaces so we are covered.
(In reply to Mohamed Mahmoud from comment #2) > @zzhao > 1- how you sure the above build contains my fix ? we can check this build already including the fix in https://openshift-release.apps.ci.l2s4.p1.openshiftapps.com/releasestream/4.8.0-0.nightly/release/4.8.0-0.nightly-2021-05-11-025533 > 2- do u see any log like this "Netid 0 for namespace z4 already exists under > different namespace default" yes, I can see `W0512 01:15:37.790443 1730 vnids.go:219] Netid 0 for namespace z4 already exists under different namespace default` > 3- I confirmed the fix by adding UT testcases for duplicate and none > duplicate cases
(In reply to Mohamed Mahmoud from comment #3) > (In reply to Mohamed Mahmoud from comment #2) > > @zzhao > > 1- how you sure the above build contains my fix ? > > 2- do u see any log like this "Netid 0 for namespace z4 already exists under > > different namespace default" > > 3- I confirmed the fix by adding UT testcases for duplicate and none > > duplicate cases > > just we are clear we aren't preventing customers from changing netns > manually what we are doing in this PR is to protect SDN when this happens so > we won't attempt to add/delete policy for this modified netns. in your test > pls look for the warning message I shared above to confirm the new behavior > > we also need to doc in this BZ that customers are discouraged from manually > modifying namespaces so we are covered. ok, if only prevent add/delete modified netid, this issue already be fixed when I attempted add deny policy in z4 namespace. it did not affect the default namespace pod communication # cat deny.yaml apiVersion: networking.k8s.io/v1 kind: NetworkPolicy metadata: name: default-deny-all spec: podSelector: null # oc get networkpolicy -n z4 NAME POD-SELECTOR AGE default-deny-all <none> 7s # oc exec -n z4 test-rc-7gswb -- curl --connect-timeout 4 10.128.2.24:8080 % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 0 0 0 0 0 0 0 0 --:--:-- 0:00:04 --:--:-- 0 curl: (28) Connection timed out after 4001 milliseconds command terminated with exit code 28 ###check the default pod### # oc get pod -n default -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES test-rc-jp2kq 1/1 Running 0 10m 10.131.0.24 ip-10-0-215-250.ap-northeast-1.compute.internal <none> <none> test-rc-sdtq6 1/1 Running 0 10m 10.129.3.73 ip-10-0-186-183.ap-northeast-1.compute.internal <none> <none> ##check the pods can communicate each other in default namespace # oc exec -n default test-rc-jp2kq -- curl 10.129.3.73:8080 % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 17 100 17 0 0 5666 0 --:--:-- --:--:-- --:--:-- 5666 Hello OpenShift! Move this bug to verified.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2438