Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1928851

Summary: manually creating NetNamespaces will break things and this is not obvious
Product: OpenShift Container Platform Reporter: Dan Winship <danw>
Component: NetworkingAssignee: Mohamed Mahmoud <mmahmoud>
Networking sub component: openshift-sdn QA Contact: zhaozhanqi <zzhao>
Status: CLOSED ERRATA Docs Contact:
Severity: medium    
Priority: medium CC: aconstan, astoycos, zzhao
Version: 4.8   
Target Milestone: ---   
Target Release: 4.8.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Cause: Users manually modify network namespaces in use by SDN. Consequence: If this namespace is already in use by network policy, the network policy behavior will be changed causing undesired behaviors. Fix: While customers are discouraged from manually modifying namespaces in general, the code has been updated to prevent such behavior from corrupting deployed network policies and log a warning with this happens. Result: test shows the warning has been logged and network policy behavior hasn't been changed.
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-07-27 22:44:45 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Dan Winship 2021-02-15 16:32:44 UTC
In bug 1905196 the customer ran into problems because they had manually created NetNamespace objects, resulting in conflicting NetIDs (in particular, multiple NetID 0 namespaces, because they hadn't bothered to manually assign a NetID), resulting in undefined behavior since conflicting NetIDs aren't allowed in NetworkPolicy mode.

(They were manually creating NetNamespaces because they wanted to set an egress IP for a newly-created project. The fix here is to create the project, then wait for openshift-sdn to create the NetNamespace, then modify the NetNamespace, rather than creating the NetNamespace yourself. But there is nothing in the documentation that suggests that this should be necessary.)

Note that if you manually create a NetNamespace and manually assign it an unused NetID, then things will appear to work, but sdn-controller won't actually observe the new NetID assignment, and so could in theory later assign that same NetID to another namespace. Also, you could manually assign a NetID in the "reserved" range (0-10) which could cause unexpected conflicts if we use another one of those IDs for internal purposes later.

So...

1. If we're keeping the current semantics, then the Egress IP docs need to make it clear that you can only _modify_ NetNamespaces, not _create_ them.

2. sdn-node in NetworkPolicy mode needs to handle NetID conflicts in a well-defined way (ie, treating them as "the user tried to do something that won't work" rather than "a paradox occurred, assuming false==true")

  2a. at a minimum, recognize that `NetID == 0 && Name != "default" &&
      EgressIPs != nil` represents a user error

3. maybe we should fix sdn-controller to recognize when a user has created a NetNamespace with no NetID, and assign a NetID to it in that case. (This would also require sdn-node to ignore the NetNamespace until sdn-controller modified it.)

Comment 1 zhaozhanqi 2021-05-11 11:02:58 UTC
Have a test on 4.8.0-0.nightly-2021-05-11-025533

seems user still can modify the netID, 

1. oc new-project z4

# oc get netnamespace z4
NAME   NETID      EGRESS IPS
z4     15085403   

2. using `oc edit netnamespace z4` to update the netid to 0

# oc edit netnamespace z4
error: netnamespaces.network.openshift.io "z4" could not be patched: the body of the request was in an unknown format - accepted media types include: application/json-patch+json, application/merge-patch+json, application/apply-patch+yaml
You can run `oc replace -f /tmp/oc-edit-o92w3.yaml` to try this update again.

# cat /tmp/oc-edit-o92w3.yaml
# Please edit the object below. Lines beginning with a '#' will be ignored,
# and an empty file will abort the edit. If an error occurs while saving this file will be
# reopened with the relevant failures.
#
apiVersion: network.openshift.io/v1
kind: NetNamespace
metadata:
  creationTimestamp: "2021-05-11T10:59:03Z"
  generation: 1
  managedFields:
  - apiVersion: network.openshift.io/v1
    fieldsType: FieldsV1
    fieldsV1:
      f:netid: {}
      f:netname: {}
    manager: openshift-sdn-controller
    operation: Update
    time: "2021-05-11T10:59:03Z"
  name: z4
  resourceVersion: "189337"
  uid: 664c9db1-36c2-446a-8017-63e0142d3c7d
netid: 0
netname: z4

3. # oc replace -f /tmp/oc-edit-o92w3.yaml
netnamespace.network.openshift.io/z4 replaced

4. # oc get netnamespace z4
NAME   NETID   EGRESS IPS
z4     0                             ### Find the netid already updated to 0 same with default


# oc get netnamespace default
NAME      NETID   EGRESS IPS
default   0

Comment 2 Mohamed Mahmoud 2021-05-11 13:18:52 UTC
@zzhao 
1- how you sure the above build contains my fix ?
2- do u see any log like this "Netid 0 for namespace z4 already exists under different namespace default" 
3- I confirmed the fix by adding UT testcases for duplicate and none duplicate cases

Comment 3 Mohamed Mahmoud 2021-05-11 13:29:46 UTC
(In reply to Mohamed Mahmoud from comment #2)
> @zzhao 
> 1- how you sure the above build contains my fix ?
> 2- do u see any log like this "Netid 0 for namespace z4 already exists under
> different namespace default" 
> 3- I confirmed the fix by adding UT testcases for duplicate and none
> duplicate cases

just we are clear we aren't preventing customers from changing netns manually what we are doing in this PR is to protect SDN when this happens so we won't attempt to add/delete policy for this modified netns. in your test pls look for the warning message I shared above to confirm the new behavior

we also need to doc in this BZ that customers are discouraged from manually modifying namespaces so we are covered.

Comment 4 zhaozhanqi 2021-05-12 02:10:26 UTC
(In reply to Mohamed Mahmoud from comment #2)
> @zzhao 
> 1- how you sure the above build contains my fix ?

we can check this build already including the fix in https://openshift-release.apps.ci.l2s4.p1.openshiftapps.com/releasestream/4.8.0-0.nightly/release/4.8.0-0.nightly-2021-05-11-025533

> 2- do u see any log like this "Netid 0 for namespace z4 already exists under
> different namespace default" 

yes, I can see `W0512 01:15:37.790443    1730 vnids.go:219] Netid 0 for namespace z4 already exists under different namespace default`


> 3- I confirmed the fix by adding UT testcases for duplicate and none
> duplicate cases

Comment 5 zhaozhanqi 2021-05-12 02:19:20 UTC
(In reply to Mohamed Mahmoud from comment #3)
> (In reply to Mohamed Mahmoud from comment #2)
> > @zzhao 
> > 1- how you sure the above build contains my fix ?
> > 2- do u see any log like this "Netid 0 for namespace z4 already exists under
> > different namespace default" 
> > 3- I confirmed the fix by adding UT testcases for duplicate and none
> > duplicate cases
> 
> just we are clear we aren't preventing customers from changing netns
> manually what we are doing in this PR is to protect SDN when this happens so
> we won't attempt to add/delete policy for this modified netns. in your test
> pls look for the warning message I shared above to confirm the new behavior
> 
> we also need to doc in this BZ that customers are discouraged from manually
> modifying namespaces so we are covered.

ok, if only prevent add/delete modified netid, this issue already be fixed

when I attempted add deny policy in z4 namespace. it did not affect the default namespace pod communication

# cat deny.yaml 
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: default-deny-all
spec:
  podSelector: null


# oc get networkpolicy -n z4
NAME               POD-SELECTOR   AGE
default-deny-all   <none>         7s

# oc exec -n z4 test-rc-7gswb -- curl --connect-timeout 4 10.128.2.24:8080
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:--  0:00:04 --:--:--     0
curl: (28) Connection timed out after 4001 milliseconds
command terminated with exit code 28

###check the default pod###

# oc get pod -n default -o wide
NAME            READY   STATUS    RESTARTS   AGE   IP            NODE                                              NOMINATED NODE   READINESS GATES
test-rc-jp2kq   1/1     Running   0          10m   10.131.0.24   ip-10-0-215-250.ap-northeast-1.compute.internal   <none>           <none>
test-rc-sdtq6   1/1     Running   0          10m   10.129.3.73   ip-10-0-186-183.ap-northeast-1.compute.internal   <none>           <none>

##check the pods can communicate each other in default namespace 

# oc exec -n default test-rc-jp2kq -- curl 10.129.3.73:8080
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100    17  100    17    0     0   5666      0 --:--:-- --:--:-- --:--:--  5666
Hello OpenShift!


Move this bug to verified.

Comment 8 errata-xmlrpc 2021-07-27 22:44:45 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438