Bug 1928851 - manually creating NetNamespaces will break things and this is not obvious
Summary: manually creating NetNamespaces will break things and this is not obvious
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.8
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: 4.8.0
Assignee: Mohamed Mahmoud
QA Contact: zhaozhanqi
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-02-15 16:32 UTC by Dan Winship
Modified: 2021-07-27 22:45 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: Users manually modify network namespaces in use by SDN. Consequence: If this namespace is already in use by network policy, the network policy behavior will be changed causing undesired behaviors. Fix: While customers are discouraged from manually modifying namespaces in general, the code has been updated to prevent such behavior from corrupting deployed network policies and log a warning with this happens. Result: test shows the warning has been logged and network policy behavior hasn't been changed.
Clone Of:
Environment:
Last Closed: 2021-07-27 22:44:45 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift sdn pull 294 0 None open Bug 1928851: prevent manually creating netns with existing NetID 2021-04-28 15:44:58 UTC
Red Hat Product Errata RHSA-2021:2438 0 None None None 2021-07-27 22:45:04 UTC

Description Dan Winship 2021-02-15 16:32:44 UTC
In bug 1905196 the customer ran into problems because they had manually created NetNamespace objects, resulting in conflicting NetIDs (in particular, multiple NetID 0 namespaces, because they hadn't bothered to manually assign a NetID), resulting in undefined behavior since conflicting NetIDs aren't allowed in NetworkPolicy mode.

(They were manually creating NetNamespaces because they wanted to set an egress IP for a newly-created project. The fix here is to create the project, then wait for openshift-sdn to create the NetNamespace, then modify the NetNamespace, rather than creating the NetNamespace yourself. But there is nothing in the documentation that suggests that this should be necessary.)

Note that if you manually create a NetNamespace and manually assign it an unused NetID, then things will appear to work, but sdn-controller won't actually observe the new NetID assignment, and so could in theory later assign that same NetID to another namespace. Also, you could manually assign a NetID in the "reserved" range (0-10) which could cause unexpected conflicts if we use another one of those IDs for internal purposes later.

So...

1. If we're keeping the current semantics, then the Egress IP docs need to make it clear that you can only _modify_ NetNamespaces, not _create_ them.

2. sdn-node in NetworkPolicy mode needs to handle NetID conflicts in a well-defined way (ie, treating them as "the user tried to do something that won't work" rather than "a paradox occurred, assuming false==true")

  2a. at a minimum, recognize that `NetID == 0 && Name != "default" &&
      EgressIPs != nil` represents a user error

3. maybe we should fix sdn-controller to recognize when a user has created a NetNamespace with no NetID, and assign a NetID to it in that case. (This would also require sdn-node to ignore the NetNamespace until sdn-controller modified it.)

Comment 1 zhaozhanqi 2021-05-11 11:02:58 UTC
Have a test on 4.8.0-0.nightly-2021-05-11-025533

seems user still can modify the netID, 

1. oc new-project z4

# oc get netnamespace z4
NAME   NETID      EGRESS IPS
z4     15085403   

2. using `oc edit netnamespace z4` to update the netid to 0

# oc edit netnamespace z4
error: netnamespaces.network.openshift.io "z4" could not be patched: the body of the request was in an unknown format - accepted media types include: application/json-patch+json, application/merge-patch+json, application/apply-patch+yaml
You can run `oc replace -f /tmp/oc-edit-o92w3.yaml` to try this update again.

# cat /tmp/oc-edit-o92w3.yaml
# Please edit the object below. Lines beginning with a '#' will be ignored,
# and an empty file will abort the edit. If an error occurs while saving this file will be
# reopened with the relevant failures.
#
apiVersion: network.openshift.io/v1
kind: NetNamespace
metadata:
  creationTimestamp: "2021-05-11T10:59:03Z"
  generation: 1
  managedFields:
  - apiVersion: network.openshift.io/v1
    fieldsType: FieldsV1
    fieldsV1:
      f:netid: {}
      f:netname: {}
    manager: openshift-sdn-controller
    operation: Update
    time: "2021-05-11T10:59:03Z"
  name: z4
  resourceVersion: "189337"
  uid: 664c9db1-36c2-446a-8017-63e0142d3c7d
netid: 0
netname: z4

3. # oc replace -f /tmp/oc-edit-o92w3.yaml
netnamespace.network.openshift.io/z4 replaced

4. # oc get netnamespace z4
NAME   NETID   EGRESS IPS
z4     0                             ### Find the netid already updated to 0 same with default


# oc get netnamespace default
NAME      NETID   EGRESS IPS
default   0

Comment 2 Mohamed Mahmoud 2021-05-11 13:18:52 UTC
@zzhao 
1- how you sure the above build contains my fix ?
2- do u see any log like this "Netid 0 for namespace z4 already exists under different namespace default" 
3- I confirmed the fix by adding UT testcases for duplicate and none duplicate cases

Comment 3 Mohamed Mahmoud 2021-05-11 13:29:46 UTC
(In reply to Mohamed Mahmoud from comment #2)
> @zzhao 
> 1- how you sure the above build contains my fix ?
> 2- do u see any log like this "Netid 0 for namespace z4 already exists under
> different namespace default" 
> 3- I confirmed the fix by adding UT testcases for duplicate and none
> duplicate cases

just we are clear we aren't preventing customers from changing netns manually what we are doing in this PR is to protect SDN when this happens so we won't attempt to add/delete policy for this modified netns. in your test pls look for the warning message I shared above to confirm the new behavior

we also need to doc in this BZ that customers are discouraged from manually modifying namespaces so we are covered.

Comment 4 zhaozhanqi 2021-05-12 02:10:26 UTC
(In reply to Mohamed Mahmoud from comment #2)
> @zzhao 
> 1- how you sure the above build contains my fix ?

we can check this build already including the fix in https://openshift-release.apps.ci.l2s4.p1.openshiftapps.com/releasestream/4.8.0-0.nightly/release/4.8.0-0.nightly-2021-05-11-025533

> 2- do u see any log like this "Netid 0 for namespace z4 already exists under
> different namespace default" 

yes, I can see `W0512 01:15:37.790443    1730 vnids.go:219] Netid 0 for namespace z4 already exists under different namespace default`


> 3- I confirmed the fix by adding UT testcases for duplicate and none
> duplicate cases

Comment 5 zhaozhanqi 2021-05-12 02:19:20 UTC
(In reply to Mohamed Mahmoud from comment #3)
> (In reply to Mohamed Mahmoud from comment #2)
> > @zzhao 
> > 1- how you sure the above build contains my fix ?
> > 2- do u see any log like this "Netid 0 for namespace z4 already exists under
> > different namespace default" 
> > 3- I confirmed the fix by adding UT testcases for duplicate and none
> > duplicate cases
> 
> just we are clear we aren't preventing customers from changing netns
> manually what we are doing in this PR is to protect SDN when this happens so
> we won't attempt to add/delete policy for this modified netns. in your test
> pls look for the warning message I shared above to confirm the new behavior
> 
> we also need to doc in this BZ that customers are discouraged from manually
> modifying namespaces so we are covered.

ok, if only prevent add/delete modified netid, this issue already be fixed

when I attempted add deny policy in z4 namespace. it did not affect the default namespace pod communication

# cat deny.yaml 
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: default-deny-all
spec:
  podSelector: null


# oc get networkpolicy -n z4
NAME               POD-SELECTOR   AGE
default-deny-all   <none>         7s

# oc exec -n z4 test-rc-7gswb -- curl --connect-timeout 4 10.128.2.24:8080
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:--  0:00:04 --:--:--     0
curl: (28) Connection timed out after 4001 milliseconds
command terminated with exit code 28

###check the default pod###

# oc get pod -n default -o wide
NAME            READY   STATUS    RESTARTS   AGE   IP            NODE                                              NOMINATED NODE   READINESS GATES
test-rc-jp2kq   1/1     Running   0          10m   10.131.0.24   ip-10-0-215-250.ap-northeast-1.compute.internal   <none>           <none>
test-rc-sdtq6   1/1     Running   0          10m   10.129.3.73   ip-10-0-186-183.ap-northeast-1.compute.internal   <none>           <none>

##check the pods can communicate each other in default namespace 

# oc exec -n default test-rc-jp2kq -- curl 10.129.3.73:8080
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100    17  100    17    0     0   5666      0 --:--:-- --:--:-- --:--:--  5666
Hello OpenShift!


Move this bug to verified.

Comment 8 errata-xmlrpc 2021-07-27 22:44:45 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438


Note You need to log in before you can comment on or make changes to this bug.