Description of problem: The current documentation around migrating from ovs-subnet to ovs-network policy does not lead to a successful migration. I took my existing v3.7.23 cluster for this migration. Here is what it looked like before: [cloud-user@osemaster1 ~]$ oc get clusternetwork NAME CLUSTER NETWORKS SERVICE NETWORK PLUGIN NAME default 10.128.0.0/14:9 172.30.0.0/16 redhat/openshift-ovs-subnet [cloud-user@osemaster1 ~]$ oc get hostsubnets NAME HOST HOST IP SUBNET EGRESS IPS mrmaster1.usersys.redhat.com mrmaster1.usersys.redhat.com 10.10.85.213 10.129.0.0/23 [] mrmaster2.usersys.redhat.com mrmaster2.usersys.redhat.com 10.10.85.48 10.128.0.0/23 [] mrmaster3.usersys.redhat.com mrmaster3.usersys.redhat.com 10.10.85.49 10.130.0.0/23 [] mrnode1.usersys.redhat.com mrnode1.usersys.redhat.com 10.10.85.242 10.131.0.0/23 [] mrnode2.usersys.redhat.com mrnode2.usersys.redhat.com 10.10.85.221 10.129.2.0/23 [] mrnode3.usersys.redhat.com mrnode3.usersys.redhat.com 10.10.85.243 10.128.2.0/23 [] mrnode4.usersys.redhat.com mrnode4.usersys.redhat.com 10.10.85.244 10.130.2.0/23 [] [cloud-user@osemaster1 ~]$ oc get netnamespaces No resources found. 1) Updated master-config.yaml on all 3 masters: networkPluginName: redhat/openshift-ovs-networkpolicy 2) Updated node-config.yaml on all 3 masters: networkPluginName: redhat/openshift-ovs-networkpolicy and networkPluginName: redhat/openshift-ovs-networkpolicy 3) Updated node-config.yaml on all nodes: networkPluginName: redhat/openshift-ovs-networkpolicy and networkPluginName: redhat/openshift-ovs-networkpolicy 4) Restart master api's 1 at a time 5) Restart master controllers 1 at a time 6) Restart nodes on master 1 at a time 7) Restart nodes 1 at a time [cloud-user@osemaster1 ~]$ oc get clusternetwork NAME CLUSTER NETWORKS SERVICE NETWORK PLUGIN NAME default 10.128.0.0/14:9 172.30.0.0/16 redhat/openshift-ovs-networkpolicy [cloud-user@osemaster1 ~]$ oc get hostsubnet NAME HOST HOST IP SUBNET EGRESS IPS mrmaster1.usersys.redhat.com mrmaster1.usersys.redhat.com 10.10.85.213 10.129.0.0/23 [] mrmaster2.usersys.redhat.com mrmaster2.usersys.redhat.com 10.10.85.48 10.128.0.0/23 [] mrmaster3.usersys.redhat.com mrmaster3.usersys.redhat.com 10.10.85.49 10.130.0.0/23 [] mrnode1.usersys.redhat.com mrnode1.usersys.redhat.com 10.10.85.242 10.131.0.0/23 [] mrnode2.usersys.redhat.com mrnode2.usersys.redhat.com 10.10.85.221 10.129.2.0/23 [] mrnode3.usersys.redhat.com mrnode3.usersys.redhat.com 10.10.85.243 10.128.2.0/23 [] mrnode4.usersys.redhat.com mrnode4.usersys.redhat.com 10.10.85.244 10.130.2.0/23 [] [cloud-user@osemaster1 ~]$ oc get netnamespace NAME NETID EGRESS IPS cloudforms 7715750 [] default 0 [] gluster 14215287 [] kube-public 12594662 [] kube-service-catalog 4330082 [] kube-system 11747802 [] logging 6522750 [] management-infra 10776907 [] nptest2 1624012 [] openshift 12062157 [] openshift-ansible-service-broker 13812239 [] openshift-infra 8090575 [] openshift-node 3158383 [] openshift-template-service-broker 9536235 [] test1 14946879 [] test2 2247975 [] At this point, the entire cluster is in a bad state. oc commands are VERY slow and services are not working properly at all. Some debugging shows that after the switch on all nodes, there are no veth interfaces for any of the containers: [cloud-user@osenode2 ~]$ sudo ovs-ofctl -O OpenFlow13 dump-ports-desc br0 OFPST_PORT_DESC reply (OF1.3) (xid=0x2): 1(vxlan0): addr:66:af:8e:8a:16:7a config: 0 state: 0 speed: 0 Mbps now, 0 Mbps max 2(tun0): addr:ee:1b:b0:dd:38:68 config: 0 state: 0 speed: 0 Mbps now, 0 Mbps max LOCAL(br0): addr:fa:8e:e4:9b:d7:43 config: PORT_DOWN state: LINK_DOWN speed: 0 Mbps now, 0 Mbps max On all servers, I restarted network -> iptables -> docker -> atomic-openshift-node and also restarted the loadbalancer haproxy service. This resulted in oc commands and services to start working again. The veth's were also back for all of the containers. [cloud-user@osenode2 ~]$ sudo ovs-ofctl -O OpenFlow13 dump-ports-desc br0 OFPST_PORT_DESC reply (OF1.3) (xid=0x2): 1(vxlan0): addr:1a:9d:5c:e3:87:09 config: 0 state: 0 speed: 0 Mbps now, 0 Mbps max 2(tun0): addr:6e:a0:ae:7d:aa:1f config: 0 state: 0 speed: 0 Mbps now, 0 Mbps max 3(veth3278d5e7): addr:96:80:d9:4e:cb:b6 config: 0 state: 0 current: 10GB-FD COPPER speed: 10000 Mbps now, 0 Mbps max 4(vethf2a92e2f): addr:4a:e2:ad:5d:51:ba config: 0 state: 0 current: 10GB-FD COPPER speed: 10000 Mbps now, 0 Mbps max 5(veth3b50c259): addr:5a:59:5e:0e:8a:29 config: 0 state: 0 current: 10GB-FD COPPER speed: 10000 Mbps now, 0 Mbps max LOCAL(br0): addr:f6:cb:9c:6b:36:4b config: PORT_DOWN state: LINK_DOWN speed: 0 Mbps now, 0 Mbps max Now I tried to add a network policy to deny all in my test1 project: kind: NetworkPolicy apiVersion: extensions/v1beta1 metadata: name: deny-by-default spec: podSelector: ingress: [] [cloud-user@osemaster1 ~]$ oc get networkpolicy NAME POD-SELECTOR AGE deny-by-default <none> 3h This had no impact. I could hit my test service from inside and outside the cluster. [cloud-user@osemaster1 ~]$ oc get svc NAME CLUSTER-IP EXTERNAL-IP PORT(S) AGE ruby-ex 172.30.117.206 <none> 8080/TCP 7h [cloud-user@osemaster1 ~]$ curl http://172.30.117.206:8080/health 1 [cloud-user@osemaster1 ~]$ oc rsh rhel-openshift sh-4.2$ curl http://172.30.117.206:8080/health 1 At this point, I did a rolling restart of all the servers in the cluster. When the cluster came back, post restart, the network policy seemed to be functioning per the deny-all: [cloud-user@osemaster1 ~]$ curl http://172.30.117.206:8080/health curl: (7) Failed connect to 172.30.117.206:8080; Connection timed out Version-Release number of selected component (if applicable): [cloud-user@osemaster1 ~]$ oc version oc v3.7.23 kubernetes v1.7.6+a08f5eeb62 features: Basic-Auth GSSAPI Kerberos SPNEGO Server https://mrlb.usersys.redhat.com:8443 openshift v3.7.23 kubernetes v1.7.6+a08f5eeb62 How reproducible: Always Steps to Reproduce: 1. See above Actual results: Network policy does not work without a full restart of the server Expected results: Need to determine what services need to be restarted along with the api, controller and node to ensure OVS is properly synced to enable networkpolicy. Additional info:
Weibin: Can you try this and see what needs to be restarted to make it work? And then we can update the docs.
I have reproduced two issues mentioned in bug description in my setup which installed v3.7.44. 1. oc commands are VERY slow 2. networkpolicy does not function after migration until restart services. For the send issue, I checked ovs rules when this happened and found the ovs rule does not up after apply networkpolicy. OVS rules log which include working setup and not working setup is attached.
Created attachment 1431537 [details] ovs rules log
After updating networkPluginName: redhat/openshift-ovs-networkpolicy in master-config and node-config files, please restart services in below orders: In Master: systemctl restart iptables systemctl restart openvswitch systemctl restart docker systemctl restart atomic-openshift-master-api.service systemctl restart atomic-openshift-master-controllers.service systemctl restart atomic-openshift-node.service In Node: systemctl restart iptables systemctl restart openvswitch systemctl restart docker systemctl restart atomic-openshift-node.service Trying above steps and without rebooting systems in both v3.7 and v3.9, all oc commands, container veth interfaces and deny-by-default networkpolicy work fine
Weibin, I've been testing these added steps in the CEE QuickLabs with mixed results. I'm noting that at times it works as expected, however, other times, while the svc is restricted via curl, the endpoint is not. In looking at the pod placement, it appears that when the target pod is on the same node as the router pod, the svc is restricted, but not the endpoint, as noted when running curl for both svc and endpoint. When the target pod is on a different node as the router pod, both the svc and endpoint appear to have proper restrictions. I too am using OCP 3.7.44/RHEL7.4 in the QuickLabs. I've attached to additional files recording the steps for each lab. If you want to inspect the labs, you are more than welcome. Working Lab Details: -------------------- https://operations.cee.redhat.com/quicklab/cluster/tmanornetwork attached file: working.pods.diff.nodes.txt Not Working Lab Details: ------------------------ https://operations.cee.redhat.com/quicklab/cluster/tmanor1networkpolicy attached file notworking.pods.same.node.txt Instructions for creating the key for QuickLabs can be found here: https://gitlab.cee.redhat.com/cee_ops/quicklab/wikis/access
Created attachment 1433504 [details] recording of working policy
Created attachment 1433505 [details] recording of nonworking policy
Weibin, Something else worth noting, in my testing, I created my target project AFTER migrating the environment from ovs-subnet to ovs-networkpolicy, reaching the 2 different results noted above. I would think we would need to consider the test case where EXISTING projects running under ovs-subnet would also need to be able to have NetworkPolicy objects defined once the migration to ovs-networkpolicy has occurred. In some preliminary testing, I have not been able to get this test case to work either, regardless if the target pod is hosted on the same, or different node as the router pod. Would it make sense to cover the use-cases/test cases as a separate BZ? I'm going to continue testing this additional use case as well. Thanks!
@Tom, I reproduced the issue you mentioned in Comment 5. After I tried same testing steps in a new openshift cluster which install openshift-ovs-networkpolicy at beginning, I saw the same issue. So, the problem you saw is a networkpolicy issue, but not a migration issue. Please open another bug to report the networkpolicy issue you found here. For this bug, We will need a documentation PR to describe the correct orders of restarting services after migrate openshift-ovs-sub to openshift-ovs-networkpolicy
@Weibin, New BZ created for networkpolicy issue - BZ 1576857.
Weibin, do we need to do all of those steps? For the master is it sufficient to do: systemctl restart openvswitch systemctl restart atomic-openshift-master-api.service systemctl restart atomic-openshift-master-controllers.service systemctl restart atomic-openshift-node.service And for the node: systemctl restart openvswitch systemctl restart atomic-openshift-node.service
(In reply to Ben Bennett from comment #11) > Weibin, do we need to do all of those steps? > > For the master is it sufficient to do: > systemctl restart openvswitch > systemctl restart atomic-openshift-master-api.service > systemctl restart atomic-openshift-master-controllers.service > systemctl restart atomic-openshift-node.service > > And for the node: > systemctl restart openvswitch > systemctl restart atomic-openshift-node.service Ben, Re test it again, I think the whole steps should be like this: ####Migrating from ovs-subnet to ovs-networkpolicy 1) Updated networkPluginName in master-config.yaml on all masters: 2) Updated networkPluginName in node-config.yaml on all masters and nodes: 3) Restart the atomic-openshift-master-api and atomic-openshift-master-controllers on all masters one by one 4) Restart the atomic-openshift-node service on all masters and nodes one by one 5) Restart openvswitch on all masters and nodes one by one 6) Restart the atomic-openshift-master-api and atomic-openshift-master-controllers on all masters one by one 7) Restart the atomic-openshift-node service on all masters one by one
(In reply to Weibin Liang from comment #12) > (In reply to Ben Bennett from comment #11) > > Weibin, do we need to do all of those steps? > > > > For the master is it sufficient to do: > > systemctl restart openvswitch > > systemctl restart atomic-openshift-master-api.service > > systemctl restart atomic-openshift-master-controllers.service > > systemctl restart atomic-openshift-node.service > > > > And for the node: > > systemctl restart openvswitch > > systemctl restart atomic-openshift-node.service > > Ben, Re test it again, I think the whole steps should be like this: > > ####Migrating from ovs-subnet to ovs-networkpolicy > > 1) Updated networkPluginName in master-config.yaml on all masters: > > 2) Updated networkPluginName in node-config.yaml on all masters and nodes: > > 3) Restart the atomic-openshift-master-api and > atomic-openshift-master-controllers on all masters one by one > > 4) Restart the atomic-openshift-node service on all masters and nodes one by > one > > 5) Restart openvswitch on all masters and nodes one by one > > 6) Restart the atomic-openshift-master-api and > atomic-openshift-master-controllers on all masters one by one > > 7) Restart the atomic-openshift-node service on all masters one by one 7) Should be: 7) Restart the atomic-openshift-node service on all masters and nodes one by one
Fixed by docs PR https://github.com/openshift/openshift-docs/pull/9638
Commit pushed to master at https://github.com/openshift/openshift-docs https://github.com/openshift/openshift-docs/commit/8eeb262175dca99706acd54f529ec8985c036913 Fix the OpenShift SDN migration steps We need to restart openvswitch to clean out the rules before we restart the node processes. Fixes bug 1569244 (https://bugzilla.redhat.com/show_bug.cgi?id=1569244)
The docs changes LGTM ,verified this bug
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:1816
The *Migrating Between SDN Plug-ins* section in the 3.9[1] and 3.10[2] docs was changed last May per https://bugzilla.redhat.com/show_bug.cgi?id=1569244#c13. The 3.7[3] docs were not changed, though the issue was originally raised for 3.7. @misalunk - What version was your customer using? Please confirm whether the 3.7 docs need this change. [1] https://docs.openshift.com/container-platform/3.9/install_config/configuring_sdn.html#migrating-between-sdn-plugins [2] https://docs.openshift.com/container-platform/3.10/install_config/configuring_sdn.html#migrating-between-sdn-plugins [3] https://docs.openshift.com/container-platform/3.7/install_config/configuring_sdn.html#migrating-between-sdn-plugins
OCP 3.7-3.10 has reached the end of full support [1]. Closing this BZ as WONTFIX. If there is a customer case to be attached with a valid support exception and we still need a fix here, please post those details and reopen. [1] - https://access.redhat.com/support/policy/updates/openshift
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days