Bug 1569244

Summary:

ovs-subnet to ovs-networkpolicy migration does not work as documented [docs]

Product:

OpenShift Container Platform

Reporter:

Matthew Robson <mrobson>

Component:

Networking

Assignee:

Brandi Munilla <bmcelvee>

Networking sub component:

openshift-sdn

QA Contact:

zhaozhanqi <zzhao>

Status:

CLOSED WONTFIX

Docs Contact:

Severity:

high

Priority:

unspecified

CC:

aos-bugs, bbennett, misalunk, scuppett, tmanor, zzhao

Version:

3.7.0

Keywords:

Reopened

Target Milestone:

---

Target Release:

3.10.z

Hardware:

All

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Cause: The documentation for ovs-subnet to ovs-networkpolicy migration was not complete. Consequence: Migration would not succeed without a reboot. Fix: The documentation was corrected. Result: Migration can be done without a reboot.

Story Points:

---

Clone Of:

Environment:

Last Closed:

2019-11-20 15:46:24 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
ovs rules log	none
recording of working policy	none
recording of nonworking policy	none

Description Matthew Robson 2018-04-18 21:16:04 UTC

Description of problem:

The current documentation around migrating from ovs-subnet to ovs-network policy does not lead to a successful migration.

I took my existing v3.7.23 cluster for this migration.

Here is what it looked like before:

[cloud-user@osemaster1 ~]$ oc get clusternetwork
NAME      CLUSTER NETWORKS   SERVICE NETWORK   PLUGIN NAME
default   10.128.0.0/14:9    172.30.0.0/16     redhat/openshift-ovs-subnet

[cloud-user@osemaster1 ~]$ oc get hostsubnets
NAME                           HOST                           HOST IP        SUBNET          EGRESS IPS
mrmaster1.usersys.redhat.com   mrmaster1.usersys.redhat.com   10.10.85.213   10.129.0.0/23   []
mrmaster2.usersys.redhat.com   mrmaster2.usersys.redhat.com   10.10.85.48    10.128.0.0/23   []
mrmaster3.usersys.redhat.com   mrmaster3.usersys.redhat.com   10.10.85.49    10.130.0.0/23   []
mrnode1.usersys.redhat.com     mrnode1.usersys.redhat.com     10.10.85.242   10.131.0.0/23   []
mrnode2.usersys.redhat.com     mrnode2.usersys.redhat.com     10.10.85.221   10.129.2.0/23   []
mrnode3.usersys.redhat.com     mrnode3.usersys.redhat.com     10.10.85.243   10.128.2.0/23   []
mrnode4.usersys.redhat.com     mrnode4.usersys.redhat.com     10.10.85.244   10.130.2.0/23   []

[cloud-user@osemaster1 ~]$ oc get netnamespaces
No resources found.

1) Updated master-config.yaml on all 3 masters:
  networkPluginName: redhat/openshift-ovs-networkpolicy

2) Updated node-config.yaml on all 3 masters:
networkPluginName: redhat/openshift-ovs-networkpolicy
and 
   networkPluginName: redhat/openshift-ovs-networkpolicy

3) Updated node-config.yaml on all nodes:
networkPluginName: redhat/openshift-ovs-networkpolicy
and 
   networkPluginName: redhat/openshift-ovs-networkpolicy

4) Restart master api's 1 at a time

5) Restart master controllers 1 at a time

6) Restart nodes on master 1 at a time

7) Restart nodes 1 at a time

[cloud-user@osemaster1 ~]$ oc get clusternetwork
NAME      CLUSTER NETWORKS   SERVICE NETWORK   PLUGIN NAME
default   10.128.0.0/14:9    172.30.0.0/16     redhat/openshift-ovs-networkpolicy

[cloud-user@osemaster1 ~]$ oc get hostsubnet
NAME                           HOST                           HOST IP        SUBNET          EGRESS IPS
mrmaster1.usersys.redhat.com   mrmaster1.usersys.redhat.com   10.10.85.213   10.129.0.0/23   []
mrmaster2.usersys.redhat.com   mrmaster2.usersys.redhat.com   10.10.85.48    10.128.0.0/23   []
mrmaster3.usersys.redhat.com   mrmaster3.usersys.redhat.com   10.10.85.49    10.130.0.0/23   []
mrnode1.usersys.redhat.com     mrnode1.usersys.redhat.com     10.10.85.242   10.131.0.0/23   []
mrnode2.usersys.redhat.com     mrnode2.usersys.redhat.com     10.10.85.221   10.129.2.0/23   []
mrnode3.usersys.redhat.com     mrnode3.usersys.redhat.com     10.10.85.243   10.128.2.0/23   []
mrnode4.usersys.redhat.com     mrnode4.usersys.redhat.com     10.10.85.244   10.130.2.0/23   []

[cloud-user@osemaster1 ~]$ oc get netnamespace
NAME                                NETID      EGRESS IPS
cloudforms                          7715750    []
default                             0          []
gluster                             14215287   []
kube-public                         12594662   []
kube-service-catalog                4330082    []
kube-system                         11747802   []
logging                             6522750    []
management-infra                    10776907   []
nptest2                             1624012    []
openshift                           12062157   []
openshift-ansible-service-broker    13812239   []
openshift-infra                     8090575    []
openshift-node                      3158383    []
openshift-template-service-broker   9536235    []
test1                               14946879   []
test2                               2247975    []

At this point, the entire cluster is in a bad state. oc commands are VERY slow and services are not working properly at all.

Some debugging shows that after the switch on all nodes, there are no veth interfaces for any of the containers:

[cloud-user@osenode2 ~]$ sudo ovs-ofctl -O OpenFlow13 dump-ports-desc br0
OFPST_PORT_DESC reply (OF1.3) (xid=0x2):
 1(vxlan0): addr:66:af:8e:8a:16:7a
     config:     0
     state:      0
     speed: 0 Mbps now, 0 Mbps max
 2(tun0): addr:ee:1b:b0:dd:38:68
     config:     0
     state:      0
     speed: 0 Mbps now, 0 Mbps max
 LOCAL(br0): addr:fa:8e:e4:9b:d7:43
     config:     PORT_DOWN
     state:      LINK_DOWN
     speed: 0 Mbps now, 0 Mbps max

On all servers, I restarted network -> iptables -> docker -> atomic-openshift-node and also restarted the loadbalancer haproxy service.

This resulted in oc commands and services to start working again. The veth's were also back for all of the containers.

[cloud-user@osenode2 ~]$ sudo ovs-ofctl -O OpenFlow13 dump-ports-desc br0
OFPST_PORT_DESC reply (OF1.3) (xid=0x2):
 1(vxlan0): addr:1a:9d:5c:e3:87:09
     config:     0
     state:      0
     speed: 0 Mbps now, 0 Mbps max
 2(tun0): addr:6e:a0:ae:7d:aa:1f
     config:     0
     state:      0
     speed: 0 Mbps now, 0 Mbps max
 3(veth3278d5e7): addr:96:80:d9:4e:cb:b6
     config:     0
     state:      0
     current:    10GB-FD COPPER
     speed: 10000 Mbps now, 0 Mbps max
 4(vethf2a92e2f): addr:4a:e2:ad:5d:51:ba
     config:     0
     state:      0
     current:    10GB-FD COPPER
     speed: 10000 Mbps now, 0 Mbps max
 5(veth3b50c259): addr:5a:59:5e:0e:8a:29
     config:     0
     state:      0
     current:    10GB-FD COPPER
     speed: 10000 Mbps now, 0 Mbps max
 LOCAL(br0): addr:f6:cb:9c:6b:36:4b
     config:     PORT_DOWN
     state:      LINK_DOWN
     speed: 0 Mbps now, 0 Mbps max

Now I tried to add a network policy to deny all in my test1 project:

kind: NetworkPolicy
apiVersion: extensions/v1beta1
metadata:
  name: deny-by-default
spec:
  podSelector:
  ingress: []

[cloud-user@osemaster1 ~]$ oc get networkpolicy
NAME              POD-SELECTOR   AGE
deny-by-default   <none>         3h

This had no impact. I could hit my test service from inside and outside the cluster.

[cloud-user@osemaster1 ~]$ oc get svc
NAME      CLUSTER-IP       EXTERNAL-IP   PORT(S)    AGE
ruby-ex   172.30.117.206   <none>        8080/TCP   7h

[cloud-user@osemaster1 ~]$ curl http://172.30.117.206:8080/health
1

[cloud-user@osemaster1 ~]$ oc rsh rhel-openshift
sh-4.2$ curl http://172.30.117.206:8080/health
1

At this point, I did a rolling restart of all the servers in the cluster.

When the cluster came back, post restart, the network policy seemed to be functioning per the deny-all:

[cloud-user@osemaster1 ~]$ curl http://172.30.117.206:8080/health
curl: (7) Failed connect to 172.30.117.206:8080; Connection timed out


Version-Release number of selected component (if applicable):

[cloud-user@osemaster1 ~]$ oc version
oc v3.7.23
kubernetes v1.7.6+a08f5eeb62
features: Basic-Auth GSSAPI Kerberos SPNEGO

Server https://mrlb.usersys.redhat.com:8443
openshift v3.7.23
kubernetes v1.7.6+a08f5eeb62


How reproducible:

Always

Steps to Reproduce:
1. See above

Actual results:

Network policy does not work without a full restart of the server

Expected results:

Need to determine what services need to be restarted along with the api, controller and node to ensure OVS is properly synced to enable networkpolicy.

Additional info:

Comment 1 Ben Bennett 2018-05-01 15:37:44 UTC

Weibin: Can you try this and see what needs to be restarted to make it work?  And then we can update the docs.

Comment 2 Weibin Liang 2018-05-04 18:30:01 UTC

I have reproduced two issues mentioned in bug description in my setup which installed v3.7.44.

1. oc commands are VERY slow
2. networkpolicy does not function after migration until restart services.

For the send issue, I checked ovs rules when this happened and found the ovs rule does not up after apply networkpolicy.


OVS rules log which include working setup and not working setup is attached.

Comment 3 Weibin Liang 2018-05-04 18:31:10 UTC

Created attachment 1431537 [details]
ovs rules log

Comment 4 Weibin Liang 2018-05-08 13:37:06 UTC

After updating networkPluginName: redhat/openshift-ovs-networkpolicy in master-config and node-config files, please restart services in below orders:

In Master:
systemctl restart iptables
systemctl restart openvswitch
systemctl restart docker
systemctl restart atomic-openshift-master-api.service 
systemctl restart atomic-openshift-master-controllers.service
systemctl restart atomic-openshift-node.service 

In Node:
systemctl restart iptables
systemctl restart openvswitch
systemctl restart docker
systemctl restart atomic-openshift-node.service 

Trying above steps and without rebooting systems in both v3.7 and v3.9, all oc commands, container veth interfaces and deny-by-default networkpolicy work fine

Comment 5 Tom Manor 2018-05-09 03:28:25 UTC

Weibin, I've been testing these added steps in the CEE QuickLabs with mixed results.  I'm noting that at times it works as expected, however, other times, while the svc is restricted via curl, the endpoint is not.  In looking at the pod placement, it appears that when the target pod is on the same node as the router pod, the svc is restricted, but not the endpoint, as noted when running curl for both svc and endpoint.  When the target pod is on a different node as the router pod, both the svc and endpoint appear to have proper restrictions.

I too am using OCP 3.7.44/RHEL7.4 in the QuickLabs.  I've attached to additional files recording the steps for each lab.  If you want to inspect the labs, you are more than welcome.

Working Lab Details:
--------------------
https://operations.cee.redhat.com/quicklab/cluster/tmanornetwork
attached file: working.pods.diff.nodes.txt

Not Working Lab Details:
------------------------
https://operations.cee.redhat.com/quicklab/cluster/tmanor1networkpolicy
attached file notworking.pods.same.node.txt

Instructions for creating the key for QuickLabs can be found here:
https://gitlab.cee.redhat.com/cee_ops/quicklab/wikis/access

Comment 6 Tom Manor 2018-05-09 03:29:59 UTC

Created attachment 1433504 [details]
recording of working policy

Comment 7 Tom Manor 2018-05-09 03:30:39 UTC

Created attachment 1433505 [details]
recording of nonworking policy

Comment 8 Tom Manor 2018-05-09 03:37:42 UTC

Weibin,

Something else worth noting, in my testing, I created my target project AFTER migrating the environment from ovs-subnet to ovs-networkpolicy, reaching the 2 different results noted above.

I would think we would need to consider the test case where EXISTING projects running under ovs-subnet would also need to be able to have NetworkPolicy objects defined once the migration to ovs-networkpolicy has occurred.

In some preliminary testing, I have not been able to get this test case to work either, regardless if the target pod is hosted on the same, or different node as the router pod.

Would it make sense to cover the use-cases/test cases as a separate BZ?  I'm going to continue testing this additional use case as well.

Thanks!

Comment 9 Weibin Liang 2018-05-09 18:34:08 UTC

@Tom,

I reproduced the issue you mentioned in Comment 5.

After I tried same testing steps in a new openshift cluster which install openshift-ovs-networkpolicy at beginning, I saw the same issue.


So, the problem you saw is a networkpolicy issue, but not a migration issue.

Please open another bug to report the networkpolicy issue you found here.

For this bug, We will need a documentation PR to describe the correct orders of restarting services after migrate openshift-ovs-sub to openshift-ovs-networkpolicy

Comment 10 Tom Manor 2018-05-10 14:58:12 UTC

@Weibin,

New BZ created for networkpolicy issue - BZ 1576857.

Comment 11 Ben Bennett 2018-05-14 19:06:53 UTC

Weibin, do we need to do all of those steps?

For the master is it sufficient to do:
systemctl restart openvswitch
systemctl restart atomic-openshift-master-api.service 
systemctl restart atomic-openshift-master-controllers.service
systemctl restart atomic-openshift-node.service 

And for the node:
systemctl restart openvswitch
systemctl restart atomic-openshift-node.service

Comment 12 Weibin Liang 2018-05-15 18:53:04 UTC

(In reply to Ben Bennett from comment #11)
> Weibin, do we need to do all of those steps?
> 
> For the master is it sufficient to do:
> systemctl restart openvswitch
> systemctl restart atomic-openshift-master-api.service 
> systemctl restart atomic-openshift-master-controllers.service
> systemctl restart atomic-openshift-node.service 
> 
> And for the node:
> systemctl restart openvswitch
> systemctl restart atomic-openshift-node.service

Ben, Re test it again, I think the whole steps should be like this:

####Migrating from ovs-subnet to ovs-networkpolicy

1) Updated networkPluginName in master-config.yaml on all masters:

2) Updated networkPluginName in node-config.yaml on all masters and nodes:

3) Restart the atomic-openshift-master-api and atomic-openshift-master-controllers on all masters one by one

4) Restart the atomic-openshift-node service on all masters and nodes one by one

5) Restart openvswitch on all masters and nodes one by one

6) Restart the atomic-openshift-master-api and atomic-openshift-master-controllers on all masters one by one

7) Restart the atomic-openshift-node service on all masters one by one

Comment 13 Weibin Liang 2018-05-15 19:11:40 UTC

(In reply to Weibin Liang from comment #12)
> (In reply to Ben Bennett from comment #11)
> > Weibin, do we need to do all of those steps?
> > 
> > For the master is it sufficient to do:
> > systemctl restart openvswitch
> > systemctl restart atomic-openshift-master-api.service 
> > systemctl restart atomic-openshift-master-controllers.service
> > systemctl restart atomic-openshift-node.service 
> > 
> > And for the node:
> > systemctl restart openvswitch
> > systemctl restart atomic-openshift-node.service
> 
> Ben, Re test it again, I think the whole steps should be like this:
> 
> ####Migrating from ovs-subnet to ovs-networkpolicy
> 
> 1) Updated networkPluginName in master-config.yaml on all masters:
> 
> 2) Updated networkPluginName in node-config.yaml on all masters and nodes:
> 
> 3) Restart the atomic-openshift-master-api and
> atomic-openshift-master-controllers on all masters one by one
> 
> 4) Restart the atomic-openshift-node service on all masters and nodes one by
> one
> 
> 5) Restart openvswitch on all masters and nodes one by one
> 
> 6) Restart the atomic-openshift-master-api and
> atomic-openshift-master-controllers on all masters one by one
> 
> 7) Restart the atomic-openshift-node service on all masters one by one

7) Should be: 7) Restart the atomic-openshift-node service on all masters and nodes one by one

Comment 14 Ben Bennett 2018-05-29 14:41:39 UTC

Fixed by docs PR https://github.com/openshift/openshift-docs/pull/9638

Comment 15 openshift-github-bot 2018-06-05 00:44:34 UTC

Commit pushed to master at https://github.com/openshift/openshift-docs

https://github.com/openshift/openshift-docs/commit/8eeb262175dca99706acd54f529ec8985c036913
Fix the OpenShift SDN migration steps

We need to restart openvswitch to clean out the rules before we
restart the node processes.

Fixes bug 1569244 (https://bugzilla.redhat.com/show_bug.cgi?id=1569244)

Comment 17 zhaozhanqi 2018-06-07 06:07:04 UTC

The docs changes LGTM ,verified this bug

Comment 19 errata-xmlrpc 2018-07-30 19:13:21 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:1816

Comment 21 Aine Riordan 2019-02-26 10:24:35 UTC

The *Migrating Between SDN Plug-ins* section in the 3.9[1] and 3.10[2] docs was changed last May per https://bugzilla.redhat.com/show_bug.cgi?id=1569244#c13.
The 3.7[3] docs were not changed, though the issue was originally raised for 3.7.

@misalunk - What version was your customer using?

Please confirm whether the 3.7 docs need this change.


[1] https://docs.openshift.com/container-platform/3.9/install_config/configuring_sdn.html#migrating-between-sdn-plugins

[2] https://docs.openshift.com/container-platform/3.10/install_config/configuring_sdn.html#migrating-between-sdn-plugins

[3] https://docs.openshift.com/container-platform/3.7/install_config/configuring_sdn.html#migrating-between-sdn-plugins

Comment 22 Stephen Cuppett 2019-11-20 15:46:24 UTC

OCP 3.7-3.10 has reached the end of full support [1]. Closing this BZ as WONTFIX. If there is a customer case to be attached with a valid support exception and we still need a fix here, please post those details and reopen.

[1] - https://access.redhat.com/support/policy/updates/openshift

Comment 23 Red Hat Bugzilla 2023-09-15 00:07:41 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days