Bug 1677788

Summary:	Deleting clusternetwork is possible / does not automatically recreate
Product:	OpenShift Container Platform	Reporter:	Steven Walter <stwalter>
Component:	Networking	Assignee:	Ricardo Carrillo Cruz <ricarril>
Networking sub component:	openshift-sdn	QA Contact:	zhaozhanqi <zzhao>
Status:	CLOSED CURRENTRELEASE	Docs Contact:
Severity:	medium
Priority:	medium	CC:	aos-bugs, bbennett, cdc, erich, nagrawal, piqin, ricarril
Version:	4.1.0	Keywords:	Reopened
Target Milestone:	---
Target Release:	4.3.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	No Doc Update
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2019-11-22 18:16:55 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1664187

Description Steven Walter 2019-02-15 21:46:37 UTC

Description of problem:
It's possible to delete the clusternetwork object. It does not automatically recreate itself.
Since this can cause new pods to fail to start up, if important pods are then deleted, the cluster becomes unresponsive.

I discovered this when poking around, trying to change the network plugin. Editing the default networkconfigs.networkoperator.openshift.io alone didnt seem to do it. As per the pastebins in [1], I:
oc delete configmap applied-defaults -n openshift-network-operator

This also did not result in the networkplugin changing. (Not sure why deleting a confimap would do that, but hey I'm willing to try anything once)

So then I delete the clusternetwork to see if OCP will rebuild it or not. It doesn't. This comes with the side effect that pod networking breaks!

[1] https://mojo.redhat.com/docs/DOC-1185646



Version-Release number of selected component (if applicable):
4.0 HTB

How reproducible:
Easily

Steps to Reproduce:

$ oc edit networkconfigs.networkoperator.openshift.io
networkconfig "default" edited
$ oc get clusternetwork
NAME      NETWORK         HOST SUBNET LENGTH   SERVICE NETWORK   PLUGIN NAME
default   10.128.0.0/14   9                    172.30.0.0/16     redhat/openshift-ovs-networkpolicy
$ oc project openshift-cluster-network-operator
Now using project "openshift-cluster-network-operator" on server "https://stwalter-g5corp-api.rhcee.support:6443".
$ oc get cm
NAME                       DATA      AGE
applied-default            1         1h
cluster-network-operator   0         1h
$ oc delete cm applied-default
configmap "applied-default" deleted
$ oc get clusternetwork
NAME      NETWORK         HOST SUBNET LENGTH   SERVICE NETWORK   PLUGIN NAME
default   10.128.0.0/14   9                    172.30.0.0/16     redhat/openshift-ovs-networkpolicy

$ oc delete clusternetwork default
clusternetwork "default" deleted
$ oc get clusternetwork
No resources found.
$ oc get networkconfigs.networkoperator.openshift.io
NAME      KIND
default   NetworkConfig.v1.networkoperator.openshift.io
$ oc get networkconfigs.networkoperator.openshift.io

NAME      KIND
default   NetworkConfig.v1.networkoperator.openshift.io

$ oc get netnamespace
NAME                                         NETID
default                                      0
kube-public                                  15255683
kube-system                                  3130926
. . .

$ oc get pod -n openshift-sdn
NAME                   READY     STATUS    RESTARTS   AGE
ovs-69vj8              1/1       Running   0          1h
. . .

$ oc delete pod --all -n openshift-sdn
pod "ovs-69vj8" deleted
pod "ovs-bzzqr" deleted
pod "ovs-c8q69" deleted
. . .
$ oc get pod -n openshift-sdn
NAME                   READY     STATUS              RESTARTS   AGE
ovs-69vj8              1/1       Terminating         0          1h
ovs-bzzqr              1/1       Terminating         0          1h
ovs-c8q69              0/1       Terminating         0          56m
ovs-gl8jh              0/1       Pending             0          0s
ovs-nj7px              0/1       ContainerCreating   0          0s
ovs-p9qdf              1/1       Running             0          3s
sdn-bk5qx              0/1       CrashLoopBackOff    1          4s
sdn-controller-cq4hm   1/1       Terminating         0          1h
sdn-controller-ngvkf   1/1       Terminating         1          1h
sdn-controller-q6smm   1/1       Terminating         0          1h
sdn-dl562              0/1       Terminating         0          56m
sdn-ft2fh              0/1       CrashLoopBackOff    1          4s
sdn-jr57r              0/1       CrashLoopBackOff    1          8s
sdn-l7gn7              0/1       CrashLoopBackOff    1          5s
sdn-tqcgt              0/1       Error               1          6s
$ oc get pod -n openshift-sdn
oc pro^C
$ oc project openshift-sdn
^C
$ oc get node
Error from server (ServerTimeout): the server cannot complete the requested operation at this time, try again later (get nodes)

Comment 1 Casey Callendrello 2019-02-18 13:08:02 UTC

Yup, we don't support changing the network mode on a running cluster, as you've found out. We should definitely look in to reconciling cluster networks, but I suspect this has been this way since 3.0.

Comment 2 Steven Walter 2019-02-18 17:30:24 UTC

Hi,
Is that true? I've found: https://docs.openshift.com/container-platform/3.9/install_config/configuring_sdn.html#migrating-between-sdn-plugins

Comment 3 Casey Callendrello 2019-02-18 17:46:39 UTC

Migrating between SDN providers or openshift-sdn modes won't work in 4.0. We intend to add it to the operator, but it's just a matter of time / prioritization.

Comment 4 Eric Rich 2019-02-18 19:11:09 UTC

We need to block people from being able to delete the network until such an operation is possible then.

Comment 5 Casey Callendrello 2019-04-05 16:12:42 UTC

Only the admin user can delete this object... they can delete lots of objects that will break the cluster. I'm not sure that this is a bug.

Comment 6 Steven Walter 2019-04-05 17:58:52 UTC

Sure, but do we *want* people to be able to delete this -- even administrators? Is there a legitimate use case for doing so? (If there is that's fine, we just need steps to recreate it documented)