Bug 1677788 - Deleting clusternetwork is possible / does not automatically recreate
Summary: Deleting clusternetwork is possible / does not automatically recreate
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.1.0
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: 4.3.0
Assignee: Ricardo Carrillo Cruz
QA Contact: zhaozhanqi
URL:
Whiteboard:
Depends On:
Blocks: 1664187
TreeView+ depends on / blocked
 
Reported: 2019-02-15 21:46 UTC by Steven Walter
Modified: 2020-01-09 18:39 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-11-22 18:16:55 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Steven Walter 2019-02-15 21:46:37 UTC
Description of problem:
It's possible to delete the clusternetwork object. It does not automatically recreate itself.
Since this can cause new pods to fail to start up, if important pods are then deleted, the cluster becomes unresponsive.

I discovered this when poking around, trying to change the network plugin. Editing the default networkconfigs.networkoperator.openshift.io alone didnt seem to do it. As per the pastebins in [1], I:
oc delete configmap applied-defaults -n openshift-network-operator

This also did not result in the networkplugin changing. (Not sure why deleting a confimap would do that, but hey I'm willing to try anything once)

So then I delete the clusternetwork to see if OCP will rebuild it or not. It doesn't. This comes with the side effect that pod networking breaks!

[1] https://mojo.redhat.com/docs/DOC-1185646



Version-Release number of selected component (if applicable):
4.0 HTB

How reproducible:
Easily

Steps to Reproduce:

$ oc edit networkconfigs.networkoperator.openshift.io
networkconfig "default" edited
$ oc get clusternetwork
NAME      NETWORK         HOST SUBNET LENGTH   SERVICE NETWORK   PLUGIN NAME
default   10.128.0.0/14   9                    172.30.0.0/16     redhat/openshift-ovs-networkpolicy
$ oc project openshift-cluster-network-operator
Now using project "openshift-cluster-network-operator" on server "https://stwalter-g5corp-api.rhcee.support:6443".
$ oc get cm
NAME                       DATA      AGE
applied-default            1         1h
cluster-network-operator   0         1h
$ oc delete cm applied-default
configmap "applied-default" deleted
$ oc get clusternetwork
NAME      NETWORK         HOST SUBNET LENGTH   SERVICE NETWORK   PLUGIN NAME
default   10.128.0.0/14   9                    172.30.0.0/16     redhat/openshift-ovs-networkpolicy

$ oc delete clusternetwork default
clusternetwork "default" deleted
$ oc get clusternetwork
No resources found.
$ oc get networkconfigs.networkoperator.openshift.io
NAME      KIND
default   NetworkConfig.v1.networkoperator.openshift.io
$ oc get networkconfigs.networkoperator.openshift.io

NAME      KIND
default   NetworkConfig.v1.networkoperator.openshift.io

$ oc get netnamespace
NAME                                         NETID
default                                      0
kube-public                                  15255683
kube-system                                  3130926
. . .

$ oc get pod -n openshift-sdn
NAME                   READY     STATUS    RESTARTS   AGE
ovs-69vj8              1/1       Running   0          1h
. . .

$ oc delete pod --all -n openshift-sdn
pod "ovs-69vj8" deleted
pod "ovs-bzzqr" deleted
pod "ovs-c8q69" deleted
. . .
$ oc get pod -n openshift-sdn
NAME                   READY     STATUS              RESTARTS   AGE
ovs-69vj8              1/1       Terminating         0          1h
ovs-bzzqr              1/1       Terminating         0          1h
ovs-c8q69              0/1       Terminating         0          56m
ovs-gl8jh              0/1       Pending             0          0s
ovs-nj7px              0/1       ContainerCreating   0          0s
ovs-p9qdf              1/1       Running             0          3s
sdn-bk5qx              0/1       CrashLoopBackOff    1          4s
sdn-controller-cq4hm   1/1       Terminating         0          1h
sdn-controller-ngvkf   1/1       Terminating         1          1h
sdn-controller-q6smm   1/1       Terminating         0          1h
sdn-dl562              0/1       Terminating         0          56m
sdn-ft2fh              0/1       CrashLoopBackOff    1          4s
sdn-jr57r              0/1       CrashLoopBackOff    1          8s
sdn-l7gn7              0/1       CrashLoopBackOff    1          5s
sdn-tqcgt              0/1       Error               1          6s
$ oc get pod -n openshift-sdn
oc pro^C
$ oc project openshift-sdn
^C
$ oc get node
Error from server (ServerTimeout): the server cannot complete the requested operation at this time, try again later (get nodes)

Comment 1 Casey Callendrello 2019-02-18 13:08:02 UTC
Yup, we don't support changing the network mode on a running cluster, as you've found out. We should definitely look in to reconciling cluster networks, but I suspect this has been this way since 3.0.

Comment 3 Casey Callendrello 2019-02-18 17:46:39 UTC
Migrating between SDN providers or openshift-sdn modes won't work in 4.0. We intend to add it to the operator, but it's just a matter of time / prioritization.

Comment 4 Eric Rich 2019-02-18 19:11:09 UTC
We need to block people from being able to delete the network until such an operation is possible then.

Comment 5 Casey Callendrello 2019-04-05 16:12:42 UTC
Only the admin user can delete this object... they can delete lots of objects that will break the cluster. I'm not sure that this is a bug.

Comment 6 Steven Walter 2019-04-05 17:58:52 UTC
Sure, but do we *want* people to be able to delete this -- even administrators? Is there a legitimate use case for doing so? (If there is that's fine, we just need steps to recreate it documented)


Note You need to log in before you can comment on or make changes to this bug.