Version-Release number of selected component (if applicable): 4.5.0-0.nightly-2020-04-01-015139 How reproducible: Always Steps to Reproduce: 1. Setup cluster, 3 master,2 workers 2. Allow migration: oc annotate Network.operator.openshift.io cluster "networkoperator.openshift.io/network-migration"="" 3. Update networkType in network.config.openshift.io cluster oc patch Network.config.openshift.io cluster --type='merge' --patch '{"spec":{"networkType":"OVNKubernetes"}}' 4. Wait more than 10 minutes 5. Check openshift-sdn namespace Actual results: huiran-mac:script hrwang$ oc get ns openshift-sdn NAME STATUS AGE openshift-sdn Terminating 31m huiran-mac:script hrwang$ oc get pods -n openshift-sdn No resources found in openshift-sdn namespace. huiran-mac:script hrwang$ oc get ns openshift-sdn -o yaml apiVersion: v1 kind: Namespace metadata: annotations: openshift.io/description: OpenShift SDN components openshift.io/node-selector: "" openshift.io/sa.scc.mcs: s0:c5,c0 openshift.io/sa.scc.supplemental-groups: 1000020000/10000 openshift.io/sa.scc.uid-range: 1000020000/10000 creationTimestamp: "2020-04-02T07:18:37Z" deletionTimestamp: "2020-04-02T07:28:11Z" labels: name: openshift-sdn olm.operatorgroup/openshift-monitoring.openshift-cluster-monitoring: "" openshift.io/cluster-monitoring: "true" openshift.io/run-level: "0" name: openshift-sdn ownerReferences: - apiVersion: operator.openshift.io/v1 blockOwnerDeletion: true controller: true kind: Network name: cluster uid: ebd928ef-6878-41b3-a2c0-950044d93b2b resourceVersion: "51616" selfLink: /api/v1/namespaces/openshift-sdn uid: cd1d36b0-f7d9-496d-9c04-652d91e552d7 spec: finalizers: - kubernetes status: conditions: - lastTransitionTime: "2020-04-02T07:28:22Z" message: 'Discovery failed for some groups, 13 failing: unable to retrieve the complete list of server APIs: apps.openshift.io/v1: the server is currently unable to handle the request, authorization.openshift.io/v1: the server is currently unable to handle the request, build.openshift.io/v1: the server is currently unable to handle the request, image.openshift.io/v1: the server is currently unable to handle the request, metrics.k8s.io/v1beta1: the server is currently unable to handle the request, oauth.openshift.io/v1: the server is currently unable to handle the request, packages.operators.coreos.com/v1: the server is currently unable to handle the request, project.openshift.io/v1: the server is currently unable to handle the request, quota.openshift.io/v1: the server is currently unable to handle the request, route.openshift.io/v1: the server is currently unable to handle the request, security.openshift.io/v1: the server is currently unable to handle the request, template.openshift.io/v1: the server is currently unable to handle the request, user.openshift.io/v1: the server is currently unable to handle the request' reason: DiscoveryFailed status: "True" type: NamespaceDeletionDiscoveryFailure - lastTransitionTime: "2020-04-02T07:28:26Z" message: All legacy kube types successfully parsed reason: ParsedGroupVersions status: "False" type: NamespaceDeletionGroupVersionParsingFailure - lastTransitionTime: "2020-04-02T07:31:13Z" message: All content successfully deleted, may be waiting on finalization reason: ContentDeleted status: "False" type: NamespaceDeletionContentFailure - lastTransitionTime: "2020-04-02T07:31:12Z" message: All content successfully removed reason: ContentRemoved status: "False" type: NamespaceContentRemaining - lastTransitionTime: "2020-04-02T07:28:26Z" message: All content-preserving finalizers finished reason: ContentHasNoFinalizers status: "False" type: NamespaceFinalizersRemaining phase: Terminating Collapse Expected results: The namespaces openshift-sdn should be terminated successfully in 5~10 minutes Additional info:
There is a message of the namespace says 'Discovery failed for some groups'. Those APIs that shall be replied by the openshift-apiserver, were not accessible, as the Openshift-SDN pods have been deleted. It caused the namespace stuck in 'Terminating' state. However, the namespace shall be able to be deleted successfully after the cluster reboot, when the cluster network is back to normal. So normally, it shall not be an issue. But, this issue will cause trouble when the OVN-kubernete network didn't work after the cluster reboot. And users want to rollback to Openshift-SDN. All the openshift-sdn resources cannot be created, because the namespace is still 'Terminating'. So to solve this issue, I think we have 2 options: 1. In CNO, not deleting the namespace when doing the migration, so that we don't need to recreate it during rollback. 2. In the migration procedure document, ask users to check and force delete the namespace if needed before executing the rollback.
I'm inclined for option 2.
We definitely want to delete the openshift-sdn namespace *eventually*. Maybe it makes sense to figure out how to tweak things so that it doesn't get deleted until after the cluster is back up and running with ovn-kubernetes. If we can't do that then I think "the user has to manually delete the namespace if they have to roll back" is better than "the user has to manually delete the namespace even on success (if they don't want a stray unused namespace lying around forever)". (So, 2.)
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.5 image release advisory), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:2409