1820472 – [Migration] After migrate from sdn to ovn, namespace "openshift-sdn" got stuck at terminating status

Bug 1820472 - [Migration] After migrate from sdn to ovn, namespace "openshift-sdn" got stuck at terminating status

Summary: [Migration] After migrate from sdn to ovn, namespace "openshift-sdn" got stuc...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	4.5
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	4.5.0
Assignee:	Peng Liu
QA Contact:	huirwang
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-04-03 07:30 UTC by huirwang
Modified:	2020-08-04 18:07 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-08-04 18:07:38 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	openshift cluster-network-operator pull 633	None	closed	Bug 1820472: Finalize namespace when it is stuck in 'Terminating' after migration	2020-11-03 18:49:45 UTC
Github	openshift cluster-network-operator pull 641	None	closed	Bug 1820472: Not delete namespace object when cleanup not rended objects	2020-11-03 18:49:45 UTC
Red Hat Product Errata	RHBA-2020:2409	None	None	None	2020-08-04 18:07:39 UTC

Description huirwang 2020-04-03 07:30:59 UTC

Version-Release number of selected component (if applicable):
4.5.0-0.nightly-2020-04-01-015139


How reproducible:
Always

Steps to Reproduce:
1. Setup cluster, 3 master,2 workers

2. Allow migration:
oc annotate Network.operator.openshift.io cluster "networkoperator.openshift.io/network-migration"=""

3. Update networkType in network.config.openshift.io cluster
oc patch Network.config.openshift.io cluster --type='merge' --patch '{"spec":{"networkType":"OVNKubernetes"}}'

4. Wait more than 10 minutes

5. Check openshift-sdn namespace

Actual results:

huiran-mac:script hrwang$ oc get ns openshift-sdn
NAME      STATUS    AGE
openshift-sdn  Terminating  31m
huiran-mac:script hrwang$ oc get pods -n openshift-sdn
No resources found in openshift-sdn namespace.

huiran-mac:script hrwang$ oc get ns openshift-sdn -o yaml
apiVersion: v1
kind: Namespace
metadata:
  annotations:
    openshift.io/description: OpenShift SDN components
    openshift.io/node-selector: ""
    openshift.io/sa.scc.mcs: s0:c5,c0
    openshift.io/sa.scc.supplemental-groups: 1000020000/10000
    openshift.io/sa.scc.uid-range: 1000020000/10000
  creationTimestamp: "2020-04-02T07:18:37Z"
  deletionTimestamp: "2020-04-02T07:28:11Z"
  labels:
    name: openshift-sdn
    olm.operatorgroup/openshift-monitoring.openshift-cluster-monitoring: ""
    openshift.io/cluster-monitoring: "true"
    openshift.io/run-level: "0"
  name: openshift-sdn
  ownerReferences:
  - apiVersion: operator.openshift.io/v1
    blockOwnerDeletion: true
    controller: true
    kind: Network
    name: cluster
    uid: ebd928ef-6878-41b3-a2c0-950044d93b2b
  resourceVersion: "51616"
  selfLink: /api/v1/namespaces/openshift-sdn
  uid: cd1d36b0-f7d9-496d-9c04-652d91e552d7
spec:
  finalizers:
  - kubernetes
status:
  conditions:
  - lastTransitionTime: "2020-04-02T07:28:22Z"
    message: 'Discovery failed for some groups, 13 failing: unable to retrieve the
      complete list of server APIs: apps.openshift.io/v1: the server is currently
      unable to handle the request, authorization.openshift.io/v1: the server is currently
      unable to handle the request, build.openshift.io/v1: the server is currently
      unable to handle the request, image.openshift.io/v1: the server is currently
      unable to handle the request, metrics.k8s.io/v1beta1: the server is currently
      unable to handle the request, oauth.openshift.io/v1: the server is currently
      unable to handle the request, packages.operators.coreos.com/v1: the server is
      currently unable to handle the request, project.openshift.io/v1: the server
      is currently unable to handle the request, quota.openshift.io/v1: the server
      is currently unable to handle the request, route.openshift.io/v1: the server
      is currently unable to handle the request, security.openshift.io/v1: the server
      is currently unable to handle the request, template.openshift.io/v1: the server
      is currently unable to handle the request, user.openshift.io/v1: the server
      is currently unable to handle the request'
    reason: DiscoveryFailed
    status: "True"
    type: NamespaceDeletionDiscoveryFailure
  - lastTransitionTime: "2020-04-02T07:28:26Z"
    message: All legacy kube types successfully parsed
    reason: ParsedGroupVersions
    status: "False"
    type: NamespaceDeletionGroupVersionParsingFailure
  - lastTransitionTime: "2020-04-02T07:31:13Z"
    message: All content successfully deleted, may be waiting on finalization
    reason: ContentDeleted
    status: "False"
    type: NamespaceDeletionContentFailure
  - lastTransitionTime: "2020-04-02T07:31:12Z"
    message: All content successfully removed
    reason: ContentRemoved
    status: "False"
    type: NamespaceContentRemaining
  - lastTransitionTime: "2020-04-02T07:28:26Z"
    message: All content-preserving finalizers finished
    reason: ContentHasNoFinalizers
    status: "False"
    type: NamespaceFinalizersRemaining
  phase: Terminating
Collapse

Expected results:

The namespaces openshift-sdn should be terminated successfully in 5~10 minutes

Additional info:

Comment 1 Peng Liu 2020-04-03 11:39:07 UTC

There is a message of the namespace says 'Discovery failed for some groups'. Those APIs that shall be replied by the openshift-apiserver, were not accessible, as the Openshift-SDN pods have been deleted. It caused the namespace stuck in 'Terminating' state. However, the namespace shall be able to be deleted successfully after the cluster reboot, when the cluster network is back to normal.

So normally, it shall not be an issue. But, this issue will cause trouble when the OVN-kubernete network didn't work after the cluster reboot. And users want to rollback to Openshift-SDN. All the openshift-sdn resources cannot be created, because the namespace is still 'Terminating'. So to solve this issue, I think we have 2 options:
1. In CNO, not deleting the namespace when doing the migration, so that we don't need to recreate it during rollback.
2. In the migration procedure document, ask users to check and force delete the namespace if needed before executing the rollback.

Comment 2 Ricardo Carrillo Cruz 2020-04-03 11:48:27 UTC

I'm inclined for option 2.

Comment 3 Dan Winship 2020-04-03 12:34:59 UTC

We definitely want to delete the openshift-sdn namespace *eventually*. Maybe it makes sense to figure out how to tweak things so that it doesn't get deleted until after the cluster is back up and running with ovn-kubernetes.

If we can't do that then I think "the user has to manually delete the namespace if they have to roll back" is better than "the user has to manually delete the namespace even on success (if they don't want a stray unused namespace lying around forever)". (So, 2.)

Comment 8 errata-xmlrpc 2020-08-04 18:07:38 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.5 image release advisory), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2409

Note You need to log in before you can comment on or make changes to this bug.