Bug 2108088

Summary: in v2.0.3 - Provider stuck in addon deleting state even after all consumsers are deleted
Product: [Red Hat Storage] Red Hat OpenShift Data Foundation Reporter: suchita <sgatfane>
Component: odf-managed-serviceAssignee: Kaustav Majumder <kmajumde>
Status: VERIFIED --- QA Contact: suchita <sgatfane>
Severity: high Docs Contact:
Priority: unspecified    
Version: 4.10CC: aeyal, dbindra, ebondare, fbalak, kmajumde, odf-bz-bot, sgatfane
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description suchita 2022-07-18 12:29:34 UTC
Description of problem:
In deployer version v2.0.3, now https://issues.redhat.com/browse/RHSTOR-3353 -Prevent uninstallation if storage consumers are present in the Provider cluster.
During testing, it has been observed that the provider cluster uninstallation getting stuck in deleting service status.

ocs-osd-controller-manager shows log
"INFO	controllers.ManagedOCS	Found OCS storage consumers, cannot proceed with uninstallation"



Version-Release number of selected component (if applicable):
 oc get csv
NAME                                      DISPLAY                       VERSION           REPLACES                                  PHASE
mcg-operator.v4.10.4                      NooBaa Operator               4.10.4            mcg-operator.v4.10.3                      Succeeded
ocs-operator.v4.10.4                      OpenShift Container Storage   4.10.4            ocs-operator.v4.10.3                      Succeeded
ocs-osd-deployer.v2.0.3                   OCS OSD Deployer              2.0.3             ocs-osd-deployer.v2.0.2                   Succeeded
odf-csi-addons-operator.v4.10.4           CSI Addons                    4.10.4            odf-csi-addons-operator.v4.10.3           Succeeded
odf-operator.v4.10.4                      OpenShift Data Foundation     4.10.4            odf-operator.v4.10.3                      Succeeded
ose-prometheus-operator.4.10.0            Prometheus Operator           4.10.0            ose-prometheus-operator.4.8.0             Succeeded
route-monitor-operator.v0.1.422-151be96   Route Monitor Operator        0.1.422-151be96   route-monitor-operator.v0.1.420-b65f47e   Succeeded

oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.10.21   True        False         3h1m    Cluster version is 4.10.21

Deployer
    Mediatype:   image/svg+xml
                Image:  quay.io/openshift/origin-kube-rbac-proxy:4.10.0
                Image:             quay.io/osd-addons/ocs-osd-deployer:2.0.3-2
                Image:             quay.io/osd-addons/ocs-osd-deployer:2.0.3-2


How reproducible:
3/4

Steps to Reproduce:
1. Deploy appliance mode provider cluster with 2 consumers
2. uninstall both consumers
3.rosa delete service --id=<cluster_service_id>

Actual results:
ocs-osd-controller-manager logs 
INFO	controllers.ManagedOCS	Found OCS storage consumers, cannot proceed with the uninstallation
provider addon stuck in 'deleting' state and 
cluster uninstall stuck in 'deleting service' till the time manual workaround applied for uninstallation. 
 


Expected results:
Provider cluster should get uninstall


Additional info:
Command o/p after initiating uninstallation 
$rosa list service
SERVICE_ID                   SERVICE          SERVICE_STATE     CLUSTER_NAME
2C3n2uNkBWPzfrBZCVfqcdazVja  ocs-provider-qe  deleting service  alayani-p17j
$ rosa list addon -c alayani-p17j | grep ocs-provider-qe
ocs-provider-qe             Red Hat OpenShift Data Foundation Managed Service Provider (QE)       deleting
$$ rosa list cluster | grep alayani
1tgm0trfact2ed113uoq2c9o96rdek67  alayani-p17j    ready

$ oc get storageconsumer -n openshift-storage
NAME                                                   AGE
storageconsumer-73f24e41-4040-4394-a345-e93a7422a11e   31h
storageconsumer-f4e18f4d-2bb2-4794-b561-8efc6583f09f   31h

Workaround:
Delete storageconsumers

=====After applying workaround cluster uninsatllation resumed===========
$oc get storageconsumer -n openshift-storage | awk 'NR>1{print $1}' | xargs -t oc delete storageconsumer -n openshift-storage
oc delete storageconsumer -n openshift-storage storageconsumer-f4e18f4d-2bb2-4794-b561-8efc6583f09f
storageconsumer.ocs.openshift.io "storageconsumer-f4e18f4d-2bb2-4794-b561-8efc6583f09f" deleted
storageconsumer.ocs.openshift.io "storageconsumer-73f24e41-4040-4394-a345-e93a7422a11e" deleted


oc logs -f -n openshift-storage ocs-osd-controller-manager-6f67967567-fthw4 -c manager
2022-07-18T16:05:42.815Z	INFO	controllers.ManagedOCS	Found OCS storage consumers, cannot proceed with uninstallation
2022-07-18T16:05:52.821Z	INFO	controllers.ManagedOCS	Starting reconcile for ManagedOCS	{"req.Namespace": "openshift-storage", "req.Name": "managedocs"}
2022-07-18T16:05:52.821Z	INFO	controllers.ManagedOCS	Reconciling onboardingValidationKeySecret
2022-07-18T16:05:52.821Z	INFO	controllers.ManagedOCS	Reconciling StorageCluster
2022-07-18T16:05:52.821Z	INFO	controllers.ManagedOCS	Requested add-on settings	{"size": "20", "enable-mcg": "false"}
2022-07-18T16:05:52.821Z	INFO	controllers.ManagedOCS	Setting storage device set count	{"Current": 5, "New": 5}
2022-07-18T16:05:52.822Z	INFO	controllers.ManagedOCS	Reconciling CSVs
2022-07-18T16:05:52.822Z	INFO	controllers.ManagedOCS	Reconciling alertRelabelConfigSecret
2022-07-18T16:05:52.822Z	INFO	controllers.ManagedOCS	Reconciling kubeRBACConfigMap
2022-07-18T16:05:52.822Z	INFO	controllers.ManagedOCS	Reconciling PrometheusService
2022-07-18T16:05:52.822Z	INFO	controllers.ManagedOCS	Reconciling Prometheus
2022-07-18T16:05:52.832Z	INFO	controllers.ManagedOCS	Reconciling Alertmanager
2022-07-18T16:05:52.832Z	INFO	controllers.ManagedOCS	Reconciling AlertmanagerConfig secret
2022-07-18T16:05:52.832Z	WARN	controllers.ManagedOCS	Customer Email for alert notification is not provided
2022-07-18T16:05:52.839Z	INFO	controllers.ManagedOCS	Reconciling k8sMetricsServiceMonitorAuthSecret
2022-07-18T16:05:52.842Z	INFO	controllers.ManagedOCS	Unable to find v1 grafana-datasources secret
2022-07-18T16:05:52.844Z	INFO	controllers.ManagedOCS	Reconciling k8sMetricsServiceMonitor
2022-07-18T16:05:52.845Z	INFO	controllers.ManagedOCS	reconciling monitoring resources
2022-07-18T16:05:52.908Z	INFO	controllers.ManagedOCS	Reconciling DMS Prometheus Rule
2022-07-18T16:05:52.908Z	INFO	controllers.ManagedOCS	Reconciling OCSInitialization
2022-07-18T16:05:52.908Z	INFO	controllers.ManagedOCS	reconciling PrometheusProxyNetworkPolicy resources
2022-07-18T16:05:52.908Z	INFO	controllers.ManagedOCS	Non converged deployment, skipping reconcile for egress network policy
2022-07-18T16:05:52.908Z	INFO	controllers.ManagedOCS	starting OCS uninstallation - deleting managedocs
2022-07-18T16:05:52.919Z	ERROR	controller-runtime.manager.controller.managedocs	Reconciler error	{"reconciler group": "ocs.openshift.io", "reconciler kind": "ManagedOCS", "name": "managedocs", "namespace": "openshift-storage", "error": "Operation cannot be fulfilled on managedocs.ocs.openshift.io \"managedocs\": the object has been modified; please apply your changes to the latest version and try again"}
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
	/tmp/go/ocs-osd-deployer/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:253
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
	/tmp/go/ocs-osd-deployer/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:214
2022-07-18T16:05:52.919Z	INFO	controllers.ManagedOCS	Starting reconcile for ManagedOCS	{"req.Namespace": "openshift-storage", "req.Name": "managedocs"}
2022-07-18T16:05:52.919Z	INFO	controllers.ManagedOCS	deleting storagecluster
2022-07-18T16:05:52.929Z	INFO	controllers.ManagedOCS	deleting storageSystems
2022-07-18T16:05:53.054Z	INFO	controllers.ManagedOCS	Starting reconcile for ManagedOCS	{"req.Namespace": "openshift-storage", "req.Name": "managedocs"}
2022-07-18T16:05:53.055Z	INFO	controllers.ManagedOCS	deleting storagecluster
2022-07-18T16:05:53.143Z	INFO	controllers.ManagedOCS	deleting storageSystems
2022-07-18T16:05:53.336Z	INFO	controllers.ManagedOCS	Starting reconcile for ManagedOCS	{"req.Namespace": "openshift-storage", "req.Name": "managedocs"}
2022-07-18T16:05:53.336Z	INFO	controllers.ManagedOCS	deleting storagecluster
2022-07-18T16:05:53.345Z	INFO	controllers.ManagedOCS	deleting storageSystems
2022-07-18T16:05:57.502Z	INFO	controllers.ManagedOCS	Starting reconcile for ManagedOCS	{"req.Namespace": "openshift-storage", "req.Name": "managedocs"}
2022-07-18T16:05:57.502Z	INFO	controllers.ManagedOCS	deleting OCS CSV
2022-07-18T16:05:57.549Z	INFO	controllers.ManagedOCS	removing finalizer from the ManagedOCS resource
2022-07-18T16:05:57.577Z	INFO	controllers.ManagedOCS	finallizer removed successfully
2022-07-18T16:05:57.596Z	ERROR	controller-runtime.manager.controller.managedocs	Reconciler error	{"reconciler group": "ocs.openshift.io", "reconciler kind": "ManagedOCS", "name": "managedocs", "namespace": "openshift-storage", "error": "Operation cannot be fulfilled on managedocs.ocs.openshift.io \"managedocs\": StorageError: invalid object, Code: 4, Key: /kubernetes.io/ocs.openshift.io/managedocs/openshift-storage/managedocs, ResourceVersion: 0, AdditionalErrorMsg: Precondition failed: UID in precondition: 170869a8-17dc-40ea-8402-6d3101c73372, UID in object meta: "}
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
	/tmp/go/ocs-osd-deployer/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:253
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
	/tmp/go/ocs-osd-deployer/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:214
2022-07-18T16:05:57.596Z	INFO	controllers.ManagedOCS	Starting reconcile for ManagedOCS	{"req.Namespace": "openshift-storage", "req.Name": "managedocs"}
2022-07-18T16:05:57.596Z	WARN	controllers.ManagedOCS	ManagedOCS resource not found
2022-07-18T16:05:57.596Z	INFO	controllers.ManagedOCS	deleting deployer csv
2022-07-18T16:05:57.621Z	INFO	controllers.ManagedOCS	Deployer csv removed successfully
2022-07-18T16:05:57.621Z	INFO	controllers.ManagedOCS	Starting reconcile for ManagedOCS	{"req.Namespace": "openshift-storage", "req.Name": "managedocs"}
2022-07-18T16:05:57.621Z	WARN	controllers.ManagedOCS	ManagedOCS resource not found
2022-07-18T16:05:57.621Z	INFO	controllers.ManagedOCS	deleting deployer csv
2022-07-18T16:05:57.645Z	INFO	controllers.ManagedOCS	Deployer csv removed successfully
2022-07-18T16:05:57.724Z	INFO	controllers.ManagedOCS	Starting reconcile for ManagedOCS	{"req.Namespace": "openshift-storage", "req.Name": "managedocs"}
2022-07-18T16:05:57.724Z	WARN	controllers.ManagedOCS	ManagedOCS resource not found
2022-07-18T16:05:57.724Z	INFO	controllers.ManagedOCS	deleting deployer csv
2022-07-18T16:05:57.725Z	INFO	controllers.ManagedOCS	Deployer csv removed successfully
2022-07-18T16:05:57.729Z	INFO	controllers.ManagedOCS	Starting reconcile for ManagedOCS	{"req.Namespace": "openshift-storage", "req.Name": "managedocs"}
2022-07-18T16:05:57.729Z	WARN	controllers.ManagedOCS	ManagedOCS resource not found
2022-07-18T16:05:57.729Z	INFO	controllers.ManagedOCS	deleting deployer csv
2022-07-18T16:05:57.729Z	INFO	controllers.ManagedOCS	Deployer csv removed successfully
2022-07-18T16:05:57.729Z	INFO	controllers.ManagedOCS	Starting reconcile for ManagedOCS	{"req.Namespace": "openshift-storage", "req.Name": "managedocs"}
2022-07-18T16:05:57.729Z	WARN	controllers.ManagedOCS	ManagedOCS resource not found
2022-07-18T16:05:57.729Z	INFO	controllers.ManagedOCS	deleting deployer csv
2022-07-18T16:05:57.730Z	INFO	controllers.ManagedOCS	Deployer csv removed successfully
2022-07-18T16:05:57.739Z	INFO	controllers.ManagedOCS	Starting reconcile for ManagedOCS	{"req.Namespace": "openshift-storage", "req.Name": "managedocs"}
2022-07-18T16:05:57.739Z	WARN	controllers.ManagedOCS	ManagedOCS resource not found
2022-07-18T16:05:57.739Z	INFO	controllers.ManagedOCS	deleting deployer csv
2022-07-18T16:05:57.739Z	INFO	controllers.ManagedOCS	Deployer csv removed successfully
2022-07-18T16:05:57.744Z	INFO	controllers.ManagedOCS	Starting reconcile for ManagedOCS	{"req.Namespace": "openshift-storage", "req.Name": "managedocs"}
2022-07-18T16:05:57.744Z	WARN	controllers.ManagedOCS	ManagedOCS resource not found
2022-07-18T16:05:57.744Z	INFO	controllers.ManagedOCS	deleting deployer csv
2022-07-18T16:05:57.744Z	INFO	controllers.ManagedOCS	Deployer csv removed successfully
2022-07-18T16:05:57.744Z	INFO	controllers.ManagedOCS	Starting reconcile for ManagedOCS	{"req.Namespace": "openshift-storage", "req.Name": "managedocs"}
2022-07-18T16:05:57.744Z	WARN	controllers.ManagedOCS	ManagedOCS resource not found
2022-07-18T16:05:57.745Z	INFO	controllers.ManagedOCS	deleting deployer csv
2022-07-18T16:05:57.745Z	INFO	controllers.ManagedOCS	Deployer csv removed successfully
2022-07-18T16:05:57.745Z	INFO	controllers.ManagedOCS	Starting reconcile for ManagedOCS	{"req.Namespace": "openshift-storage", "req.Name": "managedocs"}
2022-07-18T16:05:57.745Z	WARN	controllers.ManagedOCS	ManagedOCS resource not found
2022-07-18T16:05:57.745Z	INFO	controllers.ManagedOCS	deleting deployer csv
2022-07-18T16:05:57.745Z	INFO	controllers.ManagedOCS	Deployer csv removed successfully
2022-07-18T16:05:57.746Z	INFO	controllers.ManagedOCS	Starting reconcile for ManagedOCS	{"req.Namespace": "openshift-storage", "req.Name": "managedocs"}
2022-07-18T16:05:57.746Z	WARN	controllers.ManagedOCS	ManagedOCS resource not found
2022-07-18T16:05:57.746Z	INFO	controllers.ManagedOCS	deleting deployer csv
2022-07-18T16:05:57.746Z	INFO	controllers.ManagedOCS	Deployer csv removed successfully
2022-07-18T16:05:57.746Z	INFO	controllers.ManagedOCS	Starting reconcile for ManagedOCS	{"req.Namespace": "openshift-storage", "req.Name": "managedocs"}
2022-07-18T16:05:57.746Z	WARN	controllers.ManagedOCS	ManagedOCS resource not found
2022-07-18T16:05:57.746Z	INFO	controllers.ManagedOCS	deleting deployer csv
2022-07-18T16:05:57.746Z	INFO	controllers.ManagedOCS	Deployer csv removed successfully
2022-07-18T16:05:57.746Z	INFO	controllers.ManagedOCS	Starting reconcile for ManagedOCS	{"req.Namespace": "openshift-storage", "req.Name": "managedocs"}
2022-07-18T16:05:57.746Z	WARN	controllers.ManagedOCS	ManagedOCS resource not found
2022-07-18T16:05:57.746Z	INFO	controllers.ManagedOCS	deleting deployer csv
2022-07-18T16:05:57.746Z	INFO	controllers.ManagedOCS	Deployer csv removed successfully
2022-07-18T16:05:57.751Z	INFO	controllers.ManagedOCS	Starting reconcile for ManagedOCS	{"req.Namespace": "openshift-storage", "req.Name": "managedocs"}
2022-07-18T16:05:57.751Z	WARN	controllers.ManagedOCS	ManagedOCS resource not found
2022-07-18T16:05:57.751Z	INFO	controllers.ManagedOCS	deleting deployer csv
2022-07-18T16:05:57.751Z	INFO	controllers.ManagedOCS	Deployer csv removed successfully
2022-07-18T16:05:57.763Z	INFO	controllers.ManagedOCS	Starting reconcile for ManagedOCS	{"req.Namespace": "openshift-storage", "req.Name": "managedocs"}
2022-07-18T16:05:57.763Z	WARN	controllers.ManagedOCS	ManagedOCS resource not found
2022-07-18T16:05:57.763Z	INFO	controllers.ManagedOCS	deleting deployer csv
2022-07-18T16:05:57.763Z	INFO	controllers.ManagedOCS	Deployer csv removed successfully
2022-07-18T16:05:57.768Z	INFO	controllers.ManagedOCS	Starting reconcile for ManagedOCS	{"req.Namespace": "openshift-storage", "req.Name": "managedocs"}
2022-07-18T16:05:57.768Z	WARN	controllers.ManagedOCS	ManagedOCS resource not found
2022-07-18T16:05:57.768Z	INFO	controllers.ManagedOCS	deleting deployer csv
2022-07-18T16:05:57.769Z	INFO	controllers.ManagedOCS	Deployer csv removed successfully
2022-07-18T16:05:57.984Z	INFO	controllers.ManagedOCS	Starting reconcile for ManagedOCS	{"req.Namespace": "openshift-storage", "req.Name": "managedocs"}
2022-07-18T16:05:57.984Z	WARN	controllers.ManagedOCS	ManagedOCS resource not found
2022-07-18T16:05:57.984Z	INFO	controllers.ManagedOCS	deleting deployer csv
2022-07-18T16:05:57.984Z	INFO	controllers.ManagedOCS	Deployer csv removed successfully
2022-07-18T16:05:58.000Z	INFO	controllers.ManagedOCS	Starting reconcile for ManagedOCS	{"req.Namespace": "openshift-storage", "req.Name": "managedocs"}
2022-07-18T16:05:58.000Z	WARN	controllers.ManagedOCS	ManagedOCS resource not found
2022-07-18T16:05:58.000Z	INFO	controllers.ManagedOCS	deleting deployer csv
2022-07-18T16:05:58.000Z	INFO	controllers.ManagedOCS	Deployer csv removed successfully
2022-07-18T16:05:58.140Z	INFO	controller-runtime.manager.controller.managedocs	Shutdown signal received, waiting for all workers to finish	{"reconciler group": "ocs.openshift.io", "reconciler kind": "ManagedOCS"}
2022-07-18T16:05:58.140Z	INFO	controller-runtime.manager.controller.managedocs	All workers finished	{"reconciler group": "ocs.openshift.io", "reconciler kind": "ManagedOCS"}
=====================================================

rosa cluster and service get deleted after some time

Comment 1 suchita 2022-07-18 16:39:59 UTC
The root cause of looks like the leftover of storage consumers in storage consumer list even after offboarding of the consumer. 
This is already reported to this bug https://bugzilla.redhat.com/show_bug.cgi?id=2069389

Comment 2 Kaustav Majumder 2022-07-19 11:36:51 UTC
Looking at this, deployer is working as expected. I think a bug needs to be raised on the product side to remove storage consumers after offboarding.
Ohad WDYT?

Comment 3 Kaustav Majumder 2022-07-19 13:51:42 UTC
@sgatfane Can you provide the storage consumer CR yaml after the consumer has offboarded?

Seems after offboarding the Storage Consumer CR is marked for deletion but doesnot get deleted immediately.
We might have to check our uninstallation logic to include deletionTimestamp.

Comment 4 Ohad 2022-07-26 05:54:43 UTC
@kmajumde

Storage Consumer will not get deleted until all of the rook resources that are owned by it are deleted. 
The underlying rook resource might not get deleted because of an unclean removal of PV/PVC on the consumer cluster side.
Let's try to identify which rook resource is stuck on deleting, but does not get deleted, and try to address that.

Comment 13 Dhruv Bindra 2023-01-20 09:51:56 UTC
Try it on the latest build

Comment 16 Elena Bondarenko 2023-03-27 10:34:38 UTC
As the issue is not always reproducible, I asked the team members who created clusters lately if anyone faced any issues with uninstallation. I'll gather the responses and update the BZ.

Comment 17 suchita 2023-05-08 09:15:48 UTC
Verified on Deployer version v2.0.12 (on QE addon/Dev Addon ) .
The issue is not reproducible single time with 5 uninstallations. so Considering it is fixed on the latest version. 
Marking BZ as verified.