2028647 – Clusters are in 'Degraded' status with upgrade env due to obs-controller not working properly

Bug 2028647 - Clusters are in 'Degraded' status with upgrade env due to obs-controller not working properly

Summary: Clusters are in 'Degraded' status with upgrade env due to obs-controller not ...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Advanced Cluster Management for Kubernetes
Classification:	Red Hat
Component:	Core Services / Observability
Sub Component:
Version:	rhacm-2.4.z
Hardware:	All
OS:	All
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	rhacm-2.5
Assignee:	Chunlin Yang
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-12-02 21:18 UTC by Xiang Yin
Modified:	2025-01-27 12:33 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2022-06-09 02:07:01 UTC
Target Upstream Version:
Embargoed:
Flags:	bot-tracker-sync: rhacm-2.5+

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	open-cluster-management backlog issues 18289	0	None	None	None	2021-12-05 21:25:15 UTC
Red Hat Product Errata	RHSA-2022:4956	0	None	None	None	2022-06-09 02:07:34 UTC

Description Xiang Yin 2021-12-02 21:18:00 UTC

Description of the problem:

After upgrade from 2.4.0 to 2.4.1 RC1, all imported clusters were in good status for a while. After one day, many clusters were in 'Degraded' status due to obs-controller not working properly

Release version: 2.4.1

Operator snapshot version: 2.4.1 RC1

OCP version: 4.9.7

Browser Info: Chrome, Firefox

Steps to reproduce:
1. Launch ACM
2. Go to clusters
3. Check cluster status

Actual results:

Clusters are in 'Degraded' status and corresponding obs-controller is also in 'Degraded' status

Expected results:

Additional info:

Comment 1 llan 2021-12-02 21:32:59 UTC

Checked the environment, metrics-collector in some clusters cannot push metrics successfully to hub server due to client ca certs which used to sign the client certs cannot be verified by server side.
I found in server side, the client ca cert secret removed recently. Ideally, the new-generated client ca ca secret should include the old ca cert, so that the sever side can verify the requests using client certs which signed by old or new ca cert. But in mco operator, there is error message which mentioned it's failed to update that secret to include the old cert.
error message is as below:
```
2021-12-01T13:37:46.631Z	ERROR	controller_certificates	Failed to update secret for ca certificate	{"name": "observability-client-ca-certs", "error": "Operation cannot be fulfilled on secrets \"observability-client-ca-certs\": StorageError: invalid object, Code: 4, Key: /kubernetes.io/secrets/open-cluster-management-observability/observability-client-ca-certs, ResourceVersion: 0, AdditionalErrorMsg: Precondition failed: UID in precondition: 4b11225a-6288-43e9-b1c9-e6a30bd42db9, UID in object meta: "}
github.com/open-cluster-management/multicluster-observability-operator/operators/multiclusterobservability/pkg/certificates.onDelete.func1
	/remote-source/app/operators/multiclusterobservability/pkg/certificates/cert_controller.go:184
k8s.io/client-go/tools/cache.ResourceEventHandlerFuncs.OnDelete
	/remote-source/app/vendor/k8s.io/client-go/tools/cache/controller.go:245
k8s.io/client-go/tools/cache.newInformer.func1
	/remote-source/app/vendor/k8s.io/client-go/tools/cache/controller.go:413
k8s.io/client-go/tools/cache.(*DeltaFIFO).Pop
	/remote-source/app/vendor/k8s.io/client-go/tools/cache/delta_fifo.go:544
k8s.io/client-go/tools/cache.(*controller).processLoop
	/remote-source/app/vendor/k8s.io/client-go/tools/cache/controller.go:183
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1
	/remote-source/app/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:155
k8s.io/apimachinery/pkg/util/wait.BackoffUntil
	/remote-source/app/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:156
k8s.io/apimachinery/pkg/util/wait.JitterUntil
	/remote-source/app/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133
k8s.io/apimachinery/pkg/util/wait.Until
	/remote-source/app/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:90
k8s.io/client-go/tools/cache.(*controller).Run
	/remote-source/app/vendor/k8s.io/client-go/tools/cache/controller.go:154
```

This seems a problem introduced by kubernetes code(https://github.com/kubernetes/kubernetes/issues/82130).
In mco operator, we can add retry logic to bypass this problem.

For users which run into this problem, they can delete the secret observability-controller-open-cluster-management.io-observability-signer-client-cert in open-cluster-management-addon-observability namespace in the managed cluster. Then the client cert will be re-generated and signed by the new ca cert, and metrics can be pushed successfully with the new cert.

Comment 5 errata-xmlrpc 2022-06-09 02:07:01 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: Red Hat Advanced Cluster Management 2.5 security updates, images, and bug fixes), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:4956

Comment 7 melindaetinw81 2022-12-31 06:19:38 UTC Comment hidden (spam)

This comment was flagged a spam, view the edit history to see the original text if required.

Comment 9 jenny 2023-01-09 06:52:07 UTC Comment hidden (spam)

This comment was flagged a spam, view the edit history to see the original text if required.

Comment 11 jems 2023-01-25 09:44:39 UTC Comment hidden (spam)

This comment was flagged a spam, view the edit history to see the original text if required.

Comment 14 John 2023-07-31 11:20:18 UTC Comment hidden (spam)

Thanks for sharing this, i have also read this. 
https://www.tellpopeyes.me/

Comment 15 victor marcel 2023-09-04 09:09:07 UTC Comment hidden (spam)

Thank you for bringing this issue to our attention. We apologize for any inconvenience caused. Our team is currently investigating the problem with obs-controller not working properly after the upgrade to version 2.4.1 RC1. We understand that this has resulted in many clusters being in a 'Degraded' status. We are working diligently to resolve this issue as soon as possible.

Comment 16 victor marcel 2023-09-06 11:17:12 UTC Comment hidden (spam)

Refer to the official documentation for OpenShift Cluster Manager (ACM) and obs-controller for version 2.4.1 RC1. There may be specific release notes or troubleshooting guides that address issues related to cluster status and the obs-controller. https://www.followmy-health.com/

Comment 17 Darrin Vitiello 2024-08-23 04:49:55 UTC Comment hidden (spam)

I faced similar kind of issue last time, I am still searching for some proper solution.https://www.starbucks-secretmenu.com

Comment 18 ellen5898 2024-09-24 05:00:40 UTC Comment hidden (spam)

Hi Team,
We're experiencing a critical issue with our clusters, currently in a 'Degraded' state due to the OBS (Observer) controller malfunctioning during the upgrade environment. This is causing instability and potential downtime.
Immediate Action Required:
Investigate OBS-controller logs for error messages.
Verify configuration files for any inconsistencies.
Roll back recent changes, if possible.
Engage the development team for assistance.
Potential Causes:
Incompatible version updates.
Misconfigured environment variables.
Resource allocation issues.
Next Steps:
Schedule a maintenance window for thorough debugging.
Collaborate with DevOps to resolve dependencies.
Consider temporary workarounds. https://www-floridablue.com
Please provide updates on your progress. Let's work together to resolve this ASAP and restore cluster stability.

Note You need to log in before you can comment on or make changes to this bug.