Bug 2028647 - Clusters are in 'Degraded' status with upgrade env due to obs-controller not working properly
Summary: Clusters are in 'Degraded' status with upgrade env due to obs-controller not ...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Advanced Cluster Management for Kubernetes
Classification: Red Hat
Component: Core Services / Observability
Version: rhacm-2.4.z
Hardware: All
OS: All
unspecified
high
Target Milestone: ---
: rhacm-2.5
Assignee: Chunlin Yang
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-12-02 21:18 UTC by Xiang Yin
Modified: 2024-09-24 05:00 UTC (History)
11 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-06-09 02:07:01 UTC
Target Upstream Version:
Embargoed:
bot-tracker-sync: rhacm-2.5+


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github open-cluster-management backlog issues 18289 0 None None None 2021-12-05 21:25:15 UTC
Red Hat Product Errata RHSA-2022:4956 0 None None None 2022-06-09 02:07:34 UTC

Description Xiang Yin 2021-12-02 21:18:00 UTC
Description of the problem:

After upgrade from 2.4.0 to 2.4.1 RC1, all imported clusters were in good status for a while. After one day, many clusters were in 'Degraded' status due to obs-controller not working properly

Release version: 2.4.1

Operator snapshot version: 2.4.1 RC1

OCP version: 4.9.7

Browser Info: Chrome, Firefox

Steps to reproduce:
1. Launch ACM
2. Go to clusters
3. Check cluster status

Actual results:

Clusters are in 'Degraded' status and corresponding obs-controller is also in 'Degraded' status

Expected results:

Additional info:

Comment 1 llan 2021-12-02 21:32:59 UTC
Checked the environment, metrics-collector in some clusters cannot push metrics successfully to hub server due to client ca certs which used to sign the client certs cannot be verified by server side.
I found in server side, the client ca cert secret removed recently. Ideally, the new-generated client ca ca secret should include the old ca cert, so that the sever side can verify the requests using client certs which signed by old or new ca cert. But in mco operator, there is error message which mentioned it's failed to update that secret to include the old cert.
error message is as below:
```
2021-12-01T13:37:46.631Z	ERROR	controller_certificates	Failed to update secret for ca certificate	{"name": "observability-client-ca-certs", "error": "Operation cannot be fulfilled on secrets \"observability-client-ca-certs\": StorageError: invalid object, Code: 4, Key: /kubernetes.io/secrets/open-cluster-management-observability/observability-client-ca-certs, ResourceVersion: 0, AdditionalErrorMsg: Precondition failed: UID in precondition: 4b11225a-6288-43e9-b1c9-e6a30bd42db9, UID in object meta: "}
github.com/open-cluster-management/multicluster-observability-operator/operators/multiclusterobservability/pkg/certificates.onDelete.func1
	/remote-source/app/operators/multiclusterobservability/pkg/certificates/cert_controller.go:184
k8s.io/client-go/tools/cache.ResourceEventHandlerFuncs.OnDelete
	/remote-source/app/vendor/k8s.io/client-go/tools/cache/controller.go:245
k8s.io/client-go/tools/cache.newInformer.func1
	/remote-source/app/vendor/k8s.io/client-go/tools/cache/controller.go:413
k8s.io/client-go/tools/cache.(*DeltaFIFO).Pop
	/remote-source/app/vendor/k8s.io/client-go/tools/cache/delta_fifo.go:544
k8s.io/client-go/tools/cache.(*controller).processLoop
	/remote-source/app/vendor/k8s.io/client-go/tools/cache/controller.go:183
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1
	/remote-source/app/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:155
k8s.io/apimachinery/pkg/util/wait.BackoffUntil
	/remote-source/app/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:156
k8s.io/apimachinery/pkg/util/wait.JitterUntil
	/remote-source/app/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133
k8s.io/apimachinery/pkg/util/wait.Until
	/remote-source/app/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:90
k8s.io/client-go/tools/cache.(*controller).Run
	/remote-source/app/vendor/k8s.io/client-go/tools/cache/controller.go:154
```

This seems a problem introduced by kubernetes code(https://github.com/kubernetes/kubernetes/issues/82130).
In mco operator, we can add retry logic to bypass this problem.

For users which run into this problem, they can delete the secret observability-controller-open-cluster-management.io-observability-signer-client-cert in open-cluster-management-addon-observability namespace in the managed cluster. Then the client cert will be re-generated and signed by the new ca cert, and metrics can be pushed successfully with the new cert.

Comment 5 errata-xmlrpc 2022-06-09 02:07:01 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: Red Hat Advanced Cluster Management 2.5 security updates, images, and bug fixes), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:4956

Comment 7 melindaetinw81 2022-12-31 06:19:38 UTC Comment hidden (spam)
Comment 9 jenny 2023-01-09 06:52:07 UTC Comment hidden (spam)
Comment 11 jems 2023-01-25 09:44:39 UTC Comment hidden (spam)
Comment 14 John 2023-07-31 11:20:18 UTC Comment hidden (spam)
Comment 15 victor marcel 2023-09-04 09:09:07 UTC Comment hidden (spam)
Comment 16 victor marcel 2023-09-06 11:17:12 UTC Comment hidden (spam)
Comment 17 Darrin Vitiello 2024-08-23 04:49:55 UTC Comment hidden (spam)
Comment 18 ellen5898 2024-09-24 05:00:40 UTC Comment hidden (spam)

Note You need to log in before you can comment on or make changes to this bug.