Description of the problem: After upgrade from 2.4.0 to 2.4.1 RC1, all imported clusters were in good status for a while. After one day, many clusters were in 'Degraded' status due to obs-controller not working properly Release version: 2.4.1 Operator snapshot version: 2.4.1 RC1 OCP version: 4.9.7 Browser Info: Chrome, Firefox Steps to reproduce: 1. Launch ACM 2. Go to clusters 3. Check cluster status Actual results: Clusters are in 'Degraded' status and corresponding obs-controller is also in 'Degraded' status Expected results: Additional info:
Checked the environment, metrics-collector in some clusters cannot push metrics successfully to hub server due to client ca certs which used to sign the client certs cannot be verified by server side. I found in server side, the client ca cert secret removed recently. Ideally, the new-generated client ca ca secret should include the old ca cert, so that the sever side can verify the requests using client certs which signed by old or new ca cert. But in mco operator, there is error message which mentioned it's failed to update that secret to include the old cert. error message is as below: ``` 2021-12-01T13:37:46.631Z ERROR controller_certificates Failed to update secret for ca certificate {"name": "observability-client-ca-certs", "error": "Operation cannot be fulfilled on secrets \"observability-client-ca-certs\": StorageError: invalid object, Code: 4, Key: /kubernetes.io/secrets/open-cluster-management-observability/observability-client-ca-certs, ResourceVersion: 0, AdditionalErrorMsg: Precondition failed: UID in precondition: 4b11225a-6288-43e9-b1c9-e6a30bd42db9, UID in object meta: "} github.com/open-cluster-management/multicluster-observability-operator/operators/multiclusterobservability/pkg/certificates.onDelete.func1 /remote-source/app/operators/multiclusterobservability/pkg/certificates/cert_controller.go:184 k8s.io/client-go/tools/cache.ResourceEventHandlerFuncs.OnDelete /remote-source/app/vendor/k8s.io/client-go/tools/cache/controller.go:245 k8s.io/client-go/tools/cache.newInformer.func1 /remote-source/app/vendor/k8s.io/client-go/tools/cache/controller.go:413 k8s.io/client-go/tools/cache.(*DeltaFIFO).Pop /remote-source/app/vendor/k8s.io/client-go/tools/cache/delta_fifo.go:544 k8s.io/client-go/tools/cache.(*controller).processLoop /remote-source/app/vendor/k8s.io/client-go/tools/cache/controller.go:183 k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1 /remote-source/app/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:155 k8s.io/apimachinery/pkg/util/wait.BackoffUntil /remote-source/app/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:156 k8s.io/apimachinery/pkg/util/wait.JitterUntil /remote-source/app/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133 k8s.io/apimachinery/pkg/util/wait.Until /remote-source/app/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:90 k8s.io/client-go/tools/cache.(*controller).Run /remote-source/app/vendor/k8s.io/client-go/tools/cache/controller.go:154 ``` This seems a problem introduced by kubernetes code(https://github.com/kubernetes/kubernetes/issues/82130). In mco operator, we can add retry logic to bypass this problem. For users which run into this problem, they can delete the secret observability-controller-open-cluster-management.io-observability-signer-client-cert in open-cluster-management-addon-observability namespace in the managed cluster. Then the client cert will be re-generated and signed by the new ca cert, and metrics can be pushed successfully with the new cert.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: Red Hat Advanced Cluster Management 2.5 security updates, images, and bug fixes), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:4956
This comment was flagged a spam, view the edit history to see the original text if required.
Thanks for sharing this, i have also read this. https://www.tellpopeyes.me/
Thank you for bringing this issue to our attention. We apologize for any inconvenience caused. Our team is currently investigating the problem with obs-controller not working properly after the upgrade to version 2.4.1 RC1. We understand that this has resulted in many clusters being in a 'Degraded' status. We are working diligently to resolve this issue as soon as possible.
Refer to the official documentation for OpenShift Cluster Manager (ACM) and obs-controller for version 2.4.1 RC1. There may be specific release notes or troubleshooting guides that address issues related to cluster status and the obs-controller. https://www.followmy-health.com/
I faced similar kind of issue last time, I am still searching for some proper solution.https://www.starbucks-secretmenu.com
Hi Team, We're experiencing a critical issue with our clusters, currently in a 'Degraded' state due to the OBS (Observer) controller malfunctioning during the upgrade environment. This is causing instability and potential downtime. Immediate Action Required: Investigate OBS-controller logs for error messages. Verify configuration files for any inconsistencies. Roll back recent changes, if possible. Engage the development team for assistance. Potential Causes: Incompatible version updates. Misconfigured environment variables. Resource allocation issues. Next Steps: Schedule a maintenance window for thorough debugging. Collaborate with DevOps to resolve dependencies. Consider temporary workarounds. https://www-floridablue.com Please provide updates on your progress. Let's work together to resolve this ASAP and restore cluster stability.