Bug 2042441
| Summary: | degraded/stuck Monitoring deployment after cluster upgrade 4.8->4.9 | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Kai-Uwe Rommel <kai-uwe.rommel> |
| Component: | Monitoring | Assignee: | Philip Gough <pgough> |
| Status: | CLOSED DUPLICATE | QA Contact: | Junqi Zhao <juzhao> |
| Severity: | medium | Docs Contact: | |
| Priority: | medium | ||
| Version: | 4.9 | CC: | amuller, anpicker, aos-bugs, erooth |
| Target Milestone: | --- | ||
| Target Release: | --- | ||
| Hardware: | x86_64 | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2022-01-26 16:59:22 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Kai-Uwe Rommel
2022-01-19 14:19:59 UTC
Perhaps during upgrade it makes sense that the operator deletes and recreates the statefulsets right away, without attempting to update them? It does not seem that it caused problems in my specific case. Thanks for the report! It will be hard for us to understand precisely what happened. Don't you have any logs from the cluster monitoring operator and/or prometheus-operator pods? I suspect that the operator entered a hot-loop deleting the statefulsets, similar to what's described in bug 2030539. Unfortunately, I don't have the logs any more. When the problem occured I was a bit under time pressure to get the thing up and running again and so after I saw the repeated messages in the operators log about it trying to update fields on the statefulset that cannot be updated, I quickly decided to delete the statefulsets so that the operator could recreate them. And when I afterwards decided I should better report this (actually, Vadim asked me to do so), the operator had also updated itself and restarted so the pod log of the previous instance was already gone. But unlike in the other bug you mention it was not looping deleting+recreating the statefulsets but in a loop trying (and failing) to update them. Ah, of course, since I have a logging deployment on the cluster, I can simply search for this through Kibana! The messages were all repeating this: level=info ts=2022-01-18T08:31:24.51133541Z caller=operator.go:804 component=alertmanageroperator key=openshift-monitoring/main msg="recreating AlertManager StatefulSet because the update operation wasn't possible" reason="Forbidden: updates to statefulset spec for fields other than 'replicas', 'template', 'updateStrategy' and 'minReadySeconds' are forbidden" level=info ts=2022-01-18T08:26:19.421860032Z caller=operator.go:1306 component=prometheusoperator key=openshift-monitoring/k8s statefulset=prometheus-k8s shard=0 msg="recreating StatefulSet because the update operation wasn't possible" reason="Forbidden: updates to statefulset spec for fields other than 'replicas', 'template', 'updateStrategy' and 'minReadySeconds' are forbidden" However, the operator did in fact NOT recreate the statefulset but was retrying eternally to update it, apparently. Only when I manually deleted the statefulsets, it recreated them. Thanks for reporting, I believe this is in fact a duplicate of bug 2030539. We see your logs map to the following: 1. There is an Update request. The 'recreating' log occurs just before a request to delete the StatefulSet. Which we must do since the update request cannot proceed. [1] 2. The Delete request is made, specifying foreground deletion [2] 3. The sync occurs again and the operator loops through steps 1-3 How did you delete the stateful set? I will assume via kubectl (but please clarify) without passing any `--cascade=` flag, thereby defaulting to background delete [3] (the default) which deleted the statefulset and removed it from the API straight way, which in turn caused the loop to be broken and next sync to succeed. [1] https://github.com/prometheus-operator/prometheus-operator/blob/main/pkg/alertmanager/operator.go#L815 [2] https://kubernetes.io/docs/concepts/architecture/garbage-collection/#foreground-deletion [3] https://kubernetes.io/docs/concepts/architecture/garbage-collection/#background-deletion I just did manual "oc delete statefulset <name>", nothing more. *** This bug has been marked as a duplicate of bug 2030539 *** |