2042441 – degraded/stuck Monitoring deployment after cluster upgrade 4.8->4.9

Bug 2042441 - degraded/stuck Monitoring deployment after cluster upgrade 4.8->4.9

Summary: degraded/stuck Monitoring deployment after cluster upgrade 4.8->4.9

Keywords:
Status:	CLOSED DUPLICATE of bug 2030539
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Monitoring
Sub Component:
Version:	4.9
Hardware:	x86_64
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	---
Assignee:	Philip Gough
QA Contact:	Junqi Zhao
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2022-01-19 14:19 UTC by Kai-Uwe Rommel
Modified:	2022-01-26 16:59 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2022-01-26 16:59:22 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Kai-Uwe Rommel 2022-01-19 14:19:59 UTC

Description of problem:
Yesterday I upgraded an OCP cluster from 4.8.23 to 4.9.13. The upgrade went almost fine. In the end, the monitoring deployment was degraded.
Checking this, I found that all the pods for Prometheus and Alertmanager were missing. The statefulsets were there but had 0 (zero) instances.
The Monitoring cluster operator was logging that it could not update the statefulsets because it was attempting to updated fields of these that are not updatable.
The solution was to delete the statefulsets, then the operator immediately recreated them and the pods were started and the error condition was cleared.

Version-Release number of selected component (if applicable):
4.9.13

How reproducible:
cannot say

Steps to Reproduce:
1. upgrade cluster 4.8 -> 4.9
2. when the upgrade is finished, the Monitoring deployment is degraded

Actual results:
when the upgrade is finished, the Monitoring deployment is degraded

Expected results:
should not happen

Additional info:
Since I could solve the problem to get the Monitoring up and running again, there is no error state any longer so I cannot capture logs etc.

Comment 1 Kai-Uwe Rommel 2022-01-19 14:22:24 UTC

Perhaps during upgrade it makes sense that the operator deletes and recreates the statefulsets right away, without attempting to update them? It does not seem that it caused problems in my specific case.

Comment 2 Simon Pasquier 2022-01-20 08:28:13 UTC

Thanks for the report! It will be hard for us to understand precisely what happened. Don't you have any logs from the cluster monitoring operator and/or prometheus-operator pods?
I suspect that the operator entered a hot-loop deleting the statefulsets, similar to what's described in bug 2030539.

Comment 3 Kai-Uwe Rommel 2022-01-20 10:06:58 UTC

Unfortunately, I don't have the logs any more. When the problem occured I was a bit under time pressure to get the thing up and running again and so after I saw the repeated messages in the operators log about it trying to update fields on the statefulset that cannot be updated, I quickly decided to delete the statefulsets so that the operator could recreate them. And when I afterwards decided I should better report this (actually, Vadim asked me to do so), the operator had also updated itself and restarted so the pod log of the previous instance was already gone.
But unlike in the other bug you mention it was not looping deleting+recreating the statefulsets but in a loop trying (and failing) to update them.

Comment 4 Kai-Uwe Rommel 2022-01-20 10:30:13 UTC

Ah, of course, since I have a logging deployment on the cluster, I can simply search for this through Kibana!
The messages were all repeating this:

level=info ts=2022-01-18T08:31:24.51133541Z caller=operator.go:804 component=alertmanageroperator key=openshift-monitoring/main msg="recreating AlertManager StatefulSet because the update operation wasn't possible" reason="Forbidden: updates to statefulset spec for fields other than 'replicas', 'template', 'updateStrategy' and 'minReadySeconds' are forbidden"

level=info ts=2022-01-18T08:26:19.421860032Z caller=operator.go:1306 component=prometheusoperator key=openshift-monitoring/k8s statefulset=prometheus-k8s shard=0 msg="recreating StatefulSet because the update operation wasn't possible" reason="Forbidden: updates to statefulset spec for fields other than 'replicas', 'template', 'updateStrategy' and 'minReadySeconds' are forbidden"

However, the operator did in fact NOT recreate the statefulset but was retrying eternally to update it, apparently.
Only when I manually deleted the statefulsets, it recreated them.

Comment 5 Philip Gough 2022-01-20 13:23:46 UTC

Thanks for reporting, I believe this is in fact a duplicate of bug 2030539.

We see your logs map to the following:

1. There is an Update request. The 'recreating' log occurs just before a request to delete the StatefulSet. Which we must do since the update request cannot proceed. [1]
2. The Delete request is made, specifying foreground deletion [2]
3. The sync occurs again and the operator loops through steps 1-3

How did you delete the stateful set? I will assume via kubectl (but please clarify) without passing any `--cascade=` flag, thereby defaulting to background delete [3] (the default) which deleted the statefulset and removed it from the API straight way, which in turn caused the loop to be broken and next sync to succeed.




[1] https://github.com/prometheus-operator/prometheus-operator/blob/main/pkg/alertmanager/operator.go#L815
[2] https://kubernetes.io/docs/concepts/architecture/garbage-collection/#foreground-deletion
[3] https://kubernetes.io/docs/concepts/architecture/garbage-collection/#background-deletion

Comment 6 Kai-Uwe Rommel 2022-01-20 14:04:24 UTC

I just did manual "oc delete statefulset <name>", nothing more.

Comment 7 Philip Gough 2022-01-26 16:59:22 UTC


*** This bug has been marked as a duplicate of bug 2030539 ***

Note You need to log in before you can comment on or make changes to this bug.