Description of problem: Yesterday I upgraded an OCP cluster from 4.8.23 to 4.9.13. The upgrade went almost fine. In the end, the monitoring deployment was degraded. Checking this, I found that all the pods for Prometheus and Alertmanager were missing. The statefulsets were there but had 0 (zero) instances. The Monitoring cluster operator was logging that it could not update the statefulsets because it was attempting to updated fields of these that are not updatable. The solution was to delete the statefulsets, then the operator immediately recreated them and the pods were started and the error condition was cleared. Version-Release number of selected component (if applicable): 4.9.13 How reproducible: cannot say Steps to Reproduce: 1. upgrade cluster 4.8 -> 4.9 2. when the upgrade is finished, the Monitoring deployment is degraded Actual results: when the upgrade is finished, the Monitoring deployment is degraded Expected results: should not happen Additional info: Since I could solve the problem to get the Monitoring up and running again, there is no error state any longer so I cannot capture logs etc.
Perhaps during upgrade it makes sense that the operator deletes and recreates the statefulsets right away, without attempting to update them? It does not seem that it caused problems in my specific case.
Thanks for the report! It will be hard for us to understand precisely what happened. Don't you have any logs from the cluster monitoring operator and/or prometheus-operator pods? I suspect that the operator entered a hot-loop deleting the statefulsets, similar to what's described in bug 2030539.
Unfortunately, I don't have the logs any more. When the problem occured I was a bit under time pressure to get the thing up and running again and so after I saw the repeated messages in the operators log about it trying to update fields on the statefulset that cannot be updated, I quickly decided to delete the statefulsets so that the operator could recreate them. And when I afterwards decided I should better report this (actually, Vadim asked me to do so), the operator had also updated itself and restarted so the pod log of the previous instance was already gone. But unlike in the other bug you mention it was not looping deleting+recreating the statefulsets but in a loop trying (and failing) to update them.
Ah, of course, since I have a logging deployment on the cluster, I can simply search for this through Kibana! The messages were all repeating this: level=info ts=2022-01-18T08:31:24.51133541Z caller=operator.go:804 component=alertmanageroperator key=openshift-monitoring/main msg="recreating AlertManager StatefulSet because the update operation wasn't possible" reason="Forbidden: updates to statefulset spec for fields other than 'replicas', 'template', 'updateStrategy' and 'minReadySeconds' are forbidden" level=info ts=2022-01-18T08:26:19.421860032Z caller=operator.go:1306 component=prometheusoperator key=openshift-monitoring/k8s statefulset=prometheus-k8s shard=0 msg="recreating StatefulSet because the update operation wasn't possible" reason="Forbidden: updates to statefulset spec for fields other than 'replicas', 'template', 'updateStrategy' and 'minReadySeconds' are forbidden" However, the operator did in fact NOT recreate the statefulset but was retrying eternally to update it, apparently. Only when I manually deleted the statefulsets, it recreated them.
Thanks for reporting, I believe this is in fact a duplicate of bug 2030539. We see your logs map to the following: 1. There is an Update request. The 'recreating' log occurs just before a request to delete the StatefulSet. Which we must do since the update request cannot proceed. [1] 2. The Delete request is made, specifying foreground deletion [2] 3. The sync occurs again and the operator loops through steps 1-3 How did you delete the stateful set? I will assume via kubectl (but please clarify) without passing any `--cascade=` flag, thereby defaulting to background delete [3] (the default) which deleted the statefulset and removed it from the API straight way, which in turn caused the loop to be broken and next sync to succeed. [1] https://github.com/prometheus-operator/prometheus-operator/blob/main/pkg/alertmanager/operator.go#L815 [2] https://kubernetes.io/docs/concepts/architecture/garbage-collection/#foreground-deletion [3] https://kubernetes.io/docs/concepts/architecture/garbage-collection/#background-deletion
I just did manual "oc delete statefulset <name>", nothing more.
*** This bug has been marked as a duplicate of bug 2030539 ***