Created attachment 1850495 [details] pv/pvc/sc info Description of problem: alertmanager replicas changed from 3 to 2 since 4.10 4.9 has 3 replicas: https://github.com/openshift/cluster-monitoring-operator/blob/release-4.9/assets/alertmanager/alertmanager.yaml#L121 4.10 has 2 replicas: https://github.com/openshift/cluster-monitoring-operator/blob/release-4.10/assets/alertmanager/alertmanager.yaml#L151 4.9.13 cluster, bind PVs for alertmanager via configmap and upgrade to 4.10.0-fc.0 ****************************** apiVersion: v1 kind: ConfigMap metadata: name: cluster-monitoring-config namespace: openshift-monitoring data: config.yaml: | alertmanagerMain: volumeClaimTemplate: metadata: name: alertmanager spec: volumeMode: Filesystem resources: requests: storage: 4Gi ****************************** # oc get sc NAME PROVISIONER RECLAIMPOLICY VOLUMEBINDINGMODE ALLOWVOLUMEEXPANSION AGE standard (default) kubernetes.io/gce-pd Delete WaitForFirstConsumer true 23h standard-csi pd.csi.storage.gke.io Delete WaitForFirstConsumer true 23h after upgrade, PVC alertmanager-alertmanager-main-2 status is still Bound, but no pod use the PVC # oc -n openshift-monitoring get pvc | grep alertmanager alertmanager-alertmanager-main-0 Bound pvc-8a3c4a5c-29b6-4a36-809d-a4850d78e2a7 4Gi RWO standard 20h alertmanager-alertmanager-main-1 Bound pvc-4da37a8e-f0e4-4f74-b838-c99adc031752 4Gi RWO standard 20h alertmanager-alertmanager-main-2 Bound pvc-215f87e7-0913-4163-8840-5b14784f38f3 4Gi RWO standard 20h # oc -n openshift-monitoring get pod | grep alertmanager-main alertmanager-main-0 6/6 Running 0 18h alertmanager-main-1 6/6 Running 0 18h # oc -n openshift-monitoring get pod alertmanager-main-0 -oyaml | grep persistentVolumeClaim -A1 persistentVolumeClaim: claimName: alertmanager-alertmanager-main-0 # oc -n openshift-monitoring get pod alertmanager-main-1 -oyaml | grep persistentVolumeClaim -A1 persistentVolumeClaim: claimName: alertmanager-alertmanager-main-1 Version-Release number of selected component (if applicable): 4.9.13 cluster, bind PVs for alertmanager via configmap and upgrade to 4.10.0-fc.0 How reproducible: always Steps to Reproduce: 1. 4.9.13 cluster, bind PVs for alertmanager via configmap and upgrade to 4.10.0-fc.0 2. 3. Actual results: PVC alertmanager-alertmanager-main-2 status is still Bound, but no pod use the PVC Expected results: PVC alertmanager-alertmanager-main-2 should be recycled, Bound status will make user confused and think there is pod use the PVC Master Log: Node Log (of failed PODs): PV Dump: see the attached file PVC Dump: see the attached file StorageClass Dump (if StorageClass used by PV/PVC): see the attached file Additional info:
Kubernetes does not delete PVCs created by a StatefulSet when it gets scaled down. It does not know if the user is going to scale the StatefulSet back up. There is a KEP upstream to add automatic deletion as opt-in. It's alpha in 1.23, and it will take few releases to reach GA. Moving to monitoring team to consider if they want to delete PVC automatically in cluster-monitoring-operator during/after upgrade or just document it as post-upgrade step.
Forgot a link to the upstream KEP: https://github.com/kubernetes/enhancements/tree/master/keps/sig-apps/1847-autoremove-statefulset-pvcs
IMO we need at least a note in the OCP documentation. Ideally the cluster monitoring operator should clean this up but: 1. It might be tricky to understand exactly which volume to delete, the best approach is probably to get the alertmanager-main-2 pod definition (if it exists), find the bounded PVC and delete it before scaling down the statefulset. 2. The operator deleting user data automatically is a bit scary to me.
The 4.10 release notes mentions the "issue" and explains how it should be fixed: https://docs.openshift.com/container-platform/4.10/release_notes/ocp-4-10-release-notes.html#ocp-4-10-monitoring-added-hard-anti-affinity-rules-and-pod-distruption-budgets Junqi, I'm not sure how we want to proceed with this bug? Would you move it to VERIFIED directly?
updated doc, added note for this issue
Added the text in the Doc Text field above to the "Known Issues" section of the OCP 4.10 Release Notes: https://docs.openshift.com/container-platform/4.10/release_notes/ocp-4-10-release-notes.html#ocp-4-10-known-issues