2021097 – CMO should report `Upgradeable: false` when HA workload is incorrectly spread

Bug 2021097 - CMO should report `Upgradeable: false` when HA workload is incorrectly spread

Summary: CMO should report `Upgradeable: false` when HA workload is incorrectly spread

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Monitoring
Sub Component:
Version:	4.9
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	high
Target Milestone:	---
Target Release:	4.9.z
Assignee:	Jan Fajerski
QA Contact:	Junqi Zhao
Docs Contact:
URL:
Whiteboard:
Depends On:	1995924
Blocks:
TreeView+	depends on / blocked

Reported:	2021-11-08 11:32 UTC by Simon Pasquier
Modified:	2023-03-13 15:38 UTC (History)
CC List:	9 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:	1995924
Environment:
Last Closed:	2021-11-22 21:47:05 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	openshift cluster-monitoring-operator pull 1472	None	Merged	Bug 2021097: Set Upgradeable: false when HA workloads are incorrectly spread	2021-12-09 00:25:36 UTC
Red Hat Knowledge Base (Solution)	6959436	None	None	None	2022-05-20 14:09:16 UTC
Red Hat Product Errata	RHBA-2021:4712	None	None	None	2021-11-22 21:47:18 UTC

Comment 1 Junqi Zhao 2021-11-18 03:45:34 UTC

tested with PR, bound PVs for prometheus, and schedule prometheus pods to one same node, Upgradeable is False now
# oc -n openshift-monitoring get pod -o wide |grep prometheus-k8s
prometheus-k8s-0                               7/7     Running   0             116s   10.128.2.25    ip-10-0-246-211.us-west-1.compute.internal   <none>           <none>
prometheus-k8s-1                               7/7     Running   0             116s   10.128.2.26    ip-10-0-246-211.us-west-1.compute.internal   <none>           <none>

# oc -n openshift-monitoring get pvc
NAME                                 STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   AGE
prometheus-k8s-db-prometheus-k8s-0   Bound    pvc-64ba0456-9b74-4537-97df-93459b2d3bcf   10Gi       RWO            gp2            2m17s
prometheus-k8s-db-prometheus-k8s-1   Bound    pvc-b2fa154d-2fd3-4df4-9172-14607b9b9623   10Gi       RWO            gp2            2m17s

# token=`oc sa get-token prometheus-k8s -n openshift-monitoring`
# oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -k -H "Authorization: Bearer $token" 'https://prometheus-k8s.openshift-monitoring.svc:9091/api/v1/alerts'|jq
...
    "alerts": [
      {
        "labels": {
          "alertname": "HighlyAvailableWorkloadIncorrectlySpread",
          "namespace": "openshift-monitoring",
          "severity": "warning",
          "workload": "prometheus-k8s"
        },
        "annotations": {
          "description": "Workload openshift-monitoring/prometheus-k8s is incorrectly spread across multiple nodes which breaks high-availability requirements. Since the workload is using persistent volumes, manual intervention is needed. Please follow the guidelines provided in the runbook of this alert to fix this issue.",
          "runbook_url": "https://github.com/openshift/runbooks/blob/master/alerts/HighlyAvailableWorkloadIncorrectlySpread.md",
          "summary": "Highly-available workload is incorrectly spread across multiple nodes and manual intervention is needed."
        },
        "state": "pending",
        "activeAt": "2021-11-18T03:40:35.421613011Z",
        "value": "1e+00"
      },

# oc adm upgrade
Cluster version is 4.9.0-0.ci.test-2021-11-18-023919-ci-ln-9b5cn5t-latest

Upgradeable=False

  Reason: WorkloadSinglePointOfFailure
  Message: Cluster operator monitoring should not be upgraded between minor versions: Highly-available workload in namespace openshift-monitoring, with label map["app.kubernetes.io/name":"prometheus"] and persistent storage enabled has a single point of failure.
Manual intervention is needed to upgrade to the next minor version. For each highly-available workload that has a single point of failure please mark at least one of their PersistentVolumeClaim for deletion by annotating them with map["openshift.io/cluster-monitoring-drop-pvc":"yes"].

warning: Cannot display available updates:
  Reason: NoChannel
  Message: The update channel has not been configured.

# oc get co monitoring -oyaml
...
  - lastTransitionTime: "2021-11-18T03:40:10Z"
    message: |-
      Highly-available workload in namespace openshift-monitoring, with label map["app.kubernetes.io/name":"prometheus"] and persistent storage enabled has a single point of failure.
      Manual intervention is needed to upgrade to the next minor version. For each highly-available workload that has a single point of failure please mark at least one of their PersistentVolumeClaim for deletion by annotating them with map["openshift.io/cluster-monitoring-drop-pvc":"yes"].
    reason: WorkloadSinglePointOfFailure
    status: "False"
    type: Upgradeable

Comment 6 errata-xmlrpc 2021-11-22 21:47:05 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.9.8 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:4712

Comment 7 hongyan li 2021-11-23 10:41:08 UTC

Suppose the bug need update documentation, only annotating pvc with map["openshift.io/cluster-monitoring-drop-pvc":"yes"] can't make upgrade true and the pvc with be recreated quickly

Comment 8 hongyan li 2021-11-23 12:43:09 UTC

correct comments 7, annotating pvc with map["openshift.io/cluster-monitoring-drop-pvc":"yes"] can make upgrade true, I added annotation by edit one pvc

Note You need to log in before you can comment on or make changes to this bug.