Bug 2021097 - CMO should report `Upgradeable: false` when HA workload is incorrectly spread
Summary: CMO should report `Upgradeable: false` when HA workload is incorrectly spread
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Monitoring
Version: 4.9
Hardware: Unspecified
OS: Unspecified
urgent
high
Target Milestone: ---
: 4.9.z
Assignee: Jan Fajerski
QA Contact: Junqi Zhao
URL:
Whiteboard:
Depends On: 1995924
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-11-08 11:32 UTC by Simon Pasquier
Modified: 2023-03-13 15:38 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of: 1995924
Environment:
Last Closed: 2021-11-22 21:47:05 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-monitoring-operator pull 1472 0 None Merged Bug 2021097: Set Upgradeable: false when HA workloads are incorrectly spread 2021-12-09 00:25:36 UTC
Red Hat Knowledge Base (Solution) 6959436 0 None None None 2022-05-20 14:09:16 UTC
Red Hat Product Errata RHBA-2021:4712 0 None None None 2021-11-22 21:47:18 UTC

Comment 1 Junqi Zhao 2021-11-18 03:45:34 UTC
tested with PR, bound PVs for prometheus, and schedule prometheus pods to one same node, Upgradeable is False now
# oc -n openshift-monitoring get pod -o wide |grep prometheus-k8s
prometheus-k8s-0                               7/7     Running   0             116s   10.128.2.25    ip-10-0-246-211.us-west-1.compute.internal   <none>           <none>
prometheus-k8s-1                               7/7     Running   0             116s   10.128.2.26    ip-10-0-246-211.us-west-1.compute.internal   <none>           <none>

# oc -n openshift-monitoring get pvc
NAME                                 STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   AGE
prometheus-k8s-db-prometheus-k8s-0   Bound    pvc-64ba0456-9b74-4537-97df-93459b2d3bcf   10Gi       RWO            gp2            2m17s
prometheus-k8s-db-prometheus-k8s-1   Bound    pvc-b2fa154d-2fd3-4df4-9172-14607b9b9623   10Gi       RWO            gp2            2m17s

# token=`oc sa get-token prometheus-k8s -n openshift-monitoring`
# oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -k -H "Authorization: Bearer $token" 'https://prometheus-k8s.openshift-monitoring.svc:9091/api/v1/alerts'|jq
...
    "alerts": [
      {
        "labels": {
          "alertname": "HighlyAvailableWorkloadIncorrectlySpread",
          "namespace": "openshift-monitoring",
          "severity": "warning",
          "workload": "prometheus-k8s"
        },
        "annotations": {
          "description": "Workload openshift-monitoring/prometheus-k8s is incorrectly spread across multiple nodes which breaks high-availability requirements. Since the workload is using persistent volumes, manual intervention is needed. Please follow the guidelines provided in the runbook of this alert to fix this issue.",
          "runbook_url": "https://github.com/openshift/runbooks/blob/master/alerts/HighlyAvailableWorkloadIncorrectlySpread.md",
          "summary": "Highly-available workload is incorrectly spread across multiple nodes and manual intervention is needed."
        },
        "state": "pending",
        "activeAt": "2021-11-18T03:40:35.421613011Z",
        "value": "1e+00"
      },

# oc adm upgrade
Cluster version is 4.9.0-0.ci.test-2021-11-18-023919-ci-ln-9b5cn5t-latest

Upgradeable=False

  Reason: WorkloadSinglePointOfFailure
  Message: Cluster operator monitoring should not be upgraded between minor versions: Highly-available workload in namespace openshift-monitoring, with label map["app.kubernetes.io/name":"prometheus"] and persistent storage enabled has a single point of failure.
Manual intervention is needed to upgrade to the next minor version. For each highly-available workload that has a single point of failure please mark at least one of their PersistentVolumeClaim for deletion by annotating them with map["openshift.io/cluster-monitoring-drop-pvc":"yes"].

warning: Cannot display available updates:
  Reason: NoChannel
  Message: The update channel has not been configured.

# oc get co monitoring -oyaml
...
  - lastTransitionTime: "2021-11-18T03:40:10Z"
    message: |-
      Highly-available workload in namespace openshift-monitoring, with label map["app.kubernetes.io/name":"prometheus"] and persistent storage enabled has a single point of failure.
      Manual intervention is needed to upgrade to the next minor version. For each highly-available workload that has a single point of failure please mark at least one of their PersistentVolumeClaim for deletion by annotating them with map["openshift.io/cluster-monitoring-drop-pvc":"yes"].
    reason: WorkloadSinglePointOfFailure
    status: "False"
    type: Upgradeable

Comment 6 errata-xmlrpc 2021-11-22 21:47:05 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.9.8 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:4712

Comment 7 hongyan li 2021-11-23 10:41:08 UTC
Suppose the bug need update documentation, only annotating pvc with map["openshift.io/cluster-monitoring-drop-pvc":"yes"] can't make upgrade true and the pvc with be recreated quickly

Comment 8 hongyan li 2021-11-23 12:43:09 UTC
correct comments 7, annotating pvc with map["openshift.io/cluster-monitoring-drop-pvc":"yes"] can make upgrade true, I added annotation by edit one pvc


Note You need to log in before you can comment on or make changes to this bug.