Bug 1995924

Summary:	CMO should report `Upgradeable: false` when HA workload is incorrectly spread
Product:	OpenShift Container Platform	Reporter:	Damien Grisonnet <dgrisonn>
Component:	Monitoring	Assignee:	Damien Grisonnet <dgrisonn>
Status:	CLOSED ERRATA	QA Contact:	hongyan li <hongyli>
Severity:	urgent	Docs Contact:
Priority:	urgent
Version:	4.9	CC:	amuller, anpicker, aos-bugs, erooth, hongyli, juzhao, spasquie, ychoukse
Target Milestone:	---
Target Release:	4.10.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:
Clones:	2021097 (view as bug list)		Environment:
Last Closed:	2022-03-10 16:05:43 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1933847, 1949262, 2021097

Description Damien Grisonnet 2021-08-20 07:40:18 UTC

Description of problem:

In 4.9, we have introduced a new alert 'HighlyAvailableWorkloadIncorrectlySpread' which detects if a cluster has highly available workloads backed by PVCs that are incorrectly spread between multiple nodes. The goal was to have customers fix their clusters to allow the monitoring stack to move from soft-affinity on hostname to hard-affinity without experience node affinity scheduling issues.

To make sure that clusters are fixed before the next minor upgrade, we should set the `Upgradeable` status to false in case HA workload is incorrectly spread.

Version-Release number of selected component (if applicable):
4.9

How reproducible:
Always

Steps to Reproduce:
1. configure persistent storage for prometheus-k8s
2. Cordon all worker nodes except the one running prometheus-k8s-0
3. delete the PVC bound to prometheus-k8s-1
4. oc delete pod -n openshift-monitoring prometheus-k8s-1
5. Check that both prometheus-k8s pods are scheduled on the same node and that the `HighlyAvailableWorkloadIncorrectlySpread` alert is pending
6. See that CMO is still reporting `Upgradeable` status true

Actual results:
CMO reports `Upgradeable` status true

Expected results:
CMO should report `Upgradeable` status false

Comment 3 Damien Grisonnet 2021-09-07 14:21:23 UTC

Updating this bugzilla to urgent priority/severity to be in line with its dependant bug: https://bugzilla.redhat.com/show_bug.cgi?id=1933847

Comment 6 Junqi Zhao 2021-10-12 08:23:49 UTC

4.10.0-0.nightly-2021-10-12-002740 cluster, cordon all workers except one worker, bind PVs for prometheus, and schedule prometheus-k8s pods to the same one node, Upgradeable is false now
# oc get node | grep worker
ip-10-0-132-142.us-east-2.compute.internal   Ready,SchedulingDisabled   worker   29m   v1.22.1+9312243
ip-10-0-174-65.us-east-2.compute.internal    Ready,SchedulingDisabled   worker   33m   v1.22.1+9312243
ip-10-0-255-240.us-east-2.compute.internal   Ready                      worker   33m   v1.22.1+9312243


# oc -n openshift-monitoring get pvc
NAME                          STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   AGE
prometheus-prometheus-k8s-0   Bound    pvc-38fd8f1c-5698-4ef3-8fd0-a9c2232c2a7f   10Gi       RWO            gp2            8m12s
prometheus-prometheus-k8s-1   Bound    pvc-44e78437-3f4f-40fa-88cf-9397604cf1a1   10Gi       RWO            gp2            8m12s

# oc -n openshift-monitoring get pod -o wide | grep prometheus-k8s
prometheus-k8s-0                               7/7     Running   0             7m41s   10.128.2.17    ip-10-0-255-240.us-east-2.compute.internal   <none>           <none>
prometheus-k8s-1                               7/7     Running   0             7m41s   10.128.2.18    ip-10-0-255-240.us-east-2.compute.internal   <none>           <none>


# oc get co monitoring
NAME         VERSION                              AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
monitoring   4.10.0-0.nightly-2021-10-12-002740   True        False         False      26m     

# oc get co monitoring -oyaml
...
status:
  conditions:
  - lastTransitionTime: "2021-10-12T07:49:48Z"
    status: "False"
    type: Progressing
  - lastTransitionTime: "2021-10-12T08:08:23Z"
    status: "False"
    type: Degraded
  - lastTransitionTime: "2021-10-12T08:08:23Z"
    message: |-
      Highly-available workload in namespace openshift-monitoring, with label map["app.kubernetes.io/name":"prometheus"] and persistent storage enabled has a single point of failure.
      Manual intervention is needed to upgrade to the next minor version. For each highly-available workload that has a single point of failure please mark at least one of their PersistentVolumeClaim for deletion by annotating them with map["openshift.io/cluster-monitoring-drop-pvc":"yes"].
    reason: WorkloadSinglePointOfFailure
    status: "False"
    type: Upgradeable
  - lastTransitionTime: "2021-10-12T07:49:48Z"
    message: Successfully rolled out the stack.
    reason: RollOutDone
    status: "True"
    type: Available

# oc adm upgrade
Cluster version is 4.10.0-0.nightly-2021-10-12-002740

Upgradeable=False

  Reason: ClusterOperatorsNotUpgradeable
  Message: Multiple cluster operators should not be upgraded between minor versions:
* Cluster operator monitoring should not be upgraded between minor versions: WorkloadSinglePointOfFailure: Highly-available workload in namespace openshift-monitoring, with label map["app.kubernetes.io/name":"prometheus"] and persistent storage enabled has a single point of failure.
Manual intervention is needed to upgrade to the next minor version. For each highly-available workload that has a single point of failure please mark at least one of their PersistentVolumeClaim for deletion by annotating them with map["openshift.io/cluster-monitoring-drop-pvc":"yes"].
* Cluster operator machine-config should not be upgraded between minor versions: PoolUpdating: One or more machine config pools are updating, please see `oc get mcp` for further details

warning: Cannot display available updates:
  Reason: NoChannel
  Message: The update channel has not been configured.

Comment 7 Damien Grisonnet 2021-10-12 08:42:23 UTC

Reverting the changes since we have noticed a lot of CI failures [1] caused by this addition.

[1] https://search.ci.openshift.org/?search=monitoring.*Upgradeable%3DFalse&maxAge=48h&context=1&type=bug%2Bjunit&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job

Comment 11 hongyan li 2021-10-13 03:27:22 UTC

Set the bug as assigned, as the changed code is reverted

Comment 12 Junqi Zhao 2021-10-13 04:32:20 UTC

tested with 4.10.0-0.nightly-2021-10-13-001151, bind PVs for alertmanager/prometheus pods and schedule these pods to one same node, Upgradeable is True, since the code is reverted, if we want to track the revert code change, please file one new bug.
# oc -n openshift-monitoring get pod -o wide | grep -E "prometheus-k8s|alertmanager-main"
alertmanager-main-0                            5/5     Running   0             19m   10.128.2.10    ip-10-0-174-1.ec2.internal     <none>           <none>
alertmanager-main-1                            5/5     Running   0             19m   10.128.2.14    ip-10-0-174-1.ec2.internal     <none>           <none>
alertmanager-main-2                            5/5     Running   0             19m   10.128.2.11    ip-10-0-174-1.ec2.internal     <none>           <none>
prometheus-k8s-0                               7/7     Running   0             19m   10.128.2.12    ip-10-0-174-1.ec2.internal     <none>           <none>
prometheus-k8s-1                               7/7     Running   0             19m   10.128.2.13    ip-10-0-174-1.ec2.internal     <none>           <none>
# oc -n openshift-monitoring get pvc
NAME                               STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   AGE
alertmanager-alertmanager-main-0   Bound    pvc-15e0bbd5-4331-4991-8038-d0a42bf02fa7   4Gi        RWO            gp2            20m
alertmanager-alertmanager-main-1   Bound    pvc-a76783eb-1ac6-4b22-ab4c-d1beb1d411d3   4Gi        RWO            gp2            20m
alertmanager-alertmanager-main-2   Bound    pvc-7321a16a-0fd5-4346-9d92-dea3e0aef6d3   4Gi        RWO            gp2            20m
prometheus-prometheus-k8s-0        Bound    pvc-c4a23f2d-29c5-4050-9dd3-778ce12a30e4   10Gi       RWO            gp2            21m
prometheus-prometheus-k8s-1        Bound    pvc-f21e76af-42fe-48a6-8580-8a88c2eeeee0   10Gi       RWO            gp2            21m

ALERTS{alertname="HighlyAvailableWorkloadIncorrectlySpread"}
ALERTS{alertname="HighlyAvailableWorkloadIncorrectlySpread", alertstate="pending", namespace="openshift-monitoring", severity="warning", workload="alertmanager-main"} 1
ALERTS{alertname="HighlyAvailableWorkloadIncorrectlySpread", alertstate="pending", namespace="openshift-monitoring", severity="warning", workload="prometheus-k8s"}    1

# oc get co monitoring -oyaml
...
status:
  conditions:
  - lastTransitionTime: "2021-10-13T04:03:17Z"
    status: "False"
    type: Degraded
  - lastTransitionTime: "2021-10-13T03:24:27Z"
    reason: AsExpected
    status: "True"
    type: Upgradeable
  - lastTransitionTime: "2021-10-13T03:36:48Z"
    message: Successfully rolled out the stack.
    reason: RollOutDone
    status: "True"
    type: Available
  - lastTransitionTime: "2021-10-13T03:36:48Z"
    status: "False"
    type: Progressing

Comment 13 Junqi Zhao 2021-10-13 04:33:32 UTC

as the case title is `CMO should report `Upgradeable: false` when HA workload is incorrectly spread`, move the case to Assigned is expected

Comment 15 hongyan li 2021-10-18 08:59:06 UTC

test with pr
openshift cluster-monitoring-operator pull 1431
+
4.9.0-0.ci.test-2021-10-18-074221-ci-ln-2w3bhdk-latest

hongyli@hongyli-mac Downloads % oc -n openshift-monitoring get pod -owide |grep -E 'alertmanager-main|prometheus-k8s'
alertmanager-main-0                            5/5     Running   0             26m   10.131.0.36   ci-ln-2w3bhdk-f76d1-vsbkj-worker-a-6c8xq   <none>           <none>
alertmanager-main-1                            5/5     Running   0             26m   10.131.0.34   ci-ln-2w3bhdk-f76d1-vsbkj-worker-a-6c8xq   <none>           <none>
alertmanager-main-2                            5/5     Running   0             26m   10.131.0.35   ci-ln-2w3bhdk-f76d1-vsbkj-worker-a-6c8xq   <none>           <none>
prometheus-k8s-0                               6/6     Running   0             26m   10.131.0.32   ci-ln-2w3bhdk-f76d1-vsbkj-worker-a-6c8xq   <none>           <none>
prometheus-k8s-1                               6/6     Running   0             26m   10.131.0.33   ci-ln-2w3bhdk-f76d1-vsbkj-worker-a-6c8xq   <none>           <none>
hongyli@hongyli-mac Downloads % oc -n openshift-monitoring get pvc                                                   
NAME                                       STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   AGE
alertmanager-main-db-alertmanager-main-0   Bound    pvc-3726479e-6e6a-4583-8420-73a4457b6bc9   1Gi        RWO            standard       26m
alertmanager-main-db-alertmanager-main-1   Bound    pvc-c5a4d85c-ec52-41b0-8ced-eba55281a769   1Gi        RWO            standard       26m
alertmanager-main-db-alertmanager-main-2   Bound    pvc-d80f9ae3-2509-483f-aa67-542b14bb096f   1Gi        RWO            standard       26m
prometheus-k8s-db-prometheus-k8s-0         Bound    pvc-f1c04c5a-b22e-43a2-8442-9ce670b116bc   2Gi        RWO            standard       26m
prometheus-k8s-db-prometheus-k8s-1         Bound    pvc-4baf3853-6a1e-4332-8da2-6b82e0c86047   2Gi        RWO            standard       26m

% oc get co monitoring
NAME         VERSION                                                  AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
monitoring   4.9.0-0.ci.test-2021-10-18-074221-ci-ln-2w3bhdk-latest   True        False         False      40m     
% oc get co monitoring -oyaml
---
status:
  conditions:
  - lastTransitionTime: "2021-10-18T08:07:46Z"
    status: "False"
    type: Progressing
  - lastTransitionTime: "2021-10-18T08:26:24Z"
    status: "False"
    type: Degraded
  - lastTransitionTime: "2021-10-18T08:27:55Z"
    message: |-
      Highly-available workload in namespace openshift-monitoring, with label map["app.kubernetes.io/name":"prometheus"] and persistent storage enabled has a single point of failure.
      Highly-available workload in namespace openshift-monitoring, with label map["app.kubernetes.io/name":"alertmanager"] and persistent storage enabled has a single point of failure.
      Highly-available workload in namespace openshift-user-workload-monitoring, with label map["app.kubernetes.io/name":"prometheus"] and persistent storage enabled has a single point of failure.
      Highly-available workload in namespace openshift-user-workload-monitoring, with label map["app.kubernetes.io/name":"thanos-ruler"] and persistent storage enabled has a single point of failure.
      Manual intervention is needed to upgrade to the next minor version. For each highly-available workload that has a single point of failure please mark at least one of their PersistentVolumeClaim for deletion by annotating them with map["openshift.io/cluster-monitoring-drop-pvc":"yes"].
    reason: WorkloadSinglePointOfFailure
    status: "False"
    type: Upgradeable
  - lastTransitionTime: "2021-10-18T08:07:46Z"
    message: Successfully rolled out the stack.
    reason: RollOutDone
    status: "True"
    type: Available

% oc adm upgrade
Cluster version is 4.9.0-0.ci.test-2021-10-18-074221-ci-ln-2w3bhdk-latest

Upgradeable=False

  Reason: WorkloadSinglePointOfFailure
  Message: Cluster operator monitoring should not be upgraded between minor versions: Highly-available workload in namespace openshift-monitoring, with label map["app.kubernetes.io/name":"prometheus"] and persistent storage enabled has a single point of failure.
Highly-available workload in namespace openshift-monitoring, with label map["app.kubernetes.io/name":"alertmanager"] and persistent storage enabled has a single point of failure.
Highly-available workload in namespace openshift-user-workload-monitoring, with label map["app.kubernetes.io/name":"prometheus"] and persistent storage enabled has a single point of failure.
Highly-available workload in namespace openshift-user-workload-monitoring, with label map["app.kubernetes.io/name":"thanos-ruler"] and persistent storage enabled has a single point of failure.
Manual intervention is needed to upgrade to the next minor version. For each highly-available workload that has a single point of failure please mark at least one of their PersistentVolumeClaim for deletion by annotating them with map["openshift.io/cluster-monitoring-drop-pvc":"yes"].

warning: Cannot display available updates:
  Reason: NoChannel
  Message: The update channel has not been configured.

oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -k -H "Authorization: Bearer $token" 'https://prometheus-k8s.openshift-monitoring.svc:9091/api/v1/alerts'|jq
...
      {
        "labels": {
          "alertname": "HighlyAvailableWorkloadIncorrectlySpread",
          "namespace": "openshift-monitoring",
          "severity": "warning",
          "workload": "alertmanager-main"
        },
        "annotations": {
          "description": "Workload openshift-monitoring/alertmanager-main is incorrectly spread across multiple nodes which breaks high-availability requirements. Since the workload is using persistent volumes, manual intervention is needed. Please follow the guidelines provided in the runbook of this alert to fix this issue.",
          "runbook_url": "https://github.com/openshift/runbooks/blob/master/alerts/HighlyAvailableWorkloadIncorrectlySpread.md",
          "summary": "Highly-available workload is incorrectly spread across multiple nodes and manual intervention is needed."
        },
        "state": "pending",
        "activeAt": "2021-10-18T08:27:35.421613011Z",
        "value": "1e+00"
      },
      {
        "labels": {
          "alertname": "HighlyAvailableWorkloadIncorrectlySpread",
          "namespace": "openshift-monitoring",
          "severity": "warning",
          "workload": "prometheus-k8s"
        },
        "annotations": {
          "description": "Workload openshift-monitoring/prometheus-k8s is incorrectly spread across multiple nodes which breaks high-availability requirements. Since the workload is using persistent volumes, manual intervention is needed. Please follow the guidelines provided in the runbook of this alert to fix this issue.",
          "runbook_url": "https://github.com/openshift/runbooks/blob/master/alerts/HighlyAvailableWorkloadIncorrectlySpread.md",
          "summary": "Highly-available workload is incorrectly spread across multiple nodes and manual intervention is needed."
        },
        "state": "pending",
        "activeAt": "2021-10-18T08:27:35.421613011Z",
        "value": "1e+00"
      },
      {
        "labels": {
          "alertname": "HighlyAvailableWorkloadIncorrectlySpread",
          "namespace": "openshift-user-workload-monitoring",
          "severity": "warning",
          "workload": "prometheus-user-workload"
        },
        "annotations": {
          "description": "Workload openshift-user-workload-monitoring/prometheus-user-workload is incorrectly spread across multiple nodes which breaks high-availability requirements. Since the workload is using persistent volumes, manual intervention is needed. Please follow the guidelines provided in the runbook of this alert to fix this issue.",
          "runbook_url": "https://github.com/openshift/runbooks/blob/master/alerts/HighlyAvailableWorkloadIncorrectlySpread.md",
          "summary": "Highly-available workload is incorrectly spread across multiple nodes and manual intervention is needed."
        },
        "state": "pending",
        "activeAt": "2021-10-18T08:28:05.421613011Z",
        "value": "1e+00"
      },
      {
        "labels": {
          "alertname": "HighlyAvailableWorkloadIncorrectlySpread",
          "namespace": "openshift-user-workload-monitoring",
          "severity": "warning",
          "workload": "thanos-ruler-user-workload"
        },
        "annotations": {
          "description": "Workload openshift-user-workload-monitoring/thanos-ruler-user-workload is incorrectly spread across multiple nodes which breaks high-availability requirements. Since the workload is using persistent volumes, manual intervention is needed. Please follow the guidelines provided in the runbook of this alert to fix this issue.",
          "runbook_url": "https://github.com/openshift/runbooks/blob/master/alerts/HighlyAvailableWorkloadIncorrectlySpread.md",
          "summary": "Highly-available workload is incorrectly spread across multiple nodes and manual intervention is needed."
        },
        "state": "pending",
        "activeAt": "2021-10-18T08:28:05.421613011Z",
        "value": "1e+00"
      }
    ]
  }
}

Comment 17 Junqi Zhao 2021-11-12 09:00:40 UTC

checked with 4.10.0-0.nightly-2021-11-11-170956, bound PVs for prometheus, and schedule prometheus pods to one same node, Upgradeable is False now
 oc -n openshift-monitoring get pod -o wide |grep prometheus-k8s
prometheus-k8s-0                               6/6     Running   0          6m11s   10.129.2.44    ip-10-0-185-227.us-east-2.compute.internal   <none>           <none>
prometheus-k8s-1                               6/6     Running   0          6m11s   10.129.2.45    ip-10-0-185-227.us-east-2.compute.internal   <none>           <none>

# oc -n openshift-monitoring get pvc
NAME                          STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   AGE
prometheus-prometheus-k8s-0   Bound    pvc-8df5236e-0142-4ff4-972e-98ead2aee5f4   10Gi       RWO            gp2            7m2s
prometheus-prometheus-k8s-1   Bound    pvc-b17950b3-af81-4bcb-9d22-b29c921d89f8   10Gi       RWO            gp2            7m2s

# oc get co monitoring -oyaml
...
  - lastTransitionTime: "2021-11-12T08:52:33Z"
    message: |-
      Highly-available workload in namespace openshift-monitoring, with label map["app.kubernetes.io/name":"prometheus"] and persistent storage enabled has a single point of failure.
      Manual intervention is needed to upgrade to the next minor version. For each highly-available workload that has a single point of failure please mark at least one of their PersistentVolumeClaim for deletion by annotating them with map["openshift.io/cluster-monitoring-drop-pvc":"yes"].
    reason: WorkloadSinglePointOfFailure
    status: "False"
    type: Upgradeable

# oc adm upgrade
Cluster version is 4.10.0-0.nightly-2021-11-11-170956

Upgradeable=False

  Reason: WorkloadSinglePointOfFailure
  Message: Cluster operator monitoring should not be upgraded between minor versions: Highly-available workload in namespace openshift-monitoring, with label map["app.kubernetes.io/name":"prometheus"] and persistent storage enabled has a single point of failure.
Manual intervention is needed to upgrade to the next minor version. For each highly-available workload that has a single point of failure please mark at least one of their PersistentVolumeClaim for deletion by annotating them with map["openshift.io/cluster-monitoring-drop-pvc":"yes"].

Upstream: https://amd64.ocp.releases.ci.openshift.org/graph
Channel: stable-4.10

Comment 18 hongyan li 2021-11-23 10:40:19 UTC

Suppose the bug need update documentation, only annotating pvc with map["openshift.io/cluster-monitoring-drop-pvc":"yes"] can't make upgrade true and the pvc with be recreated quickly

Comment 19 hongyan li 2021-11-23 12:42:39 UTC

correct comments 18, annotating pvc with map["openshift.io/cluster-monitoring-drop-pvc":"yes"] can make upgrade true, I added annotation by edit one pvc

Comment 20 Junqi Zhao 2022-01-27 09:40:06 UTC

see https://bugzilla.redhat.com/show_bug.cgi?id=2008540#c8
HighlyAvailableWorkloadIncorrectlySpread is removed

Comment 33 errata-xmlrpc 2022-03-10 16:05:43 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0056