1967614 – prometheus-k8s pods can't be scheduled due to volume node affinity conflict

Bug 1967614 - prometheus-k8s pods can't be scheduled due to volume node affinity conflict

Summary: prometheus-k8s pods can't be scheduled due to volume node affinity conflict

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Monitoring
Sub Component:
Version:	4.7
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.8.0
Assignee:	Jayapriya Pai
QA Contact:	Junqi Zhao
Docs Contact:
URL:
Whiteboard:	UpdateRecommendationsBlocked
Depends On:
Blocks:	1967966
TreeView+	depends on / blocked

Reported:	2021-06-03 13:20 UTC by Simon Pasquier
Modified:	2021-11-16 10:45 UTC (History)
CC List:	11 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Clones:	1967966 (view as bug list)
Environment:
Last Closed:	2021-07-27 23:11:25 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	openshift cluster-monitoring-operator pull 1198	None	Merged	Bug 1967614: Revert anti-affinity to soft	2021-12-09 00:25:24 UTC
Github	openshift cluster-monitoring-operator pull 1204	None	Merged	Bug 1967614: Remove PDB for prometheus and alertmanager	2021-12-09 00:25:20 UTC
Red Hat Product Errata	RHSA-2021:2438	None	None	None	2021-07-27 23:11:42 UTC

Description Simon Pasquier 2021-06-03 13:20:09 UTC

Bug originating from bug 1956308.

Looking at [1], the initial error reported by CMO is indeed the same: "creating Deployment object failed after update failed: object is being deleted: deployments.apps "prometheus-operator" already exists". But the subsequent reconciliations fail for different reasons [2].

The next failure is because the prometheus operator isn't ready yet and the admission webhook fails (see bug 1949840):

E0602 05:01:20.514522       1 operator.go:400] sync "openshift-monitoring/cluster-monitoring-config" failed: running task Updating Control Plane components failed: reconciling etcd rules PrometheusRule failed: updating PrometheusRule object failed: Internal error occurred: failed calling webhook "prometheusrules.openshift.io": Post "https://prometheus-operator.openshift-monitoring.svc:8080/admission-prometheusrules/validate?timeout=5s": no endpoints available for service "prometheus-operator"

Then CMO fails repeatedly because the prometheus-k8s statefulset never converges to the desired state: 

E0602 05:06:24.595880       1 operator.go:400] sync "openshift-monitoring/cluster-monitoring-config" failed: running task Updating Prometheus-k8s failed: waiting for Prometheus object changes failed: waiting for Prometheus openshift-monitoring/k8s: expected 2 replicas, got 1 updated replicas

Now looking at the pods [3], there's no node on which the prometheus-k8s-1 pod can be scheduled:

            "status": {
                "conditions": [
                    {
                        "lastProbeTime": null,
                        "lastTransitionTime": "2021-06-02T05:02:17Z",
                        "message": "0/6 nodes are available: 1 node(s) didn't match pod affinity/anti-affinity rules, 1 node(s) didn't match pod anti-affinity rules, 2 node(s) had volume node affinity conflict, 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate.",
                        "reason": "Unschedulable",
                        "status": "False",
                        "type": "PodScheduled"
                    }
                ],


IIUC the mischeduling happens because some worker nodes have moved away from the 'us-east-1b' zone while prometheus PVs are bounded to this zone.

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-ovn-upgrade/1399935465968111616/artifacts/e2e-aws-ovn-upgrade/gather-extra/artifacts/nodes.json | jq '.items| map(select( .metadata.labels["node-role.kubernetes.io/worker"] == "" )) | map( .metadata.name + ": " + .metadata.labels["topology.ebs.csi.aws.com/zone"] )' 
[
  "ip-10-0-144-164.ec2.internal: us-east-1a",
  "ip-10-0-176-27.ec2.internal: us-east-1a",
  "ip-10-0-207-235.ec2.internal: us-east-1b"
]

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-ovn-upgrade/1399935465968111616/artifacts/e2e-aws-ovn-upgrade/gather-extra/artifacts/persistentvolumes.json  | jq '.items | map( .metadata.name + ": " + .spec.claimRef.name + ": " + .metadata.labels["failure-domain.beta.kubernetes.io/zone"])'
[
  "pvc-4b072738-5319-4337-8a3c-d819f28c4bf5: prometheus-data-prometheus-k8s-1: us-east-1b",
  "pvc-64a7fce5-45c3-4551-a955-5c7bb3cd5a89: prometheus-data-prometheus-k8s-0: us-east-1b"
]

[1] https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-ovn-upgrade/1399935465968111616
[2] https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-ovn-upgrade/1399935465968111616/artifacts/e2e-aws-ovn-upgrade/gather-extra/artifacts/pods/openshift-monitoring_cluster-monitoring-operator-7556c4b9c6-7vlhj_cluster-monitoring-operator.log
[3] https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-ovn-upgrade/1399935465968111616/artifacts/e2e-aws-ovn-upgrade/gather-extra/artifacts/pods.json

Comment 1 W. Trevor King 2021-06-03 20:00:51 UTC

Timeline for [1]:

* 4:13Z, prometheus-k8s-0 on ip-10-0-207-235 [2], which is the only node in us-east-1b [3].
* 4:13Z, prometheus-k8s-1 also on ip-10-0-207-235 [2].
* 5:02Z, prometheus-k8s-0 drained and rescheduled on ip-10-0-207-235 [2].
* 5:02Z, prometheus-k8s-1 drained, but sticks [4] because it can no longer land on ip-10-0-207-235 (hard anti-affinity [5], bug 1949262, recently ported back to 4.7 as bug 1957703, about to go out with 4.7.14 [6]) but it doesn't want to leave behind the persistent volume it had been using, which is in us-east-1b, and not available on the other nodes, which are both in us-east-1a [3].

[1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-ovn-upgrade/1399935465968111616
[2]: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-ovn-upgrade/1399935465968111616/artifacts/e2e-aws-ovn-upgrade/gather-extra/artifacts/events.json | jq -r '.items[] | select(.reason == "Scheduled" and .metadata.namespace == "openshift-monitoring" and (.involvedObject.name | contains("prometheus-k8s"))) | .eventTime + " " + .message' | sort
     2021-06-02T04:13:44.787896Z Successfully assigned openshift-monitoring/prometheus-k8s-0 to ip-10-0-207-235.ec2.internal
     2021-06-02T04:13:44.906128Z Successfully assigned openshift-monitoring/prometheus-k8s-1 to ip-10-0-207-235.ec2.internal
     2021-06-02T05:02:17.225115Z Successfully assigned openshift-monitoring/prometheus-k8s-0 to ip-10-0-207-235.ec2.internal
[3]: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-ovn-upgrade/1399935465968111616/artifacts/e2e-aws-ovn-upgrade/gather-extra/artifacts/nodes.json | jq -r '.items[].metadata | select(.labels["node-role.kubernetes.io/worker"] == "") | .labels["failure-domain.beta.kubernetes.io/zone"] + " " + .name' | sort
     us-east-1a ip-10-0-144-164.ec2.internal
     us-east-1a ip-10-0-176-27.ec2.internal
     us-east-1b ip-10-0-207-235.ec2.internal
[4]: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-ovn-upgrade/1399935465968111616/artifacts/e2e-aws-ovn-upgrade/gather-extra/artifacts/pods.json | jq -r '.items[] | select(.metadata.name == "prometheus-k8s-1").status'
     {
       "conditions": [
         {
           "lastProbeTime": null,
           "lastTransitionTime": "2021-06-02T05:02:17Z",
           "message": "0/6 nodes are available: 1 node(s) didn't match pod affinity/anti-affinity rules, 1 node(s) didn't match pod anti-affinity rules, 2 node(s) had volume node affinity conflict, 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate.",
           "reason": "Unschedulable",
           "status": "False",
           "type": "PodScheduled"
         }
       ],
       "phase": "Pending",
       "qosClass": "Burstable"
     }
[5]: https://github.com/openshift/cluster-monitoring-operator/pull/1135
[6]: https://amd64.ocp.releases.ci.openshift.org/releasestream/4-stable/release/4.7.14

Comment 2 W. Trevor King 2021-06-03 23:21:38 UTC

Ok, as comment 1 pointed out, hard anti-affinity was recently ported back to 4.7 as bug 1957703 and is in 4.7.14.  Testing with a cluster-bot cluster [1]:

  $ oc get clusterversion -o jsonpath='{.status.desired.version}{"\n"}' version
  4.7.13
  $ oc -n openshift-monitoring get -o wide pods | grep prometheus-k8s
  prometheus-k8s-0                               7/7     Running   1          50m   10.131.0.23    ip-10-0-184-219.us-west-1.compute.internal   <none>           <none>
  prometheus-k8s-1                               7/7     Running   1          50m   10.131.0.25    ip-10-0-184-219.us-west-1.compute.internal   <none>           <none>
  $ oc get -o json nodes | jq -r '.items[].metadata | select(.labels["node-role.kubernetes.io/worker"] == "") | .labels["failure-domain.beta.kubernetes.io/zone"] + " " + .name' | sort
  us-west-1a ip-10-0-129-15.us-west-1.compute.internal
  us-west-1a ip-10-0-184-219.us-west-1.compute.internal
  us-west-1b ip-10-0-253-132.us-west-1.compute.internal

I want to squeeze them down onto that 1b node:

  $ oc adm cordon ip-10-0-129-15.us-west-1.compute.internal
  $ oc adm cordon ip-10-0-184-219.us-west-1.compute.internal
  $ oc -n openshift-monitoring delete pod prometheus-k8s-0
  $ oc -n openshift-monitoring delete pod prometheus-k8s-1
  $ oc -n openshift-monitoring get -o wide pods | grep prometheus-k8s
  prometheus-k8s-0                               7/7     Running   1          31s   10.129.2.8     ip-10-0-253-132.us-west-1.compute.internal   <none>           <none>
  prometheus-k8s-1                               7/7     Running   1          25s   10.129.2.9     ip-10-0-253-132.us-west-1.compute.internal   <none>           <none>

Give them a PV, following [2]:

  $ cat <<EOF >manifest_cluster-monitoring-pvc.yml
  apiVersion: v1
  kind: ConfigMap
  metadata:
    name: cluster-monitoring-config
    namespace: openshift-monitoring
  data:
    config.yaml: |
      prometheusK8s:
        volumeClaimTemplate:
          metadata:
            name: pvc
          spec:
            resources:
              requests:
                storage: 5Gi
  EOF
  $ oc apply -f manifest_cluster-monitoring-pvc.yml
  $ oc -n openshift-monitoring get -o wide pods | grep prometheus-k8s
  prometheus-k8s-0                               0/7     ContainerCreating   0          16s   <none>         ip-10-0-253-132.us-west-1.compute.internal   <none>           <none>
  prometheus-k8s-1                               0/7     ContainerCreating   0          16s   <none>         ip-10-0-253-132.us-west-1.compute.internal   <none>           <none>

Uncordon:

  $ oc adm uncordon ip-10-0-129-15.us-west-1.compute.internal
  $ oc adm uncordon ip-10-0-184-219.us-west-1.compute.internal

Update:

  $ oc patch clusterversion version --type json -p '[{"op": "add", "path": "/spec/channel", "value": "candidate-4.7"}]'
  $ oc adm upgrade --to 4.7.14

Wait a while.  Confirm that, as expected, it hung:

  $ oc adm upgrade            
  info: An upgrade is in progress. Working towards 4.7.14: 524 of 669 done (78% complete), waiting on monitoring
  ...
  $ oc get -o json clusteroperator monitoring | jq -r '.status.conditions[] | .lastTransitionTime + " " + .type + "=" + .status + " " + .reason + ": " + .message'
  2021-06-03T22:10:36Z Available=False : 
  2021-06-03T23:01:16Z Progressing=True RollOutInProgress: Rolling out the stack.
  2021-06-03T22:10:36Z Degraded=True UpdatingPrometheusK8SFailed: Failed to rollout the stack. Error: running task Updating Prometheus-k8s failed: waiting for Prometheus object changes failed: waiting for Prometheus openshift-monitoring/k8s: expected 2 replicas, got 0 updated replicas
  2021-06-03T23:01:16Z Upgradeable=True RollOutInProgress: Rollout of the monitoring stack is in progress. Please wait until it finishes.
  $ oc -n openshift-monitoring get -o wide pods | grep prometheus-k8s
  prometheus-k8s-0                               7/7     Running   1          94m   10.129.2.10    ip-10-0-253-132.us-west-1.compute.internal   <none>           <none>
  prometheus-k8s-1                               0/7     Pending   0          60m   <none>         <none>                                       <none>           <none>
  $ oc -n openshift-monitoring get -o json pod prometheus-k8s-1 | jq .status
  {
    "conditions": [
      {
        "lastProbeTime": null,
        "lastTransitionTime": "2021-06-03T22:05:41Z",
        "message": "0/6 nodes are available: 1 node(s) didn't match pod affinity/anti-affinity, 1 node(s) didn't match pod anti-affinity rules, 2 node(s) had volume node affinity conflict, 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate.",
        "reason": "Unschedulable",
        "status": "False",
        "type": "PodScheduled"
      }
    ],
    "phase": "Pending",
    "qosClass": "Burstable"
  }

Check the PVs:

  $ oc get persistentvolumes
  NAME                                       CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS   CLAIM                                       STORAGECLASS   REASON   AGE
  pvc-9e6673ed-31d4-4e6c-ab64-bd801210c126   5Gi        RWO            Delete           Bound    openshift-monitoring/pvc-prometheus-k8s-0   gp2                     95m
  pvc-bc4f1b64-c233-4c51-a78e-5ac793b6c025   5Gi        RWO            Delete           Bound    openshift-monitoring/pvc-prometheus-k8s-1   gp2                     95m

Hopefully unstick by deleting the stuck pod's PV:

  $ oc -n openshift-monitoring delete persistentvolume pvc-bc4f1b64-c233-4c51-a78e-5ac793b6c025
  
This didn't actually complete, it just moved the with the PV to Terminating status, because the PVC blocks PV deletion [3].  Remove the PVC too:

  $ oc -n openshift-monitoring delete persistentvolumeclaim pvc-prometheus-k8s-1

Ok, that unblocked the PV:

  $ oc get persistentvolumes
  NAME                                       CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS   CLAIM                                       STORAGECLASS   REASON   AGE
  pvc-9e6673ed-31d4-4e6c-ab64-bd801210c126   5Gi        RWO            Delete           Bound    openshift-monitoring/pvc-prometheus-k8s-0   gp2                     103m

But not the pod:

  $ oc -n openshift-monitoring get -o json pod prometheus-k8s-1 | jq .status
  {
    "conditions": [
      {
        "lastProbeTime": null,
        "lastTransitionTime": "2021-06-03T22:05:41Z",
        "message": "0/6 nodes are available: 6 persistentvolumeclaim \"pvc-prometheus-k8s-1\" not found.",
        "reason": "Unschedulable",
        "status": "False",
        "type": "PodScheduled"
      }
    ],
    "phase": "Pending",
    "qosClass": "Burstable"
  }

Delete the pod to get a fresh replacement:

  $ oc -n openshift-monitoring delete pod prometheus-k8s-1                             
  pod "prometheus-k8s-1" deleted

Hooray:

  $ oc -n openshift-monitoring get -o wide pods | grep prometheus-k8s
  prometheus-k8s-0                               7/7     Running   1          105m   10.129.2.10    ip-10-0-253-132.us-west- 1.compute.internal   <none>           <none>
  prometheus-k8s-1                               6/7     Running   1          40s    10.131.0.56    ip-10-0-184-219.us-west-1.compute.internal   <none>           <none>

And the update is flowing again:

  $ oc adm upgrade
  info: An upgrade is in progress. Working towards 4.7.14: 531 of 669 done (79% complete)
  ...

So, for folks who are running 4.7.13 or earlier and thus have soft-anti-affinity, and who happen to have their Prom pods scheduled to the same node, and who happen to have only that node as a possible PV attachment point (e.g. because their storage provider pins PVs to a single availability zone, and they only have one node in that zone), 4.7.14 can stick, and recovering requires manual intervention on the order of three 'oc ... delete ...' calls (it's possible the three I used can be optimized).  I'm moving the affected version back to 4.7 and setting the Regression keyword, and we can sort out whether we think this corner case is large enough to be worth tombstoning 4.7.14.

[1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-launch-aws/1400546094630309888
[2]: https://github.com/openshift/release/pull/11546/files
[3]: https://kubernetes.io/docs/concepts/storage/persistent-volumes/#storage-object-in-use-protection

Comment 3 W. Trevor King 2021-06-04 04:48:00 UTC

Also interesting from the comment 2 reproducer, here's the pod that stuck:

  $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-origin-installer-launch-aws/1400546094630309888/artifacts/launch/events.json | jq -r '.items[] | select(tostring | (contains("prometheus-k8s-1") and (contains("pvc-") or contains("Scheduled")))) | (.lastTimestamp // .eventTime) + " " + .reason + ": " + .message' | sort
  2021-06-03T20:35:35.514455Z Scheduled: Successfully assigned openshift-monitoring/prometheus-k8s-1 to ip-10-0-184-219.us-west-1.compute.internal
  2021-06-03T21:28:44.430634Z Scheduled: Successfully assigned openshift-monitoring/prometheus-k8s-1 to ip-10-0-253-132.us-west-1.compute.internal
  2021-06-03T21:31:11Z SuccessfulCreate: create Claim pvc-prometheus-k8s-1 Pod prometheus-k8s-1 in StatefulSet prometheus-k8s success
  2021-06-03T21:31:11Z WaitForFirstConsumer: waiting for first consumer to be created before binding
  2021-06-03T21:31:17.993036Z Scheduled: Successfully assigned openshift-monitoring/prometheus-k8s-1 to ip-10-0-253-132.us-west-1.compute.internal
  2021-06-03T21:31:17Z ProvisioningSucceeded: Successfully provisioned volume pvc-bc4f1b64-c233-4c51-a78e-5ac793b6c025 using kubernetes.io/aws-ebs
  2021-06-03T21:31:22Z SuccessfulAttachVolume: AttachVolume.Attach succeeded for volume "pvc-bc4f1b64-c233-4c51-a78e-5ac793b6c025" 
  2021-06-03T23:14:42.080514Z FailedScheduling: 0/6 nodes are available: 6 persistentvolumeclaim "pvc-prometheus-k8s-1" not found.
  2021-06-03T23:14:52.410034Z FailedScheduling: 0/6 nodes are available: 6 persistentvolumeclaim "pvc-prometheus-k8s-1" not found.
  2021-06-03T23:16:13Z SuccessfulCreate: create Claim pvc-prometheus-k8s-1 Pod prometheus-k8s-1 in StatefulSet prometheus-k8s success
  2021-06-03T23:16:13Z WaitForFirstConsumer: waiting for first consumer to be created before binding
  2021-06-03T23:16:18Z ProvisioningSucceeded: Successfully provisioned volume pvc-6f62185b-198a-4846-aa74-145c948d89ed using kubernetes.io/aws-ebs
  2021-06-03T23:16:19.549401Z Scheduled: Successfully assigned openshift-monitoring/prometheus-k8s-1 to ip-10-0-184-219.us-west-1.compute.internal
  2021-06-03T23:16:24Z SuccessfulAttachVolume: AttachVolume.Attach succeeded for volume "pvc-6f62185b-198a-4846-aa74-145c948d89ed" 
  2021-06-03T23:41:40.236834Z Scheduled: Successfully assigned openshift-monitoring/prometheus-k8s-1 to ip-10-0-129-15.us-west-1.compute.internal
  2021-06-03T23:41:40Z FailedAttachVolume: Multi-Attach error for volume "pvc-6f62185b-198a-4846-aa74-145c948d89ed" Volume is already exclusively attached to one node and can't be attached to another
  2021-06-03T23:41:53Z SuccessfulAttachVolume: AttachVolume.Attach succeeded for volume "pvc-6f62185b-198a-4846-aa74-145c948d89ed" 

Here's the other Prom pod:

  $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-origin-installer-launch-aws/1400546094630309888/artifacts/launch/events.json | jq -r '.items[] | select(tost
ring | (contains("prometheus-k8s-0") and (contains("pvc-") or contains("Scheduled")))) | (.lastTimestamp // .eventTime) + " " + .reason + ": " + .message' | sort
  2021-06-03T20:35:35.384820Z Scheduled: Successfully assigned openshift-monitoring/prometheus-k8s-0 to ip-10-0-184-219.us-west-1.compute.internal
  2021-06-03T21:28:38.436446Z Scheduled: Successfully assigned openshift-monitoring/prometheus-k8s-0 to ip-10-0-253-132.us-west-1.compute.internal
  2021-06-03T21:31:11Z SuccessfulCreate: create Claim pvc-prometheus-k8s-0 Pod prometheus-k8s-0 in StatefulSet prometheus-k8s success
  2021-06-03T21:31:11Z WaitForFirstConsumer: waiting for first consumer to be created before binding
  2021-06-03T21:31:17.836268Z Scheduled: Successfully assigned openshift-monitoring/prometheus-k8s-0 to ip-10-0-253-132.us-west-1.compute.internal
  2021-06-03T21:31:17Z ProvisioningSucceeded: Successfully provisioned volume pvc-9e6673ed-31d4-4e6c-ab64-bd801210c126 using kubernetes.io/aws-ebs
  2021-06-03T21:31:20Z SuccessfulAttachVolume: AttachVolume.Attach succeeded for volume "pvc-9e6673ed-31d4-4e6c-ab64-bd801210c126" 
  2021-06-03T23:17:01.551702Z Scheduled: Successfully assigned openshift-monitoring/prometheus-k8s-0 to ip-10-0-253-132.us-west-1.compute.internal
  2021-06-03T23:17:09Z SuccessfulAttachVolume: AttachVolume.Attach succeeded for volume "pvc-9e6673ed-31d4-4e6c-ab64-bd801210c126" 
  2021-06-03T23:35:19.165148Z Scheduled: Successfully assigned openshift-monitoring/prometheus-k8s-0 to ip-10-0-253-132.us-west-1.compute.internal
  2021-06-03T23:35:21Z SuccessfulAttachVolume: AttachVolume.Attach succeeded for volume "pvc-9e6673ed-31d4-4e6c-ab64-bd801210c126" 

And here are the nodes:

  $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-origin-installer-launch-aws/1400546094630309888/artifacts/launch/events.json | jq -r '.items[] | select(.metadata.namespace == "default" and (.reason | match("Node.*Ready"))) | .lastTimestamp + " " + .reason + " " + .message' | sort
  2021-06-03T20:26:48Z NodeReady Node ip-10-0-189-50.us-west-1.compute.internal status is now: NodeReady
  2021-06-03T20:26:50Z NodeReady Node ip-10-0-140-103.us-west-1.compute.internal status is now: NodeReady
  2021-06-03T20:34:20Z NodeReady Node ip-10-0-184-219.us-west-1.compute.internal status is now: NodeReady
  2021-06-03T20:37:31Z NodeReady Node ip-10-0-129-15.us-west-1.compute.internal status is now: NodeReady
  2021-06-03T20:37:31Z NodeReady Node ip-10-0-253-132.us-west-1.compute.internal status is now: NodeReady
  2021-06-03T23:35:09Z NodeNotReady Node ip-10-0-140-103.us-west-1.compute.internal status is now: NodeNotReady
  2021-06-03T23:36:09Z NodeReady Node ip-10-0-140-103.us-west-1.compute.internal status is now: NodeReady
  2021-06-03T23:39:24Z NodeNotReady Node ip-10-0-189-50.us-west-1.compute.internal status is now: NodeNotReady
  2021-06-03T23:40:24Z NodeNotReady Node ip-10-0-129-15.us-west-1.compute.internal status is now: NodeNotReady
  2021-06-03T23:40:51Z NodeReady Node ip-10-0-189-50.us-west-1.compute.internal status is now: NodeReady
  2021-06-03T23:41:07Z NodeReady Node ip-10-0-129-15.us-west-1.compute.internal status is now: NodeReady
  2021-06-03T23:44:27Z NodeNotReady Node ip-10-0-184-219.us-west-1.compute.internal status is now: NodeNotReady
  2021-06-03T23:44:27Z NodeNotReady Node ip-10-0-204-184.us-west-1.compute.internal status is now: NodeNotReady
  2021-06-03T23:44:33Z NodeReady Node ip-10-0-184-219.us-west-1.compute.internal status is now: NodeReady
  2021-06-03T23:45:35Z NodeReady Node ip-10-0-204-184.us-west-1.compute.internal status is now: NodeReady

Putting those all together:

* 20:35Z was the initial prometheus-k8s-0 and prometheus-k8s-1 installs.
* 21:28Z I push prometheus-k8s-0 and prometheus-k8s-1 onto node 132, in zone 1b, using cordons.
* 21:31Z I configure the volumeClaimTemplate and the initial persistent volume creation.
* 23:14Z the scheduler gets mad about prometheus-k8s-1 after I'd deleted the persistent volume and claim, but before I'd deleted the pod.
* 23:16Z the new prometheus-k8s-1, persistent volume (now pvc-6f62...), and claim are created after I'd deleted the pod too.  Interestingly, the new pod came back up on node 219, in zone 1a.
* 23:17Z new prometheus-k8s-0 too, as it gets bumped to the 4.7.14 images.
* 23:35Z new prometheus-k8s-0, still on node 132, in zone 1b.  More on this below.
* 23:37Z node 15 goes down.
* 23:41Z node 15 comes back.
* 23:41Z prometheus-k8s-1 is rescheduled during the pool-rolling period, this time onto node 15, also in 1a.
* 23:41Z prometheus-k8s-1 moves over to node 15, presumably because 219 is being cordoned and drained, and 15 is in the same zone and can handle the volume attachment.
* 23:44Z node 219 finishes draining and goes NodeNotReady.

I dunno why we don't have Node*Ready events for node 132 around 23:35Z.  But events are best-effort.  However, the node conditions also have old timestamps:

  $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-origin-installer-launch-aws/1400546094630309888/artifacts/launch/nodes.json | jq -r '.items[] | select(.metadata.name == "ip-10-0-253-132.us-west-1.compute.internal").status.conditions[] | .lastTransitionTime + " " + .type + "=" + .status + " " + .reason + ": " + .message' | sort
  2021-06-03T20:36:31Z DiskPressure=False KubeletHasNoDiskPressure: kubelet has no disk pressure
  2021-06-03T20:36:31Z MemoryPressure=False KubeletHasSufficientMemory: kubelet has sufficient memory available
  2021-06-03T20:36:31Z PIDPressure=False KubeletHasSufficientPID: kubelet has sufficient PID available
  2021-06-03T20:37:31Z Ready=True KubeletReady: kubelet is posting ready status

Node journals show we actually did reboot:

  $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-origin-installer-launch-aws/1400546094630309888/artifacts/launch/nodes/ip-10-0-253-132.us-west-1.compute.internal/journal | gunzip | grep -A6 'Starting Reboot'
  Jun 03 23:33:49.058030 ip-10-0-253-132 systemd[1]: Starting Reboot...
  Jun 03 23:33:49.066748 ip-10-0-253-132 systemd[1]: Shutting down.
  Jun 03 23:33:49.124865 ip-10-0-253-132 systemd-shutdown[1]: Syncing filesystems and block devices.
  Jun 03 23:33:49.186939 ip-10-0-253-132 systemd-shutdown[1]: Sending SIGTERM to remaining processes...
  Jun 03 23:33:49.195302 ip-10-0-253-132 systemd-journald[845]: Journal stopped
  -- Logs begin at Thu 2021-06-03 20:30:10 UTC, end at Thu 2021-06-03 23:47:29 UTC. --
  Jun 03 23:34:39.524762 localhost kernel: Linux version 4.18.0-240.22.1.el8_3.x86_64 (mockbuild.eng.bos.redhat.com) (gcc version 8.3.1 20191121 (Red Hat 8.3.1-5) (GCC)) #1 SMP Thu Mar 25 14:36:04 EDT 2021

So maybe some sort of kubelet or API-server bug in getting the node object updated?  Anyhow, this lack of node 132 condition changes seems unrelated to this affinity bug.

Comment 6 Junqi Zhao 2021-06-07 07:43:49 UTC

the revert does not help
# oc -n openshift-monitoring get sts prometheus-k8s -oyaml | grep podAntiAffinity
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - podAffinityTerm:
              labelSelector:
                matchLabels:
                  app.kubernetes.io/component: prometheus
                  app.kubernetes.io/name: prometheus
                  app.kubernetes.io/part-of: openshift-monitoring
                  prometheus: k8s
              namespaces:
              - openshift-monitoring
              topologyKey: kubernetes.io/hostname
            weight: 100
...

# oc get node
NAME                                         STATUS   ROLES    AGE    VERSION
ip-10-0-143-99.us-west-2.compute.internal    Ready    master   100m   v1.21.0-rc.0+2dfc46b
ip-10-0-146-41.us-west-2.compute.internal    Ready    master   99m    v1.21.0-rc.0+2dfc46b
ip-10-0-169-141.us-west-2.compute.internal   Ready    worker   93m    v1.21.0-rc.0+2dfc46b
ip-10-0-171-63.us-west-2.compute.internal    Ready    worker   92m    v1.21.0-rc.0+2dfc46b
ip-10-0-206-43.us-west-2.compute.internal    Ready    worker   92m    v1.21.0-rc.0+2dfc46b
ip-10-0-218-125.us-west-2.compute.internal   Ready    master   100m   v1.21.0-rc.0+2dfc46b

# oc -n openshift-monitoring get pod -o wide | grep prometheus-k8s
prometheus-k8s-0                              7/7     Running   1          13m    10.129.2.37    ip-10-0-171-63.us-west-2.compute.internal    <none>           <none>
prometheus-k8s-1                              7/7     Running   1          13m    10.128.2.13    ip-10-0-206-43.us-west-2.compute.internal    <none>           <none>

remove node where prometheus-k8s-1 is scheduled on
# oc delete node ip-10-0-206-43.us-west-2.compute.internal
node "ip-10-0-206-43.us-west-2.compute.internal" deleted


# oc -n openshift-monitoring get pod -o wide | grep prometheus-k8s
prometheus-k8s-0                              7/7     Running    1          24m     10.129.2.37    ip-10-0-171-63.us-west-2.compute.internal    <none>           <none>
prometheus-k8s-1                              0/7     Pending    0          7m29s   <none>         <none>                                       <none>           <none>

# oc -n openshift-monitoring describe pod prometheus-k8s-1 
Events:
  Type     Reason            Age        From               Message
  ----     ------            ----       ----               -------
  Warning  FailedScheduling  <invalid>  default-scheduler  0/6 nodes are available: 1 node(s) were unschedulable, 2 node(s) had volume node affinity conflict, 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate.
  Warning  FailedScheduling  <invalid>  default-scheduler  0/6 nodes are available: 1 node(s) were unschedulable, 2 node(s) had volume node affinity conflict, 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate.
  Warning  FailedScheduling  <invalid>  default-scheduler  0/5 nodes are available: 2 node(s) had volume node affinity conflict, 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate.

Comment 7 Junqi Zhao 2021-06-07 07:44:57 UTC

This is the PVC used by monitoring
# oc -n openshift-monitoring get pvc
NAME                               STATUS        VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   AGE
alertmanager-alertmanager-main-0   Bound         pvc-979311f1-e47a-45e1-8571-a3c255f2141a   4Gi        RWO            gp2            25m
alertmanager-alertmanager-main-1   Bound         pvc-3a17a889-0f57-4c0b-9fc0-f4f7da3b448a   4Gi        RWO            gp2            25m
alertmanager-alertmanager-main-2   Bound         pvc-ad50e689-57a3-41b5-b53c-7a6b0bfe68c1   4Gi        RWO            gp2            25m
prometheus-prometheus-k8s-0        Bound         pvc-aababb58-2840-46b9-b653-f252d3036d86   10Gi       RWO            gp2            25m
prometheus-prometheus-k8s-1        Bound         pvc-66347a06-8766-4d43-bce6-319c6d410fcf   10Gi       RWO            gp2            25m

Comment 10 Junqi Zhao 2021-06-07 13:40:35 UTC

the machine-config/monitoring operators try to upgraded to 4.8, but meet errors,  now we could see the volume node affinity conflict error if we attach PVs for alertmanager/prometheus pods, and these pods are in the same one node, in this case, we still have issue

# oc -n openshift-monitoring get po -o wide | grep -E "alertmanager-main|prometheus-k8s"
alertmanager-main-0                            5/5     Running   0          59m   10.128.2.53    ip-10-0-205-141.us-east-2.compute.internal   <none>           <none>
alertmanager-main-1                            5/5     Running   0          59m   10.128.2.54    ip-10-0-205-141.us-east-2.compute.internal   <none>           <none>
alertmanager-main-2                            0/5     Pending   0          38m   <none>         <none>                                       <none>           <none>
prometheus-k8s-0                               0/7     Pending   0          38m   <none>         <none>                                       <none>           <none>
prometheus-k8s-1                               7/7     Running   1          59m   10.128.2.58    ip-10-0-205-141.us-east-2.compute.internal   <none>           <none>

# oc get co monitoring machine-config
NAME             VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE
monitoring       4.8.0-0.nightly-2021-06-07-034343   False       True          True       23m
machine-config   4.7.0-0.nightly-2021-06-06-160728   False       True          True       45m

# oc get no ip-10-0-205-141.us-east-2.compute.internal
NAME                                         STATUS                     ROLES    AGE    VERSION
ip-10-0-205-141.us-east-2.compute.internal   Ready,SchedulingDisabled   worker   141m   v1.20.0+2817867

# oc -n openshift-monitoring describe pod alertmanager-main-2
  Warning  FailedScheduling  7m23s  default-scheduler  0/6 nodes are available: 2 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 2 node(s) had volume node affinity conflict, 2 node(s) were unschedulable.
  Warning  FailedScheduling  4m32s  default-scheduler  0/6 nodes are available: 2 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 2 node(s) had volume node affinity conflict, 2 node(s) were unschedulable.
  Warning  FailedScheduling  105s   default-scheduler  0/6 nodes are available: 1 node(s) were unschedulable, 2 node(s) had volume node affinity conflict, 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate.

# oc -n openshift-monitoring describe pod prometheus-k8s-0 
  Warning  FailedScheduling  8m34s  default-scheduler  0/6 nodes are available: 2 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 2 node(s) had volume node affinity conflict, 2 node(s) were unschedulable.
  Warning  FailedScheduling  5m45s  default-scheduler  0/6 nodes are available: 2 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 2 node(s) had volume node affinity conflict, 2 node(s) were unschedulable.
  Warning  FailedScheduling  2m58s  default-scheduler  0/6 nodes are available: 1 node(s) were unschedulable, 2 node(s) had volume node affinity conflict, 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate.

Comment 11 Junqi Zhao 2021-06-07 13:44:34 UTC

this is the info before upgrade, cluster version is 4.7.0-0.nightly-2021-06-06-160728, Comment 10 is result after upgraded to 4.8.0-0.nightly-2021-06-07-034343
# oc -n openshift-monitoring get po -o wide | grep -E "alertmanager-main|prometheus-k8s"
alertmanager-main-0                            5/5     Running   0          12m   10.128.2.53    ip-10-0-205-141.us-east-2.compute.internal   <none>           <none>
alertmanager-main-1                            5/5     Running   0          12m   10.128.2.54    ip-10-0-205-141.us-east-2.compute.internal   <none>           <none>
alertmanager-main-2                            5/5     Running   0          12m   10.128.2.55    ip-10-0-205-141.us-east-2.compute.internal   <none>           <none>
prometheus-k8s-0                               7/7     Running   1          11m   10.128.2.52    ip-10-0-205-141.us-east-2.compute.internal   <none>           <none>
prometheus-k8s-1                               7/7     Running   1          11m   10.128.2.58    ip-10-0-205-141.us-east-2.compute.internal   <none>           <none>
# oc -n openshift-monitoring get pvc
NAME                               STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   AGE
alertmanager-alertmanager-main-0   Bound    pvc-98e390a7-5dba-423a-8424-f83df96e51f4   4Gi        RWO            gp2            49m
alertmanager-alertmanager-main-1   Bound    pvc-4fbc08a3-5a91-4cbf-949e-13dc20249684   4Gi        RWO            gp2            49m
alertmanager-alertmanager-main-2   Bound    pvc-00c9e5dc-c531-4610-adf1-e27e4eea0b67   4Gi        RWO            gp2            49m
prometheus-prometheus-k8s-0        Bound    pvc-fc0c50f4-0d8e-4725-b00d-76336d382dae   10Gi       RWO            gp2            48m
prometheus-prometheus-k8s-1        Bound    pvc-8b12b298-693e-488f-93dc-d218f706b20a   10Gi       RWO            gp2            48m

Comment 12 Junqi Zhao 2021-06-08 07:12:12 UTC

followed steps
• OCP 4.7 cluster
• node A is in AZ 1, nodes B and C in AZ 2
• prom0 and prom1 scheduled on node A with persistent volumes
• upgrade to 4.8
• CMO goes unavailable/degraded because the hard affinity makes it impossible to schedule prom1 (or prom0) on nodes which are in AZ 2 (the PV sticks to AZ 1)


tried again and bound PVs only for prometheus pods, still have "volume node affinity conflict" issue.
# oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.7.0-0.nightly-2021-06-06-160728   True        False         3m49s   Cluster version is 4.7.0-0.nightly-2021-06-06-160728


# oc get node --show-labels
NAME                                         STATUS   ROLES    AGE   VERSION           LABELS
ip-10-0-156-55.us-west-1.compute.internal    Ready    master   31m   v1.20.0+2817867   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=m5.xlarge,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=us-west-1,failure-domain.beta.kubernetes.io/zone=us-west-1a,kubernetes.io/arch=amd64,kubernetes.io/hostname=ip-10-0-156-55,kubernetes.io/os=linux,node-role.kubernetes.io/master=,node.kubernetes.io/instance-type=m5.xlarge,node.openshift.io/os_id=rhcos,topology.ebs.csi.aws.com/zone=us-west-1a,topology.kubernetes.io/region=us-west-1,topology.kubernetes.io/zone=us-west-1a
ip-10-0-167-70.us-west-1.compute.internal    Ready    master   31m   v1.20.0+2817867   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=m5.xlarge,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=us-west-1,failure-domain.beta.kubernetes.io/zone=us-west-1a,kubernetes.io/arch=amd64,kubernetes.io/hostname=ip-10-0-167-70,kubernetes.io/os=linux,node-role.kubernetes.io/master=,node.kubernetes.io/instance-type=m5.xlarge,node.openshift.io/os_id=rhcos,topology.ebs.csi.aws.com/zone=us-west-1a,topology.kubernetes.io/region=us-west-1,topology.kubernetes.io/zone=us-west-1a
ip-10-0-177-19.us-west-1.compute.internal    Ready    worker   24m   v1.20.0+2817867   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=m4.xlarge,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=us-west-1,failure-domain.beta.kubernetes.io/zone=us-west-1a,kubernetes.io/arch=amd64,kubernetes.io/hostname=ip-10-0-177-19,kubernetes.io/os=linux,node-role.kubernetes.io/worker=,node.kubernetes.io/instance-type=m4.xlarge,node.openshift.io/os_id=rhcos,topology.ebs.csi.aws.com/zone=us-west-1a,topology.kubernetes.io/region=us-west-1,topology.kubernetes.io/zone=us-west-1a
ip-10-0-178-86.us-west-1.compute.internal    Ready    worker   22m   v1.20.0+2817867   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=m4.xlarge,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=us-west-1,failure-domain.beta.kubernetes.io/zone=us-west-1a,kubernetes.io/arch=amd64,kubernetes.io/hostname=ip-10-0-178-86,kubernetes.io/os=linux,node-role.kubernetes.io/worker=,node.kubernetes.io/instance-type=m4.xlarge,node.openshift.io/os_id=rhcos,topology.ebs.csi.aws.com/zone=us-west-1a,topology.kubernetes.io/region=us-west-1,topology.kubernetes.io/zone=us-west-1a
ip-10-0-216-216.us-west-1.compute.internal   Ready    worker   22m   v1.20.0+2817867   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=m4.xlarge,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=us-west-1,failure-domain.beta.kubernetes.io/zone=us-west-1b,kubernetes.io/arch=amd64,kubernetes.io/hostname=ip-10-0-216-216,kubernetes.io/os=linux,node-role.kubernetes.io/worker=,node.kubernetes.io/instance-type=m4.xlarge,node.openshift.io/os_id=rhcos,topology.ebs.csi.aws.com/zone=us-west-1b,topology.kubernetes.io/region=us-west-1,topology.kubernetes.io/zone=us-west-1b
ip-10-0-253-73.us-west-1.compute.internal    Ready    master   31m   v1.20.0+2817867   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=m5.xlarge,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=us-west-1,failure-domain.beta.kubernetes.io/zone=us-west-1b,kubernetes.io/arch=amd64,kubernetes.io/hostname=ip-10-0-253-73,kubernetes.io/os=linux,node-role.kubernetes.io/master=,node.kubernetes.io/instance-type=m5.xlarge,node.openshift.io/os_id=rhcos,topology.ebs.csi.aws.com/zone=us-west-1b,topology.kubernetes.io/region=us-west-1,topology.kubernetes.io/zone=us-west-1b

ip-10-0-216-216.us-west-1.compute.internal is worker which topology.kubernetes.io/zone=us-west-1b, other workers' zone is topology.kubernetes.io/zone=us-west-1a, attach PVs for prometheus pods and scheduled prometheus pods to ip-10-0-216-216.us-west-1.compute.internal
# oc -n openshift-monitoring get po -o wide | grep "prometheus-k8s"
prometheus-k8s-0                               7/7     Running   1          38s   10.129.2.22    ip-10-0-216-216.us-west-1.compute.internal   <none>           <none>
prometheus-k8s-1                               7/7     Running   1          38s   10.129.2.21    ip-10-0-216-216.us-west-1.compute.internal   <none>           <none>

# oc -n openshift-monitoring get pvc
NAME                          STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   AGE
prometheus-prometheus-k8s-0   Bound    pvc-d3af3885-d7a1-48b0-a8f9-776f0b4debc2   10Gi       RWO            gp2            56s
prometheus-prometheus-k8s-1   Bound    pvc-b7b330f1-1ffe-43de-ba00-0eb11e995cdc   10Gi       RWO            gp2            56s

upgrade to 4.8.0-0.nightly-2021-06-07-034343
# oc -n openshift-monitoring get po -o wide | grep "prometheus-k8s"
prometheus-k8s-0                               0/7     Pending   0          47m   <none>         <none>                                       <none>           <none>
prometheus-k8s-1                               7/7     Running   1          82m   10.129.2.35    ip-10-0-216-216.us-west-1.compute.internal   <none>           <none>

# oc -n openshift-monitoring describe pod prometheus-k8s-0
...
Events:
  Type     Reason            Age   From               Message
  ----     ------            ----  ----               -------
  Warning  FailedScheduling  24m   default-scheduler  0/6 nodes are available: 1 node(s) were unschedulable, 2 node(s) had volume node affinity conflict, 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate.
  Warning  FailedScheduling  24m   default-scheduler  0/6 nodes are available: 1 node(s) were unschedulable, 2 node(s) had volume node affinity conflict, 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate.
  Warning  FailedScheduling  24m   default-scheduler  0/6 nodes are available: 2 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 2 node(s) had volume node affinity conflict, 2 node(s) were unschedulable.
  Warning  FailedScheduling  21m   default-scheduler  0/6 nodes are available: 2 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 2 node(s) had volume node affinity conflict, 2 node(s) were unschedulable.
  Warning  FailedScheduling  18m   default-scheduler  0/6 nodes are available: 1 node(s) were unschedulable, 2 node(s) had volume node affinity conflict, 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate.

# oc get co machine-config -oyaml
...
  - lastTransitionTime: "2021-06-08T06:52:43Z"
    message: Cluster not available for 4.8.0-0.nightly-2021-06-07-034343
    status: "False"
    type: Available
  extension:
    master: all 3 nodes are at latest configuration rendered-master-311d2a1cedd6275cd6ff0f9e6e7f355c
    worker: 'pool is degraded because nodes fail with "1 nodes are reporting degraded
      status on sync": "Node ip-10-0-216-216.us-west-1.compute.internal is reporting:
      \"failed to drain node (5 tries): timed out waiting for the condition: error
      when evicting pods/\\\"prometheus-k8s-1\\\" -n \\\"openshift-monitoring\\\":
      global timeout reached: 1m30s\""'

# oc get co monitoring machine-config
NAME             VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE
monitoring       4.8.0-0.nightly-2021-06-07-034343   False       True          True       41m
machine-config   4.8.0-0.nightly-2021-06-07-034343   False       False         True       17m

Comment 14 Junqi Zhao 2021-06-08 07:22:12 UTC

continue with Comment 12, after upgraded to 4.8, ip-10-0-216-216.us-west-1.compute.internal is SchedulingDisabled, since machine-config needs to set the node as SchedulingDisabled to upgrade machine-config to 4.8
# oc get no ip-10-0-216-216.us-west-1.compute.internal
NAME                                         STATUS                     ROLES    AGE    VERSION
ip-10-0-216-216.us-west-1.compute.internal   Ready,SchedulingDisabled   worker   150m   v1.20.0+2817867

Comment 16 W. Trevor King 2021-06-08 21:28:19 UTC

We tombstoned 4.7.14 on this issue [1].  That's not technically a blocked edge, but it's pretty similar, so I'll mark up this bug as if it was a blocked edge [2].

[1]: https://github.com/openshift/cincinnati-graph-data/pull/839
[2]: https://github.com/openshift/enhancements/pull/475

Comment 19 Junqi Zhao 2021-06-09 08:10:19 UTC

4.7.0-0.nightly-2021-06-07-203428 cluster, ip-10-0-209-88.us-east-2.compute.internal is the worker node which has topology.kubernetes.io/zone=us-east-2b,it is different with other worker nodes, bind PVs for alertmanger/prometheus pods and schedule these pods to ip-10-0-209-88.us-east-2.compute.internal, then upgrade to 4.8.0-0.nightly-2021-06-09-000526, no "volume node affinity conflict" error now

# oc get node --show-labels
NAME                                         STATUS   ROLES    AGE   VERSION           LABELS
ip-10-0-132-29.us-east-2.compute.internal    Ready    worker   39m   v1.20.0+2817867   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=m4.xlarge,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=us-east-2,failure-domain.beta.kubernetes.io/zone=us-east-2a,kubernetes.io/arch=amd64,kubernetes.io/hostname=ip-10-0-132-29,kubernetes.io/os=linux,node-role.kubernetes.io/worker=,node.kubernetes.io/instance-type=m4.xlarge,node.openshift.io/os_id=rhcos,topology.ebs.csi.aws.com/zone=us-east-2a,topology.kubernetes.io/region=us-east-2,topology.kubernetes.io/zone=us-east-2a
ip-10-0-148-89.us-east-2.compute.internal    Ready    master   46m   v1.20.0+2817867   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=m5.xlarge,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=us-east-2,failure-domain.beta.kubernetes.io/zone=us-east-2a,kubernetes.io/arch=amd64,kubernetes.io/hostname=ip-10-0-148-89,kubernetes.io/os=linux,node-role.kubernetes.io/master=,node.kubernetes.io/instance-type=m5.xlarge,node.openshift.io/os_id=rhcos,topology.ebs.csi.aws.com/zone=us-east-2a,topology.kubernetes.io/region=us-east-2,topology.kubernetes.io/zone=us-east-2a
ip-10-0-151-181.us-east-2.compute.internal   Ready    worker   40m   v1.20.0+2817867   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=m4.xlarge,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=us-east-2,failure-domain.beta.kubernetes.io/zone=us-east-2a,kubernetes.io/arch=amd64,kubernetes.io/hostname=ip-10-0-151-181,kubernetes.io/os=linux,node-role.kubernetes.io/worker=,node.kubernetes.io/instance-type=m4.xlarge,node.openshift.io/os_id=rhcos,topology.ebs.csi.aws.com/zone=us-east-2a,topology.kubernetes.io/region=us-east-2,topology.kubernetes.io/zone=us-east-2a
ip-10-0-164-105.us-east-2.compute.internal   Ready    master   46m   v1.20.0+2817867   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=m5.xlarge,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=us-east-2,failure-domain.beta.kubernetes.io/zone=us-east-2a,kubernetes.io/arch=amd64,kubernetes.io/hostname=ip-10-0-164-105,kubernetes.io/os=linux,node-role.kubernetes.io/master=,node.kubernetes.io/instance-type=m5.xlarge,node.openshift.io/os_id=rhcos,topology.ebs.csi.aws.com/zone=us-east-2a,topology.kubernetes.io/region=us-east-2,topology.kubernetes.io/zone=us-east-2a
ip-10-0-209-88.us-east-2.compute.internal    Ready    worker   40m   v1.20.0+2817867   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=m4.xlarge,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=us-east-2,failure-domain.beta.kubernetes.io/zone=us-east-2b,kubernetes.io/arch=amd64,kubernetes.io/hostname=ip-10-0-209-88.us-east-2.compute.internal,kubernetes.io/os=linux,node-role.kubernetes.io/worker=,node.kubernetes.io/instance-type=m4.xlarge,node.openshift.io/os_id=rhcos,topology.ebs.csi.aws.com/zone=us-east-2b,topology.kubernetes.io/region=us-east-2,topology.kubernetes.io/zone=us-east-2b
ip-10-0-217-206.us-east-2.compute.internal   Ready    master   46m   v1.20.0+2817867   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=m5.xlarge,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=us-east-2,failure-domain.beta.kubernetes.io/zone=us-east-2b,kubernetes.io/arch=amd64,kubernetes.io/hostname=ip-10-0-217-206,kubernetes.io/os=linux,node-role.kubernetes.io/master=,node.kubernetes.io/instance-type=m5.xlarge,node.openshift.io/os_id=rhcos,topology.ebs.csi.aws.com/zone=us-east-2b,topology.kubernetes.io/region=us-east-2,topology.kubernetes.io/zone=us-east-2b


# oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.7.0-0.nightly-2021-06-07-203428   True        False         4m21s   Cluster version is 4.7.0-0.nightly-2021-06-07-203428

# oc -n openshift-monitoring get po -o wide | grep -E "prometheus-k8s|alertmanager-main"
alertmanager-main-0                            5/5     Running   0          2m18s   10.131.0.34    ip-10-0-209-88.us-east-2.compute.internal    <none>           <none>
alertmanager-main-1                            5/5     Running   0          2m18s   10.131.0.35    ip-10-0-209-88.us-east-2.compute.internal    <none>           <none>
alertmanager-main-2                            5/5     Running   0          2m18s   10.131.0.36    ip-10-0-209-88.us-east-2.compute.internal    <none>           <none>
prometheus-k8s-0                               7/7     Running   1          2m18s   10.131.0.32    ip-10-0-209-88.us-east-2.compute.internal    <none>           <none>
prometheus-k8s-1                               7/7     Running   1          2m18s   10.131.0.33    ip-10-0-209-88.us-east-2.compute.internal    <none>           <none>
# oc -n openshift-monitoring get pvc
NAME                                       STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   AGE
alertmanager-main-db-alertmanager-main-0   Bound    pvc-9fb04e7e-6498-450a-a77b-421a02b55870   4Gi        RWO            gp2            2m25s
alertmanager-main-db-alertmanager-main-1   Bound    pvc-7022640d-3bda-45ec-bbf1-0f572804323e   4Gi        RWO            gp2            2m25s
alertmanager-main-db-alertmanager-main-2   Bound    pvc-4a62d6ab-18a6-4650-a92b-c8f831e61c34   4Gi        RWO            gp2            2m25s
prometheus-k8s-db-prometheus-k8s-0         Bound    pvc-ba0ec8c9-c0aa-412f-b8ed-2d696588d2eb   10Gi       RWO            gp2            2m25s
prometheus-k8s-db-prometheus-k8s-1         Bound    pvc-79c19466-d33a-47f4-8841-d2ee5ec98a1b   10Gi       RWO            gp2            2m25s


upgrade to 4.8.0-0.nightly-2021-06-09-000526
# oc adm upgrade --to-image=registry.ci.openshift.org/ocp/release:4.8.0-0.nightly-2021-06-09-000526 --allow-explicit-upgrade=true --forces

after upgrade
# oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.8.0-0.nightly-2021-06-09-000526   True        False         3m12s   Cluster version is 4.8.0-0.nightly-2021-06-09-000526

# oc -n openshift-monitoring get po -o wide | grep -E "prometheus-k8s|alertmanager-main"
alertmanager-main-0                            5/5     Running   0          21m     10.131.0.14    ip-10-0-209-88.us-east-2.compute.internal    <none>           <none>
alertmanager-main-1                            5/5     Running   0          21m     10.131.0.16    ip-10-0-209-88.us-east-2.compute.internal    <none>           <none>
alertmanager-main-2                            5/5     Running   0          21m     10.131.0.15    ip-10-0-209-88.us-east-2.compute.internal    <none>           <none>
prometheus-k8s-0                               7/7     Running   1          21m     10.131.0.12    ip-10-0-209-88.us-east-2.compute.internal    <none>           <none>
prometheus-k8s-1                               7/7     Running   1          21m     10.131.0.11    ip-10-0-209-88.us-east-2.compute.internal    <none>           <none>
# oc -n openshift-monitoring get pvc
NAME                                       STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   AGE
alertmanager-main-db-alertmanager-main-0   Bound    pvc-9fb04e7e-6498-450a-a77b-421a02b55870   4Gi        RWO            gp2            75m
alertmanager-main-db-alertmanager-main-1   Bound    pvc-7022640d-3bda-45ec-bbf1-0f572804323e   4Gi        RWO            gp2            75m
alertmanager-main-db-alertmanager-main-2   Bound    pvc-4a62d6ab-18a6-4650-a92b-c8f831e61c34   4Gi        RWO            gp2            75m
prometheus-k8s-db-prometheus-k8s-0         Bound    pvc-ba0ec8c9-c0aa-412f-b8ed-2d696588d2eb   10Gi       RWO            gp2            75m
prometheus-k8s-db-prometheus-k8s-1         Bound    pvc-79c19466-d33a-47f4-8841-d2ee5ec98a1b   10Gi       RWO            gp2            75m

Comment 21 Clayton Coleman 2021-06-22 14:04:01 UTC

These clusters are "broken".  How are we getting customers to fix these clusters and alert them there is a problem.

2 prometheus tied to one instance due to volumes is a high severity bug and the admin needs to take corrective action.  Are we alerting on this situation now?  The PDB is what we want - these users are wasting resources (they expect prometheus to be HA) and are not able to fix it. The product bug is not the PDB, the bug is that we allowed the cluster to get in this state and didn't notify the admin of why.

I expect us to 

a) deliver an alert that flags this with corrective action
b) once that alert rate is down, redeliver the PDB in 4.9 to fix the issue
c) potentially broaden the alert if necessary to other similar cases

Comment 22 W. Trevor King 2021-06-22 16:59:13 UTC

(In reply to Clayton Coleman from comment #21)
> a) deliver an alert that flags this with corrective action

Bug 1974832 has been created to track this.

> b) once that alert rate is down, redeliver the PDB in 4.9 to fix the issue

With the reversions from this bug landing, bug 1949262 was re-opened to track this redelivery.

> c) potentially broaden the alert if necessary to other similar cases

If this needs to be extended, it sounds like a separate bug too.

With (a) and (b) being tracked in other places, and this bug's revert already being taken back to 4.7.16 via bug 1967966, I'm going to move this back to VERIFIED.

Comment 24 errata-xmlrpc 2021-07-27 23:11:25 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438

Note You need to log in before you can comment on or make changes to this bug.