Bug 2089574 - UWM prometheus-operator pod can't start up due to no master node in hypershift cluster
Summary: UWM prometheus-operator pod can't start up due to no master node in hypershif...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Monitoring
Version: 4.11
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: 4.11.0
Assignee: Joao Marcal
QA Contact: Junqi Zhao
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-05-24 03:54 UTC by Junqi Zhao
Modified: 2022-08-10 11:13 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-08-10 11:13:36 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-monitoring-operator pull 1679 0 None open Bug 2089574: Removes master node selector in UWM PO when running with HostedControlPlane external 2022-05-26 15:03:44 UTC
Red Hat Product Errata RHSA-2022:5069 0 None None None 2022-08-10 11:13:54 UTC

Description Junqi Zhao 2022-05-24 03:54:38 UTC
Description of problem:
login 4.11.0-0.nightly-2022-05-20-213928 hypershift cluster with Guest cluster kubeconfig, there are only 2 worker nodes, no master node
although there is bug 2089224, we could enable UMW first
# oc get node --show-labels
NAME                                         STATUS   ROLES    AGE   VERSION           LABELS
ip-10-0-131-0.us-east-2.compute.internal     Ready    worker   24h   v1.23.3+ad897c4   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=m5.large,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=us-east-2,failure-domain.beta.kubernetes.io/zone=us-east-2a,kubernetes.io/arch=amd64,kubernetes.io/hostname=ip-10-0-131-0.us-east-2.compute.internal,kubernetes.io/os=linux,node-role.kubernetes.io/worker=,node.kubernetes.io/instance-type=m5.large,node.openshift.io/os_id=rhcos,topology.ebs.csi.aws.com/zone=us-east-2a,topology.kubernetes.io/region=us-east-2,topology.kubernetes.io/zone=us-east-2a
ip-10-0-142-131.us-east-2.compute.internal   Ready    worker   24h   v1.23.3+ad897c4   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=m5.large,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=us-east-2,failure-domain.beta.kubernetes.io/zone=us-east-2a,kubernetes.io/arch=amd64,kubernetes.io/hostname=ip-10-0-142-131.us-east-2.compute.internal,kubernetes.io/os=linux,node-role.kubernetes.io/worker=,node.kubernetes.io/instance-type=m5.large,node.openshift.io/os_id=rhcos,topology.ebs.csi.aws.com/zone=us-east-2a,topology.kubernetes.io/region=us-east-2,topology.kubernetes.io/zone=us-east-2a

prometheus-operator pod failed to start up due to there is not node match the node selector
# oc -n openshift-user-workload-monitoring get pod
NAME                                   READY   STATUS    RESTARTS   AGE
prometheus-operator-67fd5dfd46-cfq9b   0/2     Pending   0          70s

# oc -n openshift-user-workload-monitoring describe pod prometheus-operator-67fd5dfd46-cfq9b
...
Node-Selectors:              kubernetes.io/os=linux
                             node-role.kubernetes.io/master=
Tolerations:                 node-role.kubernetes.io/master:NoSchedule op=Exists
                             node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                             node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason             Age                            From                Message
  ----     ------             ----                           ----                -------
  Warning  FailedScheduling   <invalid>                      default-scheduler   0/2 nodes are available: 2 node(s) didn't match Pod's node affinity/selector.
  Warning  FailedScheduling   <invalid>                      default-scheduler   0/2 nodes are available: 2 node(s) didn't match Pod's node affinity/selector.
  Normal   NotTriggerScaleUp  <invalid> (x9 over <invalid>)  cluster-autoscaler  pod didn't trigger scale-up:

# oc get node -l kubernetes.io/os=linux -l node-role.kubernetes.io/master=
No resources found


Version-Release number of selected component (if applicable):
4.11.0-0.nightly-2022-05-20-213928 hypershift cluster with Guest cluster kubeconfig

How reproducible:
always

Steps to Reproduce:
1. 4.11.0-0.nightly-2022-05-20-213928 hypershift cluster with Guest cluster kubeconfig, enable UWM
2.
3.

Actual results:
prometheus-operator pod failed to start up due to there is not node match the node selector

Expected results:
no error

Additional info:
please help to confirm if we support UWM in hypershift cluster

Comment 1 Cesar Wong 2022-05-25 15:56:06 UTC
User workload monitoring is supported, but a requirement for HyperShift clusters is that you specify the nodeSelector for each component of user workload monitoring in the user workload monitoring configMap.
The node selector can be something like:
  kubernetes.io/os: linux

See:
https://github.com/openshift/cluster-monitoring-operator/blob/2584b3b1694fb3cc86afa1e63193effb5c356934/pkg/manifests/config.go#L149
https://github.com/openshift/cluster-monitoring-operator/blob/2584b3b1694fb3cc86afa1e63193effb5c356934/pkg/manifests/config.go#L562
https://github.com/openshift/cluster-monitoring-operator/blob/2584b3b1694fb3cc86afa1e63193effb5c356934/pkg/manifests/config.go#L258

Comment 2 Joao Marcal 2022-05-25 18:48:55 UTC
@cewong IIUC you want to provide a node selector to the UWM components, AFAIK this can already be done through the configmap in the UWM namespace.

E.g:
apiVersion: v1
kind: ConfigMap
metadata:
  name: user-workload-monitoring-config
  namespace: openshift-user-workload-monitoring
data:
  config.yaml: |
    prometheusOperator:
      nodeSelector:
        kubernetes.io/os: "linux"

So if I'm not missing anything now the hypershift operator just has to start managing the UWM ConfigMap. With this in mind I'm going to send this bug to the hypershift team, please send it back in case I've missed something.

Comment 3 Cesar Wong 2022-05-25 19:38:45 UTC
@jmarcal my comment was that this can work if you set these node selectors yourself. 

So the way I see it, we have 2 options:
1) document that if you want to enable UWM on a hypershift cluster you must set the node selectors in the UWM config so that UWM pods can be scheduled.
OR
2) the CMO can modify the default node selector based on the value of infrastructure.status.controlPlaneTopology. If it equals "External", then default to a selector that doesn't include masters.

IMHO #2 is a better UX option for end users.

Comment 4 Joao Marcal 2022-05-26 15:00:51 UTC
@cewong ahh sorry I didn't understood you the first time. I agree with you #2 seems like the best UX, I've implemented that approach to be seen if team approves

Comment 13 errata-xmlrpc 2022-08-10 11:13:36 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5069


Note You need to log in before you can comment on or make changes to this bug.