Description of problem: login 4.11.0-0.nightly-2022-05-20-213928 hypershift cluster with Guest cluster kubeconfig, there are only 2 worker nodes, no master node although there is bug 2089224, we could enable UMW first # oc get node --show-labels NAME STATUS ROLES AGE VERSION LABELS ip-10-0-131-0.us-east-2.compute.internal Ready worker 24h v1.23.3+ad897c4 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=m5.large,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=us-east-2,failure-domain.beta.kubernetes.io/zone=us-east-2a,kubernetes.io/arch=amd64,kubernetes.io/hostname=ip-10-0-131-0.us-east-2.compute.internal,kubernetes.io/os=linux,node-role.kubernetes.io/worker=,node.kubernetes.io/instance-type=m5.large,node.openshift.io/os_id=rhcos,topology.ebs.csi.aws.com/zone=us-east-2a,topology.kubernetes.io/region=us-east-2,topology.kubernetes.io/zone=us-east-2a ip-10-0-142-131.us-east-2.compute.internal Ready worker 24h v1.23.3+ad897c4 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=m5.large,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=us-east-2,failure-domain.beta.kubernetes.io/zone=us-east-2a,kubernetes.io/arch=amd64,kubernetes.io/hostname=ip-10-0-142-131.us-east-2.compute.internal,kubernetes.io/os=linux,node-role.kubernetes.io/worker=,node.kubernetes.io/instance-type=m5.large,node.openshift.io/os_id=rhcos,topology.ebs.csi.aws.com/zone=us-east-2a,topology.kubernetes.io/region=us-east-2,topology.kubernetes.io/zone=us-east-2a prometheus-operator pod failed to start up due to there is not node match the node selector # oc -n openshift-user-workload-monitoring get pod NAME READY STATUS RESTARTS AGE prometheus-operator-67fd5dfd46-cfq9b 0/2 Pending 0 70s # oc -n openshift-user-workload-monitoring describe pod prometheus-operator-67fd5dfd46-cfq9b ... Node-Selectors: kubernetes.io/os=linux node-role.kubernetes.io/master= Tolerations: node-role.kubernetes.io/master:NoSchedule op=Exists node.kubernetes.io/memory-pressure:NoSchedule op=Exists node.kubernetes.io/not-ready:NoExecute op=Exists for 300s node.kubernetes.io/unreachable:NoExecute op=Exists for 300s Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning FailedScheduling <invalid> default-scheduler 0/2 nodes are available: 2 node(s) didn't match Pod's node affinity/selector. Warning FailedScheduling <invalid> default-scheduler 0/2 nodes are available: 2 node(s) didn't match Pod's node affinity/selector. Normal NotTriggerScaleUp <invalid> (x9 over <invalid>) cluster-autoscaler pod didn't trigger scale-up: # oc get node -l kubernetes.io/os=linux -l node-role.kubernetes.io/master= No resources found Version-Release number of selected component (if applicable): 4.11.0-0.nightly-2022-05-20-213928 hypershift cluster with Guest cluster kubeconfig How reproducible: always Steps to Reproduce: 1. 4.11.0-0.nightly-2022-05-20-213928 hypershift cluster with Guest cluster kubeconfig, enable UWM 2. 3. Actual results: prometheus-operator pod failed to start up due to there is not node match the node selector Expected results: no error Additional info: please help to confirm if we support UWM in hypershift cluster
User workload monitoring is supported, but a requirement for HyperShift clusters is that you specify the nodeSelector for each component of user workload monitoring in the user workload monitoring configMap. The node selector can be something like: kubernetes.io/os: linux See: https://github.com/openshift/cluster-monitoring-operator/blob/2584b3b1694fb3cc86afa1e63193effb5c356934/pkg/manifests/config.go#L149 https://github.com/openshift/cluster-monitoring-operator/blob/2584b3b1694fb3cc86afa1e63193effb5c356934/pkg/manifests/config.go#L562 https://github.com/openshift/cluster-monitoring-operator/blob/2584b3b1694fb3cc86afa1e63193effb5c356934/pkg/manifests/config.go#L258
@cewong IIUC you want to provide a node selector to the UWM components, AFAIK this can already be done through the configmap in the UWM namespace. E.g: apiVersion: v1 kind: ConfigMap metadata: name: user-workload-monitoring-config namespace: openshift-user-workload-monitoring data: config.yaml: | prometheusOperator: nodeSelector: kubernetes.io/os: "linux" So if I'm not missing anything now the hypershift operator just has to start managing the UWM ConfigMap. With this in mind I'm going to send this bug to the hypershift team, please send it back in case I've missed something.
@jmarcal my comment was that this can work if you set these node selectors yourself. So the way I see it, we have 2 options: 1) document that if you want to enable UWM on a hypershift cluster you must set the node selectors in the UWM config so that UWM pods can be scheduled. OR 2) the CMO can modify the default node selector based on the value of infrastructure.status.controlPlaneTopology. If it equals "External", then default to a selector that doesn't include masters. IMHO #2 is a better UX option for end users.
@cewong ahh sorry I didn't understood you the first time. I agree with you #2 seems like the best UX, I've implemented that approach to be seen if team approves
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:5069