Created attachment 1898664 [details] context deadline exceeded for kube-controller-manager and kube-scheduler targets Description of problem: found in IPI-IBMCloud cluster, context deadline exceeded for kube-controller-manager and kube-scheduler targets, also found the same issue in other 4.11 IBMCloud clusters, did not see the issue in other IAAS # oc -n openshift-monitoring port-forward pod/prometheus-k8s-0 9090 Browser navigates to: http://localhost:9090/graph will see the error in prometheus targets page $ oc -n openshift-kube-controller-manager get ep NAME ENDPOINTS AGE kube-controller-manager 10.242.0.7:10257,10.242.129.4:10257,10.242.64.6:10257 164m $ oc -n openshift-kube-scheduler get ep NAME ENDPOINTS AGE scheduler 10.242.0.7:10259,10.242.129.4:10259,10.242.64.6:10259 164m $ oc get node -o wide | grep -E "10.242.0.7|10.242.129.4|10.242.64.6" juzhao-0722-f6stv-master-0 Ready master 167m v1.24.0+9546431 10.242.0.7 10.242.0.7 Red Hat Enterprise Linux CoreOS 411.86.202207150124-0 (Ootpa) 4.18.0-372.16.1.el8_6.x86_64 cri-o://1.24.1-11.rhaos4.11.gitb0d2ef3.el8 juzhao-0722-f6stv-master-1 Ready master 167m v1.24.0+9546431 10.242.64.6 10.242.64.6 Red Hat Enterprise Linux CoreOS 411.86.202207150124-0 (Ootpa) 4.18.0-372.16.1.el8_6.x86_64 cri-o://1.24.1-11.rhaos4.11.gitb0d2ef3.el8 juzhao-0722-f6stv-master-2 Ready master 167m v1.24.0+9546431 10.242.129.4 10.242.129.4 Red Hat Enterprise Linux CoreOS 411.86.202207150124-0 (Ootpa) 4.18.0-372.16.1.el8_6.x86_64 cri-o://1.24.1-11.rhaos4.11.gitb0d2ef3.el8 this would cause KubeControllerManagerDown/KubeSchedulerDown and TargetDown alerts for kube-controller-manager and kube-scheduler Version-Release number of selected component (if applicable): 4.11.0-0.nightly-2022-07-19-104004 $ oc get clusterversion version -o jsonpath="{.spec.clusterID}" 53d9c358-e978-4794-9f47-9fcdb351003d $ oc get infrastructures/cluster -o jsonpath="{.spec.platformSpec.type}" IBMCloud $ oc -n openshift-kube-controller-manager get servicemonitor NAME AGE kube-controller-manager 3h15m $ oc -n openshift-kube-controller-manager get servicemonitor kube-controller-manager -oyaml apiVersion: monitoring.coreos.com/v1 kind: ServiceMonitor metadata: annotations: include.release.openshift.io/ibm-cloud-managed: "true" include.release.openshift.io/self-managed-high-availability: "true" include.release.openshift.io/single-node-developer: "true" creationTimestamp: "2022-07-22T04:01:29Z" generation: 1 labels: k8s-app: kube-controller-manager name: kube-controller-manager namespace: openshift-kube-controller-manager ownerReferences: - apiVersion: config.openshift.io/v1 kind: ClusterVersion name: version uid: fb10af8c-54b0-4736-bef7-a382a180260b resourceVersion: "1659" uid: 7b7ef618-4a3a-4f1f-b67e-7cb4dcd3ab7c spec: endpoints: - bearerTokenFile: /var/run/secrets/kubernetes.io/serviceaccount/token interval: 30s metricRelabelings: - action: drop regex: etcd_(debugging|disk|request|server).* sourceLabels: - __name__ - action: drop regex: rest_client_request_latency_seconds_(bucket|count|sum) sourceLabels: - __name__ - action: drop regex: root_ca_cert_publisher_sync_duration_seconds_(bucket|count|sum) sourceLabels: - __name__ port: https scheme: https tlsConfig: caFile: /etc/prometheus/configmaps/serving-certs-ca-bundle/service-ca.crt certFile: /etc/prometheus/secrets/metrics-client-certs/tls.crt keyFile: /etc/prometheus/secrets/metrics-client-certs/tls.key serverName: kube-controller-manager.openshift-kube-controller-manager.svc namespaceSelector: matchNames: - openshift-kube-controller-manager selector: matchLabels: prometheus: kube-controller-manager $ oc -n openshift-kube-scheduler get servicemonitor NAME AGE kube-scheduler 3h15m $ oc -n openshift-kube-scheduler get servicemonitor kube-scheduler -oyaml apiVersion: monitoring.coreos.com/v1 kind: ServiceMonitor metadata: annotations: exclude.release.openshift.io/internal-openshift-hosted: "true" include.release.openshift.io/self-managed-high-availability: "true" include.release.openshift.io/single-node-developer: "true" creationTimestamp: "2022-07-22T04:01:31Z" generation: 1 labels: k8s-app: kube-scheduler name: kube-scheduler namespace: openshift-kube-scheduler ownerReferences: - apiVersion: config.openshift.io/v1 kind: ClusterVersion name: version uid: fb10af8c-54b0-4736-bef7-a382a180260b resourceVersion: "1716" uid: f64e6b22-8ff2-45d8-8908-d123a45028bc spec: endpoints: - bearerTokenFile: /var/run/secrets/kubernetes.io/serviceaccount/token interval: 30s path: /metrics port: https scheme: https tlsConfig: caFile: /etc/prometheus/configmaps/serving-certs-ca-bundle/service-ca.crt serverName: scheduler.openshift-kube-scheduler.svc - bearerTokenFile: /var/run/secrets/kubernetes.io/serviceaccount/token honorLabels: true interval: 30s path: /metrics/resources port: https scheme: https tlsConfig: caFile: /etc/prometheus/configmaps/serving-certs-ca-bundle/service-ca.crt certFile: /etc/prometheus/secrets/metrics-client-certs/tls.crt keyFile: /etc/prometheus/secrets/metrics-client-certs/tls.key serverName: scheduler.openshift-kube-scheduler.svc namespaceSelector: matchNames: - openshift-kube-scheduler selector: matchLabels: prometheus: kube-scheduler How reproducible: ofen reproduce in IBMCloud cluster Steps to Reproduce: 1. see the description 2. 3. Actual results: context deadline exceeded for context deadline exceeded for kube-controller-manager and kube-scheduler endpoints Expected results: no such issue Additional info:
Reassigning to the kube-scheduler component since the service monitor and alert are managed by https://github.com/openshift/cluster-kube-scheduler-operator/.
I've duplicated on IPI 4.11 for IBM Cloud. I am investigating the potential issue with network traffic being limited by the IBM Cloud SecurityGroups and SecurityGroupRules that are setup as part of IPI on IBM Cloud. I will attempt to determine if this limitation is the cause for this traffic issue and whether we need to make adjustments to the installer code for these SecurityGroup/SecurityGroupRules that get created in IBM Cloud.
It looks like the metrics traffic, on ports 10257-10259, is being limited to originating from master nodes only, per the *-sg-cp-internal SecurityGroup https://github.com/openshift/installer/blob/2aa8651d244dc3f6342b39f818e2f4c87ac0a623/data/data/ibmcloud/network/vpc/security-groups.tf#L232-L237 https://github.com/openshift/installer/blob/2aa8651d244dc3f6342b39f818e2f4c87ac0a623/data/data/ibmcloud/network/vpc/security-groups.tf#L250-L259 Allowing traffic from any node within the cluster (master and worker), allows the prometheus pods to reach the KCM and KS metrics endpoints on the master nodes. I will make an update to the SecurityGroupRule to allow traffic from the cluster-wide SecurityGroup https://github.com/openshift/installer/blob/2aa8651d244dc3f6342b39f818e2f4c87ac0a623/data/data/ibmcloud/network/vpc/security-groups.tf#L250-L259 https://github.com/openshift/installer/blob/2aa8651d244dc3f6342b39f818e2f4c87ac0a623/data/data/ibmcloud/network/vpc/security-groups.tf#L12-L17
*** Bug 2109827 has been marked as a duplicate of this bug. ***
Thank You Christopher for figuring out the root cause. Moving the report under the installer component so the root case can be properly addressed. Thank You everyone for all the help.
A PR to resolve this issue has been opened for 4.12/master https://github.com/openshift/installer/pull/6208
checked with 4.12.0-0.nightly-2022-10-05-053337 kube-controller-manager and kube-scheduler display OK on Prometheus Target.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.12.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:7399