Bug 2109800 - [IBMCloud] context deadline exceeded for kube-scheduler targets
Summary: [IBMCloud] context deadline exceeded for kube-scheduler targets
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Installer
Version: 4.11
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 4.12.0
Assignee: Christopher J Schaefer
QA Contact: MayXu
Mike Pytlak
URL:
Whiteboard:
: 2109827 (view as bug list)
Depends On:
Blocks: 2109827
TreeView+ depends on / blocked
 
Reported: 2022-07-22 07:20 UTC by Junqi Zhao
Modified: 2023-01-17 19:53 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Previously, `kube-controller-manager` and `kube-scheduler` metrics were not available for an IBM Cloud VPC cluster due to network traffic restrictions. This resulted in alerts that these services were down. With this update, all metrics are reported as expected.(link:https://bugzilla.redhat.com/show_bug.cgi?id=2109800[*BZ#2109800*])
Clone Of:
: 2109827 (view as bug list)
Environment:
Last Closed: 2023-01-17 19:53:12 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
context deadline exceeded for kube-controller-manager and kube-scheduler targets (1.64 MB, image/png)
2022-07-22 07:20 UTC, Junqi Zhao
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github openshift installer pull 6208 0 None open Bug 2109800: IBMCloud: Allow metrics traffic 2022-08-11 14:33:00 UTC
Red Hat Product Errata RHSA-2022:7399 0 None None None 2023-01-17 19:53:25 UTC

Description Junqi Zhao 2022-07-22 07:20:50 UTC
Created attachment 1898664 [details]
context deadline exceeded for kube-controller-manager and kube-scheduler targets

Description of problem:
found in IPI-IBMCloud cluster, context deadline exceeded for kube-controller-manager and kube-scheduler targets, also found the same issue in other 4.11 IBMCloud clusters, did not see the issue in other IAAS
# oc -n openshift-monitoring port-forward pod/prometheus-k8s-0 9090
Browser navigates to:
http://localhost:9090/graph
will see the error in prometheus targets page

$ oc -n openshift-kube-controller-manager get ep
NAME                      ENDPOINTS                                               AGE
kube-controller-manager   10.242.0.7:10257,10.242.129.4:10257,10.242.64.6:10257   164m
$ oc -n openshift-kube-scheduler get ep
NAME        ENDPOINTS                                               AGE
scheduler   10.242.0.7:10259,10.242.129.4:10259,10.242.64.6:10259   164m
$ oc get node -o wide | grep -E "10.242.0.7|10.242.129.4|10.242.64.6"
juzhao-0722-f6stv-master-0         Ready    master   167m   v1.24.0+9546431   10.242.0.7     10.242.0.7     Red Hat Enterprise Linux CoreOS 411.86.202207150124-0 (Ootpa)   4.18.0-372.16.1.el8_6.x86_64   cri-o://1.24.1-11.rhaos4.11.gitb0d2ef3.el8
juzhao-0722-f6stv-master-1         Ready    master   167m   v1.24.0+9546431   10.242.64.6    10.242.64.6    Red Hat Enterprise Linux CoreOS 411.86.202207150124-0 (Ootpa)   4.18.0-372.16.1.el8_6.x86_64   cri-o://1.24.1-11.rhaos4.11.gitb0d2ef3.el8
juzhao-0722-f6stv-master-2         Ready    master   167m   v1.24.0+9546431   10.242.129.4   10.242.129.4   Red Hat Enterprise Linux CoreOS 411.86.202207150124-0 (Ootpa)   4.18.0-372.16.1.el8_6.x86_64   cri-o://1.24.1-11.rhaos4.11.gitb0d2ef3.el8

this would cause KubeControllerManagerDown/KubeSchedulerDown and TargetDown alerts for kube-controller-manager and kube-scheduler

Version-Release number of selected component (if applicable):
4.11.0-0.nightly-2022-07-19-104004
$ oc get clusterversion version -o jsonpath="{.spec.clusterID}"
53d9c358-e978-4794-9f47-9fcdb351003d
$ oc get infrastructures/cluster -o jsonpath="{.spec.platformSpec.type}"
IBMCloud

$ oc -n openshift-kube-controller-manager get servicemonitor
NAME                      AGE
kube-controller-manager   3h15m
$ oc -n openshift-kube-controller-manager get servicemonitor kube-controller-manager -oyaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  annotations:
    include.release.openshift.io/ibm-cloud-managed: "true"
    include.release.openshift.io/self-managed-high-availability: "true"
    include.release.openshift.io/single-node-developer: "true"
  creationTimestamp: "2022-07-22T04:01:29Z"
  generation: 1
  labels:
    k8s-app: kube-controller-manager
  name: kube-controller-manager
  namespace: openshift-kube-controller-manager
  ownerReferences:
  - apiVersion: config.openshift.io/v1
    kind: ClusterVersion
    name: version
    uid: fb10af8c-54b0-4736-bef7-a382a180260b
  resourceVersion: "1659"
  uid: 7b7ef618-4a3a-4f1f-b67e-7cb4dcd3ab7c
spec:
  endpoints:
  - bearerTokenFile: /var/run/secrets/kubernetes.io/serviceaccount/token
    interval: 30s
    metricRelabelings:
    - action: drop
      regex: etcd_(debugging|disk|request|server).*
      sourceLabels:
      - __name__
    - action: drop
      regex: rest_client_request_latency_seconds_(bucket|count|sum)
      sourceLabels:
      - __name__
    - action: drop
      regex: root_ca_cert_publisher_sync_duration_seconds_(bucket|count|sum)
      sourceLabels:
      - __name__
    port: https
    scheme: https
    tlsConfig:
      caFile: /etc/prometheus/configmaps/serving-certs-ca-bundle/service-ca.crt
      certFile: /etc/prometheus/secrets/metrics-client-certs/tls.crt
      keyFile: /etc/prometheus/secrets/metrics-client-certs/tls.key
      serverName: kube-controller-manager.openshift-kube-controller-manager.svc
  namespaceSelector:
    matchNames:
    - openshift-kube-controller-manager
  selector:
    matchLabels:
      prometheus: kube-controller-manager
$ oc -n openshift-kube-scheduler get servicemonitor
NAME             AGE
kube-scheduler   3h15m
$ oc -n openshift-kube-scheduler get servicemonitor kube-scheduler -oyaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  annotations:
    exclude.release.openshift.io/internal-openshift-hosted: "true"
    include.release.openshift.io/self-managed-high-availability: "true"
    include.release.openshift.io/single-node-developer: "true"
  creationTimestamp: "2022-07-22T04:01:31Z"
  generation: 1
  labels:
    k8s-app: kube-scheduler
  name: kube-scheduler
  namespace: openshift-kube-scheduler
  ownerReferences:
  - apiVersion: config.openshift.io/v1
    kind: ClusterVersion
    name: version
    uid: fb10af8c-54b0-4736-bef7-a382a180260b
  resourceVersion: "1716"
  uid: f64e6b22-8ff2-45d8-8908-d123a45028bc
spec:
  endpoints:
  - bearerTokenFile: /var/run/secrets/kubernetes.io/serviceaccount/token
    interval: 30s
    path: /metrics
    port: https
    scheme: https
    tlsConfig:
      caFile: /etc/prometheus/configmaps/serving-certs-ca-bundle/service-ca.crt
      serverName: scheduler.openshift-kube-scheduler.svc
  - bearerTokenFile: /var/run/secrets/kubernetes.io/serviceaccount/token
    honorLabels: true
    interval: 30s
    path: /metrics/resources
    port: https
    scheme: https
    tlsConfig:
      caFile: /etc/prometheus/configmaps/serving-certs-ca-bundle/service-ca.crt
      certFile: /etc/prometheus/secrets/metrics-client-certs/tls.crt
      keyFile: /etc/prometheus/secrets/metrics-client-certs/tls.key
      serverName: scheduler.openshift-kube-scheduler.svc
  namespaceSelector:
    matchNames:
    - openshift-kube-scheduler
  selector:
    matchLabels:
      prometheus: kube-scheduler

How reproducible:
ofen reproduce in IBMCloud cluster

Steps to Reproduce:
1. see the description
2.
3.

Actual results:
context deadline exceeded for context deadline exceeded for kube-controller-manager and kube-scheduler endpoints

Expected results:
no such issue

Additional info:

Comment 3 Simon Pasquier 2022-07-22 09:07:19 UTC
Reassigning to the kube-scheduler component since the service monitor and alert are managed by https://github.com/openshift/cluster-kube-scheduler-operator/.

Comment 10 Christopher J Schaefer 2022-07-26 21:06:34 UTC
I've duplicated on IPI 4.11 for IBM Cloud. I am investigating the potential issue with network traffic being limited by the IBM Cloud SecurityGroups and SecurityGroupRules that are setup as part of IPI on IBM Cloud.

I will attempt to determine if this limitation is the cause for this traffic issue and whether we need to make adjustments to the installer code for these SecurityGroup/SecurityGroupRules that get created in IBM Cloud.

Comment 12 Christopher J Schaefer 2022-07-27 14:54:40 UTC
It looks like the metrics traffic, on ports 10257-10259, is being limited to originating from master nodes only, per the *-sg-cp-internal SecurityGroup
https://github.com/openshift/installer/blob/2aa8651d244dc3f6342b39f818e2f4c87ac0a623/data/data/ibmcloud/network/vpc/security-groups.tf#L232-L237
https://github.com/openshift/installer/blob/2aa8651d244dc3f6342b39f818e2f4c87ac0a623/data/data/ibmcloud/network/vpc/security-groups.tf#L250-L259


Allowing traffic from any node within the cluster (master and worker), allows the prometheus pods to reach the KCM and KS metrics endpoints on the master nodes.

I will make an update to the SecurityGroupRule to allow traffic from the cluster-wide SecurityGroup
https://github.com/openshift/installer/blob/2aa8651d244dc3f6342b39f818e2f4c87ac0a623/data/data/ibmcloud/network/vpc/security-groups.tf#L250-L259
https://github.com/openshift/installer/blob/2aa8651d244dc3f6342b39f818e2f4c87ac0a623/data/data/ibmcloud/network/vpc/security-groups.tf#L12-L17

Comment 13 Jan Chaloupka 2022-07-28 09:13:35 UTC
*** Bug 2109827 has been marked as a duplicate of this bug. ***

Comment 14 Jan Chaloupka 2022-07-28 09:18:42 UTC
Thank You Christopher for figuring out the root cause.

Moving the report under the installer component so the root case can be properly addressed. Thank You everyone for all the help.

Comment 15 Christopher J Schaefer 2022-08-09 13:26:39 UTC
A PR to resolve this issue has been opened for 4.12/master
https://github.com/openshift/installer/pull/6208

Comment 21 MayXu 2022-10-09 08:14:04 UTC
checked with 4.12.0-0.nightly-2022-10-05-053337
kube-controller-manager and kube-scheduler display OK on Prometheus Target.

Comment 25 errata-xmlrpc 2023-01-17 19:53:12 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.12.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:7399


Note You need to log in before you can comment on or make changes to this bug.