Bug 2109800
Summary: | [IBMCloud] context deadline exceeded for kube-scheduler targets | ||||||
---|---|---|---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Junqi Zhao <juzhao> | ||||
Component: | Installer | Assignee: | Christopher J Schaefer <cschaefe> | ||||
Installer sub component: | openshift-installer | QA Contact: | MayXu <maxu> | ||||
Status: | CLOSED ERRATA | Docs Contact: | Mike Pytlak <mpytlak> | ||||
Severity: | high | ||||||
Priority: | high | CC: | anpicker, cschaefe, mfojtik, mifiedle, mpytlak, rdossant | ||||
Version: | 4.11 | ||||||
Target Milestone: | --- | ||||||
Target Release: | 4.12.0 | ||||||
Hardware: | Unspecified | ||||||
OS: | Unspecified | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||
Doc Text: |
Previously, `kube-controller-manager` and `kube-scheduler` metrics were not available for an IBM Cloud VPC cluster due to network traffic restrictions. This resulted in alerts that these services were down. With this update, all metrics are reported as expected.(link:https://bugzilla.redhat.com/show_bug.cgi?id=2109800[*BZ#2109800*])
|
Story Points: | --- | ||||
Clone Of: | |||||||
: | 2109827 (view as bug list) | Environment: | |||||
Last Closed: | 2023-01-17 19:53:12 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | |||||||
Bug Blocks: | 2109827 | ||||||
Attachments: |
|
Description
Junqi Zhao
2022-07-22 07:20:50 UTC
Reassigning to the kube-scheduler component since the service monitor and alert are managed by https://github.com/openshift/cluster-kube-scheduler-operator/. I've duplicated on IPI 4.11 for IBM Cloud. I am investigating the potential issue with network traffic being limited by the IBM Cloud SecurityGroups and SecurityGroupRules that are setup as part of IPI on IBM Cloud. I will attempt to determine if this limitation is the cause for this traffic issue and whether we need to make adjustments to the installer code for these SecurityGroup/SecurityGroupRules that get created in IBM Cloud. It looks like the metrics traffic, on ports 10257-10259, is being limited to originating from master nodes only, per the *-sg-cp-internal SecurityGroup https://github.com/openshift/installer/blob/2aa8651d244dc3f6342b39f818e2f4c87ac0a623/data/data/ibmcloud/network/vpc/security-groups.tf#L232-L237 https://github.com/openshift/installer/blob/2aa8651d244dc3f6342b39f818e2f4c87ac0a623/data/data/ibmcloud/network/vpc/security-groups.tf#L250-L259 Allowing traffic from any node within the cluster (master and worker), allows the prometheus pods to reach the KCM and KS metrics endpoints on the master nodes. I will make an update to the SecurityGroupRule to allow traffic from the cluster-wide SecurityGroup https://github.com/openshift/installer/blob/2aa8651d244dc3f6342b39f818e2f4c87ac0a623/data/data/ibmcloud/network/vpc/security-groups.tf#L250-L259 https://github.com/openshift/installer/blob/2aa8651d244dc3f6342b39f818e2f4c87ac0a623/data/data/ibmcloud/network/vpc/security-groups.tf#L12-L17 *** Bug 2109827 has been marked as a duplicate of this bug. *** Thank You Christopher for figuring out the root cause. Moving the report under the installer component so the root case can be properly addressed. Thank You everyone for all the help. A PR to resolve this issue has been opened for 4.12/master https://github.com/openshift/installer/pull/6208 checked with 4.12.0-0.nightly-2022-10-05-053337 kube-controller-manager and kube-scheduler display OK on Prometheus Target. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.12.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:7399 |