Bug 1854434
Summary: | Traffic spikes every 10 minutes to master API | ||||||
---|---|---|---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Dirk Porter <dporter> | ||||
Component: | kube-controller-manager | Assignee: | Mike Dame <mdame> | ||||
Status: | CLOSED ERRATA | QA Contact: | RamaKasturi <knarra> | ||||
Severity: | high | Docs Contact: | |||||
Priority: | urgent | ||||||
Version: | 3.11.0 | CC: | antgarci, aos-bugs, asoni, bbennett, bshirren, dmellado, gmarkley, jolee, knarra, maszulik, mdame, mfojtik, nchavan, nelluri, openshift-bugs-escalate, rcarrier, rkant, rvanderp, scuppett, suchaudh, svaughn, tsmetana, vlaad | ||||
Target Milestone: | --- | ||||||
Target Release: | 4.6.0 | ||||||
Hardware: | Unspecified | ||||||
OS: | Unspecified | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||
Doc Text: |
Cause: Various inconsistencies in storage and comparison of endpoints led to the endpoints controller incorrectly assuming there were many more updates that needed to be made to the cluster than there actually were.
Consequence: Endpoints controller would send large amounts of API requests at every informer re-sync interval, causing degradation in large clusters with many endpoints.
Fix: Update storage and comparison functions for endpoints to be more consistent.
Result: Endpoints controller now only sends updates when necessary.
|
Story Points: | --- | ||||
Clone Of: | |||||||
: | 1870540 1882030 (view as bug list) | Environment: | |||||
Last Closed: | 2020-10-27 16:12:46 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | |||||||
Bug Blocks: | 1870540, 1882030 | ||||||
Attachments: |
|
Description
Dirk Porter
2020-07-07 13:40:58 UTC
*** Bug 1854435 has been marked as a duplicate of this bug. *** Previously mentioned data has been uploaded. Based on comment 6, the traffic is coming from the system:serviceaccount:kube-system:endpoint-controller which sounds like the kube controller manager... reassigning. Created attachment 1711463 [details]
Requested increase log levels
Upon thorough inspection of the code and debugging trials we found behavior similar to what was described still exists in the current code (4.6/1.19). It's obviously surprising that such an issue has gone undetected for so long, but we isolated the behavior to a key section of code that checks for changes in Services and Endpoints in the cluster. This code could result in a false-negative due to comparison of sorted vs unsorted lists. I have opened a 3.11 PR to fix the bug at https://github.com/openshift/origin/pull/25424 I have also opened an upstream PR for the latest Kubernetes at https://github.com/kubernetes/kubernetes/pull/94112. The community maintainers will also provide review for these changes, and I expect that the change will be backported through supported versions. The 3.11 PR has moved to https://github.com/openshift/origin/pull/25426 which includes more code changes that help with this Update about the current progress: The proposed patch has been posted for an upstream review and it turned out the problem lies actually in the way kubernetes is hashing the endpoint structures: https://github.com/kubernetes/kubernetes/pull/94112#discussion_r478397569 and the proposed fix only mitigated the real problem. We need to fix the root issue first. I'm in constant contact with all the people responsible for the endpoints controller, I provided them with possible solutions to the problem discovered by Jordan yesterday. As soon as we get final agreement on the approach I'll be backporting this to all necessary versions of OpenShift. Hello Tomas and Maciej, Is there any type of workaround or temporary fix we can provide to a customer? My customer is experiencing major critical production issues across their environments. Their case was opened on 6/26 and they have been dealing with constant application downtime across their entire network. Please let me know if there is anything we can do to help them work around this issue until a proper errata is released. Regards, Tony Garcia Red Hat OpenShift TAM Both the Upstream PR (https://github.com/kubernetes/kubernetes/pull/94112) and the 3.11 backport (https://github.com/openshift/origin/pull/25426) have been approved and are in the merge queue. These should merge later today if CI passes. I'm adding UpcomingSprint, because the work for this bug is done and waiting to merge Moving the bug to verified state as i see that updates are less after the bug fix has happened. Below are the steps followed to verify the bug. [ramakasturinarra@dhcp35-60 ~]$ oc version Client Version: 4.6.0-202009150116.p0-3db46dc Server Version: 4.6.0-0.nightly-2020-09-14-221526 Kubernetes Version: v1.19.0+35ab7c5 Before Fix: =================== 1) Install 4.5 cluster with nightly 2) set kubecontrollermanager logLevel to Trace so that it runs at -v=6 3) Now run the command below to see the frequency of updates happening i.e for every 10 mins. I0916 05:28:23.909056 1 endpoints_controller.go:510] Update endpoints for openshift-monitoring/node-exporter, ready: 6 not ready: 0 I0916 05:28:23.909662 1 endpoints_controller.go:510] Update endpoints for openshift-monitoring/alertmanager-main, ready: 6 not ready: 0 I0916 05:28:23.914992 1 endpoints_controller.go:510] Update endpoints for openshift-machine-api/cluster-autoscaler-operator, ready: 2 not ready: 0 I0916 05:28:23.913246 1 endpoints_controller.go:510] Update endpoints for openshift-marketplace/marketplace-operator-metrics, ready: 2 not ready: 0 I0916 05:28:23.919947 1 endpoints_controller.go:510] Update endpoints for openshift-monitoring/prometheus-operator, ready: 1 not ready: 0 I0916 05:28:23.923106 1 endpoints_controller.go:510] Update endpoints for openshift-monitoring/openshift-state-metrics, ready: 2 not ready: 0 I0916 05:28:23.924974 1 endpoints_controller.go:510] Update endpoints for openshift-monitoring/prometheus-operated, ready: 4 not ready: 0 I0916 05:28:23.926701 1 endpoints_controller.go:510] Update endpoints for openshift-monitoring/alertmanager-operated, ready: 9 not ready: 0 I0916 05:28:23.930065 1 endpoints_controller.go:510] Update endpoints for openshift-monitoring/cluster-monitoring-operator, ready: 1 not ready: 0 I0916 05:28:23.930581 1 endpoints_controller.go:510] Update endpoints for openshift-image-registry/image-registry-operator, ready: 1 not ready: 0 I0916 05:28:23.934368 1 endpoints_controller.go:510] Update endpoints for openshift-monitoring/kube-state-metrics, ready: 2 not ready: 0 I0916 05:28:23.936261 1 endpoints_controller.go:510] Update endpoints for openshift-dns/dns-default, ready: 18 not ready: 0 I0916 05:28:23.939285 1 endpoints_controller.go:510] Update endpoints for openshift-sdn/sdn, ready: 6 not ready: 0 I0916 05:28:23.943335 1 endpoints_controller.go:510] Update endpoints for openshift-monitoring/telemeter-client, ready: 1 not ready: 0 I0916 05:28:23.945914 1 endpoints_controller.go:510] Update endpoints for openshift-cluster-machine-approver/machine-approver, ready: 1 not ready: 0 I0916 05:28:23.947878 1 endpoints_controller.go:510] Update endpoints for openshift-multus/multus-admission-controller, ready: 6 not ready: 0 I0916 05:28:23.951329 1 endpoints_controller.go:510] Update endpoints for openshift-ingress/router-internal-default, ready: 6 not ready: 0 I0916 05:28:23.951379 1 endpoints_controller.go:510] Update endpoints for openshift-cluster-samples-operator/metrics, ready: 1 not ready: 0 I0916 05:38:23.915181 1 endpoints_controller.go:510] Update endpoints for openshift-monitoring/kube-state-metrics, ready: 2 not ready: 0 I0916 05:38:23.916672 1 endpoints_controller.go:510] Update endpoints for openshift-dns/dns-default, ready: 18 not ready: 0 I0916 05:38:23.917024 1 endpoints_controller.go:510] Update endpoints for openshift-monitoring/telemeter-client, ready: 1 not ready: 0 I0916 05:38:23.918275 1 endpoints_controller.go:510] Update endpoints for openshift-sdn/sdn, ready: 6 not ready: 0 I0916 05:38:23.928198 1 endpoints_controller.go:510] Update endpoints for openshift-cluster-machine-approver/machine-approver, ready: 1 not ready: 0 I0916 05:38:23.932283 1 endpoints_controller.go:510] Update endpoints for openshift-multus/multus-admission-controller, ready: 6 not ready: 0 I0916 05:38:23.932546 1 endpoints_controller.go:510] Update endpoints for openshift-ingress/router-internal-default, ready: 6 not ready: 0 I0916 05:38:23.932961 1 endpoints_controller.go:510] Update endpoints for openshift-cluster-samples-operator/metrics, ready: 1 not ready: 0 I0916 05:38:23.934526 1 endpoints_controller.go:510] Update endpoints for openshift-monitoring/node-exporter, ready: 6 not ready: 0 I0916 05:38:23.938472 1 endpoints_controller.go:510] Update endpoints for openshift-monitoring/alertmanager-main, ready: 6 not ready: 0 I0916 05:38:23.943770 1 endpoints_controller.go:510] Update endpoints for openshift-marketplace/marketplace-operator-metrics, ready: 2 not ready: 0 I0916 05:38:23.946585 1 endpoints_controller.go:510] Update endpoints for openshift-machine-api/cluster-autoscaler-operator, ready: 2 not ready: 0 I0916 05:38:23.947588 1 endpoints_controller.go:510] Update endpoints for openshift-monitoring/openshift-state-metrics, ready: 2 not ready: 0 I0916 05:38:23.948295 1 endpoints_controller.go:510] Update endpoints for openshift-monitoring/prometheus-operated, ready: 4 not ready: 0 I0916 05:38:23.950412 1 endpoints_controller.go:510] Update endpoints for openshift-monitoring/alertmanager-operated, ready: 9 not ready: 0 I0916 05:38:23.953437 1 endpoints_controller.go:510] Update endpoints for openshift-monitoring/cluster-monitoring-operator, ready: 1 not ready: 0 I0916 05:38:23.956470 1 endpoints_controller.go:510] Update endpoints for openshift-monitoring/prometheus-operator, ready: 1 not ready: 0 I0916 05:38:23.960298 1 endpoints_controller.go:510] Update endpoints for openshift-image-registry/image-registry-operator, ready: 1 not ready: 0 I0916 05:48:23.922660 1 endpoints_controller.go:510] Update endpoints for openshift-sdn/sdn, ready: 6 not ready: 0 I0916 05:48:23.923202 1 endpoints_controller.go:510] Update endpoints for openshift-monitoring/telemeter-client, ready: 1 not ready: 0 I0916 05:48:23.924504 1 endpoints_controller.go:510] Update endpoints for openshift-cluster-machine-approver/machine-approver, ready: 1 not ready: 0 I0916 05:48:23.935735 1 endpoints_controller.go:510] Update endpoints for openshift-multus/multus-admission-controller, ready: 6 not ready: 0 I0916 05:48:23.935916 1 endpoints_controller.go:510] Update endpoints for openshift-ingress/router-internal-default, ready: 6 not ready: 0 I0916 05:48:23.935922 1 endpoints_controller.go:510] Update endpoints for openshift-cluster-samples-operator/metrics, ready: 1 not ready: 0 I0916 05:48:23.940551 1 endpoints_controller.go:510] Update endpoints for openshift-monitoring/node-exporter, ready: 6 not ready: 0 I0916 05:48:23.940787 1 endpoints_controller.go:510] Update endpoints for openshift-monitoring/alertmanager-main, ready: 6 not ready: 0 I0916 05:48:23.945877 1 endpoints_controller.go:510] Update endpoints for openshift-marketplace/marketplace-operator-metrics, ready: 2 not ready: 0 I0916 05:48:23.946141 1 endpoints_controller.go:510] Update endpoints for openshift-machine-api/cluster-autoscaler-operator, ready: 2 not ready: 0 I0916 05:48:23.946793 1 endpoints_controller.go:510] Update endpoints for openshift-monitoring/prometheus-operator, ready: 1 not ready: 0 I0916 05:48:23.951126 1 endpoints_controller.go:510] Update endpoints for openshift-monitoring/prometheus-operated, ready: 4 not ready: 0 I0916 05:48:23.951178 1 endpoints_controller.go:510] Update endpoints for openshift-monitoring/openshift-state-metrics, ready: 2 not ready: 0 I0916 05:48:23.954907 1 endpoints_controller.go:510] Update endpoints for openshift-monitoring/alertmanager-operated, ready: 9 not ready: 0 I0916 05:48:23.955575 1 endpoints_controller.go:510] Update endpoints for openshift-monitoring/cluster-monitoring-operator, ready: 1 not ready: 0 I0916 05:48:23.959774 1 endpoints_controller.go:510] Update endpoints for openshift-image-registry/image-registry-operator, ready: 1 not ready: 0 I0916 05:48:23.962264 1 endpoints_controller.go:510] Update endpoints for openshift-monitoring/kube-state-metrics, ready: 2 not ready: 0 I0916 05:48:23.962831 1 endpoints_controller.go:510] Update endpoints for openshift-dns/dns-default, ready: 18 not ready: 0 [ramakasturinarra@dhcp35-60 ~]$ oc get services --all-namespaces | wc -l 67 [ramakasturinarra@dhcp35-60 ~]$ oc logs pod/kube-controller-manager-ip-10-0-204-252.ap-northeast-1.compute.internal -n openshift-kube-controller-manager -c kube-controller-manager | grep endpoints_controller | grep -c "About to update endpoints" 244 [ramakasturinarra@dhcp35-60 ~]$ oc logs pod/kube-controller-manager-ip-10-0-204-252.ap-northeast-1.compute.internal -n openshift-kube-controller-manager -c kube-controller-manager | grep endpoints_controller | grep -c "Update endpoints for" 72 After the Fix: =================== 1) Install latest 4.6 cluster which has the fix 2) set kube-controller-manager loglevel to Trace so that it runs at -v=6 3) you can see that there are no updates in about 10 minutes. [ramakasturinarra@dhcp35-60 ~]$ oc get services --all-namespaces | wc -l 76 [ramakasturinarra@dhcp35-60 ~]$ oc logs pod/kube-controller-manager-ip-10-0-211-68.us-east-2.compute.internal -n openshift-kube-controller-manager -c kube-controller-manager | grep endpoints_controller | grep -c "About to update endpoints" 213 [ramakasturinarra@dhcp35-60 ~]$ oc logs pod/kube-controller-manager-ip-10-0-211-68.us-east-2.compute.internal -n openshift-kube-controller-manager -c kube-controller-manager | grep endpoints_controller | grep -c "Update endpoints for" 0 spec: logLevel: Trace Based on the above moving bug to verified state. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:4196 |