Bug 1897604
| Summary: | Machine API deployment fails: Kube-Controller-Manager can't reach API: "Unauthorized" | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Christian Glombek <cglombek> |
| Component: | kube-controller-manager | Assignee: | Maciej Szulik <maszulik> |
| Status: | CLOSED ERRATA | QA Contact: | zhou ying <yinzhou> |
| Severity: | high | Docs Contact: | |
| Priority: | high | ||
| Version: | 4.7 | CC: | akashem, aos-bugs, jspeed, mfojtik, mnewby, mrogers, sdodson, skuznets, wking, xxia |
| Target Milestone: | --- | ||
| Target Release: | 4.7.0 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | No Doc Update | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2021-02-24 15:33:26 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Christian Glombek
2020-11-13 15:51:05 UTC
Reassigning to Kube Controller Manager team, could you please take a look For context, our test is failing because KCM becomes non functional. About 3-6 minutes into the test run, suddenly the KCM credentials are bad and all API requests fail with `Unauthorized`. I have no idea why or how the credentials would suddenly become invalid, and I can't really see why this would be our test suite causing this. Perhaps you can provide some insight into how the KCM credentials work and why it might be consistently failing as it is in this case This is causing other errors as well, I've just caught up on all of our teams PRs over the weekend that have been looping on failures related to this issue. Seeing lots of issues on the standard AWS E2E suite as well Eg https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_machine-api-operator/750/pull-ci-openshift-machine-api-operator-master-e2e-aws/1328200472544153600 has the same underlying problem, KCM going unauthorized I checked logs of this PR specifically https://github.com/openshift/machine-api-operator/pull/745 in https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_machine-api-operator/745/pull-ci-openshift-machine-api-operator-master-e2e-aws-operator/1328148122945196032/artifacts/e2e-aws-operator/ I've noticed that kcm got first Unauthorized at 01:58: E1116 01:58:43.195359 1 cronjob_controller.go:124] Failed to extract job list: Unauthorized and never recovered. Looking through kas I see similarly around that time issues with auth: E1116 01:58:43.194769 140 authentication.go:53] Unable to authenticate the request due to an error: [invalid bearer token, token lookup failed] which never recovered until the end. I checked auth-operator it had temporary issues with kas availability: E1116 01:57:38.076803 1 base_controller.go:250] "WellKnownReadyController" controller failed to sync "key", err: need at least 3 kube-apiservers, got 2 but it recovered within the next minute. Stefan claims there were recent changes of P&F that might affect that, and I did notice a lot of request being throttled, so I'm sending this over to api&auth team for investigation. We've also experienced some CI failures where our operator deployment never scales up, and we noticed the 'Unauthorized' failures with KCM; https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_compliance-operator/469/pull-ci-openshift-compliance-operator-master-e2e-aws/1328377291310370816 We can rule p&f out of the equation: - I checked the audit logs, no trace of 429 - I loaded the prometheus metrics, no request rejected by p&f. Analyzing the "Unable to authenticate" error: > $ grep -rni "Unable to authenticate the request due to an error" logs/kas-logs/* | wc -l > 1831 I also see the following in the audit - kcm watch request is panicking > {"kind":"Event","apiVersion":"audit.k8s.io/v1","level":"Metadata","auditID":"30b73054-d008-4356-8e7a-b71f5a78447b","stage":"Panic","requestURI":"/apis/security.openshift.io/v1/rangeallocations?allowWatchBookmarks=true\u0026resourceVersion=20961\u0026timeout=5m44s\u0026timeoutSeconds=344\u0026watch=true","verb":"watch","user":{"username":"system:kube-controller-manager","groups":["system:authenticated"]},"sourceIPs":["10.0.236.239"],"userAgent":"kube-controller-manager/v1.19.2 (linux/amd64) kubernetes/4abb4a7/kube-controller-manager","objectRef":{"resource":"rangeallocations","apiGroup":"security.openshift.io","apiVersion":"v1"},"responseStatus":{"metadata":{},"status":"Failure","message":"APIServer panic'd: net/http: abort Handler","reason":"InternalError","code":500},"requestReceivedTimestamp":"2020-11-16T01:54:19.714769Z","stageTimestamp":"2020-11-16T01:58:36.485236Z","annotations":{"authorization.k8s.io/decision":"allow","authorization.k8s.io/reason":"RBAC: allowed by ClusterRoleBinding \"system:kube-controller-manager\" of ClusterRole \"system:kube-controller-manager\" to User \"system:kube-controller-manager\""}} and there are quite a few of these > $ grep -rni "APIServer panic'd" audit-logs/logs/audit_logs/kube-apiserver/audit/* | grep 'system:kube-controller-manager' | wc -l > 84 The request was received at > "requestReceivedTimestamp":"2020-11-16T01:54:19.714769Z" kas logs do not have a timestamp that overlap "01:54:19", not sure how to explain this. was there a roll out of the apiserver or the logs rotated? > $ cat logs/kas-logs/openshift-kube-apiserver_kube-apiserver-ip-10-0-181-173.ec2.internal_kube-apiserver.log | head -n 10 > I1116 01:55:51.629792 1 loader.go:375] Config loaded from file: /etc/kubernetes/static-pod-resources/configmaps/kube-apiserver-cert-syncer-kubeconfig/kubeconfig > $ cat logs/kas-logs/openshift-kube-apiserver_kube-apiserver-ip-10-0-234-30.ec2.internal_kube-apiserver.log | head -n 10 > I1116 01:58:39.065024 1 loader.go:375] Config loaded from file: /etc/kubernetes/static-pod-resources/configmaps/kube-apiserver-cert-syncer-kubeconfig/kubeconfig > $ cat logs/kas-logs/openshift-kube-apiserver_kube-apiserver-ip-10-0-234-30.ec2.internal_kube-apiserver.log | head -n 10 > I1116 01:58:39.065024 1 loader.go:375] Config loaded from file: /etc/kubernetes/static-pod-resources/configmaps/kube-apiserver-cert-syncer-kubeconfig/kubeconfig If you check the events, you will see plenty of the following errors around the same time window: > 20:43:10 (4) "openshift-apiserver" Unhealthy Readiness probe failed: Get "https://10.129.0.6:8443/healthz": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers) > 20:54:28 (3) "openshift-kube-controller-manager" Unhealthy Startup probe failed: Get "https://10.0.190.206:10357/healthz": dial tcp 10.0.190.206:10357: connect: connection refused This is a test. This is not risking any production cluster. Reducing priority and severity. Can't see any failed after the fixed PR merged. so will move to verified status. https://search.ci.openshift.org/?search=.*Machine+API+operator+deployment+should+reconcile+controllers+deployment.*&maxAge=168h&context=1&type=junit&name=&maxMatches=5&maxBytes=20971520&groupBy=job Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:5633 |