Description of problem: The MAPI operator deployment on master CI tests have started to fail at a high rate since yesterday. [1] In these cases, the kube-controller-manager is unable to reach the API, getting an "Unauthorized" error [2]: cronjob_controller.go:124] Failed to extract job list: Unauthorized Version-Release number of selected component (if applicable): master (4.7) Additional info: [1] https://search.ci.openshift.org/?search=.*Machine+API+operator+deployment+should+reconcile+controllers+deployment.*&maxAge=48h&context=1&type=junit&name=&maxMatches=5&maxBytes=20971520&groupBy=job [2] https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_machine-api-operator/733/pull-ci-openshift-machine-api-operator-master-e2e-aws-operator/1327248565881278464/artifacts/e2e-aws-operator/pods/openshift-kube-controller-manager_kube-controller-manager-ip-10-0-140-195.ec2.internal_kube-controller-manager.log
Reassigning to Kube Controller Manager team, could you please take a look For context, our test is failing because KCM becomes non functional. About 3-6 minutes into the test run, suddenly the KCM credentials are bad and all API requests fail with `Unauthorized`. I have no idea why or how the credentials would suddenly become invalid, and I can't really see why this would be our test suite causing this. Perhaps you can provide some insight into how the KCM credentials work and why it might be consistently failing as it is in this case
This is causing other errors as well, I've just caught up on all of our teams PRs over the weekend that have been looping on failures related to this issue. Seeing lots of issues on the standard AWS E2E suite as well Eg https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_machine-api-operator/750/pull-ci-openshift-machine-api-operator-master-e2e-aws/1328200472544153600 has the same underlying problem, KCM going unauthorized
I checked logs of this PR specifically https://github.com/openshift/machine-api-operator/pull/745 in https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_machine-api-operator/745/pull-ci-openshift-machine-api-operator-master-e2e-aws-operator/1328148122945196032/artifacts/e2e-aws-operator/ I've noticed that kcm got first Unauthorized at 01:58: E1116 01:58:43.195359 1 cronjob_controller.go:124] Failed to extract job list: Unauthorized and never recovered. Looking through kas I see similarly around that time issues with auth: E1116 01:58:43.194769 140 authentication.go:53] Unable to authenticate the request due to an error: [invalid bearer token, token lookup failed] which never recovered until the end. I checked auth-operator it had temporary issues with kas availability: E1116 01:57:38.076803 1 base_controller.go:250] "WellKnownReadyController" controller failed to sync "key", err: need at least 3 kube-apiservers, got 2 but it recovered within the next minute. Stefan claims there were recent changes of P&F that might affect that, and I did notice a lot of request being throttled, so I'm sending this over to api&auth team for investigation.
We've also experienced some CI failures where our operator deployment never scales up, and we noticed the 'Unauthorized' failures with KCM; https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_compliance-operator/469/pull-ci-openshift-compliance-operator-master-e2e-aws/1328377291310370816
We can rule p&f out of the equation: - I checked the audit logs, no trace of 429 - I loaded the prometheus metrics, no request rejected by p&f. Analyzing the "Unable to authenticate" error: > $ grep -rni "Unable to authenticate the request due to an error" logs/kas-logs/* | wc -l > 1831 I also see the following in the audit - kcm watch request is panicking > {"kind":"Event","apiVersion":"audit.k8s.io/v1","level":"Metadata","auditID":"30b73054-d008-4356-8e7a-b71f5a78447b","stage":"Panic","requestURI":"/apis/security.openshift.io/v1/rangeallocations?allowWatchBookmarks=true\u0026resourceVersion=20961\u0026timeout=5m44s\u0026timeoutSeconds=344\u0026watch=true","verb":"watch","user":{"username":"system:kube-controller-manager","groups":["system:authenticated"]},"sourceIPs":["10.0.236.239"],"userAgent":"kube-controller-manager/v1.19.2 (linux/amd64) kubernetes/4abb4a7/kube-controller-manager","objectRef":{"resource":"rangeallocations","apiGroup":"security.openshift.io","apiVersion":"v1"},"responseStatus":{"metadata":{},"status":"Failure","message":"APIServer panic'd: net/http: abort Handler","reason":"InternalError","code":500},"requestReceivedTimestamp":"2020-11-16T01:54:19.714769Z","stageTimestamp":"2020-11-16T01:58:36.485236Z","annotations":{"authorization.k8s.io/decision":"allow","authorization.k8s.io/reason":"RBAC: allowed by ClusterRoleBinding \"system:kube-controller-manager\" of ClusterRole \"system:kube-controller-manager\" to User \"system:kube-controller-manager\""}} and there are quite a few of these > $ grep -rni "APIServer panic'd" audit-logs/logs/audit_logs/kube-apiserver/audit/* | grep 'system:kube-controller-manager' | wc -l > 84 The request was received at > "requestReceivedTimestamp":"2020-11-16T01:54:19.714769Z" kas logs do not have a timestamp that overlap "01:54:19", not sure how to explain this. was there a roll out of the apiserver or the logs rotated? > $ cat logs/kas-logs/openshift-kube-apiserver_kube-apiserver-ip-10-0-181-173.ec2.internal_kube-apiserver.log | head -n 10 > I1116 01:55:51.629792 1 loader.go:375] Config loaded from file: /etc/kubernetes/static-pod-resources/configmaps/kube-apiserver-cert-syncer-kubeconfig/kubeconfig > $ cat logs/kas-logs/openshift-kube-apiserver_kube-apiserver-ip-10-0-234-30.ec2.internal_kube-apiserver.log | head -n 10 > I1116 01:58:39.065024 1 loader.go:375] Config loaded from file: /etc/kubernetes/static-pod-resources/configmaps/kube-apiserver-cert-syncer-kubeconfig/kubeconfig > $ cat logs/kas-logs/openshift-kube-apiserver_kube-apiserver-ip-10-0-234-30.ec2.internal_kube-apiserver.log | head -n 10 > I1116 01:58:39.065024 1 loader.go:375] Config loaded from file: /etc/kubernetes/static-pod-resources/configmaps/kube-apiserver-cert-syncer-kubeconfig/kubeconfig If you check the events, you will see plenty of the following errors around the same time window: > 20:43:10 (4) "openshift-apiserver" Unhealthy Readiness probe failed: Get "https://10.129.0.6:8443/healthz": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers) > 20:54:28 (3) "openshift-kube-controller-manager" Unhealthy Startup probe failed: Get "https://10.0.190.206:10357/healthz": dial tcp 10.0.190.206:10357: connect: connection refused
This is a test. This is not risking any production cluster. Reducing priority and severity.
Can't see any failed after the fixed PR merged. so will move to verified status. https://search.ci.openshift.org/?search=.*Machine+API+operator+deployment+should+reconcile+controllers+deployment.*&maxAge=168h&context=1&type=junit&name=&maxMatches=5&maxBytes=20971520&groupBy=job
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:5633