1897604 – Machine API deployment fails: Kube-Controller-Manager can't reach API: "Unauthorized"

Bug 1897604 - Machine API deployment fails: Kube-Controller-Manager can't reach API: "Unauthorized"

Summary: Machine API deployment fails: Kube-Controller-Manager can't reach API: "Unaut...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	kube-controller-manager
Sub Component:
Version:	4.7
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.7.0
Assignee:	Maciej Szulik
QA Contact:	zhou ying
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-11-13 15:51 UTC by Christian Glombek
Modified:	2021-02-24 15:33 UTC (History)
CC List:	10 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-02-24 15:33:26 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift kubernetes pull 459	0	None	closed	Revert "UPSTREAM: revert: <drop>: don't use dynamic tokens for KCM"	2021-02-10 03:25:18 UTC
Red Hat Product Errata	RHSA-2020:5633	0	None	None	None	2021-02-24 15:33:49 UTC

Description Christian Glombek 2020-11-13 15:51:05 UTC

Description of problem:

The MAPI operator deployment on master CI tests have started to fail at a high rate since yesterday. [1]

In these cases, the kube-controller-manager is unable to reach the API, getting an "Unauthorized" error [2]:

cronjob_controller.go:124] Failed to extract job list: Unauthorized


Version-Release number of selected component (if applicable):

master (4.7)


Additional info:

[1] https://search.ci.openshift.org/?search=.*Machine+API+operator+deployment+should+reconcile+controllers+deployment.*&maxAge=48h&context=1&type=junit&name=&maxMatches=5&maxBytes=20971520&groupBy=job
[2] https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_machine-api-operator/733/pull-ci-openshift-machine-api-operator-master-e2e-aws-operator/1327248565881278464/artifacts/e2e-aws-operator/pods/openshift-kube-controller-manager_kube-controller-manager-ip-10-0-140-195.ec2.internal_kube-controller-manager.log

Comment 1 Joel Speed 2020-11-13 16:02:42 UTC

Reassigning to Kube Controller Manager team, could you please take a look

For context, our test is failing because KCM becomes non functional. About 3-6 minutes into the test run, suddenly the KCM credentials are bad and all API requests fail with `Unauthorized`.

I have no idea why or how the credentials would suddenly become invalid, and I can't really see why this would be our test suite causing this.

Perhaps you can provide some insight into how the KCM credentials work and why it might be consistently failing as it is in this case

Comment 2 Joel Speed 2020-11-16 10:52:45 UTC

This is causing other errors as well, I've just caught up on all of our teams PRs over the weekend that have been looping on failures related to this issue. Seeing lots of issues on the standard AWS E2E suite as well

Eg https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_machine-api-operator/750/pull-ci-openshift-machine-api-operator-master-e2e-aws/1328200472544153600 has the same underlying problem, KCM going unauthorized

Comment 3 Maciej Szulik 2020-11-16 13:24:08 UTC

I checked logs of this PR specifically https://github.com/openshift/machine-api-operator/pull/745
in https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_machine-api-operator/745/pull-ci-openshift-machine-api-operator-master-e2e-aws-operator/1328148122945196032/artifacts/e2e-aws-operator/
I've noticed that kcm got first Unauthorized at 01:58:

E1116 01:58:43.195359       1 cronjob_controller.go:124] Failed to extract job list: Unauthorized

and never recovered.


Looking through kas I see similarly around that time issues with auth:

E1116 01:58:43.194769     140 authentication.go:53] Unable to authenticate the request due to an error: [invalid bearer token, token lookup failed]

which never recovered until the end.


I checked auth-operator it had temporary issues with kas availability:

E1116 01:57:38.076803       1 base_controller.go:250] "WellKnownReadyController" controller failed to sync "key", err: need at least 3 kube-apiservers, got 2

but it recovered within the next minute.

Stefan claims there were recent changes of P&F that might affect that, and I did notice a lot of request being throttled, so I'm sending this over 
to api&auth team for investigation.

Comment 4 Matt Rogers 2020-11-16 20:04:36 UTC

We've also experienced some CI failures where our operator deployment never scales up, and we noticed the 'Unauthorized' failures with KCM; https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_compliance-operator/469/pull-ci-openshift-compliance-operator-master-e2e-aws/1328377291310370816

Comment 5 Abu Kashem 2020-11-16 22:35:37 UTC

We can rule p&f out of the equation:
- I checked the audit logs, no trace of 429
- I loaded the prometheus metrics, no request rejected by p&f.


Analyzing the "Unable to authenticate" error:
> $ grep -rni "Unable to authenticate the request due to an error" logs/kas-logs/* | wc -l
> 1831



I also see the following in the audit - kcm watch request is panicking
> {"kind":"Event","apiVersion":"audit.k8s.io/v1","level":"Metadata","auditID":"30b73054-d008-4356-8e7a-b71f5a78447b","stage":"Panic","requestURI":"/apis/security.openshift.io/v1/rangeallocations?allowWatchBookmarks=true\u0026resourceVersion=20961\u0026timeout=5m44s\u0026timeoutSeconds=344\u0026watch=true","verb":"watch","user":{"username":"system:kube-controller-manager","groups":["system:authenticated"]},"sourceIPs":["10.0.236.239"],"userAgent":"kube-controller-manager/v1.19.2 (linux/amd64) kubernetes/4abb4a7/kube-controller-manager","objectRef":{"resource":"rangeallocations","apiGroup":"security.openshift.io","apiVersion":"v1"},"responseStatus":{"metadata":{},"status":"Failure","message":"APIServer panic'd: net/http: abort Handler","reason":"InternalError","code":500},"requestReceivedTimestamp":"2020-11-16T01:54:19.714769Z","stageTimestamp":"2020-11-16T01:58:36.485236Z","annotations":{"authorization.k8s.io/decision":"allow","authorization.k8s.io/reason":"RBAC: allowed by ClusterRoleBinding \"system:kube-controller-manager\" of ClusterRole \"system:kube-controller-manager\" to User \"system:kube-controller-manager\""}}

and there are quite a few of these
> $ grep -rni "APIServer panic'd" audit-logs/logs/audit_logs/kube-apiserver/audit/* | grep 'system:kube-controller-manager' | wc -l
> 84

The request was received at
> "requestReceivedTimestamp":"2020-11-16T01:54:19.714769Z"

kas logs do not have a timestamp that overlap "01:54:19", not sure how to explain this. was there a roll out of the apiserver or the logs rotated?

> $ cat logs/kas-logs/openshift-kube-apiserver_kube-apiserver-ip-10-0-181-173.ec2.internal_kube-apiserver.log | head -n 10
> I1116 01:55:51.629792       1 loader.go:375] Config loaded from file:  /etc/kubernetes/static-pod-resources/configmaps/kube-apiserver-cert-syncer-kubeconfig/kubeconfig

>  $ cat logs/kas-logs/openshift-kube-apiserver_kube-apiserver-ip-10-0-234-30.ec2.internal_kube-apiserver.log | head -n 10
> I1116 01:58:39.065024       1 loader.go:375] Config loaded from file:  /etc/kubernetes/static-pod-resources/configmaps/kube-apiserver-cert-syncer-kubeconfig/kubeconfig

> $ cat logs/kas-logs/openshift-kube-apiserver_kube-apiserver-ip-10-0-234-30.ec2.internal_kube-apiserver.log | head -n 10
> I1116 01:58:39.065024       1 loader.go:375] Config loaded from file:  /etc/kubernetes/static-pod-resources/configmaps/kube-apiserver-cert-syncer-kubeconfig/kubeconfig


If you check the events, you will see plenty of the following errors around the same time window:
> 20:43:10 (4) "openshift-apiserver" Unhealthy Readiness probe failed: Get "https://10.129.0.6:8443/healthz": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
> 20:54:28 (3) "openshift-kube-controller-manager" Unhealthy Startup probe failed: Get "https://10.0.190.206:10357/healthz": dial tcp 10.0.190.206:10357: connect: connection refused

Comment 6 Stefan Schimanski 2020-11-25 16:25:22 UTC

This is a test. This is not risking any production cluster. Reducing priority and severity.

Comment 8 zhou ying 2020-11-27 08:54:23 UTC

Can't see any failed after the fixed PR merged. so will move to verified status. 

https://search.ci.openshift.org/?search=.*Machine+API+operator+deployment+should+reconcile+controllers+deployment.*&maxAge=168h&context=1&type=junit&name=&maxMatches=5&maxBytes=20971520&groupBy=job

Comment 11 errata-xmlrpc 2021-02-24 15:33:26 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633

Note You need to log in before you can comment on or make changes to this bug.