Bug 1668632

Summary:	[Nextgen] "Unable to authenticate the request due to an error ... x509: certificate signed by unknown authority"
Product:	OpenShift Container Platform	Reporter:	Xingxing Xia <xxia>
Component:	Monitoring	Assignee:	Sergiusz Urbaniak <surbania>
Status:	CLOSED ERRATA	QA Contact:	Junqi Zhao <juzhao>
Severity:	high	Docs Contact:
Priority:	high
Version:	4.1.0	CC:	akrzos, aos-bugs, emoss, fbranczy, jiazha, jokerman, juzhao, mmccomas, pruan, sponnaga, surbania
Target Milestone:	---
Target Release:	4.1.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2019-06-04 10:42:08 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Xingxing Xia 2019-01-23 08:38:42 UTC

Description of problem:
Often encountered envs that had "metrics.k8s.io/v1beta1: the server is currently unable to handle the request".
Then debug it through `oc get apiservices -o=custom-columns="name:.metadata.name,namespace:.spec.service.namespace,status:.status.conditions[0].status"`.
Found v1beta1.metrics.k8s.io status is False (others are good status True).
Then check its backend pod log, found:
"Unable to authenticate the request due to an error ... x509: certificate signed by unknown authority". BTW, this log was found in the env for https://bugzilla.redhat.com/show_bug.cgi?id=1625194#c9 and the env for https://bugzilla.redhat.com/show_bug.cgi?id=1667030 and my today env.

Noticed https://bugzilla.redhat.com/show_bug.cgi?id=1665842#c25 , thus I tried `oc delete pod prometheus-adapter-... -n openshift-monitoring`, the problem then is gone.
Several days passed after that fix, but the error is still found in metrics pod, thus opening this bug.

Version-Release number of selected component (if applicable):
$ oc get clusterversion
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.0.0-0.nightly-2019-01-18-115403 True False 1d Cluster version is 4.0.0-0.nightly-2019-01-18-115403

How reproducible:
Seems often, not sure the clear reproducer condition.

Steps to Reproduce:
1. Create a nextgen env

2. Check `oc api-resources`,
or check `oc logs ds/apiserver -n openshift-apiserver`.
When this issue occurs, `oc api-resources` will show:
"unable to retrieve the complete list of server APIs: metrics.k8s.io/v1beta1: the server is currently unable to handle the request".

`oc logs ds/apiserver -n openshift-apiserver` will show:
"E0123 06:08:29.395219 1 memcache.go:147] couldn't get resource list for metrics.k8s.io/v1beta1: the server is currently unable to handle the request"

3. Then check backend pods of v1beta1.metrics.k8s.io
oc get apiservices v1beta1.metrics.k8s.io -o yaml
...
service:
name: prometheus-adapter
namespace: openshift-monitoring

oc logs deployment/prometheus-adapter -n openshift-monitoring
...
E0123 04:17:43.075225 1 authentication.go:62] Unable to authenticate the request due to an error: [x509: certificate signed by unknown authority, x509: certificate signed by unknown authority]
E0123 04:17:59.170792 1 authentication.go:62] Unable to authenticate the request due to an error: [x509: certificate signed by unknown authority, x509: certificate signed by unknown authority]
...

Actual results:
3. Metrics backend pod log shows error "Unable to authenticate the request due to an error", this causes other issues https://bugzilla.redhat.com/show_bug.cgi?id=1625194#c9 https://bugzilla.redhat.com/show_bug.cgi?id=1667030

Expected results:
Apiservice/v1beta1.metrics.k8s.io and its backend pod are in good contidion without the error

Additional info:

Comment 6 Jian Zhang 2019-01-25 03:18:01 UTC

Service catalog have the same issue, details in bug 1668534

[core@ip-10-0-8-244 ~]$ oc get pods
NAME                                 READY     STATUS             RESTARTS   AGE
apiserver-849f76f4b6-n7dnr           2/2       Running            3          17h
caddy-docker                         1/1       Running            0          17h
centos-pod                           1/1       Running            0          17h
controller-manager-64b8dd67d-59msf   0/1       CrashLoopBackOff   17         1h
[core@ip-10-0-8-244 ~]$ oc logs apiserver-849f76f4b6-n7dnr -c apiserver
...
E0125 02:59:46.196362       1 authentication.go:62] Unable to authenticate the request due to an error: x509: certificate signed by unknown authority
I0125 02:59:46.196411       1 wrap.go:42] GET /: (79.437µs) 401 [Go-http-client/2.0 10.130.0.1:44610]
I0125 02:59:50.689977       1 run_server.go:127] etcd checker called

Comment 9 Xingxing Xia 2019-01-29 02:52:58 UTC

(In reply to Xingxing Xia from comment #0)
> Noticed https://bugzilla.redhat.com/show_bug.cgi?id=1665842#c25 , thus I
> tried `oc delete pod prometheus-adapter-... -n openshift-monitoring`, the
> problem then is gone.
Still meet again in latest payload 4.0.0-0.nightly-2019-01-25-205123. Although this workaround can solve the problem, the issue itself is an problem serious enough. So adding beta2blocker

Comment 18 Xingxing Xia 2019-02-19 02:17:36 UTC

Verified in 4.0.0-0.nightly-2019-02-17-024922 which contains above PR per comment 0 steps

Comment 21 errata-xmlrpc 2019-06-04 10:42:08 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0758