Bug 1668632

Summary: [Nextgen] "Unable to authenticate the request due to an error ... x509: certificate signed by unknown authority"
Product: OpenShift Container Platform Reporter: Xingxing Xia <xxia>
Component: MonitoringAssignee: Sergiusz Urbaniak <surbania>
Status: CLOSED ERRATA QA Contact: Junqi Zhao <juzhao>
Severity: high Docs Contact:
Priority: high    
Version: 4.1.0CC: akrzos, aos-bugs, emoss, fbranczy, jiazha, jokerman, juzhao, mmccomas, pruan, sponnaga, surbania
Target Milestone: ---   
Target Release: 4.1.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-06-04 10:42:08 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Description Xingxing Xia 2019-01-23 08:38:42 UTC
Description of problem:
Often encountered envs that had "metrics.k8s.io/v1beta1: the server is currently unable to handle the request".
Then debug it through `oc get apiservices -o=custom-columns="name:.metadata.name,namespace:.spec.service.namespace,status:.status.conditions[0].status"`.
Found v1beta1.metrics.k8s.io status is False (others are good status True).
Then check its backend pod log, found:
"Unable to authenticate the request due to an error ... x509: certificate signed by unknown authority". BTW, this log was found in the env for https://bugzilla.redhat.com/show_bug.cgi?id=1625194#c9 and the env for https://bugzilla.redhat.com/show_bug.cgi?id=1667030 and my today env.

Noticed https://bugzilla.redhat.com/show_bug.cgi?id=1665842#c25 , thus I tried `oc delete pod prometheus-adapter-... -n openshift-monitoring`, the problem then is gone.
Several days passed after that fix, but the error is still found in metrics pod, thus opening this bug.

Version-Release number of selected component (if applicable):
$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE     STATUS
version   4.0.0-0.nightly-2019-01-18-115403   True        False         1d        Cluster version is 4.0.0-0.nightly-2019-01-18-115403

How reproducible:
Seems often, not sure the clear reproducer condition.

Steps to Reproduce:
1. Create a nextgen env

2. Check `oc api-resources`,
or check `oc logs ds/apiserver -n openshift-apiserver`.
When this issue occurs, `oc api-resources` will show:
"unable to retrieve the complete list of server APIs: metrics.k8s.io/v1beta1: the server is currently unable to handle the request".

`oc logs ds/apiserver -n openshift-apiserver` will show:
"E0123 06:08:29.395219       1 memcache.go:147] couldn't get resource list for metrics.k8s.io/v1beta1: the server is currently unable to handle the request"

3. Then check backend pods of v1beta1.metrics.k8s.io
oc get apiservices v1beta1.metrics.k8s.io -o yaml
...
  service:
    name: prometheus-adapter
    namespace: openshift-monitoring

oc logs deployment/prometheus-adapter -n openshift-monitoring
...
E0123 04:17:43.075225       1 authentication.go:62] Unable to authenticate the request due to an error: [x509: certificate signed by unknown authority, x509: certificate signed by unknown authority]
E0123 04:17:59.170792       1 authentication.go:62] Unable to authenticate the request due to an error: [x509: certificate signed by unknown authority, x509: certificate signed by unknown authority]
...

Actual results:
3. Metrics backend pod log shows error "Unable to authenticate the request due to an error", this causes other issues https://bugzilla.redhat.com/show_bug.cgi?id=1625194#c9 https://bugzilla.redhat.com/show_bug.cgi?id=1667030

Expected results:
Apiservice/v1beta1.metrics.k8s.io and its backend pod are in good contidion without the error

Additional info:

Comment 6 Jian Zhang 2019-01-25 03:18:01 UTC
Service catalog have the same issue, details in bug 1668534

[core@ip-10-0-8-244 ~]$ oc get pods
NAME                                 READY     STATUS             RESTARTS   AGE
apiserver-849f76f4b6-n7dnr           2/2       Running            3          17h
caddy-docker                         1/1       Running            0          17h
centos-pod                           1/1       Running            0          17h
controller-manager-64b8dd67d-59msf   0/1       CrashLoopBackOff   17         1h
[core@ip-10-0-8-244 ~]$ oc logs apiserver-849f76f4b6-n7dnr -c apiserver
...
E0125 02:59:46.196362       1 authentication.go:62] Unable to authenticate the request due to an error: x509: certificate signed by unknown authority
I0125 02:59:46.196411       1 wrap.go:42] GET /: (79.437µs) 401 [Go-http-client/2.0 10.130.0.1:44610]
I0125 02:59:50.689977       1 run_server.go:127] etcd checker called

Comment 9 Xingxing Xia 2019-01-29 02:52:58 UTC
(In reply to Xingxing Xia from comment #0)
> Noticed https://bugzilla.redhat.com/show_bug.cgi?id=1665842#c25 , thus I
> tried `oc delete pod prometheus-adapter-... -n openshift-monitoring`, the
> problem then is gone.
Still meet again in latest payload 4.0.0-0.nightly-2019-01-25-205123. Although this workaround can solve the problem, the issue itself is an problem serious enough. So adding beta2blocker

Comment 18 Xingxing Xia 2019-02-19 02:17:36 UTC
Verified in 4.0.0-0.nightly-2019-02-17-024922 which contains above PR per comment 0 steps

Comment 21 errata-xmlrpc 2019-06-04 10:42:08 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0758