Bug 1736800 - openshift-apiserver is down due to "x509: certificate signed by unknown authority" error
Summary: openshift-apiserver is down due to "x509: certificate signed by unknown autho...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: openshift-apiserver
Version: 4.2.0
Hardware: Unspecified
OS: Unspecified
high
urgent
Target Milestone: ---
: 4.2.0
Assignee: Standa Laznicka
QA Contact: Xingxing Xia
URL:
Whiteboard:
: 1737591 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-08-02 04:31 UTC by Junqi Zhao
Modified: 2020-02-10 23:22 UTC (History)
21 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-10-16 06:34:29 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
openshift-apiserver pod logs (358.47 KB, application/gzip)
2019-08-02 04:31 UTC, Junqi Zhao
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2019:2922 0 None None None 2019-10-16 06:34:39 UTC

Description Junqi Zhao 2019-08-02 04:31:38 UTC
Created attachment 1599485 [details]
openshift-apiserver pod logs

Description of problem:
Let the cluster run for a day, openshift-apiserver is down, error "x509: certificate signed by unknown authority" in pod logs

#  oc get co openshift-apiserver
NAME                  VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE
openshift-apiserver   4.2.0-0.nightly-2019-07-31-162901   False       False         False      6h31m


#  oc get co openshift-apiserver -oyaml
apiVersion: config.openshift.io/v1
kind: ClusterOperator
metadata:
  creationTimestamp: "2019-08-01T00:52:47Z"
  generation: 1
  name: openshift-apiserver
  resourceVersion: "1220345"
  selfLink: /apis/config.openshift.io/v1/clusteroperators/openshift-apiserver
  uid: ae44a781-b3f6-11e9-817f-02ee1b45aa8e
spec: {}
status:
  conditions:
  - lastTransitionTime: "2019-08-01T00:55:27Z"
    reason: AsExpected
    status: "False"
    type: Degraded
  - lastTransitionTime: "2019-08-01T00:57:06Z"
    reason: AsExpected
    status: "False"
    type: Progressing
  - lastTransitionTime: "2019-08-01T19:52:27Z"
    message: |-
      Available: apiservice/v1.apps.openshift.io: not available: failing or missing response from https://10.128.0.29:8443: bad status from https://10.128.0.29:8443: 401
      Available: apiservice/v1.authorization.openshift.io: not available: failing or missing response from https://10.130.0.39:8443: bad status from https://10.130.0.39:8443: 401
      Available: apiservice/v1.build.openshift.io: not available: failing or missing response from https://10.129.0.27:8443: bad status from https://10.129.0.27:8443: 401
      Available: apiservice/v1.image.openshift.io: not available: failing or missing response from https://10.129.0.27:8443: bad status from https://10.129.0.27:8443: 401
      Available: apiservice/v1.oauth.openshift.io: not available: failing or missing response from https://10.129.0.27:8443: bad status from https://10.129.0.27:8443: 401
      Available: apiservice/v1.project.openshift.io: not available: failing or missing response from https://10.129.0.27:8443: bad status from https://10.129.0.27:8443: 401
      Available: apiservice/v1.quota.openshift.io: not available: failing or missing response from https://10.129.0.27:8443: bad status from https://10.129.0.27:8443: 401
      Available: apiservice/v1.route.openshift.io: not available: failing or missing response from https://10.129.0.27:8443: bad status from https://10.129.0.27:8443: 401
      Available: apiservice/v1.security.openshift.io: not available: failing or missing response from https://10.128.0.29:8443: bad status from https://10.128.0.29:8443: 401
      Available: apiservice/v1.template.openshift.io: not available: failing or missing response from https://10.129.0.27:8443: bad status from https://10.129.0.27:8443: 401
      Available: apiservice/v1.user.openshift.io: not available: failing or missing response from https://10.130.0.39:8443: bad status from https://10.130.0.39:8443: 401
    reason: AvailableMultiple
    status: "False"
    type: Available
  - lastTransitionTime: "2019-08-01T00:52:48Z"
    reason: AsExpected
    status: "True"
    type: Upgradeable
  extension: null
  relatedObjects:
  - group: operator.openshift.io
    name: cluster
    resource: openshiftapiservers
  - group: ""
    name: openshift-config
    resource: namespaces
  - group: ""
    name: openshift-config-managed
    resource: namespaces
  - group: ""
    name: openshift-apiserver-operator
    resource: namespaces
  - group: ""
    name: openshift-apiserver
    resource: namespaces
  - group: apiregistration.k8s.io
    name: v1.apps.openshift.io
    resource: apiservices
  - group: apiregistration.k8s.io
    name: v1.authorization.openshift.io
    resource: apiservices
  - group: apiregistration.k8s.io
    name: v1.build.openshift.io
    resource: apiservices
  - group: apiregistration.k8s.io
    name: v1.image.openshift.io
    resource: apiservices
  - group: apiregistration.k8s.io
    name: v1.oauth.openshift.io
    resource: apiservices
  - group: apiregistration.k8s.io
    name: v1.project.openshift.io
    resource: apiservices
  - group: apiregistration.k8s.io
    name: v1.quota.openshift.io
    resource: apiservices
  - group: apiregistration.k8s.io
    name: v1.route.openshift.io
    resource: apiservices
  - group: apiregistration.k8s.io
    name: v1.security.openshift.io
    resource: apiservices
  - group: apiregistration.k8s.io
    name: v1.template.openshift.io
    resource: apiservices
  - group: apiregistration.k8s.io
    name: v1.user.openshift.io
    resource: apiservices
  versions:
  - name: operator
    version: 4.2.0-0.nightly-2019-07-31-162901
  - name: openshift-apiserver
    version: ""

# oc -n openshift-apiserver get pod -o wide
NAME              READY   STATUS    RESTARTS   AGE   IP            NODE                                         NOMINATED NODE   READINESS GATES
apiserver-dqzvv   1/1     Running   0          25h   10.130.0.39   ip-10-0-152-57.us-east-2.compute.internal    <none>           <none>
apiserver-qgnzr   1/1     Running   0          25h   10.128.0.29   ip-10-0-171-67.us-east-2.compute.internal    <none>           <none>
apiserver-zd2sb   1/1     Running   0          25h   10.129.0.27   ip-10-0-138-201.us-east-2.compute.internal   <none>           <none>

#  oc -n openshift-apiserver logs apiserver-dqzvv
E0802 04:27:35.918516       1 authentication.go:65] Unable to authenticate the request due to an error: x509: certificate signed by unknown authority

Version-Release number of selected component (if applicable):
4.2.0-0.nightly-2019-07-31-162901

How reproducible:
Always

Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 2 Jian Zhang 2019-08-02 05:43:20 UTC
Encounter the same issue, it works well after re-running the apiserver pods.

mac:~ jianzhang$ oc project
Error from server (ServiceUnavailable): the server is currently unable to handle the request (get projects.project.openshift.io openshift-operator-lifecycle-manager)
mac:~ jianzhang$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.2.0-0.nightly-2019-07-31-162901   True        False         27h     Cluster version is 4.2.0-0.nightly-2019-07-31-162901

mac:~ jianzhang$ oc delete pods --all -n openshift-apiserver 
pod "apiserver-2gs2q" deleted
pod "apiserver-5qxlk" deleted
pod "apiserver-m5cbs" deleted
mac:~ jianzhang$ oc get pods -n openshift-apiserver 
NAME              READY   STATUS    RESTARTS   AGE
apiserver-jg6mp   1/1     Running   0          18s
apiserver-n2lwb   1/1     Running   0          15s
apiserver-pbkgb   1/1     Running   0          15s
mac:~ jianzhang$ oc project
Using project "openshift-operator-lifecycle-manager" on server "https://api.zhsun7.qe.devcluster.openshift.com:6443".

Comment 6 Mike Fiedler 2019-08-06 12:03:34 UTC
Adding TestBlocker - this blocks the long running reliability tests for 4.2.

Comment 7 Doug Hellmann 2019-08-06 13:22:42 UTC
This seems related: https://bugzilla.redhat.com/show_bug.cgi?id=1737611

Comment 8 Standa Laznicka 2019-08-07 06:57:47 UTC
*** Bug 1737591 has been marked as a duplicate of this bug. ***

Comment 9 Standa Laznicka 2019-08-07 15:06:10 UTC
Simple reproducer - force cert rotation in the openshift-kube-apiserver namespace:

oc get secret -n openshift-kube-apiserver -A -o json | jq -r '.items[] | select(.metadata.annotations."auth.openshift.io/certificate-not-after" | .!=null and fromdateiso8601<='$( date --date='+1year' +%s )') | "-n \(.metadata.namespace) \(.metadata.name)"' | xargs -n3 oc patch secret -p='{"metadata": {"annotations": {"auth.openshift.io/certificate-not-after": null}}}'

Comment 10 Sebastian Jug 2019-08-08 13:00:19 UTC
I'm seeing a similar issue on a new bare metal server running 4.1.8.

After being installed and running for over 24 hours all `oc` commands return:
```
Unable to connect to the server: x509: certificate signed by unknown authority (possibly because of "crypto/rsa: verification error" while trying to verify candidate authority certificate "kube-apiserver-lb-signer")
```

Comment 11 Michal Fojtik 2019-08-08 13:01:56 UTC
Fix: https://github.com/openshift/openshift-apiserver/pull/16

Comment 13 Michal Fojtik 2019-08-09 07:35:47 UTC
(In reply to Sebastian Jug from comment #10)
> I'm seeing a similar issue on a new bare metal server running 4.1.8.
> 
> After being installed and running for over 24 hours all `oc` commands return:
> ```
> Unable to connect to the server: x509: certificate signed by unknown
> authority (possibly because of "crypto/rsa: verification error" while trying
> to verify candidate authority certificate "kube-apiserver-lb-signer")
> ```

This issue is different.

Comment 14 Xingxing Xia 2019-08-10 14:24:27 UTC
Verified in 4.2.0-0.nightly-2019-08-09-000333: after keeping 35h watch, the issue still does not occur.

Comment 15 Sebastian Jug 2019-08-12 15:51:17 UTC
(In reply to Michal Fojtik from comment #13)
> (In reply to Sebastian Jug from comment #10)
> > I'm seeing a similar issue on a new bare metal server running 4.1.8.
> > 
> > After being installed and running for over 24 hours all `oc` commands return:
> > ```
> > Unable to connect to the server: x509: certificate signed by unknown
> > authority (possibly because of "crypto/rsa: verification error" while trying
> > to verify candidate authority certificate "kube-apiserver-lb-signer")
> > ```
> 
> This issue is different.

Correct, thank you Michal

Comment 16 Ryan Phillips 2019-08-15 16:55:22 UTC
*** Bug 1736168 has been marked as a duplicate of this bug. ***

Comment 17 errata-xmlrpc 2019-10-16 06:34:29 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:2922


Note You need to log in before you can comment on or make changes to this bug.