Bug 1736800

Summary: openshift-apiserver is down due to "x509: certificate signed by unknown authority" error
Product: OpenShift Container Platform Reporter: Junqi Zhao <juzhao>
Component: openshift-apiserverAssignee: Standa Laznicka <slaznick>
Status: CLOSED ERRATA QA Contact: Xingxing Xia <xxia>
Severity: urgent Docs Contact:
Priority: high    
Version: 4.2.0CC: akamra, anusaxen, aos-bugs, decarr, dhellmann, dmoessne, eminguez, gklein, jhou, jiazha, mfojtik, mifiedle, mkarg, nagrawal, rsandu, scuppett, sejug, slaznick, wabouham, wking, yprokule
Target Milestone: ---Keywords: Regression, TestBlocker
Target Release: 4.2.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-10-16 06:34:29 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
openshift-apiserver pod logs none

Description Junqi Zhao 2019-08-02 04:31:38 UTC
Created attachment 1599485 [details]
openshift-apiserver pod logs

Description of problem:
Let the cluster run for a day, openshift-apiserver is down, error "x509: certificate signed by unknown authority" in pod logs

#  oc get co openshift-apiserver
NAME                  VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE
openshift-apiserver   4.2.0-0.nightly-2019-07-31-162901   False       False         False      6h31m


#  oc get co openshift-apiserver -oyaml
apiVersion: config.openshift.io/v1
kind: ClusterOperator
metadata:
  creationTimestamp: "2019-08-01T00:52:47Z"
  generation: 1
  name: openshift-apiserver
  resourceVersion: "1220345"
  selfLink: /apis/config.openshift.io/v1/clusteroperators/openshift-apiserver
  uid: ae44a781-b3f6-11e9-817f-02ee1b45aa8e
spec: {}
status:
  conditions:
  - lastTransitionTime: "2019-08-01T00:55:27Z"
    reason: AsExpected
    status: "False"
    type: Degraded
  - lastTransitionTime: "2019-08-01T00:57:06Z"
    reason: AsExpected
    status: "False"
    type: Progressing
  - lastTransitionTime: "2019-08-01T19:52:27Z"
    message: |-
      Available: apiservice/v1.apps.openshift.io: not available: failing or missing response from https://10.128.0.29:8443: bad status from https://10.128.0.29:8443: 401
      Available: apiservice/v1.authorization.openshift.io: not available: failing or missing response from https://10.130.0.39:8443: bad status from https://10.130.0.39:8443: 401
      Available: apiservice/v1.build.openshift.io: not available: failing or missing response from https://10.129.0.27:8443: bad status from https://10.129.0.27:8443: 401
      Available: apiservice/v1.image.openshift.io: not available: failing or missing response from https://10.129.0.27:8443: bad status from https://10.129.0.27:8443: 401
      Available: apiservice/v1.oauth.openshift.io: not available: failing or missing response from https://10.129.0.27:8443: bad status from https://10.129.0.27:8443: 401
      Available: apiservice/v1.project.openshift.io: not available: failing or missing response from https://10.129.0.27:8443: bad status from https://10.129.0.27:8443: 401
      Available: apiservice/v1.quota.openshift.io: not available: failing or missing response from https://10.129.0.27:8443: bad status from https://10.129.0.27:8443: 401
      Available: apiservice/v1.route.openshift.io: not available: failing or missing response from https://10.129.0.27:8443: bad status from https://10.129.0.27:8443: 401
      Available: apiservice/v1.security.openshift.io: not available: failing or missing response from https://10.128.0.29:8443: bad status from https://10.128.0.29:8443: 401
      Available: apiservice/v1.template.openshift.io: not available: failing or missing response from https://10.129.0.27:8443: bad status from https://10.129.0.27:8443: 401
      Available: apiservice/v1.user.openshift.io: not available: failing or missing response from https://10.130.0.39:8443: bad status from https://10.130.0.39:8443: 401
    reason: AvailableMultiple
    status: "False"
    type: Available
  - lastTransitionTime: "2019-08-01T00:52:48Z"
    reason: AsExpected
    status: "True"
    type: Upgradeable
  extension: null
  relatedObjects:
  - group: operator.openshift.io
    name: cluster
    resource: openshiftapiservers
  - group: ""
    name: openshift-config
    resource: namespaces
  - group: ""
    name: openshift-config-managed
    resource: namespaces
  - group: ""
    name: openshift-apiserver-operator
    resource: namespaces
  - group: ""
    name: openshift-apiserver
    resource: namespaces
  - group: apiregistration.k8s.io
    name: v1.apps.openshift.io
    resource: apiservices
  - group: apiregistration.k8s.io
    name: v1.authorization.openshift.io
    resource: apiservices
  - group: apiregistration.k8s.io
    name: v1.build.openshift.io
    resource: apiservices
  - group: apiregistration.k8s.io
    name: v1.image.openshift.io
    resource: apiservices
  - group: apiregistration.k8s.io
    name: v1.oauth.openshift.io
    resource: apiservices
  - group: apiregistration.k8s.io
    name: v1.project.openshift.io
    resource: apiservices
  - group: apiregistration.k8s.io
    name: v1.quota.openshift.io
    resource: apiservices
  - group: apiregistration.k8s.io
    name: v1.route.openshift.io
    resource: apiservices
  - group: apiregistration.k8s.io
    name: v1.security.openshift.io
    resource: apiservices
  - group: apiregistration.k8s.io
    name: v1.template.openshift.io
    resource: apiservices
  - group: apiregistration.k8s.io
    name: v1.user.openshift.io
    resource: apiservices
  versions:
  - name: operator
    version: 4.2.0-0.nightly-2019-07-31-162901
  - name: openshift-apiserver
    version: ""

# oc -n openshift-apiserver get pod -o wide
NAME              READY   STATUS    RESTARTS   AGE   IP            NODE                                         NOMINATED NODE   READINESS GATES
apiserver-dqzvv   1/1     Running   0          25h   10.130.0.39   ip-10-0-152-57.us-east-2.compute.internal    <none>           <none>
apiserver-qgnzr   1/1     Running   0          25h   10.128.0.29   ip-10-0-171-67.us-east-2.compute.internal    <none>           <none>
apiserver-zd2sb   1/1     Running   0          25h   10.129.0.27   ip-10-0-138-201.us-east-2.compute.internal   <none>           <none>

#  oc -n openshift-apiserver logs apiserver-dqzvv
E0802 04:27:35.918516       1 authentication.go:65] Unable to authenticate the request due to an error: x509: certificate signed by unknown authority

Version-Release number of selected component (if applicable):
4.2.0-0.nightly-2019-07-31-162901

How reproducible:
Always

Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 2 Jian Zhang 2019-08-02 05:43:20 UTC
Encounter the same issue, it works well after re-running the apiserver pods.

mac:~ jianzhang$ oc project
Error from server (ServiceUnavailable): the server is currently unable to handle the request (get projects.project.openshift.io openshift-operator-lifecycle-manager)
mac:~ jianzhang$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.2.0-0.nightly-2019-07-31-162901   True        False         27h     Cluster version is 4.2.0-0.nightly-2019-07-31-162901

mac:~ jianzhang$ oc delete pods --all -n openshift-apiserver 
pod "apiserver-2gs2q" deleted
pod "apiserver-5qxlk" deleted
pod "apiserver-m5cbs" deleted
mac:~ jianzhang$ oc get pods -n openshift-apiserver 
NAME              READY   STATUS    RESTARTS   AGE
apiserver-jg6mp   1/1     Running   0          18s
apiserver-n2lwb   1/1     Running   0          15s
apiserver-pbkgb   1/1     Running   0          15s
mac:~ jianzhang$ oc project
Using project "openshift-operator-lifecycle-manager" on server "https://api.zhsun7.qe.devcluster.openshift.com:6443".

Comment 6 Mike Fiedler 2019-08-06 12:03:34 UTC
Adding TestBlocker - this blocks the long running reliability tests for 4.2.

Comment 7 Doug Hellmann 2019-08-06 13:22:42 UTC
This seems related: https://bugzilla.redhat.com/show_bug.cgi?id=1737611

Comment 8 Standa Laznicka 2019-08-07 06:57:47 UTC
*** Bug 1737591 has been marked as a duplicate of this bug. ***

Comment 9 Standa Laznicka 2019-08-07 15:06:10 UTC
Simple reproducer - force cert rotation in the openshift-kube-apiserver namespace:

oc get secret -n openshift-kube-apiserver -A -o json | jq -r '.items[] | select(.metadata.annotations."auth.openshift.io/certificate-not-after" | .!=null and fromdateiso8601<='$( date --date='+1year' +%s )') | "-n \(.metadata.namespace) \(.metadata.name)"' | xargs -n3 oc patch secret -p='{"metadata": {"annotations": {"auth.openshift.io/certificate-not-after": null}}}'

Comment 10 Sebastian Jug 2019-08-08 13:00:19 UTC
I'm seeing a similar issue on a new bare metal server running 4.1.8.

After being installed and running for over 24 hours all `oc` commands return:
```
Unable to connect to the server: x509: certificate signed by unknown authority (possibly because of "crypto/rsa: verification error" while trying to verify candidate authority certificate "kube-apiserver-lb-signer")
```

Comment 11 Michal Fojtik 2019-08-08 13:01:56 UTC
Fix: https://github.com/openshift/openshift-apiserver/pull/16

Comment 13 Michal Fojtik 2019-08-09 07:35:47 UTC
(In reply to Sebastian Jug from comment #10)
> I'm seeing a similar issue on a new bare metal server running 4.1.8.
> 
> After being installed and running for over 24 hours all `oc` commands return:
> ```
> Unable to connect to the server: x509: certificate signed by unknown
> authority (possibly because of "crypto/rsa: verification error" while trying
> to verify candidate authority certificate "kube-apiserver-lb-signer")
> ```

This issue is different.

Comment 14 Xingxing Xia 2019-08-10 14:24:27 UTC
Verified in 4.2.0-0.nightly-2019-08-09-000333: after keeping 35h watch, the issue still does not occur.

Comment 15 Sebastian Jug 2019-08-12 15:51:17 UTC
(In reply to Michal Fojtik from comment #13)
> (In reply to Sebastian Jug from comment #10)
> > I'm seeing a similar issue on a new bare metal server running 4.1.8.
> > 
> > After being installed and running for over 24 hours all `oc` commands return:
> > ```
> > Unable to connect to the server: x509: certificate signed by unknown
> > authority (possibly because of "crypto/rsa: verification error" while trying
> > to verify candidate authority certificate "kube-apiserver-lb-signer")
> > ```
> 
> This issue is different.

Correct, thank you Michal

Comment 16 Ryan Phillips 2019-08-15 16:55:22 UTC
*** Bug 1736168 has been marked as a duplicate of this bug. ***

Comment 17 errata-xmlrpc 2019-10-16 06:34:29 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:2922