Created attachment 1599485 [details] openshift-apiserver pod logs Description of problem: Let the cluster run for a day, openshift-apiserver is down, error "x509: certificate signed by unknown authority" in pod logs # oc get co openshift-apiserver NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE openshift-apiserver 4.2.0-0.nightly-2019-07-31-162901 False False False 6h31m # oc get co openshift-apiserver -oyaml apiVersion: config.openshift.io/v1 kind: ClusterOperator metadata: creationTimestamp: "2019-08-01T00:52:47Z" generation: 1 name: openshift-apiserver resourceVersion: "1220345" selfLink: /apis/config.openshift.io/v1/clusteroperators/openshift-apiserver uid: ae44a781-b3f6-11e9-817f-02ee1b45aa8e spec: {} status: conditions: - lastTransitionTime: "2019-08-01T00:55:27Z" reason: AsExpected status: "False" type: Degraded - lastTransitionTime: "2019-08-01T00:57:06Z" reason: AsExpected status: "False" type: Progressing - lastTransitionTime: "2019-08-01T19:52:27Z" message: |- Available: apiservice/v1.apps.openshift.io: not available: failing or missing response from https://10.128.0.29:8443: bad status from https://10.128.0.29:8443: 401 Available: apiservice/v1.authorization.openshift.io: not available: failing or missing response from https://10.130.0.39:8443: bad status from https://10.130.0.39:8443: 401 Available: apiservice/v1.build.openshift.io: not available: failing or missing response from https://10.129.0.27:8443: bad status from https://10.129.0.27:8443: 401 Available: apiservice/v1.image.openshift.io: not available: failing or missing response from https://10.129.0.27:8443: bad status from https://10.129.0.27:8443: 401 Available: apiservice/v1.oauth.openshift.io: not available: failing or missing response from https://10.129.0.27:8443: bad status from https://10.129.0.27:8443: 401 Available: apiservice/v1.project.openshift.io: not available: failing or missing response from https://10.129.0.27:8443: bad status from https://10.129.0.27:8443: 401 Available: apiservice/v1.quota.openshift.io: not available: failing or missing response from https://10.129.0.27:8443: bad status from https://10.129.0.27:8443: 401 Available: apiservice/v1.route.openshift.io: not available: failing or missing response from https://10.129.0.27:8443: bad status from https://10.129.0.27:8443: 401 Available: apiservice/v1.security.openshift.io: not available: failing or missing response from https://10.128.0.29:8443: bad status from https://10.128.0.29:8443: 401 Available: apiservice/v1.template.openshift.io: not available: failing or missing response from https://10.129.0.27:8443: bad status from https://10.129.0.27:8443: 401 Available: apiservice/v1.user.openshift.io: not available: failing or missing response from https://10.130.0.39:8443: bad status from https://10.130.0.39:8443: 401 reason: AvailableMultiple status: "False" type: Available - lastTransitionTime: "2019-08-01T00:52:48Z" reason: AsExpected status: "True" type: Upgradeable extension: null relatedObjects: - group: operator.openshift.io name: cluster resource: openshiftapiservers - group: "" name: openshift-config resource: namespaces - group: "" name: openshift-config-managed resource: namespaces - group: "" name: openshift-apiserver-operator resource: namespaces - group: "" name: openshift-apiserver resource: namespaces - group: apiregistration.k8s.io name: v1.apps.openshift.io resource: apiservices - group: apiregistration.k8s.io name: v1.authorization.openshift.io resource: apiservices - group: apiregistration.k8s.io name: v1.build.openshift.io resource: apiservices - group: apiregistration.k8s.io name: v1.image.openshift.io resource: apiservices - group: apiregistration.k8s.io name: v1.oauth.openshift.io resource: apiservices - group: apiregistration.k8s.io name: v1.project.openshift.io resource: apiservices - group: apiregistration.k8s.io name: v1.quota.openshift.io resource: apiservices - group: apiregistration.k8s.io name: v1.route.openshift.io resource: apiservices - group: apiregistration.k8s.io name: v1.security.openshift.io resource: apiservices - group: apiregistration.k8s.io name: v1.template.openshift.io resource: apiservices - group: apiregistration.k8s.io name: v1.user.openshift.io resource: apiservices versions: - name: operator version: 4.2.0-0.nightly-2019-07-31-162901 - name: openshift-apiserver version: "" # oc -n openshift-apiserver get pod -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES apiserver-dqzvv 1/1 Running 0 25h 10.130.0.39 ip-10-0-152-57.us-east-2.compute.internal <none> <none> apiserver-qgnzr 1/1 Running 0 25h 10.128.0.29 ip-10-0-171-67.us-east-2.compute.internal <none> <none> apiserver-zd2sb 1/1 Running 0 25h 10.129.0.27 ip-10-0-138-201.us-east-2.compute.internal <none> <none> # oc -n openshift-apiserver logs apiserver-dqzvv E0802 04:27:35.918516 1 authentication.go:65] Unable to authenticate the request due to an error: x509: certificate signed by unknown authority Version-Release number of selected component (if applicable): 4.2.0-0.nightly-2019-07-31-162901 How reproducible: Always Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
Encounter the same issue, it works well after re-running the apiserver pods. mac:~ jianzhang$ oc project Error from server (ServiceUnavailable): the server is currently unable to handle the request (get projects.project.openshift.io openshift-operator-lifecycle-manager) mac:~ jianzhang$ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.2.0-0.nightly-2019-07-31-162901 True False 27h Cluster version is 4.2.0-0.nightly-2019-07-31-162901 mac:~ jianzhang$ oc delete pods --all -n openshift-apiserver pod "apiserver-2gs2q" deleted pod "apiserver-5qxlk" deleted pod "apiserver-m5cbs" deleted mac:~ jianzhang$ oc get pods -n openshift-apiserver NAME READY STATUS RESTARTS AGE apiserver-jg6mp 1/1 Running 0 18s apiserver-n2lwb 1/1 Running 0 15s apiserver-pbkgb 1/1 Running 0 15s mac:~ jianzhang$ oc project Using project "openshift-operator-lifecycle-manager" on server "https://api.zhsun7.qe.devcluster.openshift.com:6443".
Adding TestBlocker - this blocks the long running reliability tests for 4.2.
This seems related: https://bugzilla.redhat.com/show_bug.cgi?id=1737611
*** Bug 1737591 has been marked as a duplicate of this bug. ***
Simple reproducer - force cert rotation in the openshift-kube-apiserver namespace: oc get secret -n openshift-kube-apiserver -A -o json | jq -r '.items[] | select(.metadata.annotations."auth.openshift.io/certificate-not-after" | .!=null and fromdateiso8601<='$( date --date='+1year' +%s )') | "-n \(.metadata.namespace) \(.metadata.name)"' | xargs -n3 oc patch secret -p='{"metadata": {"annotations": {"auth.openshift.io/certificate-not-after": null}}}'
I'm seeing a similar issue on a new bare metal server running 4.1.8. After being installed and running for over 24 hours all `oc` commands return: ``` Unable to connect to the server: x509: certificate signed by unknown authority (possibly because of "crypto/rsa: verification error" while trying to verify candidate authority certificate "kube-apiserver-lb-signer") ```
Fix: https://github.com/openshift/openshift-apiserver/pull/16
(In reply to Sebastian Jug from comment #10) > I'm seeing a similar issue on a new bare metal server running 4.1.8. > > After being installed and running for over 24 hours all `oc` commands return: > ``` > Unable to connect to the server: x509: certificate signed by unknown > authority (possibly because of "crypto/rsa: verification error" while trying > to verify candidate authority certificate "kube-apiserver-lb-signer") > ``` This issue is different.
Verified in 4.2.0-0.nightly-2019-08-09-000333: after keeping 35h watch, the issue still does not occur.
(In reply to Michal Fojtik from comment #13) > (In reply to Sebastian Jug from comment #10) > > I'm seeing a similar issue on a new bare metal server running 4.1.8. > > > > After being installed and running for over 24 hours all `oc` commands return: > > ``` > > Unable to connect to the server: x509: certificate signed by unknown > > authority (possibly because of "crypto/rsa: verification error" while trying > > to verify candidate authority certificate "kube-apiserver-lb-signer") > > ``` > > This issue is different. Correct, thank you Michal
*** Bug 1736168 has been marked as a duplicate of this bug. ***
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:2922