Description of problem: After cert expiry DR, co/operator-lifecycle-manager-packageserver Available is False with "Unable to authenticate the request due to an error: x509: certificate signed by unknown authority" Version-Release number of selected component (if applicable): 4.6.0-0.nightly-2020-09-17-073141 How reproducible: Not sure Steps to Reproduce: 1. Launched the env. The env is IPI on AWS FIPS on. Checked pods, COs, nodes are all well. 2. Shutdown all master and worker nodes from AWS console 3. After cluster age > 25h, re-start all master and worker nodes from AWS console. Approve Pending CSRs by `oc get csr | grep Pending | awk '{print $1}' | xargs oc adm certificate approve` 4. Check pods, COs, nodes Actual results: 4. All pods are well, all nodes are Ready. All COs are well except below one stuck in Available=False: $ oc get co operator-lifecycle-manager-packageserver operator-lifecycle-manager-packageserver 4.6.0-0.nightly-2020-09-17-073141 False True False 97m Checked oc get apiservice, found: v1.packages.operators.coreos.com openshift-operator-lifecycle-manager/packageserver-service False (FailedDiscoveryCheck) 27h $ oc get apiservice v1.packages.operators.coreos.com -o yaml ... spec: caBundle: LS0tLS1CRUdJTiBDRVJUSUZJQ0FURS0tLS0tCk1JSUJhRENDQVE2Z0F3SUJBZ0lJQU5McDd5OEhMSDh3Q2dZSUtvWkl6ajBFQXdJd0dERVdNQlFHQTFVRUNoTU4KVW1Wa0lFaGhkQ3dnU1c1akxqQWVGdzB5TURBNU1UY3dPREl3TlRSYUZ3MHlNakE1TVRjd09ESXdOVFJhTUJneApGakFVQmdOVkJBb1REVkpsWkNCSVlYUXNJRWx1WXk0d1dUQVRCZ2NxaGtqT1BRSUJCZ2dxaGtqT1BRTUJCd05DCkFBUnpuLzhWQmV2QXpnMDVQQkl3Nm03c3VGaDRPL3FybDNVVGlRMHY4WnpDUjRmSjRGWXgyVUc1SjFWdEZ3SUUKOVlYZnNsUXVCY09GMi94QWRwMjd0S0I4bzBJd1FEQU9CZ05WSFE4QkFmOEVCQU1DQW9Rd0hRWURWUjBsQkJZdwpGQVlJS3dZQkJRVUhBd0lHQ0NzR0FRVUZCd01CTUE4R0ExVWRFd0VCL3dRRk1BTUJBZjh3Q2dZSUtvWkl6ajBFCkF3SURTQUF3UlFJZ000Y0duTUNKV2R3QjM5MXg1YnVqNDhWVWgvdTJBdGYwOVpqd0xyejFCYTBDSVFDWktQOVAKMzFCWkdzNXZGSmt4dC9RZUc3RUhkQW4rbWNvSXh3eFhkbS9VU1E9PQotLS0tLUVORCBDRVJUSUZJQ0FURS0tLS0tCg== group: packages.operators.coreos.com groupPriorityMinimum: 2000 service: name: packageserver-service namespace: openshift-operator-lifecycle-manager port: 5443 version: v1 versionPriority: 15 status: conditions: - lastTransitionTime: "2020-09-18T09:34:10Z" message: 'failing or missing response from https://10.129.0.21:5443/apis/packages.operators.coreos.com/v1: bad status from https://10.129.0.21:5443/apis/packages.operators.coreos.com/v1: 401' reason: FailedDiscoveryCheck status: "False" type: Available $ oc get po -A -o wide | grep 10.129.0.21 openshift-operator-lifecycle-manager packageserver-768769fcf4-xxhq2 1/1 Running 0 27h 10.129.0.21 ip-10-0-165-20.ap-northeast-2.compute.internal ... $ oc get po -n openshift-operator-lifecycle-manager -o wide | grep packageserver packageserver-768769fcf4-t5vrh 1/1 Running 0 27h 10.130.0.6 ip-10-0-205-215.ap-northeast-2.compute.internal ... packageserver-768769fcf4-xxhq2 1/1 Running 0 27h 10.129.0.21 ip-10-0-165-20.ap-northeast-2.compute.internal ... Both pods logs have many below x509 errors: $ oc logs -n openshift-operator-lifecycle-manager packageserver-768769fcf4-t5vrh ... E0918 12:10:51.202746 1 authentication.go:53] Unable to authenticate the request due to an error: x509: certificate signed by unknown authority I0918 12:10:51.202797 1 httplog.go:90] verb="GET" URI="/apis/packages.operators.coreos.com/v1" latency=161.072µs resp=401 UserAgent="Go-http-client/2.0" srcIP="10.129.0.1:51868": ... Checked kube-apiserver logs, there are many errors: 2020-09-18T12:13:21.247706425Z E0918 12:13:21.247654 17 available_controller.go:437] v1.packages.operators.coreos.com failed with: failing or missing response from https://10.129.0.21:5443/apis/packages.operators.coreos.com/v1: bad status from https://10.129.0.21:5443/apis/packages.operators.coreos.com/v1: 401 ... 2020-09-18T12:13:21.277274586Z E0918 12:13:21.277217 17 available_controller.go:437] v1.packages.operators.coreos.com failed with: failing or missing response from https://10.130.0.6:5443/apis/packages.operators.coreos.com/v1: bad status from https://10.130.0.6:5443/apis/packages.operators.coreos.com/v1: 401 Certs are already renewed, though: $ oc get secret --insecure-skip-tls-verify -A -o json | jq -r '.items[] | select(.metadata.annotations."auth.openshift.io/certificate-not-after" | . != null and fromdateiso8601<='$( date --date="+24hours" +%s )') | "\(.metadata.annotations."auth.openshift.io/certificate-not-before") \(.metadata.annotations."auth.openshift.io/certificate-not-after") \(.metadata.namespace)\t\(.metadata.name)"' 2020-09-18T09:35:12Z 2020-09-18T21:35:13Z openshift-config-managed kube-controller-manager-client-cert-key 2020-09-18T09:35:12Z 2020-09-18T21:35:13Z openshift-config-managed kube-scheduler-client-cert-key 2020-09-18T09:35:12Z 2020-09-18T21:35:13Z openshift-kube-apiserver-operator aggregator-client-signer 2020-09-18T09:35:14Z 2020-09-18T21:35:13Z openshift-kube-apiserver aggregator-client 2020-09-18T09:35:12Z 2020-09-18T21:35:13Z openshift-kube-apiserver check-endpoints-client-cert-key 2020-09-18T09:35:12Z 2020-09-18T21:35:13Z openshift-kube-apiserver control-plane-node-admin-client-cert-key 2020-09-18T09:35:12Z 2020-09-18T21:35:13Z openshift-kube-apiserver external-loadbalancer-serving-certkey 2020-09-18T09:35:12Z 2020-09-18T21:35:13Z openshift-kube-apiserver internal-loadbalancer-serving-certkey 2020-09-18T09:35:12Z 2020-09-18T21:35:13Z openshift-kube-apiserver kubelet-client 2020-09-18T09:35:12Z 2020-09-18T21:35:13Z openshift-kube-apiserver kubelet-client-7 2020-09-18T09:35:12Z 2020-09-18T21:35:13Z openshift-kube-apiserver localhost-serving-cert-certkey 2020-09-18T09:35:11Z 2020-09-18T21:35:12Z openshift-kube-apiserver service-network-serving-certkey 2020-09-18T09:35:12Z 2020-09-18T21:35:13Z openshift-kube-controller-manager kube-controller-manager-client-cert-key 2020-09-18T09:35:12Z 2020-09-18T21:35:13Z openshift-kube-scheduler kube-scheduler-client-cert-key Expected results: 4. Should not have abnormal CO Additional info:
The caBundle in apiservice v1.packages.operators.coreos.com is (supposed to be) managed by the operator belong to the package apiserver. The kube-apiserver only matches it to the server cert returned on access. Moving over to OLM.