Description of problem: The packageserver is unavailable due to the x509: certificate signed by unknown authority, see: [root@preserve-olm-env data]# oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE authentication 4.6.0-0.nightly-2020-09-19-060512 True False False 4h13m ... operator-lifecycle-manager 4.6.0-0.nightly-2020-09-19-060512 True False False 29h operator-lifecycle-manager-catalog 4.6.0-0.nightly-2020-09-19-060512 True False False 29h operator-lifecycle-manager-packageserver 4.6.0-0.nightly-2020-09-19-060512 False True False 4h17m 176119 E0921 07:18:11.647452 1 authentication.go:53] Unable to authenticate the request due to an error: x509: certificate signed by unknown authority 176120 I0921 07:18:11.647749 1 httplog.go:90] verb="GET" URI="/apis/packages.operators.coreos.com/v1" latency=516.686µs resp=401 UserAgent="Go-http-clie nt/2.0" srcIP="10.130.0.2:37892": 176121 E0921 07:18:14.204188 1 authentication.go:53] Unable to authenticate the request due to an error: x509: certificate signed by unknown authority Version-Release number of selected component (if applicable): [root@preserve-olm-env data]# oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.6.0-0.nightly-2020-09-19-060512 True False 28h Error while reconciling 4.6.0-0.nightly-2020-09-19-060512: the cluster operator operator-lifecycle-manager-packageserver has not yet successfully rolled out [root@preserve-olm-env data]# oc exec catalog-operator-587c77c6d4-lrrhp -- olm --version OLM version: 0.16.1 git commit: 5dafa75811d6682e0df44d9eff8aac9ec3bf2c21 How reproducible: Not sure, encounter once in the disconnect env(ipi-on-aws, OVN, etcd_encryption) Steps to Reproduce: 1. Install a 4.6 cluster, for example, https://mastern-jenkins-csb-openshift-qe.cloud.paas.psi.redhat.com/job/Launch%20Environment%20Flexy/113324/artifact/workdir/install-dir/auth/kubeconfig/*view*/ 2. After running days. 3. Check the OLM status. Actual results: Expected results: Additional info: 1, The CSV always in Installing status due to the Apiservice is unavailable. [root@preserve-olm-env data]# oc get csv NAME DISPLAY VERSION REPLACES PHASE packageserver Package Server 0.16.1 Installing [root@preserve-olm-env data]# oc get apiservice v1.packages.operators.coreos.com -o yaml apiVersion: apiregistration.k8s.io/v1 kind: APIService metadata: creationTimestamp: "2020-09-20T01:38:01Z" labels: olm.owner: packageserver olm.owner.kind: ClusterServiceVersion olm.owner.namespace: openshift-operator-lifecycle-manager name: v1.packages.operators.coreos.com resourceVersion: "396973" selfLink: /apis/apiregistration.k8s.io/v1/apiservices/v1.packages.operators.coreos.com uid: 542c8de4-e0b8-4912-b324-80ec6c9f8fed spec: caBundle: LS0tLS1CRUdJTiBDRVJUSUZJQ0FURS0tLS0tCk1JSUJhVENDQVE2Z0F3SUJBZ0lJYlQvZ3R5T1B3NGN3Q2dZSUtvWkl6ajBFQXdJd0dERVdNQlFHQTFVRUNoTU4KVW1Wa0lFaGhkQ3dnU1c1akxqQWVGdzB5TURBNU1qQXdNVE00TURCYUZ3MHlNakE1TWpBd01UTTRNREJhTUJneApGakFVQmdOVkJBb1REVkpsWkNCSVlYUXNJRWx1WXk0d1dUQVRCZ2NxaGtqT1BRSUJCZ2dxaGtqT1BRTUJCd05DCkFBUk9ibUVucVFSdk9oTC93U0llaVZnMzhiZmpNL2dDYkEyQUVjSVZDVzZJVUVQSk5xQ3MyTkV0aGNMcXJMN2YKbXIvd1hBMjRPVmVDMVBObURTSkgzOWRUbzBJd1FEQU9CZ05WSFE4QkFmOEVCQU1DQW9Rd0hRWURWUjBsQkJZdwpGQVlJS3dZQkJRVUhBd0lHQ0NzR0FRVUZCd01CTUE4R0ExVWRFd0VCL3dRRk1BTUJBZjh3Q2dZSUtvWkl6ajBFCkF3SURTUUF3UmdJaEFOMHdkbUFGSjZlejJGY0tEeHZoSW92OTA2ejZ2eVRjdjJrbTZqbzBOZjJHQWlFQXJpLzIKcXhQcGFpaWdZQ1g1dUdlcVVNQ3JxeW9QQTQ1MVlNSnp3WXRDZ1ZjPQotLS0tLUVORCBDRVJUSUZJQ0FURS0tLS0tCg== group: packages.operators.coreos.com groupPriorityMinimum: 2000 service: name: packageserver-service namespace: openshift-operator-lifecycle-manager port: 5443 version: v1 versionPriority: 15 status: conditions: - lastTransitionTime: "2020-09-21T02:40:26Z" message: 'failing or missing response from https://10.129.0.22:5443/apis/packages.operators.coreos.com/v1: bad status from https://10.129.0.22:5443/apis/packages.operators.coreos.com/v1: 401' reason: FailedDiscoveryCheck status: "False" type: Available 2, Check the package pods logs. [root@preserve-olm-env data]# oc get pods -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES catalog-operator-587c77c6d4-lrrhp 1/1 Running 0 28h 10.129.0.28 ip-10-0-52-4.us-east-2.compute.internal <none> <none> olm-operator-6b5d47c646-t9jsp 1/1 Running 0 28h 10.129.0.23 ip-10-0-52-4.us-east-2.compute.internal <none> <none> packageserver-5d88d79964-mbffr 1/1 Running 0 28h 10.129.0.22 ip-10-0-52-4.us-east-2.compute.internal <none> <none> packageserver-5d88d79964-zzfvr 1/1 Running 0 28h 10.128.0.30 ip-10-0-68-137.us-east-2.compute.internal <none> <none> 176124 E0921 07:18:14.204439 1 authentication.go:53] Unable to authenticate the request due to an error: x509: certificate signed by unknown authority 176125 I0921 07:18:14.204809 1 httplog.go:90] verb="GET" URI="/apis/packages.operators.coreos.com/v1" latency=517.698µs resp=401 UserAgent="Go-http-clie nt/2.0" srcIP="10.128.0.2:52974":
More informations: 1, the cluster shudown after 24 hours. 2, Restart it. (restart those instances on the AWS console manually) some CA expired, then did CA auto recovery, at last, all others co comes back well, only operator-lifecycle-manager-packageserver run into error. Here is the cluster for your debug: https://mastern-jenkins-csb-openshift-qe.cloud.paas.psi.redhat.com/job/Launch%20Environment%20Flexy/113324/artifact/workdir/install-dir/auth/kubeconfig/*view*/
I'm not sure why the CA rotaion didn't trigger the new pods generated. Workaround: delete the old pakcageserver pods. [root@preserve-olm-env data]# oc get pods NAME READY STATUS RESTARTS AGE catalog-operator-587c77c6d4-lrrhp 1/1 Running 0 28h olm-operator-6b5d47c646-t9jsp 1/1 Running 0 28h packageserver-5d88d79964-mbffr 1/1 Running 0 28h packageserver-5d88d79964-zzfvr 1/1 Running 0 28h [root@preserve-olm-env data]# oc delete pods packageserver-5d88d79964-mbffr packageserver-5d88d79964-zzfvr pod "packageserver-5d88d79964-mbffr" deleted pod "packageserver-5d88d79964-zzfvr" deleted [root@preserve-olm-env data]# oc get pods NAME READY STATUS RESTARTS AGE catalog-operator-587c77c6d4-lrrhp 1/1 Running 0 28h olm-operator-6b5d47c646-t9jsp 1/1 Running 0 28h packageserver-5d88d79964-mbm8v 1/1 Running 0 52s packageserver-5d88d79964-v5xnc 1/1 Running 0 51s [root@preserve-olm-env data]# oc get apiservice v1.packages.operators.coreos.com -o yaml apiVersion: apiregistration.k8s.io/v1 kind: APIService metadata: creationTimestamp: "2020-09-20T01:38:01Z" labels: olm.owner: packageserver olm.owner.kind: ClusterServiceVersion olm.owner.namespace: openshift-operator-lifecycle-manager name: v1.packages.operators.coreos.com resourceVersion: "424960" selfLink: /apis/apiregistration.k8s.io/v1/apiservices/v1.packages.operators.coreos.com uid: 542c8de4-e0b8-4912-b324-80ec6c9f8fed spec: caBundle: LS0tLS1CRUdJTiBDRVJUSUZJQ0FURS0tLS0tCk1JSUJhVENDQVE2Z0F3SUJBZ0lJYlQvZ3R5T1B3NGN3Q2dZSUtvWkl6ajBFQXdJd0dERVdNQlFHQTFVRUNoTU4KVW1Wa0lFaGhkQ3dnU1c1akxqQWVGdzB5TURBNU1qQXdNVE00TURCYUZ3MHlNakE1TWpBd01UTTRNREJhTUJneApGakFVQmdOVkJBb1REVkpsWkNCSVlYUXNJRWx1WXk0d1dUQVRCZ2NxaGtqT1BRSUJCZ2dxaGtqT1BRTUJCd05DCkFBUk9ibUVucVFSdk9oTC93U0llaVZnMzhiZmpNL2dDYkEyQUVjSVZDVzZJVUVQSk5xQ3MyTkV0aGNMcXJMN2YKbXIvd1hBMjRPVmVDMVBObURTSkgzOWRUbzBJd1FEQU9CZ05WSFE4QkFmOEVCQU1DQW9Rd0hRWURWUjBsQkJZdwpGQVlJS3dZQkJRVUhBd0lHQ0NzR0FRVUZCd01CTUE4R0ExVWRFd0VCL3dRRk1BTUJBZjh3Q2dZSUtvWkl6ajBFCkF3SURTUUF3UmdJaEFOMHdkbUFGSjZlejJGY0tEeHZoSW92OTA2ejZ2eVRjdjJrbTZqbzBOZjJHQWlFQXJpLzIKcXhQcGFpaWdZQ1g1dUdlcVVNQ3JxeW9QQTQ1MVlNSnp3WXRDZ1ZjPQotLS0tLUVORCBDRVJUSUZJQ0FURS0tLS0tCg== group: packages.operators.coreos.com groupPriorityMinimum: 2000 service: name: packageserver-service namespace: openshift-operator-lifecycle-manager port: 5443 version: v1 versionPriority: 15 status: conditions: - lastTransitionTime: "2020-09-21T07:48:36Z" message: all checks passed reason: Passed status: "True" type: Available [root@preserve-olm-env data]# oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE ... operator-lifecycle-manager-packageserver 4.6.0-0.nightly-2020-09-19-060512 True False False 89s
*** Bug 1880396 has been marked as a duplicate of this bug. ***
conditions: - lastTransitionTime: "2020-09-21T02:40:26Z" message: 'failing or missing response from https://10.129.0.22:5443/apis/packages.operators.coreos.com/v1: bad status from https://10.129.0.22:5443/apis/packages.operators.coreos.com/v1: 401' reason: FailedDiscoveryCheck status: "False" type: Available This indicates that the root cause is the failed discovery check, which prevents the CA from rotating. I'm not sure why discovery would return 401 - it seems like it may be an issue with rotating the secret for the operator's service account. The failing request is one from the olm-operator pod (with the olm-operator service account) to apiserver discovery - I wouldn't expect that to fail with a 401 under normal cluster conditions. I'm moving this to apiserver/auth to help triage - is there a known issue with service account tokens if a cluster has been shut down for >24hrs?
As expected, the kube-apiserver gets a 401 from the packages server: 2020-09-22T03:41:25.857583249Z E0922 03:41:25.857527 18 available_controller.go:437] v1.packages.operators.coreos.com failed with: failing or missing response from https://10.128.0.7:5443/apis/packages.operators.coreos.com/v1: bad status from https://10.128.0.7:5443/apis/packages.operators.coreos.com/v1: 401 and the packages server shows that it does not understand the client cert: I0922 08:13:22.801683 1 httplog.go:90] verb="GET" URI="/apis/packages.operators.coreos.com/v1" latency=278.075µs resp=401 UserAgent="Go-http-client/2.0" srcIP="10.130.0.1:38640": E0922 08:13:22.802062 1 authentication.go:53] Unable to authenticate the request due to an error: x509: certificate signed by unknown authority Chance is very high that the packages server does not honor kube-system/extension-apiserver-authentication ConfigMap.
I'm setting this to low and moving it to 4.7 since it doesn't seem to occur that often -- there's also a very simple work around: delete the packageserver pods. > Chance is very high that the packages server does not honor kube-system/extension-apiserver-authentication ConfigMap. Yes, maybe we're not honoring **changes** to the extension-apiserver-authentication configmap after packageserver has started -- if that occurs during an apiserver cert rotation, while packageserver is still running, OLM may get stuck. Further investigation is needed.
Hi Nick, Thanks! But, I'm setting it to medium since it always happen in our daiy test.