Bug 1880928 - The packageserver is unavailable after the cluster restarted due to the x509: certificate signed by unknown authority
Summary: The packageserver is unavailable after the cluster restarted due to the x509...
Keywords:
Status: CLOSED WORKSFORME
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: OLM
Version: 4.6
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: 4.7.0
Assignee: Daniel Sover
QA Contact: Jian Zhang
URL:
Whiteboard:
: 1880396 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-09-21 07:33 UTC by Jian Zhang
Modified: 2021-07-21 11:14 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-11-05 21:14:39 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Jian Zhang 2020-09-21 07:33:54 UTC
Description of problem:
The packageserver is  unavailable due to the x509: certificate signed by unknown authority, see:
[root@preserve-olm-env data]# oc get co
NAME                                       VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE
authentication                             4.6.0-0.nightly-2020-09-19-060512   True        False         False      4h13m
...
operator-lifecycle-manager                 4.6.0-0.nightly-2020-09-19-060512   True        False         False      29h
operator-lifecycle-manager-catalog         4.6.0-0.nightly-2020-09-19-060512   True        False         False      29h
operator-lifecycle-manager-packageserver   4.6.0-0.nightly-2020-09-19-060512   False       True          False      4h17m


176119 E0921 07:18:11.647452       1 authentication.go:53] Unable to authenticate the request due to an error: x509: certificate signed by unknown authority
176120 I0921 07:18:11.647749       1 httplog.go:90] verb="GET" URI="/apis/packages.operators.coreos.com/v1" latency=516.686µs resp=401 UserAgent="Go-http-clie       nt/2.0" srcIP="10.130.0.2:37892":
176121 E0921 07:18:14.204188       1 authentication.go:53] Unable to authenticate the request due to an error: x509: certificate signed by unknown authority


Version-Release number of selected component (if applicable):
[root@preserve-olm-env data]# oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.6.0-0.nightly-2020-09-19-060512   True        False         28h     Error while reconciling 4.6.0-0.nightly-2020-09-19-060512: the cluster operator operator-lifecycle-manager-packageserver has not yet successfully rolled out

[root@preserve-olm-env data]# oc exec catalog-operator-587c77c6d4-lrrhp -- olm --version
OLM version: 0.16.1
git commit: 5dafa75811d6682e0df44d9eff8aac9ec3bf2c21


How reproducible:
Not sure, encounter once in the disconnect env(ipi-on-aws, OVN, etcd_encryption)

Steps to Reproduce:
1. Install a 4.6 cluster, for example, https://mastern-jenkins-csb-openshift-qe.cloud.paas.psi.redhat.com/job/Launch%20Environment%20Flexy/113324/artifact/workdir/install-dir/auth/kubeconfig/*view*/

2. After running days.

3. Check the OLM status.

Actual results:


Expected results:


Additional info:
1, The CSV always in Installing status due to the Apiservice is unavailable.
[root@preserve-olm-env data]# oc get csv
NAME            DISPLAY          VERSION   REPLACES   PHASE
packageserver   Package Server   0.16.1               Installing

[root@preserve-olm-env data]# oc get apiservice v1.packages.operators.coreos.com -o yaml
apiVersion: apiregistration.k8s.io/v1
kind: APIService
metadata:
  creationTimestamp: "2020-09-20T01:38:01Z"
  labels:
    olm.owner: packageserver
    olm.owner.kind: ClusterServiceVersion
    olm.owner.namespace: openshift-operator-lifecycle-manager
  name: v1.packages.operators.coreos.com
  resourceVersion: "396973"
  selfLink: /apis/apiregistration.k8s.io/v1/apiservices/v1.packages.operators.coreos.com
  uid: 542c8de4-e0b8-4912-b324-80ec6c9f8fed
spec:
  caBundle: LS0tLS1CRUdJTiBDRVJUSUZJQ0FURS0tLS0tCk1JSUJhVENDQVE2Z0F3SUJBZ0lJYlQvZ3R5T1B3NGN3Q2dZSUtvWkl6ajBFQXdJd0dERVdNQlFHQTFVRUNoTU4KVW1Wa0lFaGhkQ3dnU1c1akxqQWVGdzB5TURBNU1qQXdNVE00TURCYUZ3MHlNakE1TWpBd01UTTRNREJhTUJneApGakFVQmdOVkJBb1REVkpsWkNCSVlYUXNJRWx1WXk0d1dUQVRCZ2NxaGtqT1BRSUJCZ2dxaGtqT1BRTUJCd05DCkFBUk9ibUVucVFSdk9oTC93U0llaVZnMzhiZmpNL2dDYkEyQUVjSVZDVzZJVUVQSk5xQ3MyTkV0aGNMcXJMN2YKbXIvd1hBMjRPVmVDMVBObURTSkgzOWRUbzBJd1FEQU9CZ05WSFE4QkFmOEVCQU1DQW9Rd0hRWURWUjBsQkJZdwpGQVlJS3dZQkJRVUhBd0lHQ0NzR0FRVUZCd01CTUE4R0ExVWRFd0VCL3dRRk1BTUJBZjh3Q2dZSUtvWkl6ajBFCkF3SURTUUF3UmdJaEFOMHdkbUFGSjZlejJGY0tEeHZoSW92OTA2ejZ2eVRjdjJrbTZqbzBOZjJHQWlFQXJpLzIKcXhQcGFpaWdZQ1g1dUdlcVVNQ3JxeW9QQTQ1MVlNSnp3WXRDZ1ZjPQotLS0tLUVORCBDRVJUSUZJQ0FURS0tLS0tCg==
  group: packages.operators.coreos.com
  groupPriorityMinimum: 2000
  service:
    name: packageserver-service
    namespace: openshift-operator-lifecycle-manager
    port: 5443
  version: v1
  versionPriority: 15
status:
  conditions:
  - lastTransitionTime: "2020-09-21T02:40:26Z"
    message: 'failing or missing response from https://10.129.0.22:5443/apis/packages.operators.coreos.com/v1: bad status from https://10.129.0.22:5443/apis/packages.operators.coreos.com/v1: 401'
    reason: FailedDiscoveryCheck
    status: "False"
    type: Available

2, Check the package pods logs.
[root@preserve-olm-env data]# oc get pods -o wide
NAME                                READY   STATUS    RESTARTS   AGE   IP            NODE                                        NOMINATED NODE   READINESS GATES
catalog-operator-587c77c6d4-lrrhp   1/1     Running   0          28h   10.129.0.28   ip-10-0-52-4.us-east-2.compute.internal     <none>           <none>
olm-operator-6b5d47c646-t9jsp       1/1     Running   0          28h   10.129.0.23   ip-10-0-52-4.us-east-2.compute.internal     <none>           <none>
packageserver-5d88d79964-mbffr      1/1     Running   0          28h   10.129.0.22   ip-10-0-52-4.us-east-2.compute.internal     <none>           <none>
packageserver-5d88d79964-zzfvr      1/1     Running   0          28h   10.128.0.30   ip-10-0-68-137.us-east-2.compute.internal   <none>           <none>

176124 E0921 07:18:14.204439       1 authentication.go:53] Unable to authenticate the request due to an error: x509: certificate signed by unknown authority
176125 I0921 07:18:14.204809       1 httplog.go:90] verb="GET" URI="/apis/packages.operators.coreos.com/v1" latency=517.698µs resp=401 UserAgent="Go-http-clie       nt/2.0" srcIP="10.128.0.2:52974":

Comment 1 Jian Zhang 2020-09-21 07:45:03 UTC
More informations:
1, the cluster shudown after 24 hours.
2, Restart it. (restart those instances on the AWS console manually)

some CA expired, then did CA auto recovery, at last, all others co comes back well, only operator-lifecycle-manager-packageserver run into error. Here is the cluster for your debug: https://mastern-jenkins-csb-openshift-qe.cloud.paas.psi.redhat.com/job/Launch%20Environment%20Flexy/113324/artifact/workdir/install-dir/auth/kubeconfig/*view*/

Comment 2 Jian Zhang 2020-09-21 07:55:48 UTC
I'm not sure why the CA rotaion didn't trigger the new pods generated. Workaround: delete the old pakcageserver pods.

[root@preserve-olm-env data]# oc get pods
NAME                                READY   STATUS    RESTARTS   AGE
catalog-operator-587c77c6d4-lrrhp   1/1     Running   0          28h
olm-operator-6b5d47c646-t9jsp       1/1     Running   0          28h
packageserver-5d88d79964-mbffr      1/1     Running   0          28h
packageserver-5d88d79964-zzfvr      1/1     Running   0          28h


[root@preserve-olm-env data]# oc delete pods packageserver-5d88d79964-mbffr packageserver-5d88d79964-zzfvr
pod "packageserver-5d88d79964-mbffr" deleted
pod "packageserver-5d88d79964-zzfvr" deleted

[root@preserve-olm-env data]# oc get pods
NAME                                READY   STATUS    RESTARTS   AGE
catalog-operator-587c77c6d4-lrrhp   1/1     Running   0          28h
olm-operator-6b5d47c646-t9jsp       1/1     Running   0          28h
packageserver-5d88d79964-mbm8v      1/1     Running   0          52s
packageserver-5d88d79964-v5xnc      1/1     Running   0          51s

[root@preserve-olm-env data]# oc get apiservice v1.packages.operators.coreos.com -o yaml
apiVersion: apiregistration.k8s.io/v1
kind: APIService
metadata:
  creationTimestamp: "2020-09-20T01:38:01Z"
  labels:
    olm.owner: packageserver
    olm.owner.kind: ClusterServiceVersion
    olm.owner.namespace: openshift-operator-lifecycle-manager
  name: v1.packages.operators.coreos.com
  resourceVersion: "424960"
  selfLink: /apis/apiregistration.k8s.io/v1/apiservices/v1.packages.operators.coreos.com
  uid: 542c8de4-e0b8-4912-b324-80ec6c9f8fed
spec:
  caBundle: LS0tLS1CRUdJTiBDRVJUSUZJQ0FURS0tLS0tCk1JSUJhVENDQVE2Z0F3SUJBZ0lJYlQvZ3R5T1B3NGN3Q2dZSUtvWkl6ajBFQXdJd0dERVdNQlFHQTFVRUNoTU4KVW1Wa0lFaGhkQ3dnU1c1akxqQWVGdzB5TURBNU1qQXdNVE00TURCYUZ3MHlNakE1TWpBd01UTTRNREJhTUJneApGakFVQmdOVkJBb1REVkpsWkNCSVlYUXNJRWx1WXk0d1dUQVRCZ2NxaGtqT1BRSUJCZ2dxaGtqT1BRTUJCd05DCkFBUk9ibUVucVFSdk9oTC93U0llaVZnMzhiZmpNL2dDYkEyQUVjSVZDVzZJVUVQSk5xQ3MyTkV0aGNMcXJMN2YKbXIvd1hBMjRPVmVDMVBObURTSkgzOWRUbzBJd1FEQU9CZ05WSFE4QkFmOEVCQU1DQW9Rd0hRWURWUjBsQkJZdwpGQVlJS3dZQkJRVUhBd0lHQ0NzR0FRVUZCd01CTUE4R0ExVWRFd0VCL3dRRk1BTUJBZjh3Q2dZSUtvWkl6ajBFCkF3SURTUUF3UmdJaEFOMHdkbUFGSjZlejJGY0tEeHZoSW92OTA2ejZ2eVRjdjJrbTZqbzBOZjJHQWlFQXJpLzIKcXhQcGFpaWdZQ1g1dUdlcVVNQ3JxeW9QQTQ1MVlNSnp3WXRDZ1ZjPQotLS0tLUVORCBDRVJUSUZJQ0FURS0tLS0tCg==
  group: packages.operators.coreos.com
  groupPriorityMinimum: 2000
  service:
    name: packageserver-service
    namespace: openshift-operator-lifecycle-manager
    port: 5443
  version: v1
  versionPriority: 15
status:
  conditions:
  - lastTransitionTime: "2020-09-21T07:48:36Z"
    message: all checks passed
    reason: Passed
    status: "True"
    type: Available


[root@preserve-olm-env data]# oc get co
NAME                                       VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE
...
operator-lifecycle-manager-packageserver   4.6.0-0.nightly-2020-09-19-060512   True        False         False      89s

Comment 3 Jian Zhang 2020-09-21 08:02:44 UTC
*** Bug 1880396 has been marked as a duplicate of this bug. ***

Comment 5 Evan Cordell 2020-09-21 17:23:31 UTC
  conditions:
  - lastTransitionTime: "2020-09-21T02:40:26Z"
    message: 'failing or missing response from https://10.129.0.22:5443/apis/packages.operators.coreos.com/v1: bad status from https://10.129.0.22:5443/apis/packages.operators.coreos.com/v1: 401'
    reason: FailedDiscoveryCheck
    status: "False"
    type: Available

This indicates that the root cause is the failed discovery check, which prevents the CA from rotating. 

I'm not sure why discovery would return 401 - it seems like it may be an issue with rotating the secret for the operator's service account. The failing request is one from the olm-operator pod (with the olm-operator service account) to apiserver discovery - I wouldn't expect that to fail with a 401 under normal cluster conditions.

I'm moving this to apiserver/auth to help triage - is there a known issue with service account tokens if a cluster has been shut down for >24hrs?

Comment 9 Stefan Schimanski 2020-09-22 08:16:28 UTC
As expected, the kube-apiserver gets a 401 from the packages server:

  2020-09-22T03:41:25.857583249Z E0922 03:41:25.857527      18 available_controller.go:437] v1.packages.operators.coreos.com failed with: failing or missing response from 
  https://10.128.0.7:5443/apis/packages.operators.coreos.com/v1: bad status from https://10.128.0.7:5443/apis/packages.operators.coreos.com/v1: 401

and the packages server shows that it does not understand the client cert:

  I0922 08:13:22.801683       1 httplog.go:90] verb="GET" URI="/apis/packages.operators.coreos.com/v1" latency=278.075µs resp=401 UserAgent="Go-http-client/2.0" srcIP="10.130.0.1:38640":
  E0922 08:13:22.802062       1 authentication.go:53] Unable to authenticate the request due to an error: x509: certificate signed by unknown authority

Chance is very high that the packages server does not honor kube-system/extension-apiserver-authentication ConfigMap.

Comment 10 Nick Hale 2020-09-22 15:52:14 UTC
I'm setting this to low and moving it to 4.7 since it doesn't seem to occur that often -- there's also a very simple work around: delete the packageserver pods.

> Chance is very high that the packages server does not honor kube-system/extension-apiserver-authentication ConfigMap.

Yes, maybe we're not honoring **changes** to the extension-apiserver-authentication configmap after packageserver has started -- if that occurs during an apiserver cert rotation, while packageserver is still running, OLM may get stuck. Further investigation is needed.

Comment 11 Jian Zhang 2020-09-23 01:25:35 UTC
Hi Nick,

Thanks! But, I'm setting it to medium since it always happen in our daiy test.


Note You need to log in before you can comment on or make changes to this bug.