Bug 1880396 - After cert expiry DR, co/operator-lifecycle-manager-packageserver Available is False with "Unable to authenticate the request due to an error: x509: certificate signed by unknown authority"
Summary: After cert expiry DR, co/operator-lifecycle-manager-packageserver Available i...
Keywords:
Status: CLOSED DUPLICATE of bug 1880928
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: OLM
Version: 4.6
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 4.6.0
Assignee: Evan Cordell
QA Contact: Jian Zhang
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-09-18 12:24 UTC by Xingxing Xia
Modified: 2020-09-21 08:47 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-09-21 08:02:44 UTC
Target Upstream Version:


Attachments (Terms of Use)

Description Xingxing Xia 2020-09-18 12:24:05 UTC
Description of problem:
After cert expiry DR, co/operator-lifecycle-manager-packageserver Available is False with "Unable to authenticate the request due to an error: x509: certificate signed by unknown authority"

Version-Release number of selected component (if applicable):
4.6.0-0.nightly-2020-09-17-073141

How reproducible:
Not sure

Steps to Reproduce:
1. Launched the env. The env is IPI on AWS FIPS on. Checked pods, COs, nodes are all well.
2. Shutdown all master and worker nodes from AWS console
3. After cluster age > 25h, re-start all master and worker nodes from AWS console. Approve Pending CSRs by `oc get csr | grep Pending | awk '{print $1}' | xargs oc adm certificate approve`
4. Check pods, COs, nodes

Actual results:
4. All pods are well, all nodes are Ready. All COs are well except below one stuck in Available=False:
$ oc get co operator-lifecycle-manager-packageserver
operator-lifecycle-manager-packageserver   4.6.0-0.nightly-2020-09-17-073141   False   True    False   97m

Checked oc get apiservice, found:
v1.packages.operators.coreos.com              openshift-operator-lifecycle-manager/packageserver-service   False (FailedDiscoveryCheck) 27h

$ oc get apiservice v1.packages.operators.coreos.com -o yaml
...
spec:
  caBundle: LS0tLS1CRUdJTiBDRVJUSUZJQ0FURS0tLS0tCk1JSUJhRENDQVE2Z0F3SUJBZ0lJQU5McDd5OEhMSDh3Q2dZSUtvWkl6ajBFQXdJd0dERVdNQlFHQTFVRUNoTU4KVW1Wa0lFaGhkQ3dnU1c1akxqQWVGdzB5TURBNU1UY3dPREl3TlRSYUZ3MHlNakE1TVRjd09ESXdOVFJhTUJneApGakFVQmdOVkJBb1REVkpsWkNCSVlYUXNJRWx1WXk0d1dUQVRCZ2NxaGtqT1BRSUJCZ2dxaGtqT1BRTUJCd05DCkFBUnpuLzhWQmV2QXpnMDVQQkl3Nm03c3VGaDRPL3FybDNVVGlRMHY4WnpDUjRmSjRGWXgyVUc1SjFWdEZ3SUUKOVlYZnNsUXVCY09GMi94QWRwMjd0S0I4bzBJd1FEQU9CZ05WSFE4QkFmOEVCQU1DQW9Rd0hRWURWUjBsQkJZdwpGQVlJS3dZQkJRVUhBd0lHQ0NzR0FRVUZCd01CTUE4R0ExVWRFd0VCL3dRRk1BTUJBZjh3Q2dZSUtvWkl6ajBFCkF3SURTQUF3UlFJZ000Y0duTUNKV2R3QjM5MXg1YnVqNDhWVWgvdTJBdGYwOVpqd0xyejFCYTBDSVFDWktQOVAKMzFCWkdzNXZGSmt4dC9RZUc3RUhkQW4rbWNvSXh3eFhkbS9VU1E9PQotLS0tLUVORCBDRVJUSUZJQ0FURS0tLS0tCg==
  group: packages.operators.coreos.com
  groupPriorityMinimum: 2000
  service:
    name: packageserver-service
    namespace: openshift-operator-lifecycle-manager
    port: 5443
  version: v1
  versionPriority: 15
status:
  conditions:
  - lastTransitionTime: "2020-09-18T09:34:10Z"
    message: 'failing or missing response from https://10.129.0.21:5443/apis/packages.operators.coreos.com/v1: bad status from https://10.129.0.21:5443/apis/packages.operators.coreos.com/v1: 401'
    reason: FailedDiscoveryCheck
    status: "False"
    type: Available

$ oc get po -A -o wide | grep 10.129.0.21
openshift-operator-lifecycle-manager               packageserver-768769fcf4-xxhq2                                             1/1     Running     0          27h    10.129.0.21    ip-10-0-165-20.ap-northeast-2.compute.internal ...

$ oc get po -n openshift-operator-lifecycle-manager -o wide | grep packageserver
packageserver-768769fcf4-t5vrh      1/1     Running   0          27h   10.130.0.6    ip-10-0-205-215.ap-northeast-2.compute.internal ...
packageserver-768769fcf4-xxhq2      1/1     Running   0          27h   10.129.0.21   ip-10-0-165-20.ap-northeast-2.compute.internal ...

Both pods logs have many below x509 errors:
$ oc logs -n openshift-operator-lifecycle-manager packageserver-768769fcf4-t5vrh
...
E0918 12:10:51.202746       1 authentication.go:53] Unable to authenticate the request due to an error: x509: certificate signed by unknown authority
I0918 12:10:51.202797       1 httplog.go:90] verb="GET" URI="/apis/packages.operators.coreos.com/v1" latency=161.072µs resp=401 UserAgent="Go-http-client/2.0" srcIP="10.129.0.1:51868":
...

Checked kube-apiserver logs, there are many errors:
2020-09-18T12:13:21.247706425Z E0918 12:13:21.247654      17 available_controller.go:437] v1.packages.operators.coreos.com failed with: failing or missing response from https://10.129.0.21:5443/apis/packages.operators.coreos.com/v1: bad status from https://10.129.0.21:5443/apis/packages.operators.coreos.com/v1: 401
...
2020-09-18T12:13:21.277274586Z E0918 12:13:21.277217      17 available_controller.go:437] v1.packages.operators.coreos.com failed with: failing or missing response from https://10.130.0.6:5443/apis/packages.operators.coreos.com/v1: bad status from https://10.130.0.6:5443/apis/packages.operators.coreos.com/v1: 401

Certs are already renewed, though:
$ oc get secret --insecure-skip-tls-verify -A -o json | jq -r '.items[] | select(.metadata.annotations."auth.openshift.io/certificate-not-after" | . != null and fromdateiso8601<='$( date --date="+24hours" +%s )') | "\(.metadata.annotations."auth.openshift.io/certificate-not-before")  \(.metadata.annotations."auth.openshift.io/certificate-not-after")  \(.metadata.namespace)\t\(.metadata.name)"'
2020-09-18T09:35:12Z  2020-09-18T21:35:13Z  openshift-config-managed    kube-controller-manager-client-cert-key
2020-09-18T09:35:12Z  2020-09-18T21:35:13Z  openshift-config-managed    kube-scheduler-client-cert-key
2020-09-18T09:35:12Z  2020-09-18T21:35:13Z  openshift-kube-apiserver-operator   aggregator-client-signer
2020-09-18T09:35:14Z  2020-09-18T21:35:13Z  openshift-kube-apiserver    aggregator-client
2020-09-18T09:35:12Z  2020-09-18T21:35:13Z  openshift-kube-apiserver    check-endpoints-client-cert-key
2020-09-18T09:35:12Z  2020-09-18T21:35:13Z  openshift-kube-apiserver    control-plane-node-admin-client-cert-key
2020-09-18T09:35:12Z  2020-09-18T21:35:13Z  openshift-kube-apiserver    external-loadbalancer-serving-certkey
2020-09-18T09:35:12Z  2020-09-18T21:35:13Z  openshift-kube-apiserver    internal-loadbalancer-serving-certkey
2020-09-18T09:35:12Z  2020-09-18T21:35:13Z  openshift-kube-apiserver    kubelet-client
2020-09-18T09:35:12Z  2020-09-18T21:35:13Z  openshift-kube-apiserver    kubelet-client-7
2020-09-18T09:35:12Z  2020-09-18T21:35:13Z  openshift-kube-apiserver    localhost-serving-cert-certkey
2020-09-18T09:35:11Z  2020-09-18T21:35:12Z  openshift-kube-apiserver    service-network-serving-certkey
2020-09-18T09:35:12Z  2020-09-18T21:35:13Z  openshift-kube-controller-manager   kube-controller-manager-client-cert-key
2020-09-18T09:35:12Z  2020-09-18T21:35:13Z  openshift-kube-scheduler    kube-scheduler-client-cert-key

Expected results:
4. Should not have abnormal CO

Additional info:

Comment 2 Stefan Schimanski 2020-09-18 12:54:34 UTC
The caBundle in apiservice v1.packages.operators.coreos.com is (supposed to be) managed by the operator belong to the package apiserver. The kube-apiserver only matches it to the server cert returned on access. Moving over to OLM.


Note You need to log in before you can comment on or make changes to this bug.