Bug 1866818 - Upgrades from 4.5.2 and later fail with RouterCertsDegraded
Summary: Upgrades from 4.5.2 and later fail with RouterCertsDegraded
Keywords:
Status: VERIFIED
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: apiserver-auth
Version: 4.5
Hardware: Unspecified
OS: Unspecified
unspecified
urgent
Target Milestone: ---
: 4.6.0
Assignee: Standa Laznicka
QA Contact: pmali
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-08-06 13:22 UTC by Martin Gencur
Modified: 2020-09-15 12:45 UTC (History)
12 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:
Target Upstream Version:


Attachments (Terms of Use)
authentication-operator-log-before-pod-restart.log (31.89 KB, text/plain)
2020-08-06 13:22 UTC, Martin Gencur
no flags Details
Install config (5.58 KB, text/plain)
2020-08-06 15:53 UTC, Martin Gencur
no flags Details


Links
System ID Priority Status Summary Last Updated
Github openshift cluster-authentication-operator pull 330 None closed Bug 1866818: routercerts: always trust the default ingress ca 2020-09-13 02:26:22 UTC

Description Martin Gencur 2020-08-06 13:22:41 UTC
Created attachment 1710654 [details]
authentication-operator-log-before-pod-restart.log

Description of problem:

Upgrading from OCP 4.5.2, 4.5.3 or 4.5.4 to a later version fail. The `oc get clusterversion` returns:

'Unable to apply 4.5.4: the cluster operator authentication is degraded'

'Cluster operator authentication is reporting a failure: RouterCertsDegraded:
        secret/v4-0-config-system-router-certs.spec.data[apps.ocf-rollup-55-rolling-upgrade.openshift-aws.rhocf-dev.net]
        -n openshift-authentication: certificate could not validate route hostname
        oauth-openshift.apps.ocf-rollup-55-rolling-upgrade.openshift-aws.rhocf-dev.net:
        x509: certificate signed by unknown authority'


How reproducible:

Steps to Reproduce:
1. Install 4.5.2 or later
2. Try upgrades via oc adm upgrade

Actual results:

The upgrades gets stuck at "Unable to apply 4.5.4: the cluster operator authentication is degraded"

Expected results: 

Upgrade goes smoothly.

Additional info:

Attaching the logs from the authentication-operator in the openshift-authentication-operator namespace before I restarted the pod.
Attaching the output of must-gather after restarting the authentication-operator pod (which didn't help).

Comment 1 Martin Gencur 2020-08-06 13:42:14 UTC
The output of must-gather can be found here: https://drive.google.com/file/d/13ue0MiB1knEZZAP_TVRnqrGBoj_QWneq/view?usp=sharing (please let me know if there's a problem with permissions)

Comment 2 Martin Gencur 2020-08-06 15:53:53 UTC
Created attachment 1710683 [details]
Install config

We setup a custom certificate for the Ingress by following docs at https://docs.openshift.com/container-platform/4.5/networking/configuring-a-custom-pki.html . I'm attaching the install-config.

Comment 4 Miciah Dashiel Butler Masters 2020-08-20 02:03:34 UTC
The attached install-config.yaml specifies an "additionalTrustBundle" stanza; however, it does not specify a "proxy" stanza, and therefore the installer does not configure the proxy.  Note that the "spec.trustedCA.name" field in the cluster proxy config is blank:

    % cat cluster-scoped-resources/config.openshift.io/proxies/cluster.yaml
    ---
    apiVersion: config.openshift.io/v1
    kind: Proxy
    metadata:
      creationTimestamp: "2020-08-06T10:08:25Z"
      generation: 1
      managedFields:
      - apiVersion: config.openshift.io/v1
        fieldsType: FieldsV1
        fieldsV1:
          f:spec:
            .: {}
            f:trustedCA:
              .: {}
              f:name: {}
          f:status: {}
        manager: cluster-bootstrap
        operation: Update
        time: "2020-08-06T10:08:26Z"
      name: cluster
      resourceVersion: "490"
      selfLink: /apis/config.openshift.io/v1/proxies/cluster
      uid: f7d34432-06fa-4d2c-924a-aa34b837aa81
    spec:
      trustedCA:
        name: ""
    status: {}

The installer did put the signing certificate in the "user-ca-bundle" configmap, and the default ingresscontroller is correctly configured to use a custom certificate that is signed using this signing certificate:

    % yaml2json() { python -c 'import json,sys,yaml;json.dump(yaml.safe_load(sys.stdin.read()),sys.stdout)' }
    % secret_name=$(cat namespaces/openshift-ingress-operator/operator.openshift.io/ingresscontrollers/default.yaml | yaml2json | jq -r .spec.defaultCertificate.name)
    % cat namespaces/openshift-ingress/core/secrets.yaml | yaml2json | jq -r ".items|.[]|select(.metadata.name==\"$secret_name\").data[\"tls.crt\"]" | base64 -d > tls.crt
    % cat namespaces/openshift-config/core/configmaps.yaml | yaml2json | jq -r '.items|.[]|select(.metadata.name=="user-ca-bundle").data["ca-bundle.crt"]' > user-ca-bundle.crt
    % openssl verify -verbose -CAfile user-ca-bundle.crt tls.crt
    tls.crt: OK

However, the authentication operator does not user the "user-ca-bundle" configmap; rather, it uses the "trusted-ca-bundle" configmap, which is missing the signing certificate:

    % cat namespaces/openshift-config-managed/core/configmaps.yaml | yaml2json | jq -r '.items|.[]|select(.metadata.name=="trusted-ca-bundle").data["ca-bundle.crt"]' > trusted-ca-bundle.crt
    % openssl verify -verbose -CAfile trusted-ca-bundle.crt tls.crt
    tls.crt: C = US, ST = North Carolina, L = Raleigh, O = Red Hat Inc., OU = RHOSS-QE, CN = ocf-rollup-55-rolling-upgrade
    error 20 at 0 depth lookup:unable to get local issuer certificate
    zsh: exit 2     openssl verify -verbose -CAfile trusted-ca-bundle.crt tls.crt

When you install a new cluster, be sure to specify a nonempty "proxy" stanza, which should cause the installer to configure cluster proxy config with "spec.trustedCA.name" set to "user-ca-bundle".  With the proxy so configured, cluster-network-operator should merge the certificate in the "user-ca-bundle" configmap into the "trusted-ca-bundle" configmap, which should resolve the problem.

To fix the cluster after installation, it should suffice to patch the cluster proxy config as follows:

    oc patch proxies.config.openshift.io/cluster --type=merge --patch='{"spec":{"trustedCA":{"name":"user-ca-bundle"}}}'

Does that resolve the issue?

Comment 5 Martin Gencur 2020-08-20 16:09:17 UTC
Thanks, Miciah!
Patching the cluster proxy helped and I was able to perform a cluster upgrade.

Comment 6 Martin Gencur 2020-08-20 16:20:05 UTC
However, I don't want to be configuring the cluster proxy so I'm wondering what could be the "nonempty" proxy stanza in my case so that it doesn't break other functionality.
Would it be just something like this?
```
proxy:
  noProxy: example.com
```

Comment 7 Miciah Dashiel Butler Masters 2020-08-21 16:41:56 UTC
(In reply to Martin Gencur from comment #6)
> However, I don't want to be configuring the cluster proxy so I'm wondering
> what could be the "nonempty" proxy stanza in my case so that it doesn't
> break other functionality.
> Would it be just something like this?
> ```
> proxy:
>   noProxy: example.com
> ```

I was going to suggest using "proxy: {}", but the installer does not allow this; when I tried specifying "proxy: {}" in install-config.yaml, the installer reported, "invalid 'install-config.yaml' file: proxy: Required value: must include httpProxy or httpsProxy".  

I think the installer should be fixed, either (1) to allow specifying "proxy: {}" or (2) to set spec.trustedCA in proxy.config/cluster if install-config.yaml specifies additionalTrustBundle irrespective of whether install-config.yaml specifies proxy.  I'll open a PR to do (2).  With this change, the install-config.yaml that you provided would work as is.  

We'll try to get this fixed in the upcoming sprint.

Comment 8 Miciah Dashiel Butler Masters 2020-08-21 19:40:38 UTC
On further investigation, the authentication operator should not be using the trusted CA bundle to validate the router's default certificate; instead, it should trust the default certificate implicitly (as described in https://github.com/openshift/enhancements/blob/master/enhancements/network/default-ingress-cert-configmap.md).  

However, the authentication operator does check that the certificate in the "default-ingress-cert" configmap matches the certificate in the "router-certs" secret.  The must-gather archive includes the configmap but not the secret; would you be able to share the secret?  `oc -n openshift-config-managed get secrets/router-certs`

Comment 9 Martin Gencur 2020-08-24 14:23:30 UTC
I'm not sure this will help you because the original cluster was killed a long time ago. However, if I use the same install-config and start a new cluster the secret looks like this:
oc -n openshift-config-managed get secrets/router-certs -oyaml
apiVersion: v1
data:
  apps.ocf-rollup-139-rolling-upgrade.openshift-aws.rhocf-dev.net: LS0tLS1CRUdJTiBDRVJUSUZJQ0FURS0tLS0tCk1JSUZQekNDQkNlZ0F3SUJBZ0lRRi9rM2lYRElYWmhQSzlUWllBeUdtVEFOQmdrcWhraUc5dzBCQVFzRkFEQ0IKb2pFTE1Ba0dBMVVFQmhNQ1ZWTXhGekFWQmdOVkJBZ1REazV2Y25Sb0lFTmhjbTlzYVc1aE1SQXdEZ1lEVlFRSApFd2RTWVd4bGFXZG9NUlV3RXdZRFZRUUtFd3hTWldRZ1NHRjBJRWx1WXk0eEVUQVBCZ05WQkFzVENGSklUMU5UCkxWRkZNVDR3UEFZRFZRUURFelZQY0dWdVUyaHBablFnVTJWeWRtVnliR1Z6Y3lBb1VraFBVMU1wSUZGRklFTmwKY25ScFptbGpZWFJsSUVGMWRHaHZjbWwwZVRBZUZ3MHlNREE0TWpReE1URTVNRFJhRncweU1EQTVNRGN4TVRFNQpNRFJhTUlHTE1Rc3dDUVlEVlFRR0V3SlZVekVYTUJVR0ExVUVDQk1PVG05eWRHZ2dRMkZ5YjJ4cGJtRXhFREFPCkJnTlZCQWNUQjFKaGJHVnBaMmd4RlRBVEJnTlZCQW9UREZKbFpDQklZWFFnU1c1akxqRVJNQThHQTFVRUN4TUkKVWtoUFUxTXRVVVV4SnpBbEJnTlZCQU1USG05alppMXliMnhzZFhBdE1UTTVMWEp2Ykd4cGJtY3RkWEJuY21GawpaVENDQVNJd0RRWUpLb1pJaHZjTkFRRUJCUUFEZ2dFUEFEQ0NBUW9DZ2dFQkFPTXh2dzhTSXN5NCs2YXdneEJFCjBHNUdXYldlaWVOVnlPQVpWcVM0QmRhM0hyK3RQc29xNnZBWCtZYkRhaVl4TGcyVGlWejN3cGdVVnhud0hOR0IKSlZ1SjBPQ3ZWTmVQSlRzV3FQcUM2OGJ2UWI0bzBGTkI4aWlaay9tWEFtRytLNXZZSnpWNlBPdUJYeVA0Y1VLcQplTXRsVlY2Sm5Ydzh5YVdId3d6bTQ0VkdXbEdyOWc1UEJzRzZhbWlOOUpoRkZzZlkxYkFZYURQbjdmVjkrRFZkCnJCWDMwZ3FFSkVDckszRVVLZEExS0QrcFA3K0xVT2hkME9aZHk3Y2Z0bWZheFVaTzdBSm1LaUZuQlJaN1lyMVoKdzFDc2N2MmtiVGJHeGlRY2k1a3o1MkIzM0pZS2xMZnBUT0wwRy9yWmdGay9kZ1BPa2w0M1JyVHoxNEFFS1FwSgozT2NDQXdFQUFhT0NBWVF3Z2dHQU1BNEdBMVVkRHdFQi93UUVBd0lGb0RBZEJnTlZIU1VFRmpBVUJnZ3JCZ0VGCkJRY0RBUVlJS3dZQkJRVUhBd0l3SFFZRFZSME9CQllFRkxRZ1lSNExkSWQ2TFJwZXRqa0I3R0JVc1p3OU1COEcKQTFVZEl3UVlNQmFBRkppcTh1cmZlTFZ3cmZwRVN1NTRxWmRzbWE5T01JSUJEUVlEVlIwUkJJSUJCRENDQVFDQwpPbTlqWmkxeWIyeHNkWEF0TVRNNUxYSnZiR3hwYm1jdGRYQm5jbUZrWlM1dmNHVnVjMmhwWm5RdFlYZHpMbkpvCmIyTm1MV1JsZGk1dVpYU0NQbUZ3YVM1dlkyWXRjbTlzYkhWd0xURXpPUzF5YjJ4c2FXNW5MWFZ3WjNKaFpHVXUKYjNCbGJuTm9hV1owTFdGM2N5NXlhRzlqWmkxa1pYWXVibVYwZ2o5aGNIQnpMbTlqWmkxeWIyeHNkWEF0TVRNNQpMWEp2Ykd4cGJtY3RkWEJuY21Ga1pTNXZjR1Z1YzJocFpuUXRZWGR6TG5Kb2IyTm1MV1JsZGk1dVpYU0NRU291CllYQndjeTV2WTJZdGNtOXNiSFZ3TFRFek9TMXliMnhzYVc1bkxYVndaM0poWkdVdWIzQmxibk5vYVdaMExXRjMKY3k1eWFHOWpaaTFrWlhZdWJtVjBNQTBHQ1NxR1NJYjNEUUVCQ3dVQUE0SUJBUUNPY0xKTmNZY2c1Tkc3YXlJTQpoU21nTlU0Yk1jZFROU2lCUmV5eEMrV3lyeVJQdXFMdXY0NnIzeXpaNTVOR1o3TDZPT3NmdVhRdThqd2NzRktyCmNDNStqQU92Z0VEcThocVJSdkQ1OE1LMTZtaExsZnlIUEdxMklPaFpneTV6aXR1Q1FuaWdHTks4T0NYLy94U3QKY3RZUWlIa3RQcmJ6RU1mSitlWlRFWlkrVmlYSUc4eEhjUzFJZloxNHJiVzJCRUlyNTNvRUFacHdNWWhnTnFCZQp0ZGFBMlZDRXAydmd0OWZ0OG9xd2hNR1I5TmFoR0JleVVTcFdCdHhMOFpQQW5oS21WV1YvMGRkeDMralkvc2ZXCkVSTVI3WFQzMFZzNGZBVndjVndWRjd1UUE2VmpCSVZVQmptcDU1NjA1MzBPS3dFbkliVS8vOG9mUzIreGRCZEoKMzEyWQotLS0tLUVORCBDRVJUSUZJQ0FURS0tLS0tCi0tLS0tQkVHSU4gUlNBIFBSSVZBVEUgS0VZLS0tLS0KTUlJRXBBSUJBQUtDQVFFQTR6Ry9EeElpekxqN3ByQ0RFRVRRYmtaWnRaNko0MVhJNEJsV3BMZ0YxcmNldjYwKwp5aXJxOEJmNWhzTnFKakV1RFpPSlhQZkNtQlJYR2ZBYzBZRWxXNG5RNEs5VTE0OGxPeGFvK29Mcnh1OUJ2aWpRClUwSHlLSm1UK1pjQ1liNHJtOWduTlhvODY0RmZJL2h4UXFwNHkyVlZYb21kZkR6SnBZZkRET2JqaFVaYVVhdjIKRGs4R3dicHFhSTMwbUVVV3g5alZzQmhvTStmdDlYMzROVjJzRmZmU0NvUWtRS3NyY1JRcDBEVW9QNmsvdjR0UQo2RjNRNWwzTHR4KzJaOXJGUms3c0FtWXFJV2NGRm50aXZWbkRVS3h5L2FSdE5zYkdKQnlMbVRQbllIZmNsZ3FVCnQrbE00dlFiK3RtQVdUOTJBODZTWGpkR3RQUFhnQVFwQ2tuYzV3SURBUUFCQW9JQkFHN0xXN2tseHdLL1V6bSsKNnF1TVkzampwZXdFSElwWTAxVTJCaUxkK3pyeW9uUW5NRysyN2t1WDVYL3EzR0V6cXBuRVVVQ2RNckNuZXJLVApmZnBOV01LRE92SFhqekJ3Qm1BQ2RQVjEwelY0aUQ4TCtFd2g1TTRYMXlub2txakg3TXhiWlFPWFVRNG9VUlZoCm14by91Qmk0bWlFNFN5ekRHRE01T2MyWTYydWFNQmYvcWFud2E1dWlNMEdlYVJYRzQ2OCsyYWJsa0xRbFV2eXIKRmQ1WnB4SzZxNUc5bHBmeEZyaVI0MlkvenRWR1N0MG00WHl2eE0xQjJUZzNGVmlmVjRTZzVLeGJiVlhHaHVnUgplb0dzR0VBTGt6TVBXT2VUb1JmWXVDVlpaZzlsbnhjMFo3LysrTUxOUTAzbTBKdWdCVXhOeVlRcTdDeTl2L0RvCnFMek9mSUVDZ1lFQSt4cFl0K2JINC93TlVQL09sQ2pPOTMycXVvTWl5TnFVOGcvTFNmcnk5blRMbkpxcXh0d0gKbmltSjJ5dUFtcHoxMUJoRUhPd1F3ZnI2bE96ODFlUC9nM09DdkM3a3hzK1B0dmsrcW01bXRyL1Y1WFJnNThoOQpGN1VWNVVjcVFpS1I2RVJvNTJWOW5LQXRoUUhpalc3c2lxbERBMzh6clR2L2xRVGh5R2doZ3M4Q2dZRUE1NkFJCnRFNG14Mnp4V252azUrNW5nQkRPNEQxdkw0UDBCaks4MlZlMGpPZFRrVXJkbDJmSUJpOXc5cmx4Q1VUNWhlWVEKQ245RlN1d2FibVVEOGlPeTRySmVxZHRTaUNmM1BseWIvSWRKZWsxRVhEU1g4bkNPOGxTc3JtajNTby9rTEJnUApReXVJeGxxeGhjWVJiUWhPOEdHaUlib2g2Qm40MWdFT2tFUG02bWtDZ1lCMjBrelJHUi9Wdmx2K3pFM1F4azdKCnhtbVh3SjRoTlczdDdaTmcrcU1tQkxhazhIdUhobThFWk51YkhzYklZeVhncTJydjFMVkpWWjVtQW83U0dBVzkKQ2xmKy9LRzlnbEtiWHU1TWI5bWkrTHdhekN0ZkF2eE96NTRBMU9BbVUzMS96Mzlrb0I0RWs3ZDJqU0hMazRYVApSNjB5Wm1ycHVzNkNrY0RWdUpEQytRS0JnUUM3VU9pNUtCcWtYSzR6QnM3djRoVkJ0RllaY3BWZ1Q4NGcxUmQwCmpVRXVVa1Y2MHBpeHdQUTZURk9HdENGOTVaSUZmekNwekpNMUxBdVVDNDFOWFNGbHcrcGFZMHd6WUY3S3lBbysKQndxZEphK0xBZDEvNnhjdlV0cnprVitycFFKWnhudFJUdnVscmVLeTFLTnpFYTBGS1cvODVwSlZLZXZhNWEvcAphNEJyUVFLQmdRRGliV2JNdkJ6MjlxOFQrRjVVUWhPRjEreHl4dkdkT1lhc2ExcGw0OFl0R0IxTXdaMXUvNFdBCmRyWW5jNGEzUnlkbHRtSGsvcnMrditzNlppOG5PbVRlMU1ET0JtalpkQWVIZXU4V3BIdVVGaGZjOVgwa05MdTQKQk5wVElLcGlvYmxXcjFXcjdxdUZqU1dYNExEZWtydysrS1lMYTRjS0FNL0ZMSWRIL0xOTzNnPT0KLS0tLS1FTkQgUlNBIFBSSVZBVEUgS0VZLS0tLS0K
kind: Secret
metadata:
  creationTimestamp: "2020-08-24T11:37:45Z"
  managedFields:
  - apiVersion: v1
    fieldsType: FieldsV1
    fieldsV1:
      f:data:
        .: {}
        f:apps.ocf-rollup-139-rolling-upgrade.openshift-aws.rhocf-dev.net: {}
      f:type: {}
    manager: ingress-operator
    operation: Update
    time: "2020-08-24T11:50:37Z"
  name: router-certs
  namespace: openshift-config-managed
  resourceVersion: "24232"
  selfLink: /api/v1/namespaces/openshift-config-managed/secrets/router-certs
  uid: c00c811e-a5d9-4aa2-af81-53cfb535343a
type: Opaque

Comment 10 Abhinav Dahiya 2020-08-31 17:35:22 UTC
The attached PR is on the authentication operator so moving to that component.


Note You need to log in before you can comment on or make changes to this bug.