Bug 1777593 - Metrics endpoints of catalog-operator and olm-operator are potentially broken by service CA rotation
Summary: Metrics endpoints of catalog-operator and olm-operator are potentially broken...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: OLM
Version: 4.3.0
Hardware: Unspecified
OS: Unspecified
unspecified
urgent
Target Milestone: ---
: 4.3.0
Assignee: Jeff Peeler
QA Contact: Jian Zhang
URL:
Whiteboard:
Depends On: 1771811
Blocks: 1775250
TreeView+ depends on / blocked
 
Reported: 2019-11-27 22:02 UTC by Evan Cordell
Modified: 2020-01-23 11:14 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of: 1771811
Environment:
Last Closed: 2020-01-23 11:14:45 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github operator-framework operator-lifecycle-manager pull 1165 0 'None' 'closed' '[release-4.3] bug 1777593: make certificate updates live upon update' 2020-01-23 20:29:54 UTC

Description Evan Cordell 2019-11-27 22:02:45 UTC
+++ This bug was initially created as a clone of Bug #1771811 +++

A serving cert supplied by the service CA operator appears to be used to secure the /metrics endpoints of catalog-operator and olm-operator. Neither operator appears to reload the key material if it were to change. When the serving cert is regenerated (i.e. when the service CA is rotated), the endpoints may cease to work until the operators are restarted.

The 'Refresh Strategies' section of the linked compatibility doc catalogs potential strategies for responding to changes in key material supplied by the service CA operator.

Note that CA rotation can be manually triggered in any 4.x release by removing the signing secret. Automated rotation is likely to be introduced in a future z-stream release. 

References: 

Enhancement for automated service CA rotation: 

https://github.com/openshift/enhancements/blob/master/enhancements/automated-service-ca-rotation.md

Operator compatibility with service ca rotation:

https://docs.google.com/document/d/1NB2wUf9e8XScfVM6jFBl8VuLYG6-3uV63eUpqmYE8Ts/edit

Comment 2 Jian Zhang 2019-12-05 10:03:48 UTC
Hi, Jeff

I test it in a cluster without this fixed PR. But, I couldn't reproduce this issue. Details as follows:
Cluster version is 4.3.0-0.nightly-2019-12-03-032607
The OLM version without that fixed PR.
mac:~ jianzhang$ oc exec catalog-operator-6cfdcd86fd-xwpsh -- olm --version
OLM version: 0.13.0
git commit: ba10413e72cfe23724edc588ff25f36dfdbeb37e

1, Delete olm-operator-serving-cert  and catalog-operator-serving-cert.
mac:~ jianzhang$ oc get secret
NAME                                          TYPE                                  DATA   AGE
builder-dockercfg-lv7jr                       kubernetes.io/dockercfg               1      23h
builder-token-2k476                           kubernetes.io/service-account-token   4      23h
builder-token-zj659                           kubernetes.io/service-account-token   4      23h
catalog-operator-serving-cert                 kubernetes.io/tls                     2      6m19s
default-dockercfg-kzn65                       kubernetes.io/dockercfg               1      23h
default-token-lbgz5                           kubernetes.io/service-account-token   4      23h
default-token-x554m                           kubernetes.io/service-account-token   4      23h
deployer-dockercfg-pdrmt                      kubernetes.io/dockercfg               1      23h
deployer-token-mclc7                          kubernetes.io/service-account-token   4      23h
deployer-token-q9jtd                          kubernetes.io/service-account-token   4      23h
olm-operator-serviceaccount-dockercfg-zqfnc   kubernetes.io/dockercfg               1      23h
olm-operator-serviceaccount-token-4vtnf       kubernetes.io/service-account-token   4      23h
olm-operator-serviceaccount-token-vgfxq       kubernetes.io/service-account-token   4      23h
olm-operator-serving-cert                     kubernetes.io/tls                     2      6m19s
v1.packages.operators.coreos.com-cert         kubernetes.io/tls                     2      23h

2, Forward the port to my localhost.
mac:~ jianzhang$ oc port-forward catalog-operator-6cfdcd86fd-xwpsh 8081:8081
Forwarding from 127.0.0.1:8081 -> 8081
Forwarding from [::1]:8081 -> 8081
Handling connection for 8081

3, In another terminal, run `openssl s_client -connect`, it works well.
mac:~ jianzhang$ openssl s_client -connect localhost:8081
CONNECTED(00000005)
depth=1 CN = openshift-service-serving-signer@1575455354
verify error:num=19:self signed certificate in certificate chain
verify return:0
---
Certificate chain
 0 s:/CN=catalog-operator-metrics.openshift-operator-lifecycle-manager.svc
   i:/CN=openshift-service-serving-signer@1575455354
 1 s:/CN=openshift-service-serving-signer@1575455354
   i:/CN=openshift-service-serving-signer@1575455354
---
Server certificate
-----BEGIN CERTIFICATE-----
...
    Start Time: 1575538998
    Timeout   : 7200 (sec)
    Verify return code: 19 (self signed certificate in certificate chain)

4, Check the metrics on the Promuttheus, it works well. See a screenshot: https://user-images.githubusercontent.com/15416633/70224969-1cd68280-1789-11ea-8aa9-669a4c9c9f0d.png

So, what're the steps to reproduce this issue?

Comment 3 Jeff Peeler 2019-12-05 15:32:56 UTC
My reference to using openssl s_client was a pointer to get started, not the entire test itself. Without the PR, the original certificate will stay in use until the container is restarted. I don't see much value in testing anything without the PR, but if you really wanted to you can verify that the certificate is still the same after you delete the certs in the OLM namespace.


With the PR, do something like this after you've set up the port forwarding you had before:

$ echo | openssl s_client -connect localhost:8081 2>&1 | sed --quiet '/-BEGIN CERTIFICATE-/,/-END CERTIFICATE-/p' > olm.crt
$ openssl x509 -in olm.crt -purpose -noout -text

Do the above before and after deleting the certificate in the OLM namespace. The result should be that the certificate is different and I assume the validity (not before / not after) will be slightly different too.

Comment 4 Jian Zhang 2019-12-06 06:53:55 UTC
Hi Jeff,

Many thanks for your information! I test it in a cluster within this fixed PR, details as follows:
Cluster version is 4.3.0-0.nightly-2019-12-05-213858
mac:~ jianzhang$ oc exec catalog-operator-8fcc9bc76-bjzz6 -- olm --version
OLM version: 0.13.0
git commit: 7dfd4517e5368fa19c48dab9b9e126798f3c3f40


mac:~ jianzhang$ oc port-forward catalog-operator-8fcc9bc76-kvctw  8081:8081 
Forwarding from 127.0.0.1:8081 -> 8081
Forwarding from [::1]:8081 -> 8081
Handling connection for 8081
...
mac:~ jianzhang$ echo | openssl s_client -connect localhost:8081 2>&1 | gsed --quiet '/-BEGIN CERTIFICATE-/,/-END CERTIFICATE-/p' > olm4.crt

Only delete secret: catalog-operator-serving-cert, olm-operator-serving-cert  
mac:~ jianzhang$ oc get secret
NAME                                          TYPE                                  DATA   AGE
builder-dockercfg-mpcgh                       kubernetes.io/dockercfg               1      42m
builder-token-4dpzn                           kubernetes.io/service-account-token   4      43m
builder-token-p8vm9                           kubernetes.io/service-account-token   4      43m
catalog-operator-serving-cert                 kubernetes.io/tls                     2      64s
default-dockercfg-4zkbl                       kubernetes.io/dockercfg               1      42m
default-token-dhdst                           kubernetes.io/service-account-token   4      51m
default-token-j2vdg                           kubernetes.io/service-account-token   4      43m
deployer-dockercfg-v979w                      kubernetes.io/dockercfg               1      42m
deployer-token-54pmq                          kubernetes.io/service-account-token   4      43m
deployer-token-tr248                          kubernetes.io/service-account-token   4      43m
olm-operator-serviceaccount-dockercfg-ldx5g   kubernetes.io/dockercfg               1      43m
olm-operator-serviceaccount-token-kbshw       kubernetes.io/service-account-token   4      43m
olm-operator-serviceaccount-token-knwvr       kubernetes.io/service-account-token   4      51m
olm-operator-serving-cert                     kubernetes.io/tls                     2      64s
v1.packages.operators.coreos.com-cert         kubernetes.io/tls                     2      47m

mac:~ jianzhang$ echo | openssl s_client -connect localhost:8081 2>&1 | gsed --quiet '/-BEGIN CERTIFICATE-/,/-END CERTIFICATE-/p' > olm5.crt

Check if the olm4.crt and olm5. crt  are the same.
mac:~ jianzhang$ diff olm4.crt olm5.crt 
2c2
< MIIEVjCCAz6gAwIBAgIIKAf5qP8BYvcwDQYJKoZIhvcNAQELBQAwNjE0MDIGA1UE
---
> MIIEVjCCAz6gAwIBAgIIBZzzc3WJ7kYwDQYJKoZIhvcNAQELBQAwNjE0MDIGA1UE
4c4
< Fw0xOTEyMDYwNTM0NDRaFw0yMTEyMDUwNTM0NDVaMEwxSjBIBgNVBAMTQWNhdGFs
---
> Fw0xOTEyMDYwNTQzMDlaFw0yMTEyMDUwNTQzMTBaMEwxSjBIBgNVBAMTQWNhdGFs
6,14c6,14
< LW1hbmFnZXIuc3ZjMIIBIjANBgkqhkiG9w0BAQEFAAOCAQ8AMIIBCgKCAQEAxxpz
< w7/wDb5aJGsu9cVGzF08wVpXMWYW6VfFa0oiipLO/RttLOgm8UUsjqgH+w/bwaCl
< X1zxdVBbpqvHX3NDxvb72GM24qhTKoWXuQX0Vt6pzn8vhzzvnzFcy4sjXx7fOmC2
< tc4b4dGiwmYh9hqy/Jtv19QTU7LI+/Prk+2oYe/fRK5PDH1UEFLWx3nfzmjstZGE
< 9aRnh5wTba2iCnmP8i/BYa9yVdt58Mb7touBA+/Nj3iTL0KgNBkJQLEoiIcmuE7C
< jgUQRMxRfRVVdXR7XMHrQerr96tajZwnSjbcM4SYEcigoRJVa+o/g019mRfktajH
< o8d+6fuf6AHt8uNA0QIDAQABo4IBUDCCAUwwDgYDVR0PAQH/BAQDAgWgMBMGA1Ud
< JQQMMAoGCCsGAQUFBwMBMAwGA1UdEwEB/wQCMAAwHQYDVR0OBBYEFO5v1gBY0qlr
< f59f05f4V6IEMAP/MB8GA1UdIwQYMBaAFNxpLrspblVMh04UbIaYlAYneWO2MIGf
---
> LW1hbmFnZXIuc3ZjMIIBIjANBgkqhkiG9w0BAQEFAAOCAQ8AMIIBCgKCAQEAsagD
> npKrLfRay3Re7RMBpRl4MtCoZrqR9I5Aps575G8k0uBGwXf2F4YURHjpXvD0zfly
> mbTy3U/oeStX+HDQ54mfLjDhGqkizpFmYHASwtqXdDxsrRbeGRKzWCYsYWaBZTAq
> KrniFtPCiAOCEAbJBvUmcv2ahR6CVajXNiUSz9j+ptPoGCyfpQ4CO1kSF6X0Y5Gy
> R8kTExhXua6bs30jpdhE9vcENpc8YjGrh/81HtMZRohwWyZNeAz3dwbIxuX1YfVB
> dz1AT9O5ebciy3cs4EaU5wr5bj6/63I4DF5rQa7NZJPLlCurBFLYpR5F4Mk0a1TD
> LQ3c4DQRM+6wgLSqmwIDAQABo4IBUDCCAUwwDgYDVR0PAQH/BAQDAgWgMBMGA1Ud
> JQQMMAoGCCsGAQUFBwMBMAwGA1UdEwEB/wQCMAAwHQYDVR0OBBYEFHAevST4r1DZ
> WJAsxGXaL/OxucL/MB8GA1UdIwQYMBaAFNxpLrspblVMh04UbIaYlAYneWO2MIGf
19,25c19,25
< NDllOS05ZDQ5LWEyMzI4ODc5NDM1YTANBgkqhkiG9w0BAQsFAAOCAQEAW3QGBOxR
< 7dzGifds6qnei4JjFx85Jgq6eLUKZSvz3RLfToKtWs96LCQIp0cxPdnJtFAzfzEO
< 3vk04ZXfgG2FnomlQ0h7SOZQH03+khwErVjIwfoHyHvVIzLXEI9p6yyHCWArkS3L
< YrIqbCMN+hP6BNi9+iFXRuF80H0POMwXIz96Sk6hOxZOqg6lb8NBiJusf2Av6Np0
< DduWZJC/Xef9paiDkLKzXJkginNNQ0MZCWnTgl5+weXJJYeQauk8zUyGunDu4Os6
< hSYi16xPKHryIlsWEPMnMdKlye8pn3UDT4E+5xKjBf26ML5kiPSYbCav/pt7olkF
< DbxbG5OYu9KKWQ==
---
> NDllOS05ZDQ5LWEyMzI4ODc5NDM1YTANBgkqhkiG9w0BAQsFAAOCAQEADUdpNgTW
> HjwfQorMzRKVMYdvSGC/Ku/SaSBJd65mbQFexNeYiloX+UcogM5IawFqDw6haK6m
> DJlG5hR+uBgdSIgSYlRvUPkLU/iRgtUXnMydb8OTOs3cxTFTEloaaA4BzJNz7qn8
> M0TggdR5jKDHa29h1IyO30jvQnz52mMpLfXt+QrRoWQ+Gs+Pv1mLjomMUkPcgxOS
> s5JKJ0AVcrEQmQbZPuLTmispVtZ3v1YD4mvI4Fc5HsMRXSQwIYVOioimC9ownK0n
> 6ldi9gDEPE/JjaDOj53McVP2TSnaEaGdDksVPei5Y45Y+MmrHqWlTIcKfnax53R+
> Ec7NcsKMB0QCPg==

They are different. LGTM, verify it.

Comment 6 errata-xmlrpc 2020-01-23 11:14:45 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0062


Note You need to log in before you can comment on or make changes to this bug.