Bug 1771811 - Metrics endpoints of catalog-operator and olm-operator are potentially broken by service CA rotation [NEEDINFO]
Summary: Metrics endpoints of catalog-operator and olm-operator are potentially broken...
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: OLM
Version: 4.4
Hardware: Unspecified
OS: Unspecified
Target Milestone: ---
: 4.4.0
Assignee: Evan Cordell
QA Contact: Bruno Andrade
Depends On:
Blocks: 1777593
TreeView+ depends on / blocked
Reported: 2019-11-13 03:15 UTC by Maru Newby
Modified: 2020-05-13 21:52 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1775250 1775253 1777593 (view as bug list)
Last Closed: 2020-05-13 21:52:42 UTC
Target Upstream Version:
mnewby: needinfo? (ecordell)

Attachments (Terms of Use)

System ID Private Priority Status Summary Last Updated
Github operator-framework operator-lifecycle-manager pull 1151 0 'None' closed bug 1771811: make certificate updates live upon update 2020-05-12 16:00:30 UTC
Red Hat Product Errata RHBA-2020:0581 0 None None None 2020-05-13 21:52:44 UTC

Description Maru Newby 2019-11-13 03:15:56 UTC
A serving cert supplied by the service CA operator appears to be used to secure the /metrics endpoints of catalog-operator and olm-operator. Neither operator appears to reload the key material if it were to change. When the serving cert is regenerated (i.e. when the service CA is rotated), the endpoints may cease to work until the operators are restarted.

The 'Refresh Strategies' section of the linked compatibility doc catalogs potential strategies for responding to changes in key material supplied by the service CA operator.

Note that CA rotation can be manually triggered in any 4.x release by removing the signing secret. Automated rotation is likely to be introduced in a future z-stream release. 


Enhancement for automated service CA rotation: 


Operator compatibility with service ca rotation:


Comment 3 Jeff Peeler 2019-12-04 15:34:05 UTC
In order to test this functionality, one should delete the olm-operator-serving-cert and the catalog-operator-serving-cert in the openshift-operator-lifecycle-manager namespace, wait for the CA operator to regenerate the certificates, and then ensure that the metrics are being served using the newly generated certificates. Metrics on both operators are served on port 8081. You might consider looking at just the certificates using openssl s_connect...

Comment 4 Jeff Peeler 2019-12-04 15:37:13 UTC
Sorry, that's "openssl s_client -connect"...

Comment 5 Jian Zhang 2019-12-05 06:31:49 UTC
Change its version to 4.4 since we have already a bug 1777593 for 4.3.

Comment 6 Maru Newby 2019-12-05 09:04:19 UTC
Jeff's instructions are valid, though I'm not sure it's worth testing this functionality manually. My intention is to add a periodic rotation job that checks that metrics from all operators are collected after a combination of CA expiry and manual rotation:


Comment 10 Bruno Andrade 2020-01-24 18:24:35 UTC
Worked as expected, the certs were rotated as expected. Marking as VERIFIED.

Steps used to reproduce:

OLM version: 0.13.0
git commit: 30838b7abce35c2d0d24bcf91596fc31db50755b
Cluster Version: 4.4.0-0.nightly-2020-01-24-113037

oc port-forward catalog-operator-6795f76457-6tn6g  -n openshift-operator-lifecycle-manager 8081:8081
Forwarding from -> 8081
Forwarding from [::1]:8081 -> 8081
Handling connection for 8081

echo | openssl s_client -connect localhost:8081 2>&1 | sed --quiet '/-BEGIN CERTIFICATE-/,/-END CERTIFICATE-/p' > olm4.crt 

Only delete secret: catalog-operator-serving-cert, olm-operator-serving-cert  
oc delete secret olm-operator-serving-cert catalog-operator-serving-cert -n openshift-operator-lifecycle-manager
secret "olm-operator-serving-cert" deleted
secret "catalog-operator-serving-cert" deleted

echo | openssl s_client -connect localhost:8081 2>&1 | sed --quiet '/-BEGIN CERTIFICATE-/,/-END CERTIFICATE-/p' > olm5.crt 

diff olm4.crt olm5.crt                                                                                                    
< DSe3F5IBivhC/+H4ooz/TDZRX0eCoik0cU492w+bQ7dnjdlvsj0k2SmFltTgA2gP
< eMhz4YMoO1/T5qQ2HfxBtcJ3sronSgll+k/RPGHCb8JCqDd9hE82ITlLd7WqZuI8
< U1WrRG7JQNknk/+OgIoAMHNVmSlp2hJNbMx5pzcPMv5BvAiNa8FJ7/39yxdYaCGa
< EzEeWrezzn6H2xuX1oRUKoT67GZPM+ZkQZrl7PcSd+gTwtTbCaofsER5CG+4ydJ4
< nIMRKvALBlnbPNA9pJCjVc4nJrGgNVlcNskxluJ5XWtuYGUcTzLQ0M+lZ5Ztxf2S
> aKHd/nd+pSkmJuCXSh6FUbW5tssmgGVj+ICPxCbJLDaHAf3xmG8nx5p8sWTAj1bJ
> LvW4BZ394lV4YCYvr97WpE5AVJZ0R9/xcFLb3XOkClOy2GxfuFW3SS4zXjHZHaNl
> 03YTZoYT7JhacNhcUgKxDdmTvfkIbOlTvHicx5dhyN9ObnAzEZ72ZDnVPWAL/5YR
> 8Aii4CWhT94gIfs8GAcuxmxnfA7PBTrMYOmNh/6GHa2cjG8fHLOBOkGOPwwm2KsT
> O+WzxV0T9/tvc6im/w0RZ0GhBzXIrF9AGYQ+fPt7zrKiVYLEc0GkwwwAGcyfstQa
< QPfNk6gGvmQtdXxSfHAvfex1SE2bM+KLyDcpkHDFOHiTOxGD/jseEtyznzRcr/fF
< yCSJQxQnAaUi7Wq5JP/5f65AkcLdQDv/HLqm2awQsghKSS7Zb9CN2/vEWyeKyNVX
< KYCkxR4obomrVhYF43l85VU/FI2cqBDXcxo3mFaPrJtFvUdtwwwGXVVicKYy97Xo
< 6IDsT1pc8hT7xRFs81uVSo4zkcQ4VRwMmbyCkIaZYZfnbX4dzW50ydYLJk4vjg6D
< GhVPQVxN88koTAWszvlAvXDXn+rn1WYKuxFNnjYm+63+0VTqt6M8C0GWT4GjuiZd
< jrQnzAFU1DTEDQ==
> quIt7eD3xI6dNi3Nb1lcRInI+c0y0RSlNwuNDdIRk5BOFF3p6eXD9rf3TnmoyqcT
> Q1dMCDUjIKQwPb2L4Z1Ok3cV5H8QwK3YIxmcqyBT5qM3kJKRWbsxGXTd5pxT/ZTK
> qbvj7l6BHQYSiZsjbm0pWH8fOoIeQo6YvpvJkycSm9lZVTwnTS5tMq19dcalCw6t
> O5CnDdLjVgb5Zo+yc9nczdtKjK930PiA2+/wi2dxS6JcCFHMuNKJUkz13D7l/UmN
> LcOS/Mtan+az3pa565tfhP1AM/N3G4LTGhDPLI540h5Aaoy1p+RQKg3Ok5DHyIwT
> NwH66/C4SrDrqg==

Comment 12 Maru Newby 2020-02-17 20:55:10 UTC
I'm in the process of backporting CA rotation to 4.2 and 4.3. Would it make sense to for the fix for this BZ to similarly be backported to preclude failure of metrics collection in the event that the operator is not restarted after CA rotation and before expiry of the pre-rotation CA?

Comment 14 errata-xmlrpc 2020-05-13 21:52:42 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.


Note You need to log in before you can comment on or make changes to this bug.