Bug 1771783 - Template router metrics listener is potentially broken by service CA rotation [NEEDINFO]
Summary: Template router metrics listener is potentially broken by service CA rotation
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Routing
Version: 4.2.0
Hardware: Unspecified
OS: Unspecified
high
urgent
Target Milestone: ---
: 4.3.z
Assignee: Dan Mace
QA Contact: Hongan Li
URL:
Whiteboard:
Depends On:
Blocks: 1809398
TreeView+ depends on / blocked
 
Reported: 2019-11-12 23:42 UTC by Maru Newby
Modified: 2020-03-03 04:20 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1809398 (view as bug list)
Environment:
Last Closed: 2020-01-23 11:12:22 UTC
Target Upstream Version:
mnewby: needinfo? (dmace)


Attachments (Terms of Use)


Links
System ID Priority Status Summary Last Updated
Github openshift router pull 61 'None' closed Bug 1771783: Dynamically reload metrics certificate/key file changes 2020-05-26 19:25:13 UTC
Red Hat Product Errata RHBA-2020:0062 None None None 2020-01-23 11:12:32 UTC

Description Maru Newby 2019-11-12 23:42:10 UTC
The template router controller is supplied with the names of the serving cert and key files via env var: 

ROUTER_METRICS_TLS_CERT_FILE, ROUTER_METRICS_TLS_KEY_FILE

The key material is loaded when the router runs and used to configure the metrics listener:

https://github.com/openshift/router/blob/master/pkg/cmd/infra/router/template.go#L420

These vars are set in the deployment to a serving cert and key injected by the service ca controller: 

https://github.com/openshift/cluster-ingress-operator/blob/master/pkg/operator/controller/ingress/deployment.go#L241

In the event of the serving cert and key changing after service CA rotation, the operator (or router) should be capable of automatically detecting the change and refreshing to use the latest key material. The 'Refresh Strategies' section of the linked compatibility doc catalogs potential strategies for responding to changes in key material supplied by the service ca operator.

Note that CA rotation can be manually triggered in any 4.x release by removing the service CA signing secret. Automated rotation is likely to be introduced in a future z-stream release. 

Reference: 

Enhancement for automated service CA rotation: 

https://github.com/openshift/enhancements/blob/master/enhancements/automated-service-ca-rotation.md

Operator compatibility with service ca rotation:

https://docs.google.com/document/d/1NB2wUf9e8XScfVM6jFBl8VuLYG6-3uV63eUpqmYE8Ts/edit

Comment 1 Dan Mace 2019-11-18 17:55:16 UTC
Miciah wants to try a kube-rbac-proxy approach to replace my implementation in https://github.com/openshift/router/pull/61, which I think is a good idea. I'm going to re-assign and Miciah will supersede my PR with a new one.

Comment 2 Dan Mace 2019-11-18 20:12:12 UTC
Further discussion revealed that the kube-rbac-proxy approach is incompatible with HostNetwork-exposed ingresscontrollers, so we're going back to my solution for now.

Comment 4 Hongan Li 2019-11-22 06:16:09 UTC
verified with 4.3.0-0.nightly-2019-11-19-122017 and issue has been fixed.

delete the secret and router can reload the new one.

# oc delete secret router-metrics-certs-default
# oc logs router-default-xx-xx
I1122 05:07:51.514632       1 template.go:630] router "level"=0 "msg"="reloaded metrics certificate"  "cert"="/etc/pki/tls/metrics-certs/tls.crt" "key"="/etc/pki/tls/metrics-certs/tls.key"

Comment 6 errata-xmlrpc 2020-01-23 11:12:22 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0062

Comment 7 Maru Newby 2020-02-17 20:44:48 UTC
I'm in the process of backporting service CA rotation to 4.2 and 4.3. Would it make sense to for this fix to similarly be backported to preclude failure of metrics collection in the event that the ingress operator is not restarted after CA rotation and before expiry of the pre-rotation CA?

Comment 8 Dan Mace 2020-02-17 21:14:22 UTC
(In reply to Maru Newby from comment #7)
> I'm in the process of backporting service CA rotation to 4.2 and 4.3. Would
> it make sense to for this fix to similarly be backported to preclude failure
> of metrics collection in the event that the ingress operator is not
> restarted after CA rotation and before expiry of the pre-rotation CA?

I think the router reload function would be a nice addition to the rotation backport. I'll bet the patch applies cleanly through the cherry-pick automation. Could be an easy win.

Comment 9 Maru Newby 2020-03-03 04:20:07 UTC
Updating the target release to satisfy the bugzilla bot. Not sure why it should care what the target release is for something that is long fixed.


Note You need to log in before you can comment on or make changes to this bug.