Bug 1771783

Summary: Template router metrics listener is potentially broken by service CA rotation
Product: OpenShift Container Platform Reporter: Maru Newby <mnewby>
Component: RoutingAssignee: Dan Mace <dmace>
Status: CLOSED ERRATA QA Contact: Hongan Li <hongli>
Severity: urgent Docs Contact:
Priority: high    
Version: 4.2.0CC: aos-bugs, dmace
Target Milestone: ---Flags: mnewby: needinfo? (dmace)
Target Release: 4.3.z   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1809398 (view as bug list) Environment:
Last Closed: 2020-01-23 11:12:22 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Bug Depends On:    
Bug Blocks: 1809398    

Description Maru Newby 2019-11-12 23:42:10 UTC
The template router controller is supplied with the names of the serving cert and key files via env var: 

ROUTER_METRICS_TLS_CERT_FILE, ROUTER_METRICS_TLS_KEY_FILE

The key material is loaded when the router runs and used to configure the metrics listener:

https://github.com/openshift/router/blob/master/pkg/cmd/infra/router/template.go#L420

These vars are set in the deployment to a serving cert and key injected by the service ca controller: 

https://github.com/openshift/cluster-ingress-operator/blob/master/pkg/operator/controller/ingress/deployment.go#L241

In the event of the serving cert and key changing after service CA rotation, the operator (or router) should be capable of automatically detecting the change and refreshing to use the latest key material. The 'Refresh Strategies' section of the linked compatibility doc catalogs potential strategies for responding to changes in key material supplied by the service ca operator.

Note that CA rotation can be manually triggered in any 4.x release by removing the service CA signing secret. Automated rotation is likely to be introduced in a future z-stream release. 

Reference: 

Enhancement for automated service CA rotation: 

https://github.com/openshift/enhancements/blob/master/enhancements/automated-service-ca-rotation.md

Operator compatibility with service ca rotation:

https://docs.google.com/document/d/1NB2wUf9e8XScfVM6jFBl8VuLYG6-3uV63eUpqmYE8Ts/edit

Comment 1 Dan Mace 2019-11-18 17:55:16 UTC
Miciah wants to try a kube-rbac-proxy approach to replace my implementation in https://github.com/openshift/router/pull/61, which I think is a good idea. I'm going to re-assign and Miciah will supersede my PR with a new one.

Comment 2 Dan Mace 2019-11-18 20:12:12 UTC
Further discussion revealed that the kube-rbac-proxy approach is incompatible with HostNetwork-exposed ingresscontrollers, so we're going back to my solution for now.

Comment 4 Hongan Li 2019-11-22 06:16:09 UTC
verified with 4.3.0-0.nightly-2019-11-19-122017 and issue has been fixed.

delete the secret and router can reload the new one.

# oc delete secret router-metrics-certs-default
# oc logs router-default-xx-xx
I1122 05:07:51.514632       1 template.go:630] router "level"=0 "msg"="reloaded metrics certificate"  "cert"="/etc/pki/tls/metrics-certs/tls.crt" "key"="/etc/pki/tls/metrics-certs/tls.key"

Comment 6 errata-xmlrpc 2020-01-23 11:12:22 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0062

Comment 7 Maru Newby 2020-02-17 20:44:48 UTC
I'm in the process of backporting service CA rotation to 4.2 and 4.3. Would it make sense to for this fix to similarly be backported to preclude failure of metrics collection in the event that the ingress operator is not restarted after CA rotation and before expiry of the pre-rotation CA?

Comment 8 Dan Mace 2020-02-17 21:14:22 UTC
(In reply to Maru Newby from comment #7)
> I'm in the process of backporting service CA rotation to 4.2 and 4.3. Would
> it make sense to for this fix to similarly be backported to preclude failure
> of metrics collection in the event that the ingress operator is not
> restarted after CA rotation and before expiry of the pre-rotation CA?

I think the router reload function would be a nice addition to the rotation backport. I'll bet the patch applies cleanly through the cherry-pick automation. Could be an easy win.

Comment 9 Maru Newby 2020-03-03 04:20:07 UTC
Updating the target release to satisfy the bugzilla bot. Not sure why it should care what the target release is for something that is long fixed.