Bug 1691488 - authentication operator in failing due to Failing: x509: certificate signed by unknown authority
Summary: authentication operator in failing due to Failing: x509: certificate signed b...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Auth
Version: 4.1.0
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: 4.1.0
Assignee: Mo
QA Contact: scheng
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-03-21 17:51 UTC by Justin Pierce
Modified: 2019-06-04 10:46 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-06-04 10:46:21 UTC
Target Upstream Version:


Attachments (Terms of Use)
post ingress configuration, authentication operator was monitored for readiness (11.67 KB, text/plain)
2019-03-21 17:51 UTC, Justin Pierce
no flags Details


Links
System ID Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2019:0758 None None None 2019-06-04 10:46:28 UTC

Description Justin Pierce 2019-03-21 17:51:57 UTC
Created attachment 1546648 [details]
post ingress configuration, authentication operator was monitored for readiness

Description of problem:
Observed after a 4.0 install & configuration operation. Authentication operator was constantly toggling between Failure=True&Progressing=True and Failure=True&Progressing=False.

Version-Release number of selected component (if applicable):
4.0.0-0.alpha-2019-03-21-045513

Additional info:
See attached listing

Comment 6 Justin Pierce 2019-03-21 20:36:58 UTC
Miciah pointed to https://github.com/openshift/cluster-ingress-operator/pull/168 as a possible fix.

Comment 7 Justin Pierce 2019-03-21 20:47:01 UTC
Update on https://bugzilla.redhat.com/show_bug.cgi?id=1691488#c6 - It appears I already had this fix. 

`oc adm release info` shows `cluster-ingress-operator                      https://github.com/openshift/cluster-ingress-operator                      ee176866c1ca5fb4e6a014bfd13fcdab0cbcae92`  which seems to include PR 168.

Comment 8 Standa Laznicka 2019-03-22 13:19:38 UTC
Couple of interesting findings:

Events in openshift-authentication namespace show errors like:
'Unable to mount volumes for pod "openshift-authentication-545d764f7b-29xnb_openshift-authentication(c1b4117e-4bfc-11e9-802b-02080fc4efd6)":
    timeout expired waiting for volumes to attach or mount for pod "openshift-authentication"/"openshift-authentication-545d764f7b-29xnb"

but also

'Failed create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_openshift-authentication-545d764f7b-44ngv_openshift-authentication_ab4c6b6e-4c00-11e9-802b-02080fc4efd6_0(d235df482c762b43ab01ee10322f6fd8106204398a5d7710787728e95d8e1351):
    Multus: Err adding pod to network "openshift-sdn": cannot set "openshift-sdn" ifname to "eth0": no netns: failed to Statfs "/proc/761/ns/net": no such file or directory'


Another thing is that while events in openshift-authentication-operator show that the last update of "Secret/v4-0-config-system-router-certs" is at 16:42:48, the authetication-operator reports it to be empty at least up until 16:46:06. Note that it started looking for it at 16:21:47, that's 25 (!) minutes this secret was not available for use, who knows if it is afterwards. Also, it's not entirely clear what was happening with the openshift-apiserver at these times at the last logs show that it started at 17:04...

So my current summary is - yes, something's wrong with the cluster, but I am not sure it's authentication-operator that's causing it, although it trying to recreate its deployment sure is not helping and it probably should not do that.

If you could please provide some better steps to reproduce other than "The ingresscontroller.operator is configured to use a default certificate" (I may be lame but I do not know what that means, I only noticed that openshift-ingress:secret/router-certs contains a Let's Encrypt cert and no "tls.key"), I could try to do that with my cluster, see if it's 100% reproducible and maybe add some logging to the authentication-operator so that we know a bit more about what's going on in there.

Comment 9 Justin Pierce 2019-03-22 13:30:44 UTC
- The tls.key is redacted by must-gather to avoid extracting private data from clusters. On the live cluster, I've verified that the key is intact.

- Configuring the default certificate involves applying a cluster resource like:

apiVersion: "operator.openshift.io/v1"
kind: "IngressController"
metadata: 
  name: "default"
  namespace: "openshift-ingress-operator"
spec: 
  defaultCertificate: 
    name: "router-certs"

Where "router-certs" is a secret which exists in the openshift-ingress namespace. 

I don't know whether configuring this value is what breaks the authentication operator, I only know it happened two-out-of-two times while attempting cluster installations yesterday.

Comment 10 Standa Laznicka 2019-03-22 14:28:29 UTC
Thanks, I see it was more obvious than I expected, I should probably read the actual readme of your operator rather than just ctrl+F through it...

Anyway, I was able to reproduce by simply copying router-certs-default to router-certs, applying the IngressController settings and removing the router-certs from openshift-config-managed.

Comment 11 Standa Laznicka 2019-03-22 16:32:04 UTC
So the reason for the `x509` failures in the authentication-operator is that it's trying to use the serviceaccount CA bundle but, as you can see, the ConfigMap `openshift-config-managed/router-ca` with the CA that we're missing in the bandle is no more in the data gathered by `must-gather`, nor is in the cluster (noticed that in my deployment).

I do not think that simply using the `defaultCertificate` option should have this effect.

We will be adding our own fix to this as we probably shouldn't be relying on the SA CA bundle to contain our router-ca. On the other hand, the router-ca CM not being available sounds like a bug of the networking team.

Comment 12 Miciah Dashiel Butler Masters 2019-03-22 19:57:07 UTC
This belongs to either Auth or to Routing (which is the Network Edge team's component).  Not sure which, but it's not Networking (which is the SDN team's component), so I'll assign to Routing for now.

Turns out we (Auth and Network Edge) had a misunderstanding about the contract between cluster-authentication-operator and cluster-ingress-operator regarding the router-ca configmap.  I was under the impression that cluster-ingress-operator should only publish the router-ca configmap if an operator-generated certificate (i.e., one signed by the router CA) were in use, but Mo is telling me we should publish the configmap with whatever CA was used to sign the certificates in use (possibly multiple CAs in the case of multiple ingresscontrollers), be it the router CA for an operator-generated certificate or the CA for the administrator-provided certificate, with an explicit opt-out option.

We need to rehash the contract, and then we can move forward on the issue.

Comment 13 Ben Bennett 2019-03-26 15:20:56 UTC
Auth and routing talked about this:

Scenarios:
1) No default cert given. ingress-controller makes a CA and sign the cert and publish the CA cert for auth to use (and tell the other pods to trust it)
2) A default cert that is signed by a globally trusted CA.  We do nothing.
3) A default cert given to us that is externally signed, but not by a trusted CA.  We do nothing.

Scenarios 1 and 2 work because in 1 auth adds the ingress-controller generated ca cert to the trust chain, and in 2 the root CA is already in the trust chain

In scenario 3, there's some other CA that signed the default cert... but it is not in the trust chain so pods freak out (if they are trying to use things that the wildcard cert points at)

Conclusion: Scenario 2 is actually what is broken in BZ#1691488 and Auth will fix the auth server to include the root CAs as well.  BUT we need to think about whether scenario 3 needs to be supported, and we realized we need to work out how auth will support sharded routers.  Does one need to be flagged as default?

Comment 16 scheng 2019-03-28 09:48:03 UTC
Verified.

The authentication operator without that PR:
# oc logs -f openshift-authentication-operator-7b5dbb57c8-7md8v -n openshift-authentication-operator |grep 509
E0328 07:27:22.560871       1 controller.go:130] {🐼 🐼} failed with: x509: certificate signed by unknown authority
I0328 07:27:22.564447       1 status_controller.go:150] clusteroperator/authentication diff {"status":{"conditions":[{"lastTransitionTime":"2019-03-28T07:23:03Z","message":"Failing: x509: certificate signed by unknown authority","reason":"Failing","status":"True","type":"Failing"},{"lastTransitionTime":"2019-03-28T07:27:01Z","reason":"AsExpected","status":"False","type":"Progressing"},{"lastTransitionTime":"2019-03-28T07:22:25Z","reason":"Available","status":"False","type":"Available"},{"lastTransitionTime":"2019-03-28T07:22:19Z","reason":"NoData","status":"Unknown","type":"Upgradeable"}]}}


The authentication operator with that PR:
# oc logs -f openshift-authentication-operator-7b5d4779bd-7mhhb -n openshift-authentication-operator |grep 509

The x509 error does not occurs. 

# oc get co authentication
NAME             VERSION                             AVAILABLE   PROGRESSING   FAILING   SINCE
authentication   4.0.0-0.nightly-2019-03-28-030453   True        False         False     3h33m


As I understand: for scenario 3, we should add externally CA to system CA to avoid this error.

Comment 18 errata-xmlrpc 2019-06-04 10:46:21 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0758


Note You need to log in before you can comment on or make changes to this bug.