Hide Forgot
Created attachment 1546648 [details] post ingress configuration, authentication operator was monitored for readiness Description of problem: Observed after a 4.0 install & configuration operation. Authentication operator was constantly toggling between Failure=True&Progressing=True and Failure=True&Progressing=False. Version-Release number of selected component (if applicable): 4.0.0-0.alpha-2019-03-21-045513 Additional info: See attached listing
Miciah pointed to https://github.com/openshift/cluster-ingress-operator/pull/168 as a possible fix.
Update on https://bugzilla.redhat.com/show_bug.cgi?id=1691488#c6 - It appears I already had this fix. `oc adm release info` shows `cluster-ingress-operator https://github.com/openshift/cluster-ingress-operator ee176866c1ca5fb4e6a014bfd13fcdab0cbcae92` which seems to include PR 168.
Couple of interesting findings: Events in openshift-authentication namespace show errors like: 'Unable to mount volumes for pod "openshift-authentication-545d764f7b-29xnb_openshift-authentication(c1b4117e-4bfc-11e9-802b-02080fc4efd6)": timeout expired waiting for volumes to attach or mount for pod "openshift-authentication"/"openshift-authentication-545d764f7b-29xnb" but also 'Failed create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_openshift-authentication-545d764f7b-44ngv_openshift-authentication_ab4c6b6e-4c00-11e9-802b-02080fc4efd6_0(d235df482c762b43ab01ee10322f6fd8106204398a5d7710787728e95d8e1351): Multus: Err adding pod to network "openshift-sdn": cannot set "openshift-sdn" ifname to "eth0": no netns: failed to Statfs "/proc/761/ns/net": no such file or directory' Another thing is that while events in openshift-authentication-operator show that the last update of "Secret/v4-0-config-system-router-certs" is at 16:42:48, the authetication-operator reports it to be empty at least up until 16:46:06. Note that it started looking for it at 16:21:47, that's 25 (!) minutes this secret was not available for use, who knows if it is afterwards. Also, it's not entirely clear what was happening with the openshift-apiserver at these times at the last logs show that it started at 17:04... So my current summary is - yes, something's wrong with the cluster, but I am not sure it's authentication-operator that's causing it, although it trying to recreate its deployment sure is not helping and it probably should not do that. If you could please provide some better steps to reproduce other than "The ingresscontroller.operator is configured to use a default certificate" (I may be lame but I do not know what that means, I only noticed that openshift-ingress:secret/router-certs contains a Let's Encrypt cert and no "tls.key"), I could try to do that with my cluster, see if it's 100% reproducible and maybe add some logging to the authentication-operator so that we know a bit more about what's going on in there.
- The tls.key is redacted by must-gather to avoid extracting private data from clusters. On the live cluster, I've verified that the key is intact. - Configuring the default certificate involves applying a cluster resource like: apiVersion: "operator.openshift.io/v1" kind: "IngressController" metadata: name: "default" namespace: "openshift-ingress-operator" spec: defaultCertificate: name: "router-certs" Where "router-certs" is a secret which exists in the openshift-ingress namespace. I don't know whether configuring this value is what breaks the authentication operator, I only know it happened two-out-of-two times while attempting cluster installations yesterday.
Thanks, I see it was more obvious than I expected, I should probably read the actual readme of your operator rather than just ctrl+F through it... Anyway, I was able to reproduce by simply copying router-certs-default to router-certs, applying the IngressController settings and removing the router-certs from openshift-config-managed.
So the reason for the `x509` failures in the authentication-operator is that it's trying to use the serviceaccount CA bundle but, as you can see, the ConfigMap `openshift-config-managed/router-ca` with the CA that we're missing in the bandle is no more in the data gathered by `must-gather`, nor is in the cluster (noticed that in my deployment). I do not think that simply using the `defaultCertificate` option should have this effect. We will be adding our own fix to this as we probably shouldn't be relying on the SA CA bundle to contain our router-ca. On the other hand, the router-ca CM not being available sounds like a bug of the networking team.
This belongs to either Auth or to Routing (which is the Network Edge team's component). Not sure which, but it's not Networking (which is the SDN team's component), so I'll assign to Routing for now. Turns out we (Auth and Network Edge) had a misunderstanding about the contract between cluster-authentication-operator and cluster-ingress-operator regarding the router-ca configmap. I was under the impression that cluster-ingress-operator should only publish the router-ca configmap if an operator-generated certificate (i.e., one signed by the router CA) were in use, but Mo is telling me we should publish the configmap with whatever CA was used to sign the certificates in use (possibly multiple CAs in the case of multiple ingresscontrollers), be it the router CA for an operator-generated certificate or the CA for the administrator-provided certificate, with an explicit opt-out option. We need to rehash the contract, and then we can move forward on the issue.
Auth and routing talked about this: Scenarios: 1) No default cert given. ingress-controller makes a CA and sign the cert and publish the CA cert for auth to use (and tell the other pods to trust it) 2) A default cert that is signed by a globally trusted CA. We do nothing. 3) A default cert given to us that is externally signed, but not by a trusted CA. We do nothing. Scenarios 1 and 2 work because in 1 auth adds the ingress-controller generated ca cert to the trust chain, and in 2 the root CA is already in the trust chain In scenario 3, there's some other CA that signed the default cert... but it is not in the trust chain so pods freak out (if they are trying to use things that the wildcard cert points at) Conclusion: Scenario 2 is actually what is broken in BZ#1691488 and Auth will fix the auth server to include the root CAs as well. BUT we need to think about whether scenario 3 needs to be supported, and we realized we need to work out how auth will support sharded routers. Does one need to be flagged as default?
https://github.com/openshift/cluster-authentication-operator/pull/102
Verified. The authentication operator without that PR: # oc logs -f openshift-authentication-operator-7b5dbb57c8-7md8v -n openshift-authentication-operator |grep 509 E0328 07:27:22.560871 1 controller.go:130] {🐼 🐼} failed with: x509: certificate signed by unknown authority I0328 07:27:22.564447 1 status_controller.go:150] clusteroperator/authentication diff {"status":{"conditions":[{"lastTransitionTime":"2019-03-28T07:23:03Z","message":"Failing: x509: certificate signed by unknown authority","reason":"Failing","status":"True","type":"Failing"},{"lastTransitionTime":"2019-03-28T07:27:01Z","reason":"AsExpected","status":"False","type":"Progressing"},{"lastTransitionTime":"2019-03-28T07:22:25Z","reason":"Available","status":"False","type":"Available"},{"lastTransitionTime":"2019-03-28T07:22:19Z","reason":"NoData","status":"Unknown","type":"Upgradeable"}]}} The authentication operator with that PR: # oc logs -f openshift-authentication-operator-7b5d4779bd-7mhhb -n openshift-authentication-operator |grep 509 The x509 error does not occurs. # oc get co authentication NAME VERSION AVAILABLE PROGRESSING FAILING SINCE authentication 4.0.0-0.nightly-2019-03-28-030453 True False False 3h33m As I understand: for scenario 3, we should add externally CA to system CA to avoid this error.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:0758