Description of problem: If you configure IngressController/default to have a defaultCertificate, but do not create the corresponding secret, the ingress operator does not report any error. The failure presents non-intuitively by breaking the authentication operator: Failing: failed handling the route: router secret is empty: &v1.Secret{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:v1.ObjectMeta{Name:"v4-0-config-system-router-certs", GenerateName:"", Namespace:"openshift-authentication", SelfLink:"/api/v1/namespaces/openshift-authentication/secrets/v4-0-config-system-router-certs", UID:"6ecb402f-57b7-11e9-8678-023af517f40a", ResourceVersion:"2088903", Generation:0, CreationTimestamp:v1.Time{Time:time.Time{wall:0x0, ext:63690074896, loc:(*time.Location)(0x2c627a0)}}, DeletionTimestamp:(*v1.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Initializers:(*v1.Initializers)(nil), Finalizers:[]string(nil), ClusterName:""}, Data:map[string][]uint8(nil), StringData:map[string]string(nil), Type:"Opaque"} Version-Release number of selected component (if applicable): 4.0.0-0.9 How reproducible: 100% Steps to Reproduce: 1. Name a secret the openshift-ingress-operator/ingresscontroller.operator/default instance 2. 3. Actual results: The ingress operator does not report failing. The authentication will eventually start reporting it. NAME VERSION AVAILABLE PROGRESSING FAILING SINCE authentication 4.0.0-0.9 False False True 20m cloud-credential 4.0.0-0.9 True False False 5d cluster-autoscaler 4.0.0-0.9 True False False 5d console 4.0.0-0.9 True False False 4d23h dns 4.0.0-0.9 True False False 5d image-registry 4.0.0-0.9 True False False 4d23h ingress 4.0.0-0.9 True False False 4d23h ... Expected results: The ingress operator should detect and report the problem, Available=True, Progressing=False, Failing=True. Ideally it not apply the partial/invalid configuration (leaving authentication available until the configuration was valid). Additional info:
See http://post-office.corp.redhat.com/archives/aos-devel/2019-August/msg00085.html for a discussion about how status should be reported in this case (and cases like it).
https://github.com/openshift/cluster-ingress-operator/pull/283 will address this issue in the most generic sense. In the reported scenario, the router deployment will eventually become failed, which will cause the ingresscontroller to be degraded, which will cause the operator to become degraded. Fixing the secret reference will cause the degraded condition to become false once again. Here's an example of the degraded condition on the ingresscontroller: - lastTransitionTime: "2019-08-07T15:05:03Z" message: 'The deployment failed (reason: ProgressDeadlineExceeded) with message: ReplicaSet "router-default-858c66c5c9" has timed out progressing.' reason: DeploymentFailed status: "True" type: Degraded And here's the corresponding degraded condition on the clusteroperator: - lastTransitionTime: "2019-08-07T15:05:03Z" message: 'Some ingresscontrollers are degraded: default' reason: IngressControllersDegraded status: "True" type: Degraded The ingresscontroller and operator are still available in this case because minimum deployment availability is maintained.
While I do agree we should make the status reporting even smarter (since we can do deeper analysis to detect why the deployment is stuck), we'll need to address that as a followup enhancement. My rationale is that failing to report degraded given a failed operand deployment is the underlying blocker.
verified with 4.2.0-0.nightly-2019-08-08-103722 and issue has been fixed. 1. configure ingresscontroller to use a non existent secret as default certificate. oc patch --type=merge --namespace openshift-ingress-operator ingresscontrollers/default --patch '{"spec":{"defaultCertificate":{"name":"custom-certs-default"}}}' 2. wait about 10min then check pod, ingresscontroller and clusteroperator status $ oc get pod -n openshift-ingress NAME READY STATUS RESTARTS AGE router-default-586884c57-nbf2r 0/1 ContainerCreating 0 17m router-default-7d59f5b55-z8nqx 1/1 Running 0 3h24m $ oc get ingresscontrollers.operator.openshift.io default -n openshift-ingress-operator -o yaml <---snip---> - lastTransitionTime: "2019-08-09T05:40:39Z" message: 'The deployment failed (reason: ProgressDeadlineExceeded) with message: ReplicaSet "router-default-586884c57" has timed out progressing.' reason: DeploymentFailed status: "True" type: Degraded $ oc get co/ingress -o yaml <---snip---> status: conditions: - lastTransitionTime: "2019-08-09T05:40:39Z" message: 'Some ingresscontrollers are degraded: default' reason: IngressControllersDegraded status: "True" type: Degraded
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:2922