Bug 1698562 - ingress operator does not complain about missing defaulCertificate secret
Summary: ingress operator does not complain about missing defaulCertificate secret
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.1.0
Hardware: Unspecified
OS: Unspecified
high
medium
Target Milestone: ---
: 4.2.0
Assignee: Miciah Dashiel Butler Masters
QA Contact: Hongan Li
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-04-10 15:34 UTC by Justin Pierce
Modified: 2022-08-04 22:24 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-10-16 06:28:05 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-ingress-operator pull 283 0 'None' closed Bug 1698562: status: introduce ingresscontroller degraded condition 2020-08-11 21:20:57 UTC
Red Hat Product Errata RHBA-2019:2922 0 None None None 2019-10-16 06:28:22 UTC

Description Justin Pierce 2019-04-10 15:34:10 UTC
Description of problem:
If you configure IngressController/default to have a defaultCertificate, but do not create the corresponding secret, the ingress operator does not report any error.

The failure presents non-intuitively by breaking the authentication operator:

Failing: failed handling the route: router secret is empty: &v1.Secret{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:v1.ObjectMeta{Name:"v4-0-config-system-router-certs", GenerateName:"", Namespace:"openshift-authentication", SelfLink:"/api/v1/namespaces/openshift-authentication/secrets/v4-0-config-system-router-certs", UID:"6ecb402f-57b7-11e9-8678-023af517f40a", ResourceVersion:"2088903", Generation:0, CreationTimestamp:v1.Time{Time:time.Time{wall:0x0, ext:63690074896, loc:(*time.Location)(0x2c627a0)}}, DeletionTimestamp:(*v1.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Initializers:(*v1.Initializers)(nil), Finalizers:[]string(nil), ClusterName:""}, Data:map[string][]uint8(nil), StringData:map[string]string(nil), Type:"Opaque"}

Version-Release number of selected component (if applicable):
4.0.0-0.9

How reproducible:
100%

Steps to Reproduce:
1. Name a secret the openshift-ingress-operator/ingresscontroller.operator/default instance
2.
3.

Actual results:
The ingress operator does not report failing. The authentication will eventually start reporting it.

NAME                                 VERSION     AVAILABLE   PROGRESSING   FAILING   SINCE
authentication                       4.0.0-0.9   False       False         True      20m
cloud-credential                     4.0.0-0.9   True        False         False     5d
cluster-autoscaler                   4.0.0-0.9   True        False         False     5d
console                              4.0.0-0.9   True        False         False     4d23h
dns                                  4.0.0-0.9   True        False         False     5d
image-registry                       4.0.0-0.9   True        False         False     4d23h
ingress                              4.0.0-0.9   True        False         False     4d23h
...

Expected results:
The ingress operator should detect and report the problem, Available=True, Progressing=False, Failing=True. Ideally it not apply the partial/invalid configuration (leaving authentication available until the configuration was valid).

Additional info:

Comment 1 Dan Mace 2019-08-06 19:29:07 UTC
See http://post-office.corp.redhat.com/archives/aos-devel/2019-August/msg00085.html for a discussion about how status should be reported in this case (and cases like it).

Comment 2 Dan Mace 2019-08-07 15:34:38 UTC
https://github.com/openshift/cluster-ingress-operator/pull/283 will address this issue in the most generic sense. In the reported scenario, the router deployment will eventually become failed, which will cause the ingresscontroller to be degraded, which will cause the operator to become degraded. Fixing the secret reference will cause the degraded condition to become false once again.

Here's an example of the degraded condition on the ingresscontroller:

  - lastTransitionTime: "2019-08-07T15:05:03Z"
    message: 'The deployment failed (reason: ProgressDeadlineExceeded) with message:
      ReplicaSet "router-default-858c66c5c9" has timed out progressing.'
    reason: DeploymentFailed
    status: "True"
    type: Degraded

And here's the corresponding degraded condition on the clusteroperator:

  - lastTransitionTime: "2019-08-07T15:05:03Z"
    message: 'Some ingresscontrollers are degraded: default'
    reason: IngressControllersDegraded
    status: "True"
    type: Degraded

The ingresscontroller and operator are still available in this case because minimum deployment availability is maintained.

Comment 3 Dan Mace 2019-08-07 15:36:48 UTC
While I do agree we should make the status reporting even smarter (since we can do deeper analysis to detect why the deployment is stuck), we'll need to address that as a followup enhancement. My rationale is that failing to report degraded given a failed operand deployment is the underlying blocker.

Comment 5 Hongan Li 2019-08-09 06:17:00 UTC
verified with 4.2.0-0.nightly-2019-08-08-103722 and issue has been fixed.

1. configure ingresscontroller to use a non existent secret as default certificate.
   oc patch --type=merge --namespace openshift-ingress-operator ingresscontrollers/default --patch '{"spec":{"defaultCertificate":{"name":"custom-certs-default"}}}'

2. wait about 10min then check pod, ingresscontroller and clusteroperator status

$ oc get pod -n openshift-ingress
NAME                             READY   STATUS              RESTARTS   AGE
router-default-586884c57-nbf2r   0/1     ContainerCreating   0          17m
router-default-7d59f5b55-z8nqx   1/1     Running             0          3h24m

$ oc get ingresscontrollers.operator.openshift.io default -n openshift-ingress-operator -o yaml
<---snip--->
  - lastTransitionTime: "2019-08-09T05:40:39Z"
    message: 'The deployment failed (reason: ProgressDeadlineExceeded) with message:
      ReplicaSet "router-default-586884c57" has timed out progressing.'
    reason: DeploymentFailed
    status: "True"
    type: Degraded

$ oc get co/ingress -o yaml
<---snip--->
status:
  conditions:
  - lastTransitionTime: "2019-08-09T05:40:39Z"
    message: 'Some ingresscontrollers are degraded: default'
    reason: IngressControllersDegraded
    status: "True"
    type: Degraded

Comment 6 errata-xmlrpc 2019-10-16 06:28:05 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:2922


Note You need to log in before you can comment on or make changes to this bug.