1698562 – ingress operator does not complain about missing defaulCertificate secret

Bug 1698562 - ingress operator does not complain about missing defaulCertificate secret

Summary: ingress operator does not complain about missing defaulCertificate secret

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	4.1.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	medium
Target Milestone:	---
Target Release:	4.2.0
Assignee:	Miciah Dashiel Butler Masters
QA Contact:	Hongan Li
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2019-04-10 15:34 UTC by Justin Pierce
Modified:	2022-08-04 22:24 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2019-10-16 06:28:05 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift cluster-ingress-operator pull 283	0	'None'	closed	Bug 1698562: status: introduce ingresscontroller degraded condition	2020-08-11 21:20:57 UTC
Red Hat Product Errata	RHBA-2019:2922	0	None	None	None	2019-10-16 06:28:22 UTC

Description Justin Pierce 2019-04-10 15:34:10 UTC

Description of problem:
If you configure IngressController/default to have a defaultCertificate, but do not create the corresponding secret, the ingress operator does not report any error.

The failure presents non-intuitively by breaking the authentication operator:

Failing: failed handling the route: router secret is empty: &v1.Secret{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:v1.ObjectMeta{Name:"v4-0-config-system-router-certs", GenerateName:"", Namespace:"openshift-authentication", SelfLink:"/api/v1/namespaces/openshift-authentication/secrets/v4-0-config-system-router-certs", UID:"6ecb402f-57b7-11e9-8678-023af517f40a", ResourceVersion:"2088903", Generation:0, CreationTimestamp:v1.Time{Time:time.Time{wall:0x0, ext:63690074896, loc:(*time.Location)(0x2c627a0)}}, DeletionTimestamp:(*v1.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Initializers:(*v1.Initializers)(nil), Finalizers:[]string(nil), ClusterName:""}, Data:map[string][]uint8(nil), StringData:map[string]string(nil), Type:"Opaque"}

Version-Release number of selected component (if applicable):
4.0.0-0.9

How reproducible:
100%

Steps to Reproduce:
1. Name a secret the openshift-ingress-operator/ingresscontroller.operator/default instance
2.
3.

Actual results:
The ingress operator does not report failing. The authentication will eventually start reporting it.

NAME                                 VERSION     AVAILABLE   PROGRESSING   FAILING   SINCE
authentication                       4.0.0-0.9   False       False         True      20m
cloud-credential                     4.0.0-0.9   True        False         False     5d
cluster-autoscaler                   4.0.0-0.9   True        False         False     5d
console                              4.0.0-0.9   True        False         False     4d23h
dns                                  4.0.0-0.9   True        False         False     5d
image-registry                       4.0.0-0.9   True        False         False     4d23h
ingress                              4.0.0-0.9   True        False         False     4d23h
...

Expected results:
The ingress operator should detect and report the problem, Available=True, Progressing=False, Failing=True. Ideally it not apply the partial/invalid configuration (leaving authentication available until the configuration was valid).

Additional info:

Comment 1 Dan Mace 2019-08-06 19:29:07 UTC

See http://post-office.corp.redhat.com/archives/aos-devel/2019-August/msg00085.html for a discussion about how status should be reported in this case (and cases like it).

Comment 2 Dan Mace 2019-08-07 15:34:38 UTC

https://github.com/openshift/cluster-ingress-operator/pull/283 will address this issue in the most generic sense. In the reported scenario, the router deployment will eventually become failed, which will cause the ingresscontroller to be degraded, which will cause the operator to become degraded. Fixing the secret reference will cause the degraded condition to become false once again.

Here's an example of the degraded condition on the ingresscontroller:

  - lastTransitionTime: "2019-08-07T15:05:03Z"
    message: 'The deployment failed (reason: ProgressDeadlineExceeded) with message:
      ReplicaSet "router-default-858c66c5c9" has timed out progressing.'
    reason: DeploymentFailed
    status: "True"
    type: Degraded

And here's the corresponding degraded condition on the clusteroperator:

  - lastTransitionTime: "2019-08-07T15:05:03Z"
    message: 'Some ingresscontrollers are degraded: default'
    reason: IngressControllersDegraded
    status: "True"
    type: Degraded

The ingresscontroller and operator are still available in this case because minimum deployment availability is maintained.

Comment 3 Dan Mace 2019-08-07 15:36:48 UTC

While I do agree we should make the status reporting even smarter (since we can do deeper analysis to detect why the deployment is stuck), we'll need to address that as a followup enhancement. My rationale is that failing to report degraded given a failed operand deployment is the underlying blocker.

Comment 5 Hongan Li 2019-08-09 06:17:00 UTC

verified with 4.2.0-0.nightly-2019-08-08-103722 and issue has been fixed.

1. configure ingresscontroller to use a non existent secret as default certificate.
   oc patch --type=merge --namespace openshift-ingress-operator ingresscontrollers/default --patch '{"spec":{"defaultCertificate":{"name":"custom-certs-default"}}}'

2. wait about 10min then check pod, ingresscontroller and clusteroperator status

$ oc get pod -n openshift-ingress
NAME                             READY   STATUS              RESTARTS   AGE
router-default-586884c57-nbf2r   0/1     ContainerCreating   0          17m
router-default-7d59f5b55-z8nqx   1/1     Running             0          3h24m

$ oc get ingresscontrollers.operator.openshift.io default -n openshift-ingress-operator -o yaml
<---snip--->
  - lastTransitionTime: "2019-08-09T05:40:39Z"
    message: 'The deployment failed (reason: ProgressDeadlineExceeded) with message:
      ReplicaSet "router-default-586884c57" has timed out progressing.'
    reason: DeploymentFailed
    status: "True"
    type: Degraded

$ oc get co/ingress -o yaml
<---snip--->
status:
  conditions:
  - lastTransitionTime: "2019-08-09T05:40:39Z"
    message: 'Some ingresscontrollers are degraded: default'
    reason: IngressControllersDegraded
    status: "True"
    type: Degraded

Comment 6 errata-xmlrpc 2019-10-16 06:28:05 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:2922

Note You need to log in before you can comment on or make changes to this bug.