Bug 1683515

Summary: Deleting a ClusterIngress before a DNS Alias record is created causes the operation to hang.
Product: OpenShift Container Platform Reporter: Daneyon Hansen <dhansen>
Component: NetworkingAssignee: Daneyon Hansen <dhansen>
Networking sub component: router QA Contact: Hongan Li <hongli>
Status: CLOSED ERRATA Docs Contact:
Severity: high    
Priority: medium CC: aos-bugs, mmasters
Version: 4.1.0   
Target Milestone: ---   
Target Release: 4.1.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-06-04 10:44:39 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Description Daneyon Hansen 2019-02-27 05:23:11 UTC
Description of problem:
Deleting a ClusterIngress before a DNS Alias record is created causes the operation to hang. This causes dependent resources to become orphaned. The source of the issue appears to be a reconciliation error when trying to delete the router service loadbalancer dns record. Since the record does not exist, the reconciliation errors and stops trying to reconcile the resource. 

Version-Release number of selected component (if applicable):
$ git log --oneline
1b4fa5a5 Merge pull request #132 from pravisankar/fix-retry-controller

How reproducible:
always

Steps to Reproduce:
1. Create a clusteringress.

2. Before the clusteringress dns record and associated to the router's serivce (type: LoadBalancer), delete the clusteringress.

Actual results:
The deletion event hangs. The DeleteionTimestamp is applied to the clusteringress, but the resource and dependent resources are not deleted.

Expected results:
The deletion event to complete successfully.

Additional info:

Relevant Operator Log Messages):
2019-02-26T21:12:39.934-0800	INFO	operator.controller	controller/controller.go:82	reconciling	{"request": "openshift-ingress-operator/test1"}
2019-02-26T21:12:41.721-0800	ERROR	operator.init.kubebuilder.controller	controller/controller.go:217	Reconciler error	{"controller": "operator-controller", "request": "openshift-ingress-operator/test1", "error": "failed to ensure ingress deletion: failed to finalize load balancer service for test1: [failed to delete DNS record &{{ map[Name:danehans-9nggd-int kubernetes.io/cluster/danehans-9nggd:owned]} ALIAS *.tests1.danehans.devcluster.openshift.com -> a53a055213a4c11e9adeb0a6b3bd6b3e-1017019273.us-west-2.elb.amazonaws.com} for ingress openshift-ingress-operator/test1: failed to update alias in zone ZYPD4B0DM135S: couldn't update DNS record in zone ZYPD4B0DM135S: InvalidChangeBatch: [Tried to delete resource record set [name='\\052.tests1.danehans.devcluster.openshift.com.', type='A'] but it was not found]\n\tstatus code: 400, request id: 4e6f07e1-3a4e-11e9-95f1-3f28621c9566, failed to delete DNS record &{{Z3URY6TWQ91KVV map[]} ALIAS *.tests1.danehans.devcluster.openshift.com -> a53a055213a4c11e9adeb0a6b3bd6b3e-1017019273.us-west-2.elb.amazonaws.com} for ingress openshift-ingress-operator/test1: failed to update alias in zone Z3URY6TWQ91KVV: couldn't update DNS record in zone Z3URY6TWQ91KVV: InvalidChangeBatch: [Tried to delete resource record set [name='\\052.tests1.danehans.devcluster.openshift.com.', type='A'] but it was not found]\n\tstatus code: 400, request id: 4ebcd95b-3a4e-11e9-9626-4f65dfa62605]", "errorCauses": [{"error": "failed to ensure ingress deletion: failed to finalize load balancer service for test1: [failed to delete DNS record &{{ map[Name:danehans-9nggd-int kubernetes.io/cluster/danehans-9nggd:owned]} ALIAS *.tests1.danehans.devcluster.openshift.com -> a53a055213a4c11e9adeb0a6b3bd6b3e-1017019273.us-west-2.elb.amazonaws.com} for ingress openshift-ingress-operator/test1: failed to update alias in zone ZYPD4B0DM135S: couldn't update DNS record in zone ZYPD4B0DM135S: InvalidChangeBatch: [Tried to delete resource record set [name='\\052.tests1.danehans.devcluster.openshift.com.', type='A'] but it was not found]\n\tstatus code: 400, request id: 4e6f07e1-3a4e-11e9-95f1-3f28621c9566, failed to delete DNS record &{{Z3URY6TWQ91KVV map[]} ALIAS *.tests1.danehans.devcluster.openshift.com -> a53a055213a4c11e9adeb0a6b3bd6b3e-1017019273.us-west-2.elb.amazonaws.com} for ingress openshift-ingress-operator/test1: failed to update alias in zone Z3URY6TWQ91KVV: couldn't update DNS record in zone Z3URY6TWQ91KVV: InvalidChangeBatch: [Tried to delete resource record set [name='\\052.tests1.danehans.devcluster.openshift.com.', type='A'] but it was not found]\n\tstatus code: 400, request id: 4ebcd95b-3a4e-11e9-9626-4f65dfa62605]"}]}
github.com/openshift/cluster-ingress-operator/vendor/github.com/go-logr/zapr.(*zapLogger).Error
	/Users/daneyonhansen/code/go/src/github.com/openshift/cluster-ingress-operator/vendor/github.com/go-logr/zapr/zapr.go:128
github.com/openshift/cluster-ingress-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
	/Users/daneyonhansen/code/go/src/github.com/openshift/cluster-ingress-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:217
github.com/openshift/cluster-ingress-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1
	/Users/daneyonhansen/code/go/src/github.com/openshift/cluster-ingress-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:158
github.com/openshift/cluster-ingress-operator/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1
	/Users/daneyonhansen/code/go/src/github.com/openshift/cluster-ingress-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133
github.com/openshift/cluster-ingress-operator/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil
	/Users/daneyonhansen/code/go/src/github.com/openshift/cluster-ingress-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:134
github.com/openshift/cluster-ingress-operator/vendor/k8s.io/apimachinery/pkg/util/wait.Until
	/Users/daneyonhansen/code/go/src/github.com/openshift/cluster-ingress-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88
2019-02-26T21:12:42.723-0800	INFO	operator.controller	controller/controller.go:82	reconciling	{"request": "openshift-ingress-operator/test0"}
2019-02-26T21:12:43.356-0800	INFO	operator.dns	aws/dns.go:271	skipping DNS record update	{"record": {"Zone":{"tags":{"Name":"danehans-9nggd-int","kubernetes.io/cluster/danehans-9nggd":"owned"}},"Type":"ALIAS","Alias":{"Domain":"*.tests.danehans.devcluster.openshift.com","Target":"a56efc2813a4c11e9adeb0a6b3bd6b3e-1598625165.us-west-2.elb.amazonaws.com"}}}
2019-02-26T21:12:43.356-0800	INFO	operator.controller	controller/controller_dns.go:26	ensured DNS record for clusteringress	{"namespace": "openshift-ingress-operator", "name": "test0", "record": {"Zone":{"tags":{"Name":"danehans-9nggd-int","kubernetes.io/cluster/danehans-9nggd":"owned"}},"Type":"ALIAS","Alias":{"Domain":"*.tests.danehans.devcluster.openshift.com","Target":"a56efc2813a4c11e9adeb0a6b3bd6b3e-1598625165.us-west-2.elb.amazonaws.com"}}}
2019-02-26T21:12:43.356-0800	INFO	operator.dns	aws/dns.go:271	skipping DNS record update	{"record": {"Zone":{"id":"Z3URY6TWQ91KVV"},"Type":"ALIAS","Alias":{"Domain":"*.tests.danehans.devcluster.openshift.com","Target":"a56efc2813a4c11e9adeb0a6b3bd6b3e-1598625165.us-west-2.elb.amazonaws.com"}}}
2019-02-26T21:12:43.356-0800	INFO	operator.controller	controller/controller_dns.go:26	ensured DNS record for clusteringress	{"namespace": "openshift-ingress-operator", "name": "test0", "record": {"Zone":{"id":"Z3URY6TWQ91KVV"},"Type":"ALIAS","Alias":{"Domain":"*.tests.danehans.devcluster.openshift.com","Target":"a56efc2813a4c11e9adeb0a6b3bd6b3e-1598625165.us-west-2.elb.amazonaws.com"}}}
2019-02-26T21:12:44.164-0800	DEBUG	operator.init.kubebuilder.controller

Comment 1 Daneyon Hansen 2019-02-27 17:02:00 UTC
The issue also exists if you create a clusteringress and the service LoadBalancer EXTERNAL-IP gets stuck in pending:

$ oc get svc -n openshift-ingress | grep test1
NAME                      TYPE           CLUSTER-IP       EXTERNAL-IP                                                              PORT(S)                      AGE
router-internal-test1     ClusterIP      172.30.161.222   <none>                                                                   80/TCP,443/TCP,1936/TCP      6m31s
router-test1              LoadBalancer   172.30.87.116    <pending>                                                                80:32051/TCP,443:30474/TCP   6m33s

You are unable to delete the associated clustyeringress.

2019-02-27T08:58:08.204-0800	INFO	operator.controller	controller/controller.go:82	reconciling	{"request": "openshift-ingress-operator/test1"}
2019-02-27T08:58:09.137-0800	ERROR	operator.init.kubebuilder.controller	controller/controller.go:217	Reconciler error	{"controller": "operator-controller", "request": "openshift-ingress-operator/test1", "error": "failed to ensure ingress deletion: failed to finalize load balancer service for test1: no load balancer is assigned to service openshift-ingress/router-test1", "errorCauses": [{"error": "failed to ensure ingress deletion: failed to finalize load balancer service for test1: no load balancer is assigned to service openshift-ingress/router-test1"}]}
github.com/openshift/cluster-ingress-operator/vendor/github.com/go-logr/zapr.(*zapLogger).Error
	/Users/daneyonhansen/code/go/src/github.com/openshift/cluster-ingress-operator/vendor/github.com/go-logr/zapr/zapr.go:128
github.com/openshift/cluster-ingress-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
	/Users/daneyonhansen/code/go/src/github.com/openshift/cluster-ingress-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:217
github.com/openshift/cluster-ingress-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1
	/Users/daneyonhansen/code/go/src/github.com/openshift/cluster-ingress-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:158
github.com/openshift/cluster-ingress-operator/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1
	/Users/daneyonhansen/code/go/src/github.com/openshift/cluster-ingress-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133
github.com/openshift/cluster-ingress-operator/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil
	/Users/daneyonhansen/code/go/src/github.com/openshift/cluster-ingress-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:134
github.com/openshift/cluster-ingress-operator/vendor/k8s.io/apimachinery/pkg/util/wait.Until
	/Users/daneyonhansen/code/go/src/github.com/openshift/cluster-ingress-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88

Comment 3 Daneyon Hansen 2019-03-13 00:31:40 UTC
If ingresscontroller 'delete' occurs during a dns update:
2019-03-12T16:31:11.638-0700	ERROR	operator.init.kubebuilder.controller	controller/controller.go:217	Reconciler error	{"controller": "operator-controller", "request": "openshift-ingress-operator/test0", "error": "failed to ensure ingress deletion: failed to finalize load balancer service for test0: [failed to delete DNS record &{{ map[Name:danehans-wpwp4-int kubernetes.io/cluster/danehans-wpwp4:owned]} ALIAS *.tests0.danehans.devcluster.openshift.com -> adfb4c320451e11e99bb706fb156f538-250013580.us-west-2.elb.amazonaws.com} for ingress openshift-ingress-operator/test0: failed to delete alias in zone Z1O11RGK05PNBT: couldn't update DNS record in zone Z1O11RGK05PNBT: InvalidChangeBatch: [Tried to delete resource record set [name='\\052.tests0.danehans.devcluster.openshift.com.', type='A'] but it was not found]\n\tstatus code: 400, request id: eb4454ee-451e-11e9-b04e-9b8331d6fb61, failed to delete DNS record &{{Z3URY6TWQ91KVV map[]} ALIAS *.tests0.danehans.devcluster.openshift.com -> adfb4c320451e11e99bb706fb156f538-250013580.us-west-2.elb.amazonaws.com} for ingress openshift-ingress-operator/test0: failed to delete alias in zone Z3URY6TWQ91KVV: couldn't update DNS record in zone Z3URY6TWQ91KVV: InvalidChangeBatch: [Tried to delete resource record set [name='\\052.tests0.danehans.devcluster.openshift.com.', type='A'] but it was not found]\n\tstatus code: 400, request id: eb7bdfdf-451e-11e9-83a1-a146a63c1bc2]", "errorCauses": [{"error": "failed to ensure ingress deletion: failed to finalize load balancer service for test0: [failed to delete DNS record &{{ map[Name:danehans-wpwp4-int kubernetes.io/cluster/danehans-wpwp4:owned]} ALIAS *.tests0.danehans.devcluster.openshift.com -> adfb4c320451e11e99bb706fb156f538-250013580.us-west-2.elb.amazonaws.com} for ingress openshift-ingress-operator/test0: failed to delete alias in zone Z1O11RGK05PNBT: couldn't update DNS record in zone Z1O11RGK05PNBT: InvalidChangeBatch: [Tried to delete resource record set [name='\\052.tests0.danehans.devcluster.openshift.com.', type='A'] but it was not found]\n\tstatus code: 400, request id: eb4454ee-451e-11e9-b04e-9b8331d6fb61, failed to delete DNS record &{{Z3URY6TWQ91KVV map[]} ALIAS *.tests0.danehans.devcluster.openshift.com -> adfb4c320451e11e99bb706fb156f538-250013580.us-west-2.elb.amazonaws.com} for ingress openshift-ingress-operator/test0: failed to delete alias in zone Z3URY6TWQ91KVV: couldn't update DNS record in zone Z3URY6TWQ91KVV: InvalidChangeBatch: [Tried to delete resource record set [name='\\052.tests0.danehans.devcluster.openshift.com.', type='A'] but it was not found]\n\tstatus code: 400, request id: eb7bdfdf-451e-11e9-83a1-a146a63c1bc2]"}]}

If ingresscontroller 'delete' occurs during a finalize:
2019-03-12T17:24:46.296-0700	ERROR	operator.init.kubebuilder.controller	controller/controller.go:217	Reconciler error	{"controller": "operator-controller", "request": "openshift-ingress-operator/test0", "error": "failed to ensure ingress deletion: failed to finalize load balancer service for test0: no load balancer is assigned to service openshift-ingress/router-test0", "errorCauses": [{"error": "failed to ensure ingress deletion: failed to finalize load balancer service for test0: no load balancer is assigned to service openshift-ingress/router-test0"}]}

Comment 4 Daneyon Hansen 2019-03-13 15:28:08 UTC
PR to fix bug: https://github.com/openshift/cluster-ingress-operator/pull/164

Comment 6 Hongan Li 2019-03-20 06:07:58 UTC
verified with 4.0.0-0.nightly-2019-03-19-004004 and issue has been fixed.

Comment 8 errata-xmlrpc 2019-06-04 10:44:39 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0758