Bug 1798282

Summary:	the deletion is stuck when keep deleting the default ingresscontroller
Product:	OpenShift Container Platform	Reporter:	Hongan Li <hongli>
Component:	Networking	Assignee:	Miciah Dashiel Butler Masters <mmasters>
Networking sub component:	router	QA Contact:	Hongan Li <hongli>
Status:	CLOSED ERRATA	Docs Contact:
Severity:	high
Priority:	medium	CC:	afield, aos-bugs, bbennett, mgencur, mmasters, yhe
Version:	4.3.0
Target Milestone:	---
Target Release:	4.5.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:	Cause: The service controller was changed in OpenShift 3.10 in order to prevent unnecessary "GetLoadBalancer" cloud-provider API calls when non-"LoadBalancer" Services were created or deleted. A subsequent change in Kubernetes 1.15 prevented the unnecessary API calls in a different way. An interaction between these two changes broke the service controller's clean-up logic for Services with type "LoadBalancer". Consequence: Deletion of a "LoadBalancer"-type Service (or deletion of an IngressController with the "LoadBalancerService" endpoint publishing strategy type) would never complete; the Service would forever remain present, in "pending" state. Fix: The change that had been added in OpenShift 3.10 was dropped. Result: Deletion of "LoadBalancer"-type Services and of "LoadBalancerService"-type IngressControllers can now complete.	Story Points:	---
Clone Of:		Environment:
Last Closed:	2020-07-13 17:14:06 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1814044
Bug Blocks:

Description Hongan Li 2020-02-05 03:39:50 UTC

Created attachment 1657732 [details]
LB in Azure console

Description of problem:
The default ingresscontroller will be recreated after you delete it, but when you keep deleting it the deletion is stuck.

Version-Release number of selected component (if applicable):
4.4.0-0.nightly-2020-02-03-163409

How reproducible:
sometimes

Steps to Reproduce:
1. delete the default ingresscontroller
2. wait for the ingresscontroller is recreated.
3. delete the default ingresscontroller (the LB svc might be still in pending status) 

Actual results:
repeat above steps several times, the deletion is stuck.
check the resources of openshift-ingress namespace, just find the LB service still there, and others like deployment, pod etc have been removed.

$ oc -n openshift-ingress get all
NAME                     TYPE           CLUSTER-IP      EXTERNAL-IP   PORT(S)                      AGE
service/router-default   LoadBalancer   172.30.103.74   <pending>     80:30118/TCP,443:30504/TCP   19h

Check the Azure Console, and the LB is still there (see attachment)


Expected results:
deleting ingresscontroller should not be stuck (even Load Balancer service in pending

Additional info:
$ oc -n openshift-ingress get svc -o yaml
apiVersion: v1
items:
- apiVersion: v1
  kind: Service
  metadata:
    creationTimestamp: "2020-02-04T07:30:32Z"
    deletionGracePeriodSeconds: 0
    deletionTimestamp: "2020-02-04T07:31:08Z"
    finalizers:
    - service.kubernetes.io/load-balancer-cleanup
    labels:
      app: router
      ingresscontroller.operator.openshift.io/owning-ingresscontroller: default
      router: router-default
    name: router-default
    namespace: openshift-ingress
    ownerReferences:
    - apiVersion: apps/v1
      controller: true
      kind: Deployment
      name: router-default
      uid: 8c194624-a5ef-413d-9647-d4c57cb4f0d1
    resourceVersion: "128124"
    selfLink: /api/v1/namespaces/openshift-ingress/services/router-default
    uid: 7d70fa44-e1dd-4637-bee9-c95b9ab0464a
  spec:
    clusterIP: 172.30.103.74
    externalTrafficPolicy: Local
    healthCheckNodePort: 32374
    ports:
    - name: http
      nodePort: 30118
      port: 80
      protocol: TCP
      targetPort: http
    - name: https
      nodePort: 30504
      port: 443
      protocol: TCP
      targetPort: https
    selector:
      ingresscontroller.operator.openshift.io/deployment-ingresscontroller: default
    sessionAffinity: None
    type: LoadBalancer
  status:
    loadBalancer: {}
kind: List
metadata:
  resourceVersion: ""
  selfLink: ""

$ oc -n openshift-ingress-operator get ingresscontroller -o yaml
apiVersion: v1
items:
- apiVersion: operator.openshift.io/v1
  kind: IngressController
  metadata:
    creationTimestamp: "2020-02-04T07:30:32Z"
    deletionGracePeriodSeconds: 0
    deletionTimestamp: "2020-02-04T07:31:00Z"
    finalizers:
    - ingresscontroller.operator.openshift.io/finalizer-ingresscontroller
    generation: 2
    name: default
    namespace: openshift-ingress-operator
    resourceVersion: "128037"
    selfLink: /apis/operator.openshift.io/v1/namespaces/openshift-ingress-operator/ingresscontrollers/default
    uid: cdf90ff4-8823-4f77-aa38-07bb3ba4fbe0
  spec:
    replicas: 2
  status:
    availableReplicas: 0
    conditions:
    - lastTransitionTime: "2020-02-04T07:30:32Z"
      reason: Valid
      status: "True"
      type: Admitted
    - lastTransitionTime: "2020-02-04T07:30:33Z"
      message: 'The deployment is unavailable: Deployment does not have minimum availability.'
      reason: MinimumReplicasUnavailable
      status: "False"
      type: Available
    - lastTransitionTime: "2020-02-04T07:30:33Z"
      message: 'The deployment has Available status condition set to False (reason:
        MinimumReplicasUnavailable) with message: Deployment does not have minimum
        availability.'
      reason: DeploymentUnavailable
      status: "True"
      type: DeploymentDegraded
    - lastTransitionTime: "2020-02-04T07:30:33Z"
      message: The endpoint publishing strategy supports a managed load balancer
      reason: WantedByEndpointPublishingStrategy
      status: "True"
      type: LoadBalancerManaged
    - lastTransitionTime: "2020-02-04T07:30:33Z"
      message: The LoadBalancer service is pending
      reason: LoadBalancerPending
      status: "False"
      type: LoadBalancerReady
    - lastTransitionTime: "2020-02-04T07:30:33Z"
      message: DNS management is supported and zones are specified in the cluster
        DNS config.
      reason: Normal
      status: "True"
      type: DNSManaged
    - lastTransitionTime: "2020-02-04T07:30:33Z"
      message: The wildcard record resource was not found.
      reason: RecordNotFound
      status: "False"
      type: DNSReady
    - lastTransitionTime: "2020-02-04T07:30:33Z"
      status: "False"
      type: Degraded
    domain: apps.hongli-az44.qe.azure.devcluster.openshift.com
    endpointPublishingStrategy:
      loadBalancer:
        scope: External
      type: LoadBalancerService
    observedGeneration: 1
    selector: ingresscontroller.operator.openshift.io/deployment-ingresscontroller=default
    tlsProfile:
      ciphers:
      - TLS_AES_128_GCM_SHA256
      - TLS_AES_256_GCM_SHA384
      - TLS_CHACHA20_POLY1305_SHA256
      - ECDHE-ECDSA-AES128-GCM-SHA256
      - ECDHE-RSA-AES128-GCM-SHA256
      - ECDHE-ECDSA-AES256-GCM-SHA384
      - ECDHE-RSA-AES256-GCM-SHA384
      - ECDHE-ECDSA-CHACHA20-POLY1305
      - ECDHE-RSA-CHACHA20-POLY1305
      - DHE-RSA-AES128-GCM-SHA256
      - DHE-RSA-AES256-GCM-SHA384
      minTLSVersion: VersionTLS12
kind: List
metadata:
  resourceVersion: ""
  selfLink: ""

Comment 2 Martin Gencur 2020-02-13 11:09:46 UTC

I reproduced this issue also on OCP 4.3 but it's one of the latest builds:
works fine: 4.3.0-0.ci-2020-02-11-113848
fails:      4.3.0-0.ci-2020-02-12-231746
(build taken from https://openshift-release.svc.ci.openshift.org/)

We have a LoadBalancer service and when we want to delete the namespace that contains this service it gets stuck and is not deleted even after one hour.

Comment 3 Miciah Dashiel Butler Masters 2020-02-19 18:07:59 UTC

OCP 4.3 is based on Kubernetes 1.16, which enabled finalizer protection for service load-balancers, which is the suspected cause of this issue.  Once we https://github.com/openshift/origin/pull/24532 is verified to fix the problem on 4.4, we can backport it to 4.3.

Comment 4 Miciah Dashiel Butler Masters 2020-03-03 19:48:36 UTC

The issue here represents a regression that was introduced in 4.3.  As it is not a new regression in 4.4, it is not a blocker for 4.4.  We will fix it in 4.5 and then fix it in the 4.3 and 4.4 z-streams.

Comment 5 Miciah Dashiel Butler Masters 2020-05-08 19:52:06 UTC

A proposed fix is posted in https://github.com/openshift/origin/pull/24532, awaiting approval.

Comment 6 Miciah Dashiel Butler Masters 2020-05-19 18:33:54 UTC

Ben, can you approve https://github.com/openshift/origin/pull/24532?

Comment 7 Miciah Dashiel Butler Masters 2020-05-20 17:59:11 UTC

https://github.com/openshift/origin/pull/24532 has been approved.

Comment 10 Hongan Li 2020-05-26 11:17:27 UTC

verified with 4.5.0-0.nightly-2020-05-25-212133 and issue has been fixed.

Comment 12 errata-xmlrpc 2020-07-13 17:14:06 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2409

Comment 13 yhe 2021-02-18 02:27:07 UTC

Hi, our customer is facing the same issue in their OCP4.6 environment. And I managed to reproduce this issue in an OCP4.6 AWS environment too.

$ oc version
Client Version: 4.6.16
Server Version: 4.6.16
Kubernetes Version: v1.19.0+e49167a

$ oc get svc
NAME                      TYPE           CLUSTER-IP      EXTERNAL-IP                                                                    PORT(S)                      AGE
router-default            LoadBalancer   172.30.50.42    a38b88ea9b0534b0588d01546c878825-1822330431.ap-northeast-1.elb.amazonaws.com   80:32048/TCP,443:32384/TCP   2d
router-internal-default   ClusterIP      172.30.244.90   <none>                                                                         80/TCP,443/TCP,1936/TCP      2d

$ oc delete svc router-default
service "router-default" deleted
^C

$ oc get svc
NAME                      TYPE           CLUSTER-IP      EXTERNAL-IP   PORT(S)                      AGE
router-default            LoadBalancer   172.30.50.42    <pending>     80:32048/TCP,443:32384/TCP   2d
router-internal-default   ClusterIP      172.30.244.90   <none>        80/TCP,443/TCP,1936/TCP      2d

$ oc get events
<..snip..>
4s          Warning   SyncLoadBalancerFailed   service/router-default                Error syncing load balancer: failed to add load balancer cleanup finalizer: Service "router-default" is invalid: metadata.finalizers: Forbidden: no new finalizers can be added if the object is being deleted, found new finalizers []string{"service.kubernetes.io/load-balancer-cleanup"}

Comment 15 Miciah Dashiel Butler Masters 2021-02-19 17:08:26 UTC

Hi!  The problem you are describing looks like bug 1914127.

Comment 16 yhe 2021-02-20 00:26:06 UTC

Hi, thank you for the information!