Created attachment 1657732 [details] LB in Azure console Description of problem: The default ingresscontroller will be recreated after you delete it, but when you keep deleting it the deletion is stuck. Version-Release number of selected component (if applicable): 4.4.0-0.nightly-2020-02-03-163409 How reproducible: sometimes Steps to Reproduce: 1. delete the default ingresscontroller 2. wait for the ingresscontroller is recreated. 3. delete the default ingresscontroller (the LB svc might be still in pending status) Actual results: repeat above steps several times, the deletion is stuck. check the resources of openshift-ingress namespace, just find the LB service still there, and others like deployment, pod etc have been removed. $ oc -n openshift-ingress get all NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE service/router-default LoadBalancer 172.30.103.74 <pending> 80:30118/TCP,443:30504/TCP 19h Check the Azure Console, and the LB is still there (see attachment) Expected results: deleting ingresscontroller should not be stuck (even Load Balancer service in pending Additional info: $ oc -n openshift-ingress get svc -o yaml apiVersion: v1 items: - apiVersion: v1 kind: Service metadata: creationTimestamp: "2020-02-04T07:30:32Z" deletionGracePeriodSeconds: 0 deletionTimestamp: "2020-02-04T07:31:08Z" finalizers: - service.kubernetes.io/load-balancer-cleanup labels: app: router ingresscontroller.operator.openshift.io/owning-ingresscontroller: default router: router-default name: router-default namespace: openshift-ingress ownerReferences: - apiVersion: apps/v1 controller: true kind: Deployment name: router-default uid: 8c194624-a5ef-413d-9647-d4c57cb4f0d1 resourceVersion: "128124" selfLink: /api/v1/namespaces/openshift-ingress/services/router-default uid: 7d70fa44-e1dd-4637-bee9-c95b9ab0464a spec: clusterIP: 172.30.103.74 externalTrafficPolicy: Local healthCheckNodePort: 32374 ports: - name: http nodePort: 30118 port: 80 protocol: TCP targetPort: http - name: https nodePort: 30504 port: 443 protocol: TCP targetPort: https selector: ingresscontroller.operator.openshift.io/deployment-ingresscontroller: default sessionAffinity: None type: LoadBalancer status: loadBalancer: {} kind: List metadata: resourceVersion: "" selfLink: "" $ oc -n openshift-ingress-operator get ingresscontroller -o yaml apiVersion: v1 items: - apiVersion: operator.openshift.io/v1 kind: IngressController metadata: creationTimestamp: "2020-02-04T07:30:32Z" deletionGracePeriodSeconds: 0 deletionTimestamp: "2020-02-04T07:31:00Z" finalizers: - ingresscontroller.operator.openshift.io/finalizer-ingresscontroller generation: 2 name: default namespace: openshift-ingress-operator resourceVersion: "128037" selfLink: /apis/operator.openshift.io/v1/namespaces/openshift-ingress-operator/ingresscontrollers/default uid: cdf90ff4-8823-4f77-aa38-07bb3ba4fbe0 spec: replicas: 2 status: availableReplicas: 0 conditions: - lastTransitionTime: "2020-02-04T07:30:32Z" reason: Valid status: "True" type: Admitted - lastTransitionTime: "2020-02-04T07:30:33Z" message: 'The deployment is unavailable: Deployment does not have minimum availability.' reason: MinimumReplicasUnavailable status: "False" type: Available - lastTransitionTime: "2020-02-04T07:30:33Z" message: 'The deployment has Available status condition set to False (reason: MinimumReplicasUnavailable) with message: Deployment does not have minimum availability.' reason: DeploymentUnavailable status: "True" type: DeploymentDegraded - lastTransitionTime: "2020-02-04T07:30:33Z" message: The endpoint publishing strategy supports a managed load balancer reason: WantedByEndpointPublishingStrategy status: "True" type: LoadBalancerManaged - lastTransitionTime: "2020-02-04T07:30:33Z" message: The LoadBalancer service is pending reason: LoadBalancerPending status: "False" type: LoadBalancerReady - lastTransitionTime: "2020-02-04T07:30:33Z" message: DNS management is supported and zones are specified in the cluster DNS config. reason: Normal status: "True" type: DNSManaged - lastTransitionTime: "2020-02-04T07:30:33Z" message: The wildcard record resource was not found. reason: RecordNotFound status: "False" type: DNSReady - lastTransitionTime: "2020-02-04T07:30:33Z" status: "False" type: Degraded domain: apps.hongli-az44.qe.azure.devcluster.openshift.com endpointPublishingStrategy: loadBalancer: scope: External type: LoadBalancerService observedGeneration: 1 selector: ingresscontroller.operator.openshift.io/deployment-ingresscontroller=default tlsProfile: ciphers: - TLS_AES_128_GCM_SHA256 - TLS_AES_256_GCM_SHA384 - TLS_CHACHA20_POLY1305_SHA256 - ECDHE-ECDSA-AES128-GCM-SHA256 - ECDHE-RSA-AES128-GCM-SHA256 - ECDHE-ECDSA-AES256-GCM-SHA384 - ECDHE-RSA-AES256-GCM-SHA384 - ECDHE-ECDSA-CHACHA20-POLY1305 - ECDHE-RSA-CHACHA20-POLY1305 - DHE-RSA-AES128-GCM-SHA256 - DHE-RSA-AES256-GCM-SHA384 minTLSVersion: VersionTLS12 kind: List metadata: resourceVersion: "" selfLink: ""
I reproduced this issue also on OCP 4.3 but it's one of the latest builds: works fine: 4.3.0-0.ci-2020-02-11-113848 fails: 4.3.0-0.ci-2020-02-12-231746 (build taken from https://openshift-release.svc.ci.openshift.org/) We have a LoadBalancer service and when we want to delete the namespace that contains this service it gets stuck and is not deleted even after one hour.
OCP 4.3 is based on Kubernetes 1.16, which enabled finalizer protection for service load-balancers, which is the suspected cause of this issue. Once we https://github.com/openshift/origin/pull/24532 is verified to fix the problem on 4.4, we can backport it to 4.3.
The issue here represents a regression that was introduced in 4.3. As it is not a new regression in 4.4, it is not a blocker for 4.4. We will fix it in 4.5 and then fix it in the 4.3 and 4.4 z-streams.
A proposed fix is posted in https://github.com/openshift/origin/pull/24532, awaiting approval.
Ben, can you approve https://github.com/openshift/origin/pull/24532?
https://github.com/openshift/origin/pull/24532 has been approved.
verified with 4.5.0-0.nightly-2020-05-25-212133 and issue has been fixed.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:2409
Hi, our customer is facing the same issue in their OCP4.6 environment. And I managed to reproduce this issue in an OCP4.6 AWS environment too. $ oc version Client Version: 4.6.16 Server Version: 4.6.16 Kubernetes Version: v1.19.0+e49167a $ oc get svc NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE router-default LoadBalancer 172.30.50.42 a38b88ea9b0534b0588d01546c878825-1822330431.ap-northeast-1.elb.amazonaws.com 80:32048/TCP,443:32384/TCP 2d router-internal-default ClusterIP 172.30.244.90 <none> 80/TCP,443/TCP,1936/TCP 2d $ oc delete svc router-default service "router-default" deleted ^C $ oc get svc NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE router-default LoadBalancer 172.30.50.42 <pending> 80:32048/TCP,443:32384/TCP 2d router-internal-default ClusterIP 172.30.244.90 <none> 80/TCP,443/TCP,1936/TCP 2d $ oc get events <..snip..> 4s Warning SyncLoadBalancerFailed service/router-default Error syncing load balancer: failed to add load balancer cleanup finalizer: Service "router-default" is invalid: metadata.finalizers: Forbidden: no new finalizers can be added if the object is being deleted, found new finalizers []string{"service.kubernetes.io/load-balancer-cleanup"}
Hi! The problem you are describing looks like bug 1914127.
Hi, thank you for the information!