Description of problem: service.kubernetes.io/load-balancer-cleanup finalizer is not removed from loadbalancer service upon CR removal if sessionAffinity: ClientIP. Regression against previous releases * crc version: 1.1.0+95966a9; OpenShift version: 4.2.2 (embedded in binary) * OCP 4.1.20 Version-Release number of selected component (if applicable): 4.3.0-0.nightly-2019-11-11-115927 (https://projects.engineering.redhat.com/browse/LPINTEROP-680) How reproducible: Always Steps to Reproduce: Log in to OCP cluster as a user with ! cluster-admin permission ! 1. Install the Wildfly operator git clone git:wildfly/wildfly-operator.git cd wildfly-operator oc apply -f deploy/service_account.yaml oc apply -f deploy/role.yaml oc apply -f deploy/role_binding.yaml oc apply -f deploy/crds/wildfly_v1alpha1_wildflyserver_crd.yaml oc apply -f deploy/operator.yaml # Make sure that the operator pod is up and running $ oc get pods -w NAME READY STATUS wildfly-operator-7f555b86d5-2947c 1/1 Running 2. Create WildFlyServer CR - https://github.com/wildfly/wildfly-operator/blob/master/doc/apis.adoc cat << EOF > wildfly-operator.yaml apiVersion: wildfly.org/v1alpha1 kind: WildFlyServer metadata: name: wildfly spec: applicationImage: "quay.io/wildfly/wildfly-centos7:18.0" replicas: 1 sessionAffinity: true EOF oc apply -f wildfly-operator.yaml # Wait until the wildfly pod is up & running $ oc get pods/wildfly-0 -w NAME READY STATUS wildfly-0 1/1 Running 3. Delete the CR object $ oc delete wildflyserver wildfly wildflyserver.wildfly.org "wildfly" deleted Actual results: Service is still there after the CR deletion, and oc delete stuck due to a finalizer - cannot be removed without manual edit (remove the finalizer field) Expected results: $ oc get service No resources found. Additional info: Probably related to https://kubernetes.io/docs/tasks/access-application-cluster/create-external-load-balancer/#garbage-collecting-load-balancers In case that sessionAffinity: None, the finalizer is removed and service is deleted with the CR removal. ############## # sessionAffinity: true $ oc get service wildfly-loadbalancer -o yaml metadata: ... finalizers: - service.kubernetes.io/load-balancer-cleanup ... spec: clusterIP: 172.30.211.55 externalTrafficPolicy: Cluster ports: - name: http nodePort: 31746 port: 8080 protocol: TCP targetPort: 8080 selector: app.kubernetes.io/managed-by: wildfly-operator app.kubernetes.io/name: wildfly app.openshift.io/runtime: wildfly wildfly.org/operated-by-loadbalancer: active sessionAffinity: ClientIP sessionAffinityConfig: clientIP: timeoutSeconds: 10800 type: LoadBalancer ############## # sessionAffinity: false $ oc get service wildfly-loadbalancer -o yaml metadata: ... finalizers: - service.kubernetes.io/load-balancer-cleanup ... spec: clusterIP: 172.30.197.110 externalTrafficPolicy: Cluster ports: - name: http nodePort: 32128 port: 8080 protocol: TCP targetPort: 8080 selector: app.kubernetes.io/managed-by: wildfly-operator app.kubernetes.io/name: wildfly app.openshift.io/runtime: wildfly wildfly.org/operated-by-loadbalancer: active sessionAffinity: None type: LoadBalancer
How was this cluster created and on what platform? If it wasn't using the OpenShift installer with a supported IPI/UPI configuration, it's highly unlikely we're going to take any action here.
Hi, by the time I reported the issues, cluster was created by FlexyWrapper installer on AWS, version 4.3.0-0.nightly-2019-11-11-115927 I retested now on cluster created by Openshift Installer on OpenStack, version 4.3.0-0.nightly-2019-12-10-034925, I'm no longer able to reproduce the issue on this setup, thus we can close this.
I am reopening this. This would be AWS specific bug as Service of type LoadBalancer use external load balancer of cloud provider. We use Flexy wrapper tool [1], which is wrapper around Flexy tool, which is used by OpenShift QE team as well. [1] https://docs.engineering.redhat.com/pages/viewpage.action?pageId=63298965 [2] https://mojo.redhat.com/docs/DOC-1074220
This is minimalistic reproducer to problem 1. cat << EOF > service.yaml apiVersion: v1 kind: Service metadata: name: example7 spec: clusterIP: 172.30.211.53 ports: - name: http nodePort: 31744 port: 8080 protocol: TCP targetPort: 8080 sessionAffinity: ClientIP type: LoadBalancer EOF oc apply -f service.yaml 2. oc delete services example7 service "example7" deleted <Prompt> 3. But service is not actually deleted.
(In reply to mchoma from comment #4) > This is minimalistic reproducer to problem > > 1. > cat << EOF > service.yaml > apiVersion: v1 > kind: Service > metadata: > name: example7 > spec: > clusterIP: 172.30.211.53 > ports: > - name: http > nodePort: 31744 > port: 8080 > protocol: TCP > targetPort: 8080 > sessionAffinity: ClientIP > type: LoadBalancer > EOF > > oc apply -f service.yaml > > 2. > oc delete services example7 > service "example7" deleted > <Prompt> > > 3. > But service is not actually deleted. What platform are you on? How did you create the cluster and with what version of OCP? ClientIP session affinity for LoadBalancer service isn't supported on AWS[1]. Sadly, it's not actually documented upstream on what platforms it is supported. When trying this on AWS, you can see the service controller reporting failure to provision the LB by looking at events in the service's namespace: 33s Warning SyncLoadBalancerFailed service/loadbalancer Error syncing load balancer: failed to ensure load balancer: unsupported load balancer affinity: ClientIP In this case, the LB will perpetually fail provisioning, and the finalizer won't be removed from the Service. The only way I see to delete the service is to patch it to remove the finalizer manually. I'm not sure whether this would be considered a bug upstream in the service controller, but I think we could make a case that it is. However, given you shouldn't have created the service in the first place (as it was destined to fail), and given the workaround (patch away the finalizer), it seems low priority. If you all want to keep the bug open for the deletion bug, I won't object, but the likelihood of us spending attention upstream for the problem is very low. If the actual concern is getting ClientIP session affinity working on a platform where it's not currently supported upstream (e.g. AWS), that would be an issue to pursue upstream and isn't a bug in OpenShift. Is there a reproducer where: 1. The LoadBalancer service is actually successfully provisioned 2. After successful provisioning, the service can't be deleted That would be a higher impact problem, IMO. [1] https://github.com/kubernetes/kubernetes/issues/13892
our cluster is of version 4.3.0-0.nightly-2020-01-06-101556 and is running on AWS I can confirm I see the same error event on OCP 4.2. On OCP4.2 finalizer was not present so Service was deleted and we haven't noticed the problem. To be honest I saw that event before. But it did not ring the bell for me that is the source of problem. We do not have strong use case for this combination of parameters to work on AWS to create RFE for kubernetes/openshift, we just tried it because it was possible combination So now problem is. How should I as a user now that this combination of parameters is not supported on AWS? I think that should be documented properly somewhere. Couldn't openshift be smart enough ? - and be able to validate Service yaml input and prevent user from creating buggy service. It means when running on AWS combination of LoadBalancer and ClientIP is prohibited. - or does not retry of creating AWS load balancer as it wont succeed never because of "unsupported load balancer affinity: ClientIP" [1] https://github.com/kubernetes/kubernetes/issues/13892
What is interesting when I create "sessionAffinity: None" service and update to "sessionAffinity: ClientIP". Service object can be deleted.
(In reply to mchoma from comment #6) > our cluster is of version 4.3.0-0.nightly-2020-01-06-101556 and is running > on AWS > > I can confirm I see the same error event on OCP 4.2. On OCP4.2 finalizer was > not present so Service was deleted and we haven't noticed the problem. > > To be honest I saw that event before. But it did not ring the bell for me > that is the source of problem. > > We do not have strong use case for this combination of parameters to work on > AWS to create RFE for kubernetes/openshift, we just tried it because it was > possible combination > > So now problem is. How should I as a user now that this combination of > parameters is not supported on AWS? I think that should be documented > properly somewhere. > > Couldn't openshift be smart enough ? > - and be able to validate Service yaml input and prevent user from creating > buggy service. It means when running on AWS combination of LoadBalancer and > ClientIP is prohibited. > - or does not retry of creating AWS load balancer as it wont succeed never > because of "unsupported load balancer affinity: ClientIP" > > [1] https://github.com/kubernetes/kubernetes/issues/13892 I agree the experience needs improved. Keep in mind the problem is with Kubernetes itself in the service controller and cloud provider implementations. You could reproduce this on a vanilla Kube installation outside of OpenShift. Any improvement here will need to start upstream.
We can keep this open for now, but I doubt we're willing to block a release on it. Moving to 4.5.
Given the existing events Kube reports (https://bugzilla.redhat.com/show_bug.cgi?id=1772879#c5) and the low overall impact, I think it's unlikely we're going to dedicate any resources to improving the upstream status reporting for this issue in the foreseeable future. I'm going to close the bug to avoid setting false expectations of action on our part. If there's some strong business justification, we can discuss and re-open later.