Bug 1772879

Summary:	Finalizer of Loadbalancer referenecd by CR is not removed upon deletion (sessionAffinity: ClientIP)
Product:	OpenShift Container Platform	Reporter:	Petr Kremensky <pkremens>
Component:	Networking	Assignee:	Dan Mace <dmace>
Networking sub component:	router	QA Contact:	Hongan Li <hongli>
Status:	CLOSED WONTFIX	Docs Contact:
Severity:	low
Priority:	low	CC:	aos-bugs, eparis, jokerman, mchoma, mfojtik
Version:	4.3.0	Keywords:	Reopened
Target Milestone:	---
Target Release:	4.5.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2020-04-06 14:23:00 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Petr Kremensky 2019-11-15 12:35:24 UTC

Description of problem:
service.kubernetes.io/load-balancer-cleanup finalizer is not removed from loadbalancer service upon CR removal if sessionAffinity: ClientIP.

Regression against previous releases
 * crc version: 1.1.0+95966a9; OpenShift version: 4.2.2 (embedded in binary)
 * OCP 4.1.20

Version-Release number of selected component (if applicable):
4.3.0-0.nightly-2019-11-11-115927 (https://projects.engineering.redhat.com/browse/LPINTEROP-680)

How reproducible:
Always

Steps to Reproduce:
Log in to OCP cluster as a user with ! cluster-admin permission !

1. Install the Wildfly operator
git clone git:wildfly/wildfly-operator.git
cd wildfly-operator
oc apply -f deploy/service_account.yaml
oc apply -f deploy/role.yaml
oc apply -f deploy/role_binding.yaml
oc apply -f deploy/crds/wildfly_v1alpha1_wildflyserver_crd.yaml
oc apply -f deploy/operator.yaml

# Make sure that the operator pod is up and running
$ oc get pods -w
NAME                                READY   STATUS 
wildfly-operator-7f555b86d5-2947c   1/1     Running

2. Create WildFlyServer CR - https://github.com/wildfly/wildfly-operator/blob/master/doc/apis.adoc
cat << EOF > wildfly-operator.yaml
apiVersion: wildfly.org/v1alpha1
kind: WildFlyServer
metadata:
  name: wildfly
spec:
  applicationImage: "quay.io/wildfly/wildfly-centos7:18.0"
  replicas: 1
  sessionAffinity: true
EOF

oc apply -f wildfly-operator.yaml

# Wait until the wildfly pod is up & running
$ oc get pods/wildfly-0 -w
NAME        READY   STATUS
wildfly-0   1/1     Running

3. Delete the CR object
$ oc delete wildflyserver wildfly 
wildflyserver.wildfly.org "wildfly" deleted

Actual results:
Service is still there after the CR deletion, and oc delete stuck due to a finalizer - cannot be removed without manual edit (remove the finalizer field)

Expected results:
$ oc get service
No resources found.

Additional info:
Probably related to https://kubernetes.io/docs/tasks/access-application-cluster/create-external-load-balancer/#garbage-collecting-load-balancers

In case that sessionAffinity: None, the finalizer is removed and service is deleted with the CR removal.

############## # sessionAffinity: true 
$ oc get service wildfly-loadbalancer -o yaml
metadata:
...
  finalizers:
  - service.kubernetes.io/load-balancer-cleanup
...
spec:
  clusterIP: 172.30.211.55
  externalTrafficPolicy: Cluster
  ports:
  - name: http
    nodePort: 31746
    port: 8080
    protocol: TCP
    targetPort: 8080
  selector:
    app.kubernetes.io/managed-by: wildfly-operator
    app.kubernetes.io/name: wildfly
    app.openshift.io/runtime: wildfly
    wildfly.org/operated-by-loadbalancer: active
  sessionAffinity: ClientIP
  sessionAffinityConfig:
    clientIP:
      timeoutSeconds: 10800
  type: LoadBalancer
############## # sessionAffinity: false
$ oc get service wildfly-loadbalancer -o yaml

metadata:
...
  finalizers:
  - service.kubernetes.io/load-balancer-cleanup
...
spec:
  clusterIP: 172.30.197.110
  externalTrafficPolicy: Cluster
  ports:
  - name: http
    nodePort: 32128
    port: 8080
    protocol: TCP
    targetPort: 8080
  selector:
    app.kubernetes.io/managed-by: wildfly-operator
    app.kubernetes.io/name: wildfly
    app.openshift.io/runtime: wildfly
    wildfly.org/operated-by-loadbalancer: active
  sessionAffinity: None
  type: LoadBalancer

Comment 1 Dan Mace 2019-12-09 16:26:20 UTC

How was this cluster created and on what platform?

If it wasn't using the OpenShift installer with a supported IPI/UPI configuration, it's highly unlikely we're going to take any action here.

Comment 2 Petr Kremensky 2019-12-10 09:47:36 UTC

Hi, 

by the time I reported the issues, cluster was created by FlexyWrapper installer on AWS, version 4.3.0-0.nightly-2019-11-11-115927

I retested now on cluster created by Openshift Installer on OpenStack, version 4.3.0-0.nightly-2019-12-10-034925, I'm no longer able to reproduce the issue on this setup, thus we can close this.

Comment 3 mchoma 2020-01-08 07:47:23 UTC

I am reopening this. This would be AWS specific bug as Service of type LoadBalancer use external load balancer of cloud provider. 

We use Flexy wrapper tool [1], which is wrapper around Flexy tool, which is used by OpenShift QE team as well.

[1] https://docs.engineering.redhat.com/pages/viewpage.action?pageId=63298965
[2] https://mojo.redhat.com/docs/DOC-1074220

Comment 4 mchoma 2020-01-08 08:32:01 UTC

This is minimalistic reproducer to problem

1.
cat << EOF > service.yaml
apiVersion: v1
kind: Service
metadata:
  name: example7
spec:
  clusterIP: 172.30.211.53
  ports:
  - name: http
    nodePort: 31744
    port: 8080
    protocol: TCP
    targetPort: 8080
  sessionAffinity: ClientIP
  type: LoadBalancer
EOF

oc apply -f service.yaml

2. 
oc delete services example7
service "example7" deleted
<Prompt>

3.
But service is not actually deleted.

Comment 5 Dan Mace 2020-01-08 15:11:03 UTC

(In reply to mchoma from comment #4)
> This is minimalistic reproducer to problem
> 
> 1.
> cat << EOF > service.yaml
> apiVersion: v1
> kind: Service
> metadata:
>   name: example7
> spec:
>   clusterIP: 172.30.211.53
>   ports:
>   - name: http
>     nodePort: 31744
>     port: 8080
>     protocol: TCP
>     targetPort: 8080
>   sessionAffinity: ClientIP
>   type: LoadBalancer
> EOF
> 
> oc apply -f service.yaml
> 
> 2. 
> oc delete services example7
> service "example7" deleted
> <Prompt>
> 
> 3.
> But service is not actually deleted.

What platform are you on? How did you create the cluster and with what version of OCP?

ClientIP session affinity for LoadBalancer service isn't supported on AWS[1]. Sadly, it's not actually documented upstream on what platforms it is supported.

When trying this on AWS, you can see the service controller reporting failure to provision the LB by looking at events in the service's namespace:

33s         Warning   SyncLoadBalancerFailed                service/loadbalancer                              Error syncing load balancer: failed to ensure load balancer: unsupported load balancer affinity: ClientIP

In this case, the LB will perpetually fail provisioning, and the finalizer won't be removed from the Service. The only way I see to delete the service is to patch it to remove the finalizer manually. I'm not sure whether this would be considered a bug upstream in the service controller, but I think we could make a case that it is. However, given you shouldn't have created the service in the first place (as it was destined to fail), and given the workaround (patch away the finalizer), it seems low priority.

If you all want to keep the bug open for the deletion bug, I won't object, but the likelihood of us spending attention upstream for the problem is very low.

If the actual concern is getting ClientIP session affinity working on a platform where it's not currently supported upstream (e.g. AWS), that would be an issue to pursue upstream and isn't a bug in OpenShift.

Is there a reproducer where:

1. The LoadBalancer service is actually successfully provisioned
2. After successful provisioning, the service can't be deleted

That would be a higher impact problem, IMO.

[1] https://github.com/kubernetes/kubernetes/issues/13892

Comment 6 mchoma 2020-01-09 08:14:28 UTC

our cluster is of version 4.3.0-0.nightly-2020-01-06-101556 and is running on AWS

I can confirm I see the same error event on OCP 4.2. On OCP4.2 finalizer was not present so Service was deleted and we haven't noticed the problem.

To be honest I saw that event before. But it did not ring the bell for me that is the source of problem.

We do not have strong use case for this combination of parameters to work on AWS to create RFE for kubernetes/openshift, we just tried it because it was possible combination

So now problem is. How should I as a user now that this combination of parameters is not supported on AWS? I think that should be documented properly somewhere.

Couldn't openshift be smart enough ?
- and be able to validate Service yaml input and prevent user from creating buggy service. It means when running on AWS combination of LoadBalancer and ClientIP is prohibited.
- or does not retry of creating AWS load balancer as it wont succeed never because of "unsupported load balancer affinity: ClientIP"

[1] https://github.com/kubernetes/kubernetes/issues/13892

Comment 7 mchoma 2020-01-09 10:16:25 UTC

What is interesting when I create "sessionAffinity: None" service and update to "sessionAffinity: ClientIP". Service object can be deleted.

Comment 8 Dan Mace 2020-01-09 14:24:31 UTC

(In reply to mchoma from comment #6)
> our cluster is of version 4.3.0-0.nightly-2020-01-06-101556 and is running
> on AWS
> 
> I can confirm I see the same error event on OCP 4.2. On OCP4.2 finalizer was
> not present so Service was deleted and we haven't noticed the problem.
> 
> To be honest I saw that event before. But it did not ring the bell for me
> that is the source of problem.
> 
> We do not have strong use case for this combination of parameters to work on
> AWS to create RFE for kubernetes/openshift, we just tried it because it was
> possible combination
> 
> So now problem is. How should I as a user now that this combination of
> parameters is not supported on AWS? I think that should be documented
> properly somewhere.
> 
> Couldn't openshift be smart enough ?
> - and be able to validate Service yaml input and prevent user from creating
> buggy service. It means when running on AWS combination of LoadBalancer and
> ClientIP is prohibited.
> - or does not retry of creating AWS load balancer as it wont succeed never
> because of "unsupported load balancer affinity: ClientIP"
> 
> [1] https://github.com/kubernetes/kubernetes/issues/13892

I agree the experience needs improved. Keep in mind the problem is with Kubernetes itself in the service controller and cloud provider implementations. You could reproduce this on a vanilla Kube installation outside of OpenShift. Any improvement here will need to start upstream.

Comment 9 Dan Mace 2020-01-31 15:16:42 UTC

We can keep this open for now, but I doubt we're willing to block a release on it. Moving to 4.5.

Comment 10 Dan Mace 2020-04-06 14:23:00 UTC

Given the existing events Kube reports (https://bugzilla.redhat.com/show_bug.cgi?id=1772879#c5) and the low overall impact, I think it's unlikely we're going to dedicate any resources to improving the upstream status reporting for this issue in the foreseeable future. I'm going to close the bug to avoid setting false expectations of action on our part.

If there's some strong business justification, we can discuss and re-open later.