Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 2039507

Summary:	Second time creating a loadbalancer service (different IP) doesn't work - the endpoint isn't reachable
Product:	OpenShift Container Platform	Reporter:	Alexander Chuzhoy <sasha>
Component:	Networking	Assignee:	Mohamed Mahmoud <mmahmoud>
Networking sub component:	Metal LB	QA Contact:	Arti Sood <asood>
Status:	CLOSED ERRATA	Docs Contact:
Severity:	urgent
Priority:	unspecified	CC:	fpaoline, fpercoco, mmahmoud, ohochman
Version:	4.9	Flags:	mmahmoud: needinfo- mmahmoud: needinfo- mmahmoud: needinfo-
Target Milestone:	---
Target Release:	4.9.z
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2022-01-24 16:50:18 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	2040394
Bug Blocks:

Description Alexander Chuzhoy 2022-01-11 19:43:45 UTC

Attempt to re-create a load balancer svc with a different IP fails. The endpoint remains exposed via the previously set IP.


Scenario:

I successfully used metallb to expose cluster's API via some IP.

Later, wanted to change the IP for API, so I deleted the created service and created a new one (from the same subnet). The endpoint doesn't become reachable via the new IP and remains reachable via the old IP.


oc get addresspool -n metallb-system   api-addresspool  -o yaml
apiVersion: metallb.io/v1alpha1
kind: AddressPool
metadata:
  creationTimestamp: "2022-01-11T04:07:19Z"
  generation: 3
  name: api-addresspool
  namespace: metallb-system
  resourceVersion: "657147"
  uid: 7d24174b-6681-454c-ac26-f8e4a22736c9
spec:
  addresses:
  - 172.22.0.230-172.22.0.230
  autoAssign: true
  protocol: layer2


############################################################
oc get network cluster -o yaml
apiVersion: config.openshift.io/v1
kind: Network
metadata:
  creationTimestamp: "2022-01-11T02:57:17Z"
  generation: 4
  name: cluster
  resourceVersion: "59121"
  uid: a7a62170-f7e5-40b3-b153-3c18b0634a15
spec:
  clusterNetwork:
  - cidr: 10.128.0.0/14
    hostPrefix: 23
  externalIP:
    policy:
      allowedCIDRs:
      - 172.22.0.0/24
  networkType: OpenShiftSDN
  serviceNetwork:
  - 172.30.0.0/16
status:
  clusterNetwork:
  - cidr: 10.128.0.0/14
    hostPrefix: 23
  clusterNetworkMTU: 1450
  networkType: OpenShiftSDN
  serviceNetwork:
  - 172.30.0.0/16

############################################################


oc get svc -n openshift-kube-apiserver                           metallb-api-service
NAME                  TYPE           CLUSTER-IP      EXTERNAL-IP                 PORT(S)          AGE
metallb-api-service   LoadBalancer   172.30.86.108   172.22.0.200,172.22.0.230   6443:32589/TCP   11m
[root@sealusa34 ansible_tests]# oc get svc -n openshift-kube-apiserver                           metallb-api-service -o yaml
apiVersion: v1
kind: Service
metadata:
  annotations:
    metallb.universe.tf/address-pool: api-addresspool
  creationTimestamp: "2022-01-11T19:29:40Z"
  name: metallb-api-service
  namespace: openshift-kube-apiserver
  resourceVersion: "657223"
  uid: 2999c4c8-21e8-4b78-9c03-f7ff91e640cb
spec:
  allocateLoadBalancerNodePorts: true
  clusterIP: 172.30.86.108
  clusterIPs:
  - 172.30.86.108
  externalIPs:
  - 172.22.0.230
  externalTrafficPolicy: Cluster
  internalTrafficPolicy: Cluster
  ipFamilies:
  - IPv4
  ipFamilyPolicy: SingleStack
  ports:
  - name: http
    nodePort: 32589
    port: 6443
    protocol: TCP
    targetPort: 6443
  selector:
    app: openshift-kube-apiserver
  sessionAffinity: None
  type: LoadBalancer
status:
  loadBalancer:
    ingress:
    - ip: 172.22.0.200


Previously the service was created for IP 172.22.0.200 (and it worked).
Then I deleted the service and created a new one with 172.22.0.230... No success reaching the new endpoint.


curl 172.22.0.200:6443   # Original endpoint - works
Client sent an HTTP request to an HTTPS server.

curl 172.22.0.230:6443   # new endpoint - doesn't work
curl: (7) Failed to connect to 172.22.0.230 port 6443: No route to host

Comment 1 Alexander Chuzhoy 2022-01-11 19:58:07 UTC

Version:
OCP 4.9.12
metallb: metallb-operator.4.9.0-202112142229

Comment 4 Federico Paolinelli 2022-01-12 08:20:36 UTC

When you say you change the ip of the service, how do you do that?
I don't see the spec.LoadBalancerIP filed, so metallb will take a random ip from the address pool and use it.
Can you share the manifests?

Comment 5 Federico Paolinelli 2022-01-12 10:23:27 UTC

I see 

externalIPs:
  - 172.22.0.230

being used here as well. The right field to use is loadbalancerIP

Comment 6 Alexander Chuzhoy 2022-01-12 17:48:21 UTC

Just tried to create the service and addresspool and then patch both with new IP several times.
Observations:

1. Upon creation - everything works as expected. (172.22.0.200)
2. Attempted to change to a new IP both the svc and addresspool - didn't work. (172.22.0.210)
3. Attempted to change again to a new IP both the svc and addresspool - worked as expected. (172.22.0.220)
4. Attempted to change to a new IP both the svc and addresspool - didn't work. (172.22.0.230)
5. Attempted to change again to a new IP both the svc and addresspool - worked as expected. (172.22.0.240)
6. Attempted to change to a new IP both the svc and addresspool - didn't work. (172.22.0.250)
oc get svc -n openshift-kube-apiserver                           metallb-api-service -o yaml
apiVersion: v1
kind: Service
metadata:
  annotations:
    metallb.universe.tf/address-pool: api-addresspool
  creationTimestamp: "2022-01-12T17:33:40Z"
  name: metallb-api-service
  namespace: openshift-kube-apiserver
  resourceVersion: "813537"
  uid: 86fdd641-6f45-4d02-8ade-05f8a80b89a8
spec:
  allocateLoadBalancerNodePorts: true
  clusterIP: 172.30.4.43
  clusterIPs:
  - 172.30.4.43
  externalTrafficPolicy: Cluster
  internalTrafficPolicy: Cluster
  ipFamilies:
  - IPv4
  ipFamilyPolicy: SingleStack
  loadBalancerIP: 172.22.0.250
  ports:
  - name: http
    nodePort: 31115
    port: 6443
    protocol: TCP
    targetPort: 6443
  selector:
    app: openshift-kube-apiserver
  sessionAffinity: None
  type: LoadBalancer
status:
  loadBalancer: {}


oc get addresspools.metallb.io -o yaml
apiVersion: v1
items:
- apiVersion: metallb.io/v1alpha1
  kind: AddressPool
  metadata:
    creationTimestamp: "2022-01-12T17:33:39Z"
    generation: 8
    name: api-addresspool
    namespace: metallb-system
    resourceVersion: "813489"
    uid: 2f68235f-3e82-4f56-8128-ae401b9329bb
  spec:
    addresses:
    - 172.22.0.250-172.22.0.250
    autoAssign: true
    protocol: layer2
kind: List
metadata:
  resourceVersion: ""
  selfLink: ""


7. Attempted to change again to a new IP both the svc and addresspool - worked as expected. (172.22.0.150)

oc get addresspools.metallb.io -o yaml
apiVersion: v1
items:
- apiVersion: metallb.io/v1alpha1
  kind: AddressPool
  metadata:
    creationTimestamp: "2022-01-12T17:33:39Z"
    generation: 8
    name: api-addresspool
    namespace: metallb-system
    resourceVersion: "813489"
    uid: 2f68235f-3e82-4f56-8128-ae401b9329bb
  spec:
    addresses:
    - 172.22.0.250-172.22.0.250
    autoAssign: true
    protocol: layer2
kind: List
metadata:
  resourceVersion: ""
  selfLink: ""
[root@sealusa34 ansible_tests]# oc get addresspools.metallb.io -o yaml
apiVersion: v1
items:
- apiVersion: metallb.io/v1alpha1
  kind: AddressPool
  metadata:
    creationTimestamp: "2022-01-12T17:33:39Z"
    generation: 9
    name: api-addresspool
    namespace: metallb-system
    resourceVersion: "816879"
    uid: 2f68235f-3e82-4f56-8128-ae401b9329bb
  spec:
    addresses:
    - 172.22.0.150-172.22.0.150
    autoAssign: true
    protocol: layer2
kind: List
metadata:
  resourceVersion: ""
  selfLink: ""


oc get svc -n openshift-kube-apiserver                           metallb-api-service -o yaml
apiVersion: v1
kind: Service
metadata:
  annotations:
    metallb.universe.tf/address-pool: api-addresspool
  creationTimestamp: "2022-01-12T17:33:40Z"
  name: metallb-api-service
  namespace: openshift-kube-apiserver
  resourceVersion: "816933"
  uid: 86fdd641-6f45-4d02-8ade-05f8a80b89a8
spec:
  allocateLoadBalancerNodePorts: true
  clusterIP: 172.30.4.43
  clusterIPs:
  - 172.30.4.43
  externalTrafficPolicy: Cluster
  internalTrafficPolicy: Cluster
  ipFamilies:
  - IPv4
  ipFamilyPolicy: SingleStack
  loadBalancerIP: 172.22.0.150
  ports:
  - name: http
    nodePort: 31115
    port: 6443
    protocol: TCP
    targetPort: 6443
  selector:
    app: openshift-kube-apiserver
  sessionAffinity: None
  type: LoadBalancer
status:
  loadBalancer:
    ingress:
    - ip: 172.22.0.150



So it works every second time.

Comment 8 Mohamed Mahmoud 2022-01-12 18:09:06 UTC

you realize that your address-pool range is just one IP so everytime u change the svc IP you need to update the pool range 
so if I took when of the failing runs I see this error in the speaker log
{"caller":"main.go:237","error":"assigned IP not allowed by config","ip":"172.22.0.220","msg":"IP allocated by controller not allowed by config","op":"setBalancer","service":"openshift-kube-apiserver/metallb-api-service","ts":"2022-01-12T17:40:26.3772146Z"}
and there should be an info event showing with this error 

this error comes from the speaker main loop if it can't find pool for the for loadbalancerIP

Comment 10 Alexander Chuzhoy 2022-01-12 18:43:39 UTC

Exactly the same method (automated - first we create/change the addresspool and then the service) was used 6 times with the following IPs: 
172.22.0.210
172.22.0.220
172.22.0.230
172.22.0.240
172.22.0.250
172.22.0.150

The log entry for "172.22.0.220"  - this one actually worked as expected (same for 172.22.0.240 and 172.22.0.150).

Comment 12 Mohamed Mahmoud 2022-01-12 22:41:08 UTC

here is what happens
we have addresspool with just 1 IP and we apply svc with loadBalancerIP set to that IP which endup allocating this IP
for example 
address-pool:
- 172.22.0.250-172.22.0.250 
then the svc will have this IP and everything is good.

update addresspool 1st then svc
===============================
now we edit the same addresspool and change it to 172.22.0.251-172.22.0.251 metallb controller will invoke SetConfig()->SetPools() since its the same addresspool
so it thinks it already been allocated and have 172.22.0.251 now we try to find the pool that has 172.22.0.250 "already allocated IP" in the pool CIDR definition which now has 172.22.0.251 and we return "" we can't find such pool and this error comes up
{"caller":"level.go:63","configmap":"metallb-system/config","error":"new config not compatible with assigned IPs: service \"default/nginx\" cannot own [\"172.22.0.250\"] under new config","level":"error","msg":"applying new configuration failed","op":"setConfig","ts":"2022-01-12T21:30:37.873683013Z"}

svc stay in pending state and after ~3sec k8s will delete the service
{"caller":"level.go:63","event":"serviceDeleted","level":"info","msg":"service deleted","service":"default/nginx","ts":"2022-01-12T21:32:40.212428145Z"}

update svc 1st then addresspool
===============================
however if we update the svc first it will invoke SetBalancer() which will trigger IPAM address allocation which will fail and get rid of the prev allocated IP now when update addresspool to match svc IP it will go and allocate

{"caller":"level.go:63","error":"[\"172.22.0.251\"] is not allowed in config","level":"error","msg":"IP allocation failed","op":"allocateIPs","service":"default/nginx","ts":"2022-01-12T22:29:03.854318793Z"}

{"caller":"level.go:63","configmap":"metallb-system/config","event":"configLoaded","level":"info","msg":"config (re)loaded","ts":"2022-01-12T22:29:41.381103058Z"}
{"caller":"level.go:63","event":"ipAllocated","ip":["172.22.0.251"],"level":"info","msg":"IP address assigned by controller","service":"default/nginx","ts":"2022-01-12T22:29:41.386954354Z"}
{"caller":"level.go:63","event":"serviceUpdated","level":"info","msg":"updated service object","service":"default/nginx","ts":"2022-01-12T22:29:41.395010516Z"}

Comment 13 Flavio Percoco 2022-01-13 11:35:38 UTC

It looks like there's a fix for this upstream https://github.com/metallb/metallb/pull/1028

Is there a chance for this fix to be backported to 4.9?

Comment 17 Alexander Chuzhoy 2022-01-13 16:56:15 UTC

If we first reconfigure the service and then the addresspool - then the issue is not observed.

Comment 22 errata-xmlrpc 2022-01-24 16:50:18 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.9.17 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2022:0195