Bug 1995507

Summary: Kuryr controller error prevents OCP service of type LoadBalancer in Obtaining externalIP/FIP.
Product: OpenShift Container Platform Reporter: Mohammad <mahmad>
Component: NetworkingAssignee: Michał Dulko <mdulko>
Networking sub component: kuryr QA Contact: Itzik Brown <itbrown>
Status: CLOSED ERRATA Docs Contact:
Severity: high    
Priority: unspecified CC: rbarrott, rcernin
Version: 3.11.0Keywords: Triaged
Target Milestone: ---   
Target Release: 3.11.z   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-09-15 19:21:16 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Mohammad 2021-08-19 09:47:40 UTC
Description of problem: Kuryr controller error prevents OCP service of type LoadBalancer in Obtaining externalIP/FIP.


Version-Release number of selected component (if applicable): 3.11.452


How reproducible:

1- Create service of type loadbalancer:

[openshift@master-2 mowork]$ cat lb-momohttpd-40.yaml
apiVersion: v1
kind: Service
metadata:
  annotations:
  name: lb-momohttpd-40
  namespace: momo
spec:
  externalTrafficPolicy: Cluster
  ports:
  - name: http
    nodePort: 30040
    port: 80
    protocol: TCP
    targetPort: 8080
  - name: https
    nodePort: 31040
    port: 443
    protocol: TCP
    targetPort: 8443
  selector:
    k8s-app: momohttpd-40
  sessionAffinity: None
  type: LoadBalancer

2- Monitor service creation:
NAME              TYPE           CLUSTER-IP      EXTERNAL-IP   PORT(S)                      AGE
lb-momohttpd-40   LoadBalancer   XX.XX.26.44   <pending>     80:30040/TCP,443:31040/TCP   0s
#####Thu Aug 19 19:01:20 AEST 2021####
NAME              TYPE           CLUSTER-IP      EXTERNAL-IP   PORT(S)                      AGE
lb-momohttpd-40   LoadBalancer   XX.XX.26.44   <pending>     80:30040/TCP,443:31040/TCP   1m
#####Thu Aug 19 19:02:21 AEST 2021####
NAME              TYPE           CLUSTER-IP      EXTERNAL-IP   PORT(S)                      AGE
lb-momohttpd-40   LoadBalancer   XX.XX.26.44   <pending>     80:30040/TCP,443:31040/TCP   2m
#####Thu Aug 19 19:03:22 AEST 2021####
NAME              TYPE           CLUSTER-IP      EXTERNAL-IP   PORT(S)                      AGE
lb-momohttpd-40   LoadBalancer   XX.XX.26.44   <pending>     80:30040/TCP,443:31040/TCP   3m
#####Thu Aug 19 19:04:23 AEST 2021####

3- Until you see this error in the controller log:
In kuryr-controller-5c96d54d78-7vkbz_logs_while_FIP_pending.txt

2021-08-19 09:02:30.047 1 ERROR kuryr_kubernetes.controller.drivers.lbaasv2 [-] Error when creating loadbalancer: {"debuginfo": null, "faultcode": "Server", "faultstring": "Provider 'amphora' reports error: IP address XX.XX.26.44 already allocated in subnet 0ada9494-f2fc-4a7a-b7df-8926209f4cc7\nNeutron server returns request_ids: ['req-98c7eb09-6cf2-4d91-9724-2037040c3ffe']"}^[[00m


4- Restart Kuryr controller

Once the controller restarts the FIP gets assigned (logs: kuryr-controller-5c96d54d78-f9rx8_logs_after_restarting_kuryr.txt but don't contain any issues)

5- Check service creation:
#####Thu Aug 19 19:04:23 AEST 2021####
NAME              TYPE           CLUSTER-IP      EXTERNAL-IP    PORT(S)                      AGE
lb-momohttpd-40   LoadBalancer   XX.XX.26.44   XXX.ZZZ.129.142   80:30040/TCP,443:31040/TCP   3m
#####Thu Aug 19 19:04:34 AEST 2021####

Actual results:
FIP/ExternalIP never gets assigned until Kuryr controller is manually restarted.

Expected results: 
FIP/ExternalIP should get assigned without a restart of Kuryr controller.

Comment 4 Mohammad 2021-08-19 10:00:34 UTC
 $ openstack loadbalancer list |grep 26.44
| 3fe16450-8034-4ee6-bdab-da0c2b123c62 | momo/lb-momohttpd-40                                     | f7b96553d2fd4e26a05beb87c85c67c9 | XXX.XXX.26.44   | ACTIVE              | amphora  |

Comment 13 Michał Dulko 2021-08-27 14:28:00 UTC
A possible fix to the problem is merged. You can easily try it on a live cluster by checking in which section lbaas_activation_timeout is in kuryr-config ConfigMap (kuryr namespace). If it doesn't help (or affected cluster has lbaas_activation_timeout in [neutron_defaults] section already), then please reopen the bug and we'll continue to investigate.

Comment 16 Itzik Brown 2021-09-01 06:07:28 UTC
Version: v3.11.515

Verified that lbaas_activation_timeout = 1200 appears under neutron_defaults section (kuryr.conf) by running oc get cm -n kuryr -o yaml |grep -A 20 neutron_defaults
kuryr_tempest_plugin.tests.scenario.test_service.TestLoadBalancerServiceScenario tests passed

Comment 19 errata-xmlrpc 2021-09-15 19:21:16 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 3.11.521 bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:3424