Bug 2014538 - Kuryr controller crash looping on self._get_vip_port(loadbalancer).id 'NoneType' object has no attribute 'id'
Summary: Kuryr controller crash looping on self._get_vip_port(loadbalancer).id 'Non...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.6
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 4.10.0
Assignee: Robin Cernin
QA Contact: rlobillo
URL:
Whiteboard:
Depends On:
Blocks: 2018129
TreeView+ depends on / blocked
 
Reported: 2021-10-15 14:03 UTC by Andreas Karis
Modified: 2022-03-16 07:35 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: If the LB already exists, the Octavia should throw 500 exception. That wasn't always the case. Consequence: New duplicated LB with the same name would be created. Fix: Check if LB exists by searching for it first. And if yes we should update already existing LB. Result: No duplicate LB should be created.
Clone Of:
: 2018129 (view as bug list)
Environment:
Last Closed: 2022-03-10 16:20:08 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift kuryr-kubernetes pull 583 0 None Merged Bug 2015154: Support for ports and primarySubnet fields in machine 2021-10-28 09:59:45 UTC
OpenStack gerrit 814248 0 None NEW Ensure KLB is updated 2021-10-16 07:43:37 UTC
Red Hat Product Errata RHSA-2022:0056 0 None None None 2022-03-10 16:20:25 UTC

Description Andreas Karis 2021-10-15 14:03:43 UTC
Description of problem:
Under specific circumstances when OVN loadbalancers go into ERROR create state, Kuryr will crash with the following stack trace, in a crash loop. Along the way, it will create an unlimited amount of loadbalancers in OVN, it will add a loadbalancer every time it comes back up after having died before.

The failed load balancer creation stems from a different problem for which we do not have the root cause yet and which was fixed after a database repair. 

However, due to the failed load balancer creation, the loadbalancer object comes back as null, and our code is not checking for that. Instead, it blindy assumes that loadbalancer is an object. The KuryrLoadBalancerHandlers then crash one after another, leading to a full stop of the entire Kuryr process.

The Kuryr controller is then killed by Kubernetes, restarted again, and the same dance starts again, filling up the neutron database with in ERROR loadbalancers. That exasperates any preexisting problem by adding a new issue (namely the generation of an unlimited number of invalid loadbalancers in the database).

I checked upstream, and "/usr/lib/python3.6/site-packages/kuryr_kubernetes/controller/drivers/lbaasv2.py", line 507 does not have a try/catch logic nor does it check for type safety, even in upstream master. Regardless of the initial issue, this would need to be fixed, as otherwise KuryrLoadBalancerHandler threads simply die at that part simply because a loadbalancer goes into ERROR state at a specific point during creation.

~~~
2021-10-14 16:38:05.094 1 ERROR kuryr_kubernetes.handlers.retry [-] Report handler unhealthy KuryrLoadBalancerHandler: AttributeError: 'NoneType' object has no attribute 'id'
2021-10-14 16:38:05.094 1 ERROR kuryr_kubernetes.handlers.retry Traceback (most recent call last):
2021-10-14 16:38:05.094 1 ERROR kuryr_kubernetes.handlers.retry   File "/usr/lib/python3.6/site-packages/kuryr_kubernetes/handlers/retry.py", line 81, in __call__
2021-10-14 16:38:05.094 1 ERROR kuryr_kubernetes.handlers.retry     self._handler(event, *args, **kwargs)
2021-10-14 16:38:05.094 1 ERROR kuryr_kubernetes.handlers.retry   File "/usr/lib/python3.6/site-packages/kuryr_kubernetes/handlers/k8s_base.py", line 84, in __call__
2021-10-14 16:38:05.094 1 ERROR kuryr_kubernetes.handlers.retry     self.on_present(obj)
2021-10-14 16:38:05.094 1 ERROR kuryr_kubernetes.handlers.retry   File "/usr/lib/python3.6/site-packages/kuryr_kubernetes/controller/handlers/loadbalancer.py", line 79, in on_present
2021-10-14 16:38:05.094 1 ERROR kuryr_kubernetes.handlers.retry     if self._sync_lbaas_members(loadbalancer_crd):
2021-10-14 16:38:05.094 1 ERROR kuryr_kubernetes.handlers.retry   File "/usr/lib/python3.6/site-packages/kuryr_kubernetes/controller/handlers/loadbalancer.py", line 178, in _sync_lbaas_members
2021-10-14 16:38:05.094 1 ERROR kuryr_kubernetes.handlers.retry     if self._sync_lbaas_pools(loadbalancer_crd):
2021-10-14 16:38:05.094 1 ERROR kuryr_kubernetes.handlers.retry   File "/usr/lib/python3.6/site-packages/kuryr_kubernetes/controller/handlers/loadbalancer.py", line 439, in _sync_lbaas_pools
2021-10-14 16:38:05.094 1 ERROR kuryr_kubernetes.handlers.retry     if self._sync_lbaas_listeners(loadbalancer_crd):
2021-10-14 16:38:05.094 1 ERROR kuryr_kubernetes.handlers.retry   File "/usr/lib/python3.6/site-packages/kuryr_kubernetes/controller/handlers/loadbalancer.py", line 534, in _sync_lbaas_listeners
2021-10-14 16:38:05.094 1 ERROR kuryr_kubernetes.handlers.retry     if self._sync_lbaas_loadbalancer(loadbalancer_crd):
2021-10-14 16:38:05.094 1 ERROR kuryr_kubernetes.handlers.retry   File "/usr/lib/python3.6/site-packages/kuryr_kubernetes/controller/handlers/loadbalancer.py", line 687, in _sync_lbaas_loadbalancer
2021-10-14 16:38:05.094 1 ERROR kuryr_kubernetes.handlers.retry     provider=loadbalancer_crd['spec'].get('provider'))
2021-10-14 16:38:05.094 1 ERROR kuryr_kubernetes.handlers.retry   File "/usr/lib/python3.6/site-packages/kuryr_kubernetes/controller/drivers/lbaasv2.py", line 147, in ensure_loadbalancer
2021-10-14 16:38:05.094 1 ERROR kuryr_kubernetes.handlers.retry     response = self._ensure_loadbalancer(request)
2021-10-14 16:38:05.094 1 ERROR kuryr_kubernetes.handlers.retry   File "/usr/lib/python3.6/site-packages/kuryr_kubernetes/controller/drivers/lbaasv2.py", line 704, in _ensure_loadbalancer
2021-10-14 16:38:05.094 1 ERROR kuryr_kubernetes.handlers.retry     result = self._create_loadbalancer(loadbalancer)
2021-10-14 16:38:05.094 1 ERROR kuryr_kubernetes.handlers.retry   File "/usr/lib/python3.6/site-packages/kuryr_kubernetes/controller/drivers/lbaasv2.py", line 507, in _create_loadbalancer
2021-10-14 16:38:05.094 1 ERROR kuryr_kubernetes.handlers.retry     loadbalancer['port_id'] = self._get_vip_port(loadbalancer).id
2021-10-14 16:38:05.094 1 ERROR kuryr_kubernetes.handlers.retry AttributeError: 'NoneType' object has no attribute 'id'
2021-10-14 16:38:05.094 1 ERROR kuryr_kubernetes.handlers.retry
2021-10-14 16:38:05.096 1 ERROR kuryr_kubernetes.handlers.logging [-] Failed to handle event {'type': 'MODIFIED', 'object': {'apiVersion': 'openstack.org/v1', 'kind': 'KuryrLoadBalancer', 'metadata': {'creationTimestamp': '2021-10-12T18:31:50Z', 'finalizers': ['kuryr.openstack.org/kuryrloadbalancer-finalizers'], 'generation': 5680, 'managedFields': [{'apiVersion': 'openstack.org/v1', 'fieldsType': 'FieldsV1', 'fieldsV1': {'f:metadata': {'f:finalizers': {'.': {}, 'v:"kuryr.openstack.org/kuryrloadbalancer-finalizers"': {}}}, 'f:spec': {'.': {}, 'f:endpointSlices': {}, 'f:ip': {}, 'f:ports': {}, 'f:project_id': {}, 'f:provider': {}, 'f:security_groups_ids': {}, 'f:subnet_id': {}, 'f:type': {}}, 'f:status': {}}, 'manager': 'python-requests', 'operation': 'Update', 'time': '2021-10-14T16:37:50Z'}],...
~~~

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 6 Robin Cernin 2021-10-16 00:27:22 UTC
I have reproduced your issue by removing the neutron port:

~~~
$ openstack port delete a5ac648d-435c-41e2-b2ce-7aaa5a721aa1
[stack@standalone ~]$ openstack loadbalancer list | grep my-service
| f3b68a75-2352-48ef-b77a-1dd1e6c618c8 | openshift-kuryr/my-service                                                   | b63936b22ced4c9b862bf02d4a9f6dc1 | 172.30.98.166  | ACTIVE              | ovn      |
| 256ce386-f207-49ef-b7c0-e21fb778e8ea | openshift-kuryr/my-service                                                   | b63936b22ced4c9b862bf02d4a9f6dc1 | 172.30.98.166  | ACTIVE              | ovn      |
| 66e57fa2-e238-4d4d-9f4b-4136dd648ac3 | openshift-kuryr/my-service                                                   | b63936b22ced4c9b862bf02d4a9f6dc1 | 172.30.98.166  | ACTIVE              | ovn      |
~~~

Which resulted into exact same Traceback as we have seen in your Environment.

~~~
2021-10-16 00:20:08.658 20014 ERROR kuryr_kubernetes.handlers.logging     loadbalancer['port_id'] = self._get_vip_port(loadbalancer).id
2021-10-16 00:20:08.658 20014 ERROR kuryr_kubernetes.handlers.logging AttributeError: 'NoneType' object has no attribute 'id'
~~~

We have actually fixed this issue in 3.11 some time ago https://github.com/openshift/kuryr-kubernetes/pull/548/files#diff-b59c31f9085285396c2952d6f8b61d995485be667f87b37cb539a56d0da2d443R1063

Where we basically switched the _find_loadbalancer prior to the creation. Will do the same for OCP4. This will resolve this issue.

Comment 7 Robin Cernin 2021-10-16 07:27:12 UTC
OK I think this was intended to prevent this from happening, basically https://github.com/openshift/kuryr-kubernetes/commit/153a16e80f3de9a364b24a4ebf0ae95f49078d23 and if everything works, it prevents the problem from happening.

If you delete port before entering into _get_vip_port it throws a 500 exception, which skips the raise and moves to find_loadbalancer function and updates KLB so it doesn't create a duplicated LB.  But at the same time, this increments a Retry failure handler, until it kills KLB Handler which restarts controller. Now I was able to reproduce it only if the LB was created, port was deleted before the _get_vip_port and Kuryr controller restarted before _find_loadbalancer was run.

As _find_loadbalancer updates the KLB with ID, this 100% solves the problem, so I believe moving _find_loadbalancer is the solution and I will go ahead with this patch.

Comment 8 Robin Cernin 2021-10-16 07:43:12 UTC
This patch fixes this behaviour https://review.opendev.org/c/openstack/kuryr-kubernetes/+/814248 once merged will create downstream backports when possible

Comment 10 Michał Dulko 2021-10-27 11:37:21 UTC
Alright, I'm okay with allowing a backport of this down to 4.6.

Comment 11 Michał Dulko 2021-10-28 09:59:14 UTC
This got fixed alongside https://github.com/openshift/kuryr-kubernetes/pull/583, so moving to ON_QA.

Comment 17 rlobillo 2021-11-12 13:31:04 UTC
Verified in 4.10.0-0.nightly-2021-11-09-181140 on top of OSP16.1 (RHOS-16.1-RHEL-8-20210903.n.0) with OVN-Octavia.

The LB creation logic has changed from: 

1. Create LB
2. Populate result with that LB
3. If there was a failure during "Create LB"
4. Find LB
5. Update KLB CRD.

to:

1. Check if LB exists, and if it does, re-use the same LB.
2. If it doesn't exist, then create new LB
3. Populate result with that LB
4. Update KLB CRD

This new logic is working as expected, as confirmed by installation and running kuryr-tempest-plugin tests (Logs attached).

It is agreed with the BZ assignee that this is enough for validating this fix, as the systematic reproduction requires code modification for it.

Comment 20 ShiftStack Bugwatcher 2022-03-05 07:07:15 UTC
Removing the Triaged keyword because:
* the QE automation assessment (flag qe_test_coverage) is missing

Comment 22 errata-xmlrpc 2022-03-10 16:20:08 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0056


Note You need to log in before you can comment on or make changes to this bug.