Bug 2014538
Summary: | Kuryr controller crash looping on self._get_vip_port(loadbalancer).id 'NoneType' object has no attribute 'id' | |||
---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Andreas Karis <akaris> | |
Component: | Networking | Assignee: | Robin Cernin <rcernin> | |
Networking sub component: | kuryr | QA Contact: | rlobillo | |
Status: | CLOSED ERRATA | Docs Contact: | ||
Severity: | high | |||
Priority: | high | CC: | dhill, gkadam, ldenny, mdemaced, mdulko, rcernin, rlobillo | |
Version: | 4.6 | |||
Target Milestone: | --- | |||
Target Release: | 4.10.0 | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | ||||
Fixed In Version: | Doc Type: | Bug Fix | ||
Doc Text: |
Cause:
If the LB already exists, the Octavia should throw 500 exception. That wasn't always the case.
Consequence:
New duplicated LB with the same name would be created.
Fix:
Check if LB exists by searching for it first. And if yes we should update already existing LB.
Result:
No duplicate LB should be created.
|
Story Points: | --- | |
Clone Of: | ||||
: | 2018129 (view as bug list) | Environment: | ||
Last Closed: | 2022-03-10 16:20:08 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 2018129 |
Description
Andreas Karis
2021-10-15 14:03:43 UTC
I have reproduced your issue by removing the neutron port: ~~~ $ openstack port delete a5ac648d-435c-41e2-b2ce-7aaa5a721aa1 [stack@standalone ~]$ openstack loadbalancer list | grep my-service | f3b68a75-2352-48ef-b77a-1dd1e6c618c8 | openshift-kuryr/my-service | b63936b22ced4c9b862bf02d4a9f6dc1 | 172.30.98.166 | ACTIVE | ovn | | 256ce386-f207-49ef-b7c0-e21fb778e8ea | openshift-kuryr/my-service | b63936b22ced4c9b862bf02d4a9f6dc1 | 172.30.98.166 | ACTIVE | ovn | | 66e57fa2-e238-4d4d-9f4b-4136dd648ac3 | openshift-kuryr/my-service | b63936b22ced4c9b862bf02d4a9f6dc1 | 172.30.98.166 | ACTIVE | ovn | ~~~ Which resulted into exact same Traceback as we have seen in your Environment. ~~~ 2021-10-16 00:20:08.658 20014 ERROR kuryr_kubernetes.handlers.logging loadbalancer['port_id'] = self._get_vip_port(loadbalancer).id 2021-10-16 00:20:08.658 20014 ERROR kuryr_kubernetes.handlers.logging AttributeError: 'NoneType' object has no attribute 'id' ~~~ We have actually fixed this issue in 3.11 some time ago https://github.com/openshift/kuryr-kubernetes/pull/548/files#diff-b59c31f9085285396c2952d6f8b61d995485be667f87b37cb539a56d0da2d443R1063 Where we basically switched the _find_loadbalancer prior to the creation. Will do the same for OCP4. This will resolve this issue. OK I think this was intended to prevent this from happening, basically https://github.com/openshift/kuryr-kubernetes/commit/153a16e80f3de9a364b24a4ebf0ae95f49078d23 and if everything works, it prevents the problem from happening. If you delete port before entering into _get_vip_port it throws a 500 exception, which skips the raise and moves to find_loadbalancer function and updates KLB so it doesn't create a duplicated LB. But at the same time, this increments a Retry failure handler, until it kills KLB Handler which restarts controller. Now I was able to reproduce it only if the LB was created, port was deleted before the _get_vip_port and Kuryr controller restarted before _find_loadbalancer was run. As _find_loadbalancer updates the KLB with ID, this 100% solves the problem, so I believe moving _find_loadbalancer is the solution and I will go ahead with this patch. This patch fixes this behaviour https://review.opendev.org/c/openstack/kuryr-kubernetes/+/814248 once merged will create downstream backports when possible Alright, I'm okay with allowing a backport of this down to 4.6. This got fixed alongside https://github.com/openshift/kuryr-kubernetes/pull/583, so moving to ON_QA. Verified in 4.10.0-0.nightly-2021-11-09-181140 on top of OSP16.1 (RHOS-16.1-RHEL-8-20210903.n.0) with OVN-Octavia. The LB creation logic has changed from: 1. Create LB 2. Populate result with that LB 3. If there was a failure during "Create LB" 4. Find LB 5. Update KLB CRD. to: 1. Check if LB exists, and if it does, re-use the same LB. 2. If it doesn't exist, then create new LB 3. Populate result with that LB 4. Update KLB CRD This new logic is working as expected, as confirmed by installation and running kuryr-tempest-plugin tests (Logs attached). It is agreed with the BZ assignee that this is enough for validating this fix, as the systematic reproduction requires code modification for it. Removing the Triaged keyword because: * the QE automation assessment (flag qe_test_coverage) is missing Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:0056 |