Bug 1884655

Summary: KeyError on self._existing_vifs[port_id]
Product: OpenShift Container Platform Reporter: rlobillo
Component: NetworkingAssignee: Luis Tomas Bolivar <ltomasbo>
Networking sub component: kuryr QA Contact: GenadiC <gcheresh>
Status: CLOSED ERRATA Docs Contact:
Severity: high    
Priority: high CC: bbennett, mdemaced
Version: 4.6   
Target Milestone: ---   
Target Release: 4.7.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-02-24 15:22:23 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1886696    
Attachments:
Description Flags
kuryr-controller logs
none
NP+Conformance results with the fix none

Description rlobillo 2020-10-02 15:13:23 UTC
Created attachment 1718465 [details]
kuryr-controller logs

Description of problem:

While running conformance tests, Kuryr-controller got restarted with below exception:

2020-10-02 12:16:16.302 1 ERROR kuryr_kubernetes.handlers.retry [-] Report handler unhealthy KuryrPortHandler: KeyError: 'd7cc9b11-13e8-481b-b494-6be3eb0ed57a'
2020-10-02 12:16:16.302 1 ERROR kuryr_kubernetes.handlers.retry Traceback (most recent call last):
2020-10-02 12:16:16.302 1 ERROR kuryr_kubernetes.handlers.retry   File "/usr/lib/python3.6/site-packages/kuryr_kubernetes/handlers/retry.py", line 81, in __call__
2020-10-02 12:16:16.302 1 ERROR kuryr_kubernetes.handlers.retry     self._handler(event, *args, **kwargs)
2020-10-02 12:16:16.302 1 ERROR kuryr_kubernetes.handlers.retry   File "/usr/lib/python3.6/site-packages/kuryr_kubernetes/handlers/k8s_base.py", line 90, in __call__
2020-10-02 12:16:16.302 1 ERROR kuryr_kubernetes.handlers.retry     self.on_present(obj)
2020-10-02 12:16:16.302 1 ERROR kuryr_kubernetes.handlers.retry   File "/usr/lib/python3.6/site-packages/kuryr_kubernetes/controller/handlers/kuryrport.py", line 66, in on_present
2020-10-02 12:16:16.302 1 ERROR kuryr_kubernetes.handlers.retry     if not self.get_vifs(kuryrport_crd):
2020-10-02 12:16:16.302 1 ERROR kuryr_kubernetes.handlers.retry   File "/usr/lib/python3.6/site-packages/kuryr_kubernetes/controller/handlers/kuryrport.py", line 231, in get_vifs
2020-10-02 12:16:16.302 1 ERROR kuryr_kubernetes.handlers.retry     pod, project_id, subnets, security_groups)
2020-10-02 12:16:16.302 1 ERROR kuryr_kubernetes.handlers.retry   File "/usr/lib/python3.6/site-packages/kuryr_kubernetes/controller/drivers/vif_pool.py", line 1181, in request_vif
2020-10-02 12:16:16.302 1 ERROR kuryr_kubernetes.handlers.retry     pod, project_id, subnets, security_groups)
2020-10-02 12:16:16.302 1 ERROR kuryr_kubernetes.handlers.retry   File "/usr/lib/python3.6/site-packages/kuryr_kubernetes/controller/drivers/vif_pool.py", line 210, in request_vif
2020-10-02 12:16:16.302 1 ERROR kuryr_kubernetes.handlers.retry     tuple(sorted(security_groups)))
2020-10-02 12:16:16.302 1 ERROR kuryr_kubernetes.handlers.retry   File "/usr/lib/python3.6/site-packages/kuryr_kubernetes/controller/drivers/vif_pool.py", line 867, in _get_port_from_pool
2020-10-02 12:16:16.302 1 ERROR kuryr_kubernetes.handlers.retry     return self._existing_vifs[port_id]
2020-10-02 12:16:16.302 1 ERROR kuryr_kubernetes.handlers.retry KeyError: 'd7cc9b11-13e8-481b-b494-6be3eb0ed57a'
2020-10-02 12:16:16.302 1 ERROR kuryr_kubernetes.handlers.retry [00m

Version-Release number of selected component (if applicable):

$ cat core_puddle_version 
RHOS-16.1-RHEL-8-20200917.n.3
$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.6.0-0.nightly-2020-10-02-001427   True        False         6h31m   Cluster version is 4.6.0-0.nightly-2020-10-02-001427


How reproducible:


Steps to Reproduce:
1. Install latest OCP on OSP16.1+OVN-Octavia
2. Run Conformance tests
3.

Actual results: kuryr-controller is restarted.


Expected results: No restarts.


Additional info: kuryr-controller logs attached

Comment 3 rlobillo 2020-10-23 10:14:23 UTC
Verified on 4.7.0-0.nightly-2020-10-22-044506 over OSP16.1 with OVN-octavia (RHOS-16.1-RHEL-8-20201007.n.0).

Conformance tests run without kuryr-controller restarts and with expected results (Attached):

$ cat conformance_results/kuryr_pods.txt 
# Kuryr pods before running conformance tests - ANSIBLE MANAGED BLOCK
NAME                                READY   STATUS    RESTARTS   AGE
kuryr-cni-chrzk                     1/1     Running   0          143m
kuryr-cni-drhbk                     1/1     Running   0          143m
kuryr-cni-pclf8                     1/1     Running   0          140m
kuryr-cni-w5tf9                     1/1     Running   0          142m
kuryr-cni-zfgzt                     1/1     Running   0          140m
kuryr-cni-zl6wv                     1/1     Running   0          141m
kuryr-controller-74584966c5-689jm   1/1     Running   1          118m
# END ANSIBLE MANAGED BLOCK
# Kuryr pods after running conformance tests - ANSIBLE MANAGED BLOCK
NAME                                READY   STATUS    RESTARTS   AGE
kuryr-cni-chrzk                     1/1     Running   0          3h18m
kuryr-cni-drhbk                     1/1     Running   2          3h19m
kuryr-cni-pclf8                     1/1     Running   0          3h15m
kuryr-cni-w5tf9                     1/1     Running   2          3h17m
kuryr-cni-zfgzt                     1/1     Running   2          3h16m
kuryr-cni-zl6wv                     1/1     Running   0          3h17m
kuryr-controller-74584966c5-689jm   1/1     Running   1          174m
# END ANSIBLE MANAGED BLOCK


Furthermore, downscaling workers is deleting properly the subports for the already deleted worker:

(shiftstack) [stack@undercloud-0 ~]$ openstack port list --device-owner=""  --tags="openshiftClusterID=ostest-rmrft" | grep -v ostest
+--------------------------------------+---------------------------+-------------------+-------------------------------------------------------------------------------+--------+
| ID                                   | Name                      | MAC Address       | Fixed IP Addresses                                                            | Status |
+--------------------------------------+---------------------------+-------------------+-------------------------------------------------------------------------------+--------+
| 035c6cef-5aa0-4104-a049-3c5b0ba73444 |                           | fa:16:3e:ab:15:43 | ip_address='10.128.106.136', subnet_id='c7dcf66b-31e5-4d81-b10b-9073c56710b7' | DOWN   |
| 07c8b907-5e1f-4d9c-b920-b0154a01ed6f |                           | fa:16:3e:9b:cb:cb | ip_address='10.128.93.30', subnet_id='0c38384b-15e0-4505-938a-56b5c8d91ce5'   | DOWN   |
| 0c892901-b58b-4bcf-9232-1e403edff30e |                           | fa:16:3e:47:6f:8b | ip_address='10.128.92.79', subnet_id='0c38384b-15e0-4505-938a-56b5c8d91ce5'   | DOWN   |
| 150d2f75-0125-4a6b-86de-81887d0c6d18 |                           | fa:16:3e:61:dc:a3 | ip_address='10.128.10.104', subnet_id='12a657ee-189a-4e56-822e-d507836e3830'  | DOWN   |
| 1802138a-8d14-4882-98c4-f204a5ba0a92 |                           | fa:16:3e:78:f2:17 | ip_address='10.128.35.224', subnet_id='0e71d23e-689f-4289-b103-39c357dfa989'  | DOWN   |
| 1f3850fd-73e9-4cf6-b065-ae6ec7df86fe |                           | fa:16:3e:9d:b8:48 | ip_address='10.128.8.188', subnet_id='d886942e-a82d-4629-974f-ececa12a6aa0'   | DOWN   |
| 378f221f-88f9-4b84-943a-cf0344e32029 |                           | fa:16:3e:29:89:fa | ip_address='10.128.9.217', subnet_id='d886942e-a82d-4629-974f-ececa12a6aa0'   | DOWN   |
| 39bb6b06-997a-48f0-919b-988ba25ca89e |                           | fa:16:3e:53:40:8a | ip_address='10.128.8.148', subnet_id='d886942e-a82d-4629-974f-ececa12a6aa0'   | DOWN   |
| 3bad459b-e9ff-41ce-b6b4-60a499bbe33a |                           | fa:16:3e:8f:13:bf | ip_address='10.128.106.21', subnet_id='c7dcf66b-31e5-4d81-b10b-9073c56710b7'  | DOWN   |
| 4b601c09-106b-4589-8045-8d90c8301274 |                           | fa:16:3e:0b:aa:3b | ip_address='10.128.10.68', subnet_id='12a657ee-189a-4e56-822e-d507836e3830'   | DOWN   |
| 5e749451-ae31-438d-963c-8095e45c843d |                           | fa:16:3e:5d:02:fa | ip_address='10.128.34.36', subnet_id='0e71d23e-689f-4289-b103-39c357dfa989'   | DOWN   |
| 6653e09b-2cf5-457a-9144-9cb6ed091aca |                           | fa:16:3e:53:4c:a2 | ip_address='10.128.66.105', subnet_id='fa481858-2961-430a-973a-df53c36c1517'  | DOWN   |
| 7b6f020d-8284-4809-aa37-f46553b88c3c |                           | fa:16:3e:f7:40:15 | ip_address='10.128.35.160', subnet_id='0e71d23e-689f-4289-b103-39c357dfa989'  | DOWN   |
| 85493a5f-4a1d-4023-920d-6ce617094a9c |                           | fa:16:3e:3a:2e:17 | ip_address='10.128.66.12', subnet_id='fa481858-2961-430a-973a-df53c36c1517'   | DOWN   |
| 860aee23-9397-472d-ac12-433227532e58 |                           | fa:16:3e:f5:c2:d3 | ip_address='10.128.10.186', subnet_id='12a657ee-189a-4e56-822e-d507836e3830'  | DOWN   |
| 91a36e95-2233-4bca-bca0-92676a350be4 |                           | fa:16:3e:21:57:78 | ip_address='10.128.8.6', subnet_id='d886942e-a82d-4629-974f-ececa12a6aa0'     | DOWN   |
| 976cc808-e52e-4b98-a2c6-94743a51823f |                           | fa:16:3e:d8:55:51 | ip_address='10.128.107.229', subnet_id='c7dcf66b-31e5-4d81-b10b-9073c56710b7' | DOWN   |
| 98e80a49-83df-409c-b384-2b8e807528b2 |                           | fa:16:3e:02:ff:25 | ip_address='10.128.92.121', subnet_id='0c38384b-15e0-4505-938a-56b5c8d91ce5'  | DOWN   |
| 9ea6f35c-4efa-44fe-9dec-ad6882401273 |                           | fa:16:3e:6a:62:09 | ip_address='10.128.66.115', subnet_id='fa481858-2961-430a-973a-df53c36c1517'  | DOWN   |
| b144491c-1717-4fd8-b0db-6711a4753519 |                           | fa:16:3e:02:50:29 | ip_address='10.128.31.186', subnet_id='f78f8a86-295b-470a-ac80-9388a2c3bfdb'  | DOWN   |
| b2c8ba93-8550-4b5a-973c-5275af1f8eb3 |                           | fa:16:3e:96:e7:ff | ip_address='10.128.66.107', subnet_id='fa481858-2961-430a-973a-df53c36c1517'  | DOWN   |
| b2e4ccbf-bc71-4d19-9025-293019683175 |                           | fa:16:3e:4e:f6:c9 | ip_address='10.128.92.217', subnet_id='0c38384b-15e0-4505-938a-56b5c8d91ce5'  | DOWN   |
| b3b4aafb-3129-46c9-8b66-bddd3dc8e3e3 |                           | fa:16:3e:77:6f:ce | ip_address='10.128.67.139', subnet_id='fa481858-2961-430a-973a-df53c36c1517'  | DOWN   |
| bb8b5b25-334d-4947-bde4-da1744797251 |                           | fa:16:3e:52:a5:ff | ip_address='10.128.8.240', subnet_id='d886942e-a82d-4629-974f-ececa12a6aa0'   | DOWN   |
| bcc7861d-4e05-41a2-b224-36e03b8d4239 |                           | fa:16:3e:8c:eb:fc | ip_address='10.128.66.248', subnet_id='fa481858-2961-430a-973a-df53c36c1517'  | DOWN   |
| cb873486-ff6b-4f3a-bf2b-60b50df16fb8 |                           | fa:16:3e:f1:86:56 | ip_address='10.128.92.195', subnet_id='0c38384b-15e0-4505-938a-56b5c8d91ce5'  | DOWN   |
| d3d75cd4-e816-45d3-9bbe-14ac0f80cf5b |                           | fa:16:3e:93:32:0c | ip_address='10.128.31.32', subnet_id='f78f8a86-295b-470a-ac80-9388a2c3bfdb'   | DOWN   |
| dbcdaeb4-b6fc-473d-b0f6-7a6a21e3236f |                           | fa:16:3e:cc:31:eb | ip_address='10.128.31.152', subnet_id='f78f8a86-295b-470a-ac80-9388a2c3bfdb'  | DOWN   |
| e697f816-a90a-44fa-89a7-e5db90abb77e |                           | fa:16:3e:1e:25:27 | ip_address='10.128.8.11', subnet_id='d886942e-a82d-4629-974f-ececa12a6aa0'    | DOWN   |
| fa3f335a-fee4-4911-bed9-427c44567a2a |                           | fa:16:3e:0f:09:d6 | ip_address='10.128.93.67', subnet_id='0c38384b-15e0-4505-938a-56b5c8d91ce5'   | DOWN   |
+--------------------------------------+---------------------------+-------------------+-------------------------------------------------------------------------------+--------+

>> Wait 20 minutes.

(shiftstack) [stack@undercloud-0 ~]$ openstack port list --device-owner=""  --tags="openshiftClusterID=ostest-rmrft" | grep -v ostest
+--------------------------------------+---------------------------+-------------------+---------------------------------------------------------------------------+--------+
| ID                                   | Name                      | MAC Address       | Fixed IP Addresses                                                        | Status |
+--------------------------------------+---------------------------+-------------------+---------------------------------------------------------------------------+--------+
+--------------------------------------+---------------------------+-------------------+---------------------------------------------------------------------------+--------+
(shiftstack) [stack@undercloud-0 ~]$ 

>> Keeping the stability of kuryr pods:

(shiftstack) [stack@undercloud-0 ~]$ oc get pods -n openshift-kuryr
NAME                                READY   STATUS    RESTARTS   AGE
kuryr-cni-chrzk                     1/1     Running   0          19h
kuryr-cni-pclf8                     1/1     Running   0          19h
kuryr-cni-w5tf9                     1/1     Running   2          19h
kuryr-cni-zfgzt                     1/1     Running   2          19h
kuryr-cni-zl6wv                     1/1     Running   0          19h
kuryr-controller-74584966c5-689jm   1/1     Running   1          19h

Comment 4 rlobillo 2020-10-23 10:15:13 UTC
Created attachment 1723754 [details]
NP+Conformance results with the fix

Comment 7 errata-xmlrpc 2021-02-24 15:22:23 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633