Bug 1873449 - [Kuryr] Cannot terminate namespaces due to error removing ports
Summary: [Kuryr] Cannot terminate namespaces due to error removing ports
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.5
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 4.6.0
Assignee: Luis Tomas Bolivar
QA Contact: GenadiC
URL:
Whiteboard:
: 1876434 (view as bug list)
Depends On:
Blocks: 1874840
TreeView+ depends on / blocked
 
Reported: 2020-08-28 11:42 UTC by Jon Uriarte
Modified: 2020-10-27 16:35 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-10-27 16:35:38 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
Kuryr controller logs (1.08 MB, text/plain)
2020-08-28 11:42 UTC, Jon Uriarte
no flags Details
conformance test result (790.87 KB, application/gzip)
2020-09-07 15:27 UTC, rlobillo
no flags Details
NP test results (189.08 KB, application/gzip)
2020-09-07 15:28 UTC, rlobillo
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github openshift kuryr-kubernetes pull 333 0 None closed Bug 1873449: Ensure proper cleanup of subports 2020-09-28 08:51:44 UTC
Github openshift kuryr-kubernetes pull 339 0 None closed Bug 1873449: Delete ports without device_owner on ns deletion 2020-09-28 08:51:44 UTC
OpenStack gerrit 748670 0 None MERGED Ensure proper cleanup of subports 2020-09-28 08:51:44 UTC
Red Hat Product Errata RHBA-2020:4196 0 None None None 2020-10-27 16:35:55 UTC

Description Jon Uriarte 2020-08-28 11:42:18 UTC
Created attachment 1712944 [details]
Kuryr controller logs

Description of problem:

kuryr-controller pod remains in crashloop after running tempest and NP tests on 4.5 UPI deployment in OSP 13.
The namespaces created during tempest and NP tests cannot be deleted due to error removing the ports:

ERROR kuryr_kubernetes.controller.drivers.vif_pool [-] Error removing the port 01a994ca-5286-45c1-b6f6-ba6cb663837a: openstack.exceptions.ConflictException: ConflictException: 409: Client Error for url: https://10.46.22.24:13696/v2.0/ports/01a994ca-5286-45c1-b6f6-ba6cb663837a, Port 01a994ca-5286-45c1-b6f6-ba6cb663837a is currently a subport for trunk 1fb6802c-69d5-469a-95ee-9b157b0d608d.

The ports are in active status. It happens with the trunks in all the worker nodes.

$ oc -n openshift-kuryr get pods
NAME                                   READY   STATUS    RESTARTS   AGE
kuryr-cni-2hn5w                        1/1     Running   0          18h
kuryr-cni-2zm85                        1/1     Running   0          18h
kuryr-cni-5jgtv                        1/1     Running   1          18h
kuryr-cni-9dr4x                        1/1     Running   0          18h
kuryr-cni-g9hq9                        1/1     Running   0          18h
kuryr-cni-k4rvv                        1/1     Running   0          18h
kuryr-controller-857bb8dc46-ps4xs      1/1     Running   106        18h
kuryr-dns-admission-controller-48hpl   1/1     Running   0          18h
kuryr-dns-admission-controller-9hdrb   1/1     Running   0          18h
kuryr-dns-admission-controller-b7qkb   1/1     Running   0          18h

$ oc get ns
NAME                                               STATUS        AGE
default                                            Active        20h
kube-node-lease                                    Active        20h
kube-public                                        Active        20h
kube-system                                        Active        20h
kuryr-namespace-2107688107                         Terminating   17h
network-policy-1136                                Terminating   16h
network-policy-1217                                Terminating   15h
network-policy-1649                                Terminating   15h
network-policy-1678                                Terminating   15h
network-policy-2176                                Terminating   16h
network-policy-2578                                Terminating   16h
network-policy-3199                                Terminating   16h
network-policy-3312                                Terminating   15h
network-policy-3340                                Terminating   16h
network-policy-5163                                Terminating   15h
network-policy-7220                                Terminating   16h
network-policy-7736                                Terminating   16h
network-policy-8173                                Terminating   15h
network-policy-8267                                Terminating   16h
network-policy-8403                                Terminating   16h
network-policy-8568                                Terminating   16h
network-policy-9343                                Terminating   16h
network-policy-9624                                Terminating   16h
network-policy-b-2382                              Terminating   15h
network-policy-b-2597                              Terminating   16h
network-policy-b-4786                              Terminating   15h
network-policy-b-512                               Terminating   16h
network-policy-b-5566                              Terminating   16h
network-policy-b-8452                              Terminating   16h
network-policy-c-6442                              Terminating   16h
openshift                                          Active        19h
openshift-apiserver                                Active        19h


Version-Release number of selected component (if applicable):

4.5.0-0.nightly-2020-08-27-110054
OSP 13 2020-08-05.1


How reproducible: don't have enough data


Steps to Reproduce:
1. Install 4.5 UPI on OSP 13 with Kuryr
2. Run tempest and NP tests

Actual results: kuryr-controller in crashloop and namespaces in Terminating status


Expected results: no crashloops and successful namespace removals


Additional info:

$ openstack network trunk list
+--------------------------------------+-----------------------------+--------------------------------------+-------------+
| ID                                   | Name                        | Parent Port                          | Description |
+--------------------------------------+-----------------------------+--------------------------------------+-------------+
| 1fb6802c-69d5-469a-95ee-9b157b0d608d | ostest-6tf5m-worker-trunk-1 | 691f1761-3599-4c1e-86aa-e008aafce806 |             |
| 6122a1f3-a7ee-4cde-93a4-8ee5cef478dc | ostest-6tf5m-worker-trunk-2 | 9ca45ccb-a009-4aa6-b702-d4648e604a01 |             |
| 7a5eb902-73ec-415f-bcd0-d193d1fc0521 | ostest-6tf5m-master-trunk-0 | d5fe7a41-ef2f-4dea-a4b3-e95745c0bb44 |             |
| 9ac571e6-ce6d-4f50-b313-bcab8f0e6c00 | ostest-6tf5m-worker-trunk-0 | f5f614cb-098c-40e9-9ca0-eb37b00b5e15 |             |
| b563e48c-5a66-43ff-87d5-96fa661201f0 | ostest-6tf5m-master-trunk-1 | f604d46d-c0a0-4546-9bb9-29c91c00aa10 |             |
| dc2eaf3f-98e8-4bb8-b618-c9275e402a81 | ostest-6tf5m-master-trunk-2 | 786eef48-7dab-41ac-8d1f-85345758fe98 |             |
+--------------------------------------+-----------------------------+--------------------------------------+-------------+

Comment 1 Luis Tomas Bolivar 2020-08-28 14:25:16 UTC
Error on the kuryr controller looks like:
2020-08-28 14:19:55.415 1 ERROR kuryr_kubernetes.controller.drivers.vif_pool [-] Error removing the port fe7832c7-954e-454b-91c1-bc7a5fc57458: openstack.exceptions.ConflictException
: ConflictException: 409: Client Error for url: https://10.46.22.24:13696/v2.0/ports/fe7832c7-954e-454b-91c1-bc7a5fc57458, Port fe7832c7-954e-454b-91c1-bc7a5fc57458 is currently a s
ubport for trunk 1fb6802c-69d5-469a-95ee-9b157b0d608d.
2020-08-28 14:19:55.415 1 ERROR kuryr_kubernetes.controller.drivers.vif_pool Traceback (most recent call last):
2020-08-28 14:19:55.415 1 ERROR kuryr_kubernetes.controller.drivers.vif_pool   File "/usr/local/lib/python3.6/site-packages/kuryr_kubernetes/controller/drivers/vif_pool.py", line 89
8, in _precreated_ports
2020-08-28 14:19:55.415 1 ERROR kuryr_kubernetes.controller.drivers.vif_pool     os_net.delete_port(port_id)
2020-08-28 14:19:55.415 1 ERROR kuryr_kubernetes.controller.drivers.vif_pool   File "/usr/local/lib/python3.6/site-packages/openstack/network/v2/_proxy.py", line 1749, in delete_por
t
2020-08-28 14:19:55.415 1 ERROR kuryr_kubernetes.controller.drivers.vif_pool     if_revision=if_revision)
2020-08-28 14:19:55.415 1 ERROR kuryr_kubernetes.controller.drivers.vif_pool   File "/usr/local/lib/python3.6/site-packages/openstack/proxy.py", line 46, in check
2020-08-28 14:19:55.415 1 ERROR kuryr_kubernetes.controller.drivers.vif_pool     return method(self, expected, actual, *args, **kwargs)
2020-08-28 14:19:55.415 1 ERROR kuryr_kubernetes.controller.drivers.vif_pool   File "/usr/local/lib/python3.6/site-packages/openstack/network/v2/_proxy.py", line 75, in _delete
2020-08-28 14:19:55.415 1 ERROR kuryr_kubernetes.controller.drivers.vif_pool     rv = res.delete(self, if_revision=if_revision)
2020-08-28 14:19:55.415 1 ERROR kuryr_kubernetes.controller.drivers.vif_pool   File "/usr/local/lib/python3.6/site-packages/openstack/resource.py", line 1622, in delete
2020-08-28 14:19:55.415 1 ERROR kuryr_kubernetes.controller.drivers.vif_pool     self._translate_response(response, has_body=False, **kwargs)
2020-08-28 14:19:55.415 1 ERROR kuryr_kubernetes.controller.drivers.vif_pool   File "/usr/local/lib/python3.6/site-packages/openstack/resource.py", line 1113, in _translate_response
2020-08-28 14:19:55.415 1 ERROR kuryr_kubernetes.controller.drivers.vif_pool     exceptions.raise_from_response(response, error_message=error_message)
2020-08-28 14:19:55.415 1 ERROR kuryr_kubernetes.controller.drivers.vif_pool   File "/usr/local/lib/python3.6/site-packages/openstack/exceptions.py", line 235, in raise_from_respons
e
2020-08-28 14:19:55.415 1 ERROR kuryr_kubernetes.controller.drivers.vif_pool     http_status=http_status, request_id=request_id
2020-08-28 14:19:55.415 1 ERROR kuryr_kubernetes.controller.drivers.vif_pool openstack.exceptions.ConflictException: ConflictException: 409: Client Error for url: https://10.46.22.2
4:13696/v2.0/ports/fe7832c7-954e-454b-91c1-bc7a5fc57458, Port fe7832c7-954e-454b-91c1-bc7a5fc57458 is currently a subport for trunk 1fb6802c-69d5-469a-95ee-9b157b0d608d.
2020-08-28 14:19:55.415 1 ERROR kuryr_kubernetes.controller.drivers.vif_pool

Comment 2 Luis Tomas Bolivar 2020-08-31 06:35:54 UTC
Problem is due to wrong tagging of worker node parent ports. This patch https://review.opendev.org/#/c/748670/ will help as it will ensure namespace are deleted anyway, but it won't solve the problem of wrong tagging which is the culprit and that breaks the proper ports pool functionality by not being able to re-discover the existing created ports

Comment 5 rlobillo 2020-09-03 15:19:34 UTC
Failed on 4.6.0-0.nightly-2020-09-03-063148 over RHOS-16.1-RHEL-8-20200821.n.0 (with ovn-octavia).

After installing with IPI and running NP+Conformance, namespaces remained also hung on terminating state:

$ oc get pods -n openshift-kuryr
NAME                                READY   STATUS    RESTARTS   AGE
kuryr-cni-4f6bh                     1/1     Running   6          4h33m
kuryr-cni-6sqt7                     1/1     Running   1          4h51m
kuryr-cni-fpql4                     1/1     Running   1          4h51m
kuryr-cni-jxdf7                     1/1     Running   0          4h51m
kuryr-cni-ssw4q                     1/1     Running   7          4h31m
kuryr-cni-v7s7r                     1/1     Running   7          4h32m
kuryr-controller-846bff6c86-7qnhd   1/1     Running   21         4h51m

$ oc get namespaces | grep Terminating
e2e-configmap-5444                                 Terminating   118m
e2e-dns-8169                                       Terminating   90m
e2e-emptydir-3568                                  Terminating   116m
e2e-gc-4183                                        Terminating   105m
e2e-kubectl-19                                     Terminating   98m
e2e-services-7416                                  Terminating   87m
e2e-statefulset-6426                               Terminating   90m
e2e-webhook-82                                     Terminating   127m
network-policy-487                                 Terminating   3h6m
network-policy-7073                                Terminating   3h18m


$ openstack subnet list | grep e2e-dns-8169
| 614b29fb-8d0d-40a0-9f64-72d592c1d70d | ns/e2e-dns-8169-subnet                                     | 94538656-a62f-456c-b53d-8fccf7aa6d8a | 10.128.156.0/23 |

The port linked to that namespace is DOWN and device_owner empty:

$ openstack port list | grep 614b29fb-8d0d-40a0-9f64-72d592c1d70d
| 48b1bbc6-cd06-47e9-8128-03dd107dd568 |                                                      | fa:16:3e:6a:0b:38 | ip_address='10.128.156.55', subnet_id='614b29fb-8d0d-40a0-9f64-72d592c1d70d'  | DOWN   |

$ openstack port show 48b1bbc6-cd06-47e9-8128-03dd107dd568 -f yaml
admin_state_up: true
allowed_address_pairs: []
binding_host_id: null
binding_profile: null
binding_vif_details: null
binding_vif_type: null
binding_vnic_type: normal
created_at: '2020-09-03T13:40:10Z'
data_plane_status: null
description: ''
device_id: ''
device_owner: ''
dns_assignment:
- fqdn: host-10-128-156-55.shiftstack.com.
  hostname: host-10-128-156-55
  ip_address: 10.128.156.55
dns_domain: ''
dns_name: ''
extra_dhcp_opts: []
fixed_ips:
- ip_address: 10.128.156.55
  subnet_id: 614b29fb-8d0d-40a0-9f64-72d592c1d70d
id: 48b1bbc6-cd06-47e9-8128-03dd107dd568
location:
  cloud: ''
  project:
    domain_id: null
    domain_name: Default
    id: a429f89224cf4940a0be7ae306cbe53f
    name: shiftstack
  region_name: regionOne
  zone: null
mac_address: fa:16:3e:6a:0b:38
name: ''
network_id: 94538656-a62f-456c-b53d-8fccf7aa6d8a
port_security_enabled: true
project_id: a429f89224cf4940a0be7ae306cbe53f
propagate_uplink_status: null
qos_policy_id: null
resource_request: null
revision_number: 8
security_group_ids:
- f9096ae0-1850-4f7f-96c1-78c6a48ffd77
status: DOWN
tags:
- openshiftClusterID=ostest-cbn5w
trunk_details: null
updated_at: '2020-09-03T13:45:01Z'


So that the kuryr-controller is not able to delete it and loopcrashing.

Comment 7 Luis Tomas Bolivar 2020-09-07 10:46:13 UTC
*** Bug 1876434 has been marked as a duplicate of this bug. ***

Comment 8 rlobillo 2020-09-07 15:26:13 UTC
Verified on 4.6.0-0.nightly-2020-09-05-015624 over RHOS-16.1-RHEL-8-20200831.n.1 with OVN-Octavia.

After installing with IPI and running NP+Conformance, namespaces were successfully terminated:

$ oc get namespaces | grep Terminating
$


NP and conformance tests results were the expected ones:

$ grep msg np_results/np_kubetest.log | grep PASSED | wc -l
23

$ grep ^passed conformance_results/conformance_ocp-tests.log | wc -l 
289

Test logs attached.

Comment 9 rlobillo 2020-09-07 15:27:29 UTC
Created attachment 1713979 [details]
conformance test result

Comment 10 rlobillo 2020-09-07 15:28:02 UTC
Created attachment 1713980 [details]
NP test results

Comment 12 errata-xmlrpc 2020-10-27 16:35:38 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196


Note You need to log in before you can comment on or make changes to this bug.