Created attachment 1712944 [details] Kuryr controller logs Description of problem: kuryr-controller pod remains in crashloop after running tempest and NP tests on 4.5 UPI deployment in OSP 13. The namespaces created during tempest and NP tests cannot be deleted due to error removing the ports: ERROR kuryr_kubernetes.controller.drivers.vif_pool [-] Error removing the port 01a994ca-5286-45c1-b6f6-ba6cb663837a: openstack.exceptions.ConflictException: ConflictException: 409: Client Error for url: https://10.46.22.24:13696/v2.0/ports/01a994ca-5286-45c1-b6f6-ba6cb663837a, Port 01a994ca-5286-45c1-b6f6-ba6cb663837a is currently a subport for trunk 1fb6802c-69d5-469a-95ee-9b157b0d608d. The ports are in active status. It happens with the trunks in all the worker nodes. $ oc -n openshift-kuryr get pods NAME READY STATUS RESTARTS AGE kuryr-cni-2hn5w 1/1 Running 0 18h kuryr-cni-2zm85 1/1 Running 0 18h kuryr-cni-5jgtv 1/1 Running 1 18h kuryr-cni-9dr4x 1/1 Running 0 18h kuryr-cni-g9hq9 1/1 Running 0 18h kuryr-cni-k4rvv 1/1 Running 0 18h kuryr-controller-857bb8dc46-ps4xs 1/1 Running 106 18h kuryr-dns-admission-controller-48hpl 1/1 Running 0 18h kuryr-dns-admission-controller-9hdrb 1/1 Running 0 18h kuryr-dns-admission-controller-b7qkb 1/1 Running 0 18h $ oc get ns NAME STATUS AGE default Active 20h kube-node-lease Active 20h kube-public Active 20h kube-system Active 20h kuryr-namespace-2107688107 Terminating 17h network-policy-1136 Terminating 16h network-policy-1217 Terminating 15h network-policy-1649 Terminating 15h network-policy-1678 Terminating 15h network-policy-2176 Terminating 16h network-policy-2578 Terminating 16h network-policy-3199 Terminating 16h network-policy-3312 Terminating 15h network-policy-3340 Terminating 16h network-policy-5163 Terminating 15h network-policy-7220 Terminating 16h network-policy-7736 Terminating 16h network-policy-8173 Terminating 15h network-policy-8267 Terminating 16h network-policy-8403 Terminating 16h network-policy-8568 Terminating 16h network-policy-9343 Terminating 16h network-policy-9624 Terminating 16h network-policy-b-2382 Terminating 15h network-policy-b-2597 Terminating 16h network-policy-b-4786 Terminating 15h network-policy-b-512 Terminating 16h network-policy-b-5566 Terminating 16h network-policy-b-8452 Terminating 16h network-policy-c-6442 Terminating 16h openshift Active 19h openshift-apiserver Active 19h Version-Release number of selected component (if applicable): 4.5.0-0.nightly-2020-08-27-110054 OSP 13 2020-08-05.1 How reproducible: don't have enough data Steps to Reproduce: 1. Install 4.5 UPI on OSP 13 with Kuryr 2. Run tempest and NP tests Actual results: kuryr-controller in crashloop and namespaces in Terminating status Expected results: no crashloops and successful namespace removals Additional info: $ openstack network trunk list +--------------------------------------+-----------------------------+--------------------------------------+-------------+ | ID | Name | Parent Port | Description | +--------------------------------------+-----------------------------+--------------------------------------+-------------+ | 1fb6802c-69d5-469a-95ee-9b157b0d608d | ostest-6tf5m-worker-trunk-1 | 691f1761-3599-4c1e-86aa-e008aafce806 | | | 6122a1f3-a7ee-4cde-93a4-8ee5cef478dc | ostest-6tf5m-worker-trunk-2 | 9ca45ccb-a009-4aa6-b702-d4648e604a01 | | | 7a5eb902-73ec-415f-bcd0-d193d1fc0521 | ostest-6tf5m-master-trunk-0 | d5fe7a41-ef2f-4dea-a4b3-e95745c0bb44 | | | 9ac571e6-ce6d-4f50-b313-bcab8f0e6c00 | ostest-6tf5m-worker-trunk-0 | f5f614cb-098c-40e9-9ca0-eb37b00b5e15 | | | b563e48c-5a66-43ff-87d5-96fa661201f0 | ostest-6tf5m-master-trunk-1 | f604d46d-c0a0-4546-9bb9-29c91c00aa10 | | | dc2eaf3f-98e8-4bb8-b618-c9275e402a81 | ostest-6tf5m-master-trunk-2 | 786eef48-7dab-41ac-8d1f-85345758fe98 | | +--------------------------------------+-----------------------------+--------------------------------------+-------------+
Error on the kuryr controller looks like: 2020-08-28 14:19:55.415 1 ERROR kuryr_kubernetes.controller.drivers.vif_pool [-] Error removing the port fe7832c7-954e-454b-91c1-bc7a5fc57458: openstack.exceptions.ConflictException : ConflictException: 409: Client Error for url: https://10.46.22.24:13696/v2.0/ports/fe7832c7-954e-454b-91c1-bc7a5fc57458, Port fe7832c7-954e-454b-91c1-bc7a5fc57458 is currently a s ubport for trunk 1fb6802c-69d5-469a-95ee-9b157b0d608d. 2020-08-28 14:19:55.415 1 ERROR kuryr_kubernetes.controller.drivers.vif_pool Traceback (most recent call last): 2020-08-28 14:19:55.415 1 ERROR kuryr_kubernetes.controller.drivers.vif_pool File "/usr/local/lib/python3.6/site-packages/kuryr_kubernetes/controller/drivers/vif_pool.py", line 89 8, in _precreated_ports 2020-08-28 14:19:55.415 1 ERROR kuryr_kubernetes.controller.drivers.vif_pool os_net.delete_port(port_id) 2020-08-28 14:19:55.415 1 ERROR kuryr_kubernetes.controller.drivers.vif_pool File "/usr/local/lib/python3.6/site-packages/openstack/network/v2/_proxy.py", line 1749, in delete_por t 2020-08-28 14:19:55.415 1 ERROR kuryr_kubernetes.controller.drivers.vif_pool if_revision=if_revision) 2020-08-28 14:19:55.415 1 ERROR kuryr_kubernetes.controller.drivers.vif_pool File "/usr/local/lib/python3.6/site-packages/openstack/proxy.py", line 46, in check 2020-08-28 14:19:55.415 1 ERROR kuryr_kubernetes.controller.drivers.vif_pool return method(self, expected, actual, *args, **kwargs) 2020-08-28 14:19:55.415 1 ERROR kuryr_kubernetes.controller.drivers.vif_pool File "/usr/local/lib/python3.6/site-packages/openstack/network/v2/_proxy.py", line 75, in _delete 2020-08-28 14:19:55.415 1 ERROR kuryr_kubernetes.controller.drivers.vif_pool rv = res.delete(self, if_revision=if_revision) 2020-08-28 14:19:55.415 1 ERROR kuryr_kubernetes.controller.drivers.vif_pool File "/usr/local/lib/python3.6/site-packages/openstack/resource.py", line 1622, in delete 2020-08-28 14:19:55.415 1 ERROR kuryr_kubernetes.controller.drivers.vif_pool self._translate_response(response, has_body=False, **kwargs) 2020-08-28 14:19:55.415 1 ERROR kuryr_kubernetes.controller.drivers.vif_pool File "/usr/local/lib/python3.6/site-packages/openstack/resource.py", line 1113, in _translate_response 2020-08-28 14:19:55.415 1 ERROR kuryr_kubernetes.controller.drivers.vif_pool exceptions.raise_from_response(response, error_message=error_message) 2020-08-28 14:19:55.415 1 ERROR kuryr_kubernetes.controller.drivers.vif_pool File "/usr/local/lib/python3.6/site-packages/openstack/exceptions.py", line 235, in raise_from_respons e 2020-08-28 14:19:55.415 1 ERROR kuryr_kubernetes.controller.drivers.vif_pool http_status=http_status, request_id=request_id 2020-08-28 14:19:55.415 1 ERROR kuryr_kubernetes.controller.drivers.vif_pool openstack.exceptions.ConflictException: ConflictException: 409: Client Error for url: https://10.46.22.2 4:13696/v2.0/ports/fe7832c7-954e-454b-91c1-bc7a5fc57458, Port fe7832c7-954e-454b-91c1-bc7a5fc57458 is currently a subport for trunk 1fb6802c-69d5-469a-95ee-9b157b0d608d. 2020-08-28 14:19:55.415 1 ERROR kuryr_kubernetes.controller.drivers.vif_pool
Problem is due to wrong tagging of worker node parent ports. This patch https://review.opendev.org/#/c/748670/ will help as it will ensure namespace are deleted anyway, but it won't solve the problem of wrong tagging which is the culprit and that breaks the proper ports pool functionality by not being able to re-discover the existing created ports
Failed on 4.6.0-0.nightly-2020-09-03-063148 over RHOS-16.1-RHEL-8-20200821.n.0 (with ovn-octavia). After installing with IPI and running NP+Conformance, namespaces remained also hung on terminating state: $ oc get pods -n openshift-kuryr NAME READY STATUS RESTARTS AGE kuryr-cni-4f6bh 1/1 Running 6 4h33m kuryr-cni-6sqt7 1/1 Running 1 4h51m kuryr-cni-fpql4 1/1 Running 1 4h51m kuryr-cni-jxdf7 1/1 Running 0 4h51m kuryr-cni-ssw4q 1/1 Running 7 4h31m kuryr-cni-v7s7r 1/1 Running 7 4h32m kuryr-controller-846bff6c86-7qnhd 1/1 Running 21 4h51m $ oc get namespaces | grep Terminating e2e-configmap-5444 Terminating 118m e2e-dns-8169 Terminating 90m e2e-emptydir-3568 Terminating 116m e2e-gc-4183 Terminating 105m e2e-kubectl-19 Terminating 98m e2e-services-7416 Terminating 87m e2e-statefulset-6426 Terminating 90m e2e-webhook-82 Terminating 127m network-policy-487 Terminating 3h6m network-policy-7073 Terminating 3h18m $ openstack subnet list | grep e2e-dns-8169 | 614b29fb-8d0d-40a0-9f64-72d592c1d70d | ns/e2e-dns-8169-subnet | 94538656-a62f-456c-b53d-8fccf7aa6d8a | 10.128.156.0/23 | The port linked to that namespace is DOWN and device_owner empty: $ openstack port list | grep 614b29fb-8d0d-40a0-9f64-72d592c1d70d | 48b1bbc6-cd06-47e9-8128-03dd107dd568 | | fa:16:3e:6a:0b:38 | ip_address='10.128.156.55', subnet_id='614b29fb-8d0d-40a0-9f64-72d592c1d70d' | DOWN | $ openstack port show 48b1bbc6-cd06-47e9-8128-03dd107dd568 -f yaml admin_state_up: true allowed_address_pairs: [] binding_host_id: null binding_profile: null binding_vif_details: null binding_vif_type: null binding_vnic_type: normal created_at: '2020-09-03T13:40:10Z' data_plane_status: null description: '' device_id: '' device_owner: '' dns_assignment: - fqdn: host-10-128-156-55.shiftstack.com. hostname: host-10-128-156-55 ip_address: 10.128.156.55 dns_domain: '' dns_name: '' extra_dhcp_opts: [] fixed_ips: - ip_address: 10.128.156.55 subnet_id: 614b29fb-8d0d-40a0-9f64-72d592c1d70d id: 48b1bbc6-cd06-47e9-8128-03dd107dd568 location: cloud: '' project: domain_id: null domain_name: Default id: a429f89224cf4940a0be7ae306cbe53f name: shiftstack region_name: regionOne zone: null mac_address: fa:16:3e:6a:0b:38 name: '' network_id: 94538656-a62f-456c-b53d-8fccf7aa6d8a port_security_enabled: true project_id: a429f89224cf4940a0be7ae306cbe53f propagate_uplink_status: null qos_policy_id: null resource_request: null revision_number: 8 security_group_ids: - f9096ae0-1850-4f7f-96c1-78c6a48ffd77 status: DOWN tags: - openshiftClusterID=ostest-cbn5w trunk_details: null updated_at: '2020-09-03T13:45:01Z' So that the kuryr-controller is not able to delete it and loopcrashing.
*** Bug 1876434 has been marked as a duplicate of this bug. ***
Verified on 4.6.0-0.nightly-2020-09-05-015624 over RHOS-16.1-RHEL-8-20200831.n.1 with OVN-Octavia. After installing with IPI and running NP+Conformance, namespaces were successfully terminated: $ oc get namespaces | grep Terminating $ NP and conformance tests results were the expected ones: $ grep msg np_results/np_kubetest.log | grep PASSED | wc -l 23 $ grep ^passed conformance_results/conformance_ocp-tests.log | wc -l 289 Test logs attached.
Created attachment 1713979 [details] conformance test result
Created attachment 1713980 [details] NP test results
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:4196