Description of problem: While running OCP functional automation non-destructive apiserver test suite, eventually new pods could not be added to the network and the kuryr-controller pod started crash looping. From the application namespace side, the events are: Failed create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_hello-pod-kf8j7_abcd_bc418dd8-e521-11e9-bebb-fa163e8e25e7_0(ea5342ae4b2a716fb1a7abc6fa8fbb1a37a4639b09843d39234546ec77885d71): Multus: Err adding pod to network "kuryr": Multus: error in invoke Delegate add - "kuryr-cni": CNI Daemon returned error 504 In the kuryr-controller pod there are ResourceNotReady and "Asynchronous handler terminated abnormally" messages and the pod restarts continuously. I'll attach the kuryr pod logs and events and follow up with a private comment on how to access the cluster. Version-Release number of selected component (if applicable): 4.2.0-0.nightly-2019-09-27-191818 How reproducible: Unknown Steps to Reproduce: 1. Cluster was installed by OSP QE team 2. Running OCP automated test suites. Several suites were run successfully, but the issue happened during an apiserver suite. The exact testcase it got stuck on due to pod creation failing was: # @case_id OCP-11189 @admin Scenario: Limit to create pod to access hostPID Which doesn't have anything to do explicitly with networking. 3. oc get pods -n openshift-kuryr Actual results: Applications pods are all stuck in ContainerCreating with the above events. kuryr-controller pod is endlessly crash looping Expected results: OCP non-destructive automation runs successfully Additional info: There is a repeated pattern of errors in the kuryr-controller log like below. I'll attach the full pod logs and events. 2019-10-02 15:20:49.864 1 CRITICAL kuryr_kubernetes.handlers.asynchronous [-] Asynchronous handler terminated abnormally; 1 events dropped for 79a0e32a-e527-11e9-bebb-fa163e8e25e7: GreenletExit^[[00m 2019-10-02 15:20:49.864 1 ERROR root [-] Original exception being dropped: ['Traceback (most recent call last):\n', ' File "/usr/lib/python2.7/site-packages/kuryr_kubernetes/handlers/retry.py", line 56, in __call__\n self._handler(event)\n', ' File "/usr/lib/python2.7/site-packages/kuryr_kubernetes/handlers/k8s_base.py", line 75, in __call__\n self.on_present(obj)\n', ' File "/usr/lib/python2.7/site-packages/kuryr_kubernetes/controller/handlers/namespace.py", line 95, in on_present\n project_id)\n', ' File "/usr/lib/python2.7/site-packages/kuryr_kubernetes/controller/drivers/namespace_subnet.py", line 141, in create_namespace_network\n "project_id": project_id\n', ' File "/usr/lib/python2.7/site-packages/neutronclient/v2_0/client.py", line 817, in create_network\n return self.post(self.networks_path, body=body)\n', ' File "/usr/lib/python2.7/site-packages/neutronclient/v2_0/client.py", line 359, in post\n headers=headers, params=params)\n', ' File "/usr/lib/python2.7/site-packages/neutronclient/v2_0/client.py", line 294, in do_request\n self._handle_fault_response(status_code, replybody, resp)\n', ' File "/usr/lib/python2.7/site-packages/neutronclient/v2_0/client.py", line 269, in _handle_fault_response\n exception_handler_v20(status_code, error_body)\n', ' File "/usr/lib/python2.7/site-packages/neutronclient/v2_0/client.py", line 93, in exception_handler_v20\n request_ids=request_ids)\n', "OverQuotaClient: Quota exceeded for resources: ['network'].\nNeutron server returns request_ids: ['req-a8e52966-573d-4ca1-9139-3eaa035f6c02']\n"]: GreenletExit^[[00m 2019-10-02 15:20:49.865 1 CRITICAL kuryr_kubernetes.handlers.asynchronous [-] Asynchronous handler terminated abnormally; 1 events dropped for b82ca3d7-e521-11e9-bebb-fa163e8e25e7: GreenletExit^[[00m 2019-10-02 15:20:49.865 1 ERROR root [-] Original exception being dropped: ['Traceback (most recent call last):\n', ' File "/usr/lib/python2.7/site-packages/kuryr_kubernetes/handlers/retry.py", line 56, in __call__\n self._handler(event)\n', ' File "/usr/lib/python2.7/site-packages/kuryr_kubernetes/handlers/k8s_base.py", line 75, in __call__\n self.on_present(obj)\n', ' File "/usr/lib/python2.7/site-packages/kuryr_kubernetes/controller/handlers/vif.py", line 94, in on_present\n subnets = self._drv_subnets.get_subnets(pod, project_id)\n', ' File "/usr/lib/python2.7/site-packages/kuryr_kubernetes/controller/drivers/namespace_subnet.py", line 47, in get_subnets\n return self.get_namespace_subnet(pod_namespace)\n', ' File "/usr/lib/python2.7/site-packages/kuryr_kubernetes/controller/drivers/namespace_subnet.py", line 51, in get_namespace_subnet\n subnet_id = self._get_namespace_subnet_id(namespace)\n', ' File "/usr/lib/python2.7/site-packages/kuryr_kubernetes/controller/drivers/namespace_subnet.py", line 69, in _get_namespace_subnet_id\n raise exceptions.ResourceNotReady(namespace)\n', "ResourceNotReady: Resource not ready: u'1qhj6'\n"]: GreenletExit^[[00m 2019-10-02 15:20:49.866 1 CRITICAL kuryr_kubernetes.handlers.asynchronous [-] Asynchronous handler terminated abnormally; 1 events dropped for 7a7f8a55-e527-11e9-a4b6-fa163ecc0ba0: GreenletExit^[[00m 2019-10-02 15:20:49.866 1 ERROR root [-] Original exception being dropped: ['Traceback (most recent call last):\n', ' File "/usr/lib/python2.7/site-packages/kuryr_kubernetes/handlers/retry.py", line 56, in __call__\n self._handler(event)\n', ' File "/usr/lib/python2.7/site-packages/kuryr_kubernetes/handlers/k8s_base.py", line 75, in __call__\n self.on_present(obj)\n', ' File "/usr/lib/python2.7/site-packages/kuryr_kubernetes/controller/handlers/vif.py", line 94, in on_present\n subnets = self._drv_subnets.get_subnets(pod, project_id)\n', ' File "/usr/lib/python2.7/site-packages/kuryr_kubernetes/controller/drivers/namespace_subnet.py", line 47, in get_subnets\n return self.get_namespace_subnet(pod_namespace)\n', ' File "/usr/lib/python2.7/site-packages/kuryr_kubernetes/controller/drivers/namespace_subnet.py", line 51, in get_namespace_subnet\n subnet_id = self._get_namespace_subnet_id(namespace)\n', ' File "/usr/lib/python2.7/site-packages/kuryr_kubernetes/controller/drivers/namespace_subnet.py", line 69, in _get_namespace_subnet_id\n raise exceptions.ResourceNotReady(namespace)\n', "ResourceNotReady: Resource not ready: u'1qhj6'\n"]: GreenletExit^[[00m 2019-10-02 15:20:49.867 1 CRITICAL kuryr_kubernetes.handlers.asynchronous [-] Asynchronous handler terminated abnormally; 1 events dropped for 7b46e8c5-e527-11e9-bebb-fa163e8e25e7: GreenletExit^[[00m 2019-10-02 15:20:49.867 1 ERROR root [-] Original exception being dropped: ['Traceback (most recent call last):\n', ' File "/usr/lib/python2.7/site-packages/kuryr_kubernetes/handlers/retry.py", line 56, in __call__\n self._handler(event)\n', ' File "/usr/lib/python2.7/site-packages/kuryr_kubernetes/handlers/k8s_base.py", line 75, in __call__\n self.on_present(obj)\n', ' File "/usr/lib/python2.7/site-packages/kuryr_kubernetes/controller/handlers/vif.py", line 94, in on_present\n subnets = self._drv_subnets.get_subnets(pod, project_id)\n', ' File "/usr/lib/python2.7/site-packages/kuryr_kubernetes/controller/drivers/namespace_subnet.py", line 47, in get_subnets\n return self.get_namespace_subnet(pod_namespace)\n', ' File "/usr/lib/python2.7/site-packages/kuryr_kubernetes/controller/drivers/namespace_subnet.py", line 51, in get_namespace_subnet\n subnet_id = self._get_namespace_subnet_id(namespace)\n', ' File "/usr/lib/python2.7/site-packages/kuryr_kubernetes/controller/drivers/namespace_subnet.py", line 69, in _get_namespace_subnet_id\n raise exceptions.ResourceNotReady(namespace)\n', "ResourceNotReady: Resource not ready: u'abcd'\n"]: GreenletExit^[[00m 2019-10-02 15:20:49.867 1 ERROR root [-] Original exception being dropped: ['Traceback (most recent call last):\n', ' File "/usr/lib/python2.7/site-packages/kuryr_kubernetes/handlers/retry.py", line 56, in __call__\n self._handler(event)\n', ' File "/usr/lib/python2.7/site-packages/kuryr_kubernetes/handlers/k8s_base.py", line 75, in __call__\n self.on_present(obj)\n', ' File "/usr/lib/python2.7/site-packages/kuryr_kubernetes/controller/handlers/vif.py", line 94, in on_present\n subnets = self._drv_subnets.get_subnets(pod, project_id)\n', ' File "/usr/lib/python2.7/site-packages/kuryr_kubernetes/controller/drivers/namespace_subnet.py", line 47, in get_subnets\n return self.get_namespace_subnet(pod_namespace)\n', ' File "/usr/lib/python2.7/site-packages/kuryr_kubernetes/controller/drivers/namespace_subnet.py", line 51, in get_namespace_subnet\n subnet_id = self._get_namespace_subnet_id(namespace)\n', ' File "/usr/lib/python2.7/site-packages/kuryr_kubernetes/controller/drivers/namespace_subnet.py", line 69, in _get_namespace_subnet_id\n raise exceptions.ResourceNotReady(namespace)\n', "ResourceNotReady: Resource not ready: u'abcd'\n"]: GreenletExit^[[00m 2019-10-02 15:20:49.868 1 INFO kuryr_kubernetes.controller.service [-] Service 'KuryrK8sService' stopped^[[00m
Created attachment 1621917 [details] kuryr pod logs + events Could not run oc adm must-gather, so collected kuryr pod logs.
This is definitely a quota issue - I see networks and subnets quotas at capacity on the setup. The quotas we provide at https://github.com/openshift/installer/blob/master/docs/user/openstack/kuryr.md#requirements-when-enabling-kuryr are the "bare minimum" meaning that those are minimal numbers to run a successful installation. As you can see those aren't allowing many more namespaces to be created. Please bump the OpenStack subnet and network quotas to 150 (or more if the number of namespaces your tests create is higher) and see if it helps.
Okay, I also see that those networks and subnets hang even though namespaces/projects related to them got removed. This might still be a symptom of a problem then. Anyway to unblock running the tests just bump the quotas.
Maybe some sort of resource leak? The tests do constantly create/delete projects and pods among other resources.
Setting this to be the master (4.3) fix. I will clone to track for 4.2. But it's likely to only be able to make 4.2.1.
*** This bug has been marked as a duplicate of bug 1759095 ***