Bug 1757876 - kuryr-controller stuck in CrashLoopBack and no pod creation possible after running OCP functional automation.
Summary: kuryr-controller stuck in CrashLoopBack and no pod creation possible after ru...
Keywords:
Status: CLOSED DUPLICATE of bug 1759095
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.2.0
Hardware: Unspecified
OS: Unspecified
unspecified
urgent
Target Milestone: ---
: 4.3.0
Assignee: Luis Tomas Bolivar
QA Contact: GenadiC
URL:
Whiteboard:
Depends On:
Blocks: 1757916
TreeView+ depends on / blocked
 
Reported: 2019-10-02 15:37 UTC by Mike Fiedler
Modified: 2019-10-17 15:40 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1757916 (view as bug list)
Environment:
Last Closed: 2019-10-17 15:40:43 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
kuryr pod logs + events (1.29 MB, application/gzip)
2019-10-02 15:40 UTC, Mike Fiedler
no flags Details

Description Mike Fiedler 2019-10-02 15:37:22 UTC
Description of problem:

While running OCP functional automation non-destructive apiserver test suite, eventually new pods could not be added to the network and the kuryr-controller pod started crash looping.    From the application namespace side, the events are:

Failed create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_hello-pod-kf8j7_abcd_bc418dd8-e521-11e9-bebb-fa163e8e25e7_0(ea5342ae4b2a716fb1a7abc6fa8fbb1a37a4639b09843d39234546ec77885d71): Multus: Err adding pod to network "kuryr": Multus: error in invoke Delegate add - "kuryr-cni": CNI Daemon returned error 504

In the kuryr-controller pod there are ResourceNotReady and "Asynchronous handler terminated abnormally" messages and the pod restarts continuously.


I'll attach the kuryr pod logs and events and follow up with a private comment on how to access the cluster.


Version-Release number of selected component (if applicable): 4.2.0-0.nightly-2019-09-27-191818


How reproducible: Unknown


Steps to Reproduce:
1.  Cluster was installed by OSP QE team
2.  Running OCP automated test suites.  Several suites were run successfully, but the issue happened during an apiserver suite.    The exact testcase it  got stuck on due to pod creation failing was:
# @case_id OCP-11189
  @admin
  Scenario: Limit to create pod to access hostPID
Which doesn't have anything to do explicitly with networking.

3. oc get pods -n openshift-kuryr

Actual results:

Applications pods are all stuck in ContainerCreating with the above events.  kuryr-controller pod is endlessly crash looping

Expected results:  OCP non-destructive automation runs successfully


Additional info:

There is a repeated pattern of errors in the kuryr-controller log like below.   I'll attach the  full pod logs and events.

2019-10-02 15:20:49.864 1 CRITICAL kuryr_kubernetes.handlers.asynchronous [-] Asynchronous handler terminated abnormally; 1 events dropped for 79a0e32a-e527-11e9-bebb-fa163e8e25e7: GreenletExit^[[00m
2019-10-02 15:20:49.864 1 ERROR root [-] Original exception being dropped: ['Traceback (most recent call last):\n', '  File "/usr/lib/python2.7/site-packages/kuryr_kubernetes/handlers/retry.py", line 56, in __call__\n    self._handler(event)\n', '  File "/usr/lib/python2.7/site-packages/kuryr_kubernetes/handlers/k8s_base.py", line 75, in __call__\n    self.on_present(obj)\n', '  File "/usr/lib/python2.7/site-packages/kuryr_kubernetes/controller/handlers/namespace.py", line 95, in on_present\n    project_id)\n', '  File "/usr/lib/python2.7/site-packages/kuryr_kubernetes/controller/drivers/namespace_subnet.py", line 141, in create_namespace_network\n    "project_id": project_id\n', '  File "/usr/lib/python2.7/site-packages/neutronclient/v2_0/client.py", line 817, in create_network\n    return self.post(self.networks_path, body=body)\n', '  File "/usr/lib/python2.7/site-packages/neutronclient/v2_0/client.py", line 359, in post\n    headers=headers, params=params)\n', '  File "/usr/lib/python2.7/site-packages/neutronclient/v2_0/client.py", line 294, in do_request\n    self._handle_fault_response(status_code, replybody, resp)\n', '  File "/usr/lib/python2.7/site-packages/neutronclient/v2_0/client.py", line 269, in _handle_fault_response\n    exception_handler_v20(status_code, error_body)\n', '  File "/usr/lib/python2.7/site-packages/neutronclient/v2_0/client.py", line 93, in exception_handler_v20\n    request_ids=request_ids)\n', "OverQuotaClient: Quota exceeded for resources: ['network'].\nNeutron server returns request_ids: ['req-a8e52966-573d-4ca1-9139-3eaa035f6c02']\n"]: GreenletExit^[[00m
2019-10-02 15:20:49.865 1 CRITICAL kuryr_kubernetes.handlers.asynchronous [-] Asynchronous handler terminated abnormally; 1 events dropped for b82ca3d7-e521-11e9-bebb-fa163e8e25e7: GreenletExit^[[00m
2019-10-02 15:20:49.865 1 ERROR root [-] Original exception being dropped: ['Traceback (most recent call last):\n', '  File "/usr/lib/python2.7/site-packages/kuryr_kubernetes/handlers/retry.py", line 56, in __call__\n    self._handler(event)\n', '  File "/usr/lib/python2.7/site-packages/kuryr_kubernetes/handlers/k8s_base.py", line 75, in __call__\n    self.on_present(obj)\n', '  File "/usr/lib/python2.7/site-packages/kuryr_kubernetes/controller/handlers/vif.py", line 94, in on_present\n    subnets = self._drv_subnets.get_subnets(pod, project_id)\n', '  File "/usr/lib/python2.7/site-packages/kuryr_kubernetes/controller/drivers/namespace_subnet.py", line 47, in get_subnets\n    return self.get_namespace_subnet(pod_namespace)\n', '  File "/usr/lib/python2.7/site-packages/kuryr_kubernetes/controller/drivers/namespace_subnet.py", line 51, in get_namespace_subnet\n    subnet_id = self._get_namespace_subnet_id(namespace)\n', '  File "/usr/lib/python2.7/site-packages/kuryr_kubernetes/controller/drivers/namespace_subnet.py", line 69, in _get_namespace_subnet_id\n    raise exceptions.ResourceNotReady(namespace)\n', "ResourceNotReady: Resource not ready: u'1qhj6'\n"]: GreenletExit^[[00m
2019-10-02 15:20:49.866 1 CRITICAL kuryr_kubernetes.handlers.asynchronous [-] Asynchronous handler terminated abnormally; 1 events dropped for 7a7f8a55-e527-11e9-a4b6-fa163ecc0ba0: GreenletExit^[[00m
2019-10-02 15:20:49.866 1 ERROR root [-] Original exception being dropped: ['Traceback (most recent call last):\n', '  File "/usr/lib/python2.7/site-packages/kuryr_kubernetes/handlers/retry.py", line 56, in __call__\n    self._handler(event)\n', '  File "/usr/lib/python2.7/site-packages/kuryr_kubernetes/handlers/k8s_base.py", line 75, in __call__\n    self.on_present(obj)\n', '  File "/usr/lib/python2.7/site-packages/kuryr_kubernetes/controller/handlers/vif.py", line 94, in on_present\n    subnets = self._drv_subnets.get_subnets(pod, project_id)\n', '  File "/usr/lib/python2.7/site-packages/kuryr_kubernetes/controller/drivers/namespace_subnet.py", line 47, in get_subnets\n    return self.get_namespace_subnet(pod_namespace)\n', '  File "/usr/lib/python2.7/site-packages/kuryr_kubernetes/controller/drivers/namespace_subnet.py", line 51, in get_namespace_subnet\n    subnet_id = self._get_namespace_subnet_id(namespace)\n', '  File "/usr/lib/python2.7/site-packages/kuryr_kubernetes/controller/drivers/namespace_subnet.py", line 69, in _get_namespace_subnet_id\n    raise exceptions.ResourceNotReady(namespace)\n', "ResourceNotReady: Resource not ready: u'1qhj6'\n"]: GreenletExit^[[00m
2019-10-02 15:20:49.867 1 CRITICAL kuryr_kubernetes.handlers.asynchronous [-] Asynchronous handler terminated abnormally; 1 events dropped for 7b46e8c5-e527-11e9-bebb-fa163e8e25e7: GreenletExit^[[00m
2019-10-02 15:20:49.867 1 ERROR root [-] Original exception being dropped: ['Traceback (most recent call last):\n', '  File "/usr/lib/python2.7/site-packages/kuryr_kubernetes/handlers/retry.py", line 56, in __call__\n    self._handler(event)\n', '  File "/usr/lib/python2.7/site-packages/kuryr_kubernetes/handlers/k8s_base.py", line 75, in __call__\n    self.on_present(obj)\n', '  File "/usr/lib/python2.7/site-packages/kuryr_kubernetes/controller/handlers/vif.py", line 94, in on_present\n    subnets = self._drv_subnets.get_subnets(pod, project_id)\n', '  File "/usr/lib/python2.7/site-packages/kuryr_kubernetes/controller/drivers/namespace_subnet.py", line 47, in get_subnets\n    return self.get_namespace_subnet(pod_namespace)\n', '  File "/usr/lib/python2.7/site-packages/kuryr_kubernetes/controller/drivers/namespace_subnet.py", line 51, in get_namespace_subnet\n    subnet_id = self._get_namespace_subnet_id(namespace)\n', '  File "/usr/lib/python2.7/site-packages/kuryr_kubernetes/controller/drivers/namespace_subnet.py", line 69, in _get_namespace_subnet_id\n    raise exceptions.ResourceNotReady(namespace)\n', "ResourceNotReady: Resource not ready: u'abcd'\n"]: GreenletExit^[[00m
2019-10-02 15:20:49.867 1 ERROR root [-] Original exception being dropped: ['Traceback (most recent call last):\n', '  File "/usr/lib/python2.7/site-packages/kuryr_kubernetes/handlers/retry.py", line 56, in __call__\n    self._handler(event)\n', '  File "/usr/lib/python2.7/site-packages/kuryr_kubernetes/handlers/k8s_base.py", line 75, in __call__\n    self.on_present(obj)\n', '  File "/usr/lib/python2.7/site-packages/kuryr_kubernetes/controller/handlers/vif.py", line 94, in on_present\n    subnets = self._drv_subnets.get_subnets(pod, project_id)\n', '  File "/usr/lib/python2.7/site-packages/kuryr_kubernetes/controller/drivers/namespace_subnet.py", line 47, in get_subnets\n    return self.get_namespace_subnet(pod_namespace)\n', '  File "/usr/lib/python2.7/site-packages/kuryr_kubernetes/controller/drivers/namespace_subnet.py", line 51, in get_namespace_subnet\n    subnet_id = self._get_namespace_subnet_id(namespace)\n', '  File "/usr/lib/python2.7/site-packages/kuryr_kubernetes/controller/drivers/namespace_subnet.py", line 69, in _get_namespace_subnet_id\n    raise exceptions.ResourceNotReady(namespace)\n', "ResourceNotReady: Resource not ready: u'abcd'\n"]: GreenletExit^[[00m
2019-10-02 15:20:49.868 1 INFO kuryr_kubernetes.controller.service [-] Service 'KuryrK8sService' stopped^[[00m

Comment 1 Mike Fiedler 2019-10-02 15:40:52 UTC
Created attachment 1621917 [details]
kuryr pod logs + events

Could not run oc adm must-gather, so collected kuryr pod logs.

Comment 3 Michał Dulko 2019-10-02 15:58:27 UTC
This is definitely a quota issue - I see networks and subnets quotas at capacity on the setup. The quotas we provide at https://github.com/openshift/installer/blob/master/docs/user/openstack/kuryr.md#requirements-when-enabling-kuryr are the "bare minimum" meaning that those are minimal numbers to run a successful installation. As you can see those aren't allowing many more namespaces to be created.

Please bump the OpenStack subnet and network quotas to 150 (or more if the number of namespaces your tests create is higher) and see if it helps.

Comment 4 Michał Dulko 2019-10-02 15:58:40 UTC
This is definitely a quota issue - I see networks and subnets quotas at capacity on the setup. The quotas we provide at https://github.com/openshift/installer/blob/master/docs/user/openstack/kuryr.md#requirements-when-enabling-kuryr are the "bare minimum" meaning that those are minimal numbers to run a successful installation. As you can see those aren't allowing many more namespaces to be created.

Please bump the OpenStack subnet and network quotas to 150 (or more if the number of namespaces your tests create is higher) and see if it helps.

Comment 5 Michał Dulko 2019-10-02 16:11:37 UTC
Okay, I also see that those networks and subnets hang even though namespaces/projects related to them got removed. This might still be a symptom of a problem then. Anyway to unblock running the tests just bump the quotas.

Comment 6 Mike Fiedler 2019-10-02 17:33:14 UTC
Maybe some sort of resource leak?   The tests do constantly create/delete projects and pods among other resources.

Comment 7 Ben Bennett 2019-10-02 17:45:28 UTC
Setting this to be the master (4.3) fix.  I will clone to track for 4.2.  But it's likely to only be able to make 4.2.1.

Comment 8 Luis Tomas Bolivar 2019-10-17 15:40:43 UTC

*** This bug has been marked as a duplicate of bug 1759095 ***


Note You need to log in before you can comment on or make changes to this bug.