Bug 1757876

Summary: kuryr-controller stuck in CrashLoopBack and no pod creation possible after running OCP functional automation.
Product: OpenShift Container Platform Reporter: Mike Fiedler <mifiedle>
Component: NetworkingAssignee: Luis Tomas Bolivar <ltomasbo>
Networking sub component: kuryr QA Contact: GenadiC <gcheresh>
Status: CLOSED DUPLICATE Docs Contact:
Severity: urgent    
Priority: unspecified CC: bbennett, juriarte, mdulko
Version: 4.2.0Keywords: TestBlocker
Target Milestone: ---   
Target Release: 4.3.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1757916 (view as bug list) Environment:
Last Closed: 2019-10-17 15:40:43 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1757916    
Attachments:
Description Flags
kuryr pod logs + events none

Description Mike Fiedler 2019-10-02 15:37:22 UTC
Description of problem:

While running OCP functional automation non-destructive apiserver test suite, eventually new pods could not be added to the network and the kuryr-controller pod started crash looping.    From the application namespace side, the events are:

Failed create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_hello-pod-kf8j7_abcd_bc418dd8-e521-11e9-bebb-fa163e8e25e7_0(ea5342ae4b2a716fb1a7abc6fa8fbb1a37a4639b09843d39234546ec77885d71): Multus: Err adding pod to network "kuryr": Multus: error in invoke Delegate add - "kuryr-cni": CNI Daemon returned error 504

In the kuryr-controller pod there are ResourceNotReady and "Asynchronous handler terminated abnormally" messages and the pod restarts continuously.


I'll attach the kuryr pod logs and events and follow up with a private comment on how to access the cluster.


Version-Release number of selected component (if applicable): 4.2.0-0.nightly-2019-09-27-191818


How reproducible: Unknown


Steps to Reproduce:
1.  Cluster was installed by OSP QE team
2.  Running OCP automated test suites.  Several suites were run successfully, but the issue happened during an apiserver suite.    The exact testcase it  got stuck on due to pod creation failing was:
# @case_id OCP-11189
  @admin
  Scenario: Limit to create pod to access hostPID
Which doesn't have anything to do explicitly with networking.

3. oc get pods -n openshift-kuryr

Actual results:

Applications pods are all stuck in ContainerCreating with the above events.  kuryr-controller pod is endlessly crash looping

Expected results:  OCP non-destructive automation runs successfully


Additional info:

There is a repeated pattern of errors in the kuryr-controller log like below.   I'll attach the  full pod logs and events.

2019-10-02 15:20:49.864 1 CRITICAL kuryr_kubernetes.handlers.asynchronous [-] Asynchronous handler terminated abnormally; 1 events dropped for 79a0e32a-e527-11e9-bebb-fa163e8e25e7: GreenletExit^[[00m
2019-10-02 15:20:49.864 1 ERROR root [-] Original exception being dropped: ['Traceback (most recent call last):\n', '  File "/usr/lib/python2.7/site-packages/kuryr_kubernetes/handlers/retry.py", line 56, in __call__\n    self._handler(event)\n', '  File "/usr/lib/python2.7/site-packages/kuryr_kubernetes/handlers/k8s_base.py", line 75, in __call__\n    self.on_present(obj)\n', '  File "/usr/lib/python2.7/site-packages/kuryr_kubernetes/controller/handlers/namespace.py", line 95, in on_present\n    project_id)\n', '  File "/usr/lib/python2.7/site-packages/kuryr_kubernetes/controller/drivers/namespace_subnet.py", line 141, in create_namespace_network\n    "project_id": project_id\n', '  File "/usr/lib/python2.7/site-packages/neutronclient/v2_0/client.py", line 817, in create_network\n    return self.post(self.networks_path, body=body)\n', '  File "/usr/lib/python2.7/site-packages/neutronclient/v2_0/client.py", line 359, in post\n    headers=headers, params=params)\n', '  File "/usr/lib/python2.7/site-packages/neutronclient/v2_0/client.py", line 294, in do_request\n    self._handle_fault_response(status_code, replybody, resp)\n', '  File "/usr/lib/python2.7/site-packages/neutronclient/v2_0/client.py", line 269, in _handle_fault_response\n    exception_handler_v20(status_code, error_body)\n', '  File "/usr/lib/python2.7/site-packages/neutronclient/v2_0/client.py", line 93, in exception_handler_v20\n    request_ids=request_ids)\n', "OverQuotaClient: Quota exceeded for resources: ['network'].\nNeutron server returns request_ids: ['req-a8e52966-573d-4ca1-9139-3eaa035f6c02']\n"]: GreenletExit^[[00m
2019-10-02 15:20:49.865 1 CRITICAL kuryr_kubernetes.handlers.asynchronous [-] Asynchronous handler terminated abnormally; 1 events dropped for b82ca3d7-e521-11e9-bebb-fa163e8e25e7: GreenletExit^[[00m
2019-10-02 15:20:49.865 1 ERROR root [-] Original exception being dropped: ['Traceback (most recent call last):\n', '  File "/usr/lib/python2.7/site-packages/kuryr_kubernetes/handlers/retry.py", line 56, in __call__\n    self._handler(event)\n', '  File "/usr/lib/python2.7/site-packages/kuryr_kubernetes/handlers/k8s_base.py", line 75, in __call__\n    self.on_present(obj)\n', '  File "/usr/lib/python2.7/site-packages/kuryr_kubernetes/controller/handlers/vif.py", line 94, in on_present\n    subnets = self._drv_subnets.get_subnets(pod, project_id)\n', '  File "/usr/lib/python2.7/site-packages/kuryr_kubernetes/controller/drivers/namespace_subnet.py", line 47, in get_subnets\n    return self.get_namespace_subnet(pod_namespace)\n', '  File "/usr/lib/python2.7/site-packages/kuryr_kubernetes/controller/drivers/namespace_subnet.py", line 51, in get_namespace_subnet\n    subnet_id = self._get_namespace_subnet_id(namespace)\n', '  File "/usr/lib/python2.7/site-packages/kuryr_kubernetes/controller/drivers/namespace_subnet.py", line 69, in _get_namespace_subnet_id\n    raise exceptions.ResourceNotReady(namespace)\n', "ResourceNotReady: Resource not ready: u'1qhj6'\n"]: GreenletExit^[[00m
2019-10-02 15:20:49.866 1 CRITICAL kuryr_kubernetes.handlers.asynchronous [-] Asynchronous handler terminated abnormally; 1 events dropped for 7a7f8a55-e527-11e9-a4b6-fa163ecc0ba0: GreenletExit^[[00m
2019-10-02 15:20:49.866 1 ERROR root [-] Original exception being dropped: ['Traceback (most recent call last):\n', '  File "/usr/lib/python2.7/site-packages/kuryr_kubernetes/handlers/retry.py", line 56, in __call__\n    self._handler(event)\n', '  File "/usr/lib/python2.7/site-packages/kuryr_kubernetes/handlers/k8s_base.py", line 75, in __call__\n    self.on_present(obj)\n', '  File "/usr/lib/python2.7/site-packages/kuryr_kubernetes/controller/handlers/vif.py", line 94, in on_present\n    subnets = self._drv_subnets.get_subnets(pod, project_id)\n', '  File "/usr/lib/python2.7/site-packages/kuryr_kubernetes/controller/drivers/namespace_subnet.py", line 47, in get_subnets\n    return self.get_namespace_subnet(pod_namespace)\n', '  File "/usr/lib/python2.7/site-packages/kuryr_kubernetes/controller/drivers/namespace_subnet.py", line 51, in get_namespace_subnet\n    subnet_id = self._get_namespace_subnet_id(namespace)\n', '  File "/usr/lib/python2.7/site-packages/kuryr_kubernetes/controller/drivers/namespace_subnet.py", line 69, in _get_namespace_subnet_id\n    raise exceptions.ResourceNotReady(namespace)\n', "ResourceNotReady: Resource not ready: u'1qhj6'\n"]: GreenletExit^[[00m
2019-10-02 15:20:49.867 1 CRITICAL kuryr_kubernetes.handlers.asynchronous [-] Asynchronous handler terminated abnormally; 1 events dropped for 7b46e8c5-e527-11e9-bebb-fa163e8e25e7: GreenletExit^[[00m
2019-10-02 15:20:49.867 1 ERROR root [-] Original exception being dropped: ['Traceback (most recent call last):\n', '  File "/usr/lib/python2.7/site-packages/kuryr_kubernetes/handlers/retry.py", line 56, in __call__\n    self._handler(event)\n', '  File "/usr/lib/python2.7/site-packages/kuryr_kubernetes/handlers/k8s_base.py", line 75, in __call__\n    self.on_present(obj)\n', '  File "/usr/lib/python2.7/site-packages/kuryr_kubernetes/controller/handlers/vif.py", line 94, in on_present\n    subnets = self._drv_subnets.get_subnets(pod, project_id)\n', '  File "/usr/lib/python2.7/site-packages/kuryr_kubernetes/controller/drivers/namespace_subnet.py", line 47, in get_subnets\n    return self.get_namespace_subnet(pod_namespace)\n', '  File "/usr/lib/python2.7/site-packages/kuryr_kubernetes/controller/drivers/namespace_subnet.py", line 51, in get_namespace_subnet\n    subnet_id = self._get_namespace_subnet_id(namespace)\n', '  File "/usr/lib/python2.7/site-packages/kuryr_kubernetes/controller/drivers/namespace_subnet.py", line 69, in _get_namespace_subnet_id\n    raise exceptions.ResourceNotReady(namespace)\n', "ResourceNotReady: Resource not ready: u'abcd'\n"]: GreenletExit^[[00m
2019-10-02 15:20:49.867 1 ERROR root [-] Original exception being dropped: ['Traceback (most recent call last):\n', '  File "/usr/lib/python2.7/site-packages/kuryr_kubernetes/handlers/retry.py", line 56, in __call__\n    self._handler(event)\n', '  File "/usr/lib/python2.7/site-packages/kuryr_kubernetes/handlers/k8s_base.py", line 75, in __call__\n    self.on_present(obj)\n', '  File "/usr/lib/python2.7/site-packages/kuryr_kubernetes/controller/handlers/vif.py", line 94, in on_present\n    subnets = self._drv_subnets.get_subnets(pod, project_id)\n', '  File "/usr/lib/python2.7/site-packages/kuryr_kubernetes/controller/drivers/namespace_subnet.py", line 47, in get_subnets\n    return self.get_namespace_subnet(pod_namespace)\n', '  File "/usr/lib/python2.7/site-packages/kuryr_kubernetes/controller/drivers/namespace_subnet.py", line 51, in get_namespace_subnet\n    subnet_id = self._get_namespace_subnet_id(namespace)\n', '  File "/usr/lib/python2.7/site-packages/kuryr_kubernetes/controller/drivers/namespace_subnet.py", line 69, in _get_namespace_subnet_id\n    raise exceptions.ResourceNotReady(namespace)\n', "ResourceNotReady: Resource not ready: u'abcd'\n"]: GreenletExit^[[00m
2019-10-02 15:20:49.868 1 INFO kuryr_kubernetes.controller.service [-] Service 'KuryrK8sService' stopped^[[00m

Comment 1 Mike Fiedler 2019-10-02 15:40:52 UTC
Created attachment 1621917 [details]
kuryr pod logs + events

Could not run oc adm must-gather, so collected kuryr pod logs.

Comment 3 Michał Dulko 2019-10-02 15:58:27 UTC
This is definitely a quota issue - I see networks and subnets quotas at capacity on the setup. The quotas we provide at https://github.com/openshift/installer/blob/master/docs/user/openstack/kuryr.md#requirements-when-enabling-kuryr are the "bare minimum" meaning that those are minimal numbers to run a successful installation. As you can see those aren't allowing many more namespaces to be created.

Please bump the OpenStack subnet and network quotas to 150 (or more if the number of namespaces your tests create is higher) and see if it helps.

Comment 4 Michał Dulko 2019-10-02 15:58:40 UTC
This is definitely a quota issue - I see networks and subnets quotas at capacity on the setup. The quotas we provide at https://github.com/openshift/installer/blob/master/docs/user/openstack/kuryr.md#requirements-when-enabling-kuryr are the "bare minimum" meaning that those are minimal numbers to run a successful installation. As you can see those aren't allowing many more namespaces to be created.

Please bump the OpenStack subnet and network quotas to 150 (or more if the number of namespaces your tests create is higher) and see if it helps.

Comment 5 Michał Dulko 2019-10-02 16:11:37 UTC
Okay, I also see that those networks and subnets hang even though namespaces/projects related to them got removed. This might still be a symptom of a problem then. Anyway to unblock running the tests just bump the quotas.

Comment 6 Mike Fiedler 2019-10-02 17:33:14 UTC
Maybe some sort of resource leak?   The tests do constantly create/delete projects and pods among other resources.

Comment 7 Ben Bennett 2019-10-02 17:45:28 UTC
Setting this to be the master (4.3) fix.  I will clone to track for 4.2.  But it's likely to only be able to make 4.2.1.

Comment 8 Luis Tomas Bolivar 2019-10-17 15:40:43 UTC

*** This bug has been marked as a duplicate of bug 1759095 ***