Bug 2002909
Summary: | [Kuryr][3.11] dont block kuryr if one subnet runs out of IPs | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Robin Cernin <rcernin> |
Component: | Networking | Assignee: | Robin Cernin <rcernin> |
Networking sub component: | kuryr | QA Contact: | Jon Uriarte <juriarte> |
Status: | CLOSED ERRATA | Docs Contact: | |
Severity: | high | ||
Priority: | unspecified | CC: | itbrown, mdemaced, mdulko |
Version: | 3.11.0 | Keywords: | Triaged |
Target Milestone: | --- | ||
Target Release: | 3.11.z | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2021-12-02 22:01:17 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Robin Cernin
2021-09-10 03:10:11 UTC
## Problem description `eventlet.spawn` swallows the Exception from neutron and the `kuryr_kubernetes/handlers/retry.py` always fails with `ResourceNotReady` exception ~~~ try: return self._get_port_from_pool(pool_key, pod, subnets, tuple(sorted(security_groups))) except exceptions.ResourceNotReady: LOG.warning("Ports pool does not have available ports!") eventlet.spawn(self._populate_pool, pool_key, pod, subnets, tuple(sorted(security_groups))) raise ~~~ This means that the following retry will always raise an exception `ResourceNotReady` regardless neutron ~~~ try: self._handler(event) break except n_exc.OverQuotaClient: with excutils.save_and_reraise_exception() as ex: if self._sleep(deadline, attempt, ex.value): ex.reraise = False except self._exceptions: with excutils.save_and_reraise_exception() as ex: if self._sleep(deadline, attempt, ex.value): ex.reraise = False else: LOG.debug('Report handler unhealthy %s', self._handler) self._handler.set_liveness(alive=False) except Exception: LOG.exception('Report handler unhealthy %s', self._handler) self._handler.set_liveness(alive=False) raise ~~~ Thus ignoring `OverQuotaClient` or any other exception. I run into this when I wanted to exclude new exception `IpAddressGenerationFailureClient` because it doesn't make much sense to restart Kuryr controller on no more IP addresses available in the pool, since Kuryr doesn't allow update of neutron subnet. The sysadmins must be aware of the issue as their developers are unable to create any new pods in the namespace. Well Kuryr controller still works as there is no problem with is as such. We delay some handlers being processed and more pressure on OpenStack. The `/alive` will block `VIFHandler` and it may take time before `K8s` actually trigger the restart of the container. This extends the time when the Kuryr controller is unusable and creates more headaches to system admins working with Kuryr. ## Reproducer ~~~ # Check original range openstack subnet show ns/momo-subnet -f value -c allocation_pools 192.168.3.2-192.168.3.62 # Update the subnet allocation pool neutron subnet-update --allocation-pool start=192.168.3.2,end=192.168.3.4 ns/momo-subnet # Check the updated range openstack subnet show ns/momo-subnet -f value -c allocation_pools 192.168.3.2-192.168.3.4 ~~~ Scale the pods beyond the range ~~~ oc scale deployment echo --replicas=25 ~~~ Check the pods ~~~ $ oc get pods NAME READY STATUS RESTARTS AGE echo-6b477b8fd7-26vjw 0/1 ContainerCreating 0 12m echo-6b477b8fd7-49hnf 0/1 ContainerCreating 0 12m echo-6b477b8fd7-5jbvk 1/1 Running 0 12m echo-6b477b8fd7-6svm9 1/1 Running 0 12m echo-6b477b8fd7-7wln8 1/1 Running 0 15d echo-6b477b8fd7-9wvp2 1/1 Running 0 12m echo-6b477b8fd7-bpd7f 1/1 Running 0 13m echo-6b477b8fd7-cms48 0/1 ContainerCreating 0 12m echo-6b477b8fd7-d64zm 1/1 Running 0 13m echo-6b477b8fd7-dtfsq 1/1 Running 0 12m echo-6b477b8fd7-f7zqg 1/1 Running 0 13m echo-6b477b8fd7-g6tw4 0/1 ContainerCreating 0 12m echo-6b477b8fd7-jcqcf 1/1 Running 0 12m echo-6b477b8fd7-jpbbh 1/1 Running 0 12m echo-6b477b8fd7-k2kf8 0/1 ContainerCreating 0 12m echo-6b477b8fd7-l775r 0/1 ContainerCreating 0 12m echo-6b477b8fd7-lbjrr 1/1 Running 0 12m echo-6b477b8fd7-n8tq2 1/1 Running 0 12m echo-6b477b8fd7-nlsbr 1/1 Running 0 12m echo-6b477b8fd7-s96m2 1/1 Running 0 12m echo-6b477b8fd7-vrchb 0/1 ContainerCreating 0 12m echo-6b477b8fd7-wgjnj 0/1 ContainerCreating 0 12m echo-6b477b8fd7-wsvz6 0/1 ContainerCreating 0 12m echo-6b477b8fd7-xdph2 1/1 Running 0 12m echo-6b477b8fd7-zxztf 0/1 ContainerCreating 0 12m ~~~ We can see that neutron fails to create more ports ~~~ 2021-09-10 03:50:54.354 4487 ERROR kuryr_kubernetes.controller.drivers.nested_vlan_vif [-] Error creating bulk ports: IpAddressGenerationFailureClient: No more IP addresses available on network 5dd1dacd-e11e-47d2-ace7-20b497384092. 2021-09-10 03:49:29.352 Traceback (most recent call last): 2021-09-10 03:49:29.352 File "/home/cloud-user/stack/kuryr-kubernetes/kuryr_kubernetes/controller/drivers/nested_vlan_vif.py", line 84, in request_vifs 2021-09-10 03:49:29.352 ports = neutron.create_port(bulk_port_rq).get('ports') 2021-09-10 03:49:29.352 File "/home/cloud-user/stack/kuryr-kubernetes/.venv/lib/python2.7/site-packages/neutronclient/v2_0/client.py", line 794, in create_port 2021-09-10 03:49:29.352 return self.post(self.ports_path, body=body) 2021-09-10 03:49:29.352 File "/home/cloud-user/stack/kuryr-kubernetes/.venv/lib/python2.7/site-packages/neutronclient/v2_0/client.py", line 359, in post 2021-09-10 03:49:29.352 headers=headers, params=params) 2021-09-10 03:49:29.352 File "/home/cloud-user/stack/kuryr-kubernetes/.venv/lib/python2.7/site-packages/neutronclient/v2_0/client.py", line 294, in do_request 2021-09-10 03:49:29.352 self._handle_fault_response(status_code, replybody, resp) 2021-09-10 03:49:29.352 File "/home/cloud-user/stack/kuryr-kubernetes/.venv/lib/python2.7/site-packages/neutronclient/v2_0/client.py", line 269, in _handle_fault_response 2021-09-10 03:49:29.352 exception_handler_v20(status_code, error_body) 2021-09-10 03:49:29.352 File "/home/cloud-user/stack/kuryr-kubernetes/.venv/lib/python2.7/site-packages/neutronclient/v2_0/client.py", line 93, in exception_handler_v20 2021-09-10 03:49:29.352 request_ids=request_ids) 2021-09-10 03:49:29.352 IpAddressGenerationFailureClient: No more IP addresses available on network 5dd1dacd-e11e-47d2-ace7-20b497384092. ~~~ And we can see the VIF Handler fails eventually ~~~ 2021-09-10 03:56:06.167 4487 DEBUG kuryr_kubernetes.handlers.retry [-] Handler VIFHandler failed (attempt 9; ResourceNotReady: Resource not ready: ...) 2021-09-10 03:56:06.168 4487 DEBUG kuryr_kubernetes.handlers.retry [-] Report handler unhealthy VIFHandler __call__ /home/cloud-user/stack/kuryr-kubernetes/kuryr_kubernetes/handlers/retry.py:89 ~~~ And liveness is set to `False` ~~~ curl -v 127.0.0.1:8082/alive * About to connect() to 127.0.0.1 port 8082 (#0) * Trying 127.0.0.1... * Connected to 127.0.0.1 (127.0.0.1) port 8082 (#0) > GET /alive HTTP/1.1 > User-Agent: curl/7.29.0 > Host: 127.0.0.1:8082 > Accept: */* > * HTTP 1.0, assume close after body < HTTP/1.0 500 INTERNAL SERVER ERROR < Connection: close < Content-Type: text/html; charset=utf-8 < Content-Length: 0 < Server: Werkzeug/0.14.1 Python/2.7.5 < Date: Fri, 10 Sep 2021 07:56:29 GMT < * Closing connection 0 ~~~ ## Pull Request https://github.com/openshift/kuryr-kubernetes/pull/559 Without this patch: ``` 2021-09-10 03:56:06.167 4487 DEBUG kuryr_kubernetes.handlers.retry [-] Handler VIFHandler failed (attempt 9; ResourceNotReady: Resource not ready: ``` With this patch: ``` 2021-09-10 16:34:17.581 6985 DEBUG kuryr_kubernetes.handlers.retry [-] Handler VIFHandler failed (attempt 8; IpAddressGenerationFailureClient: No more IP addresses available on network 5dd1dacd-e11e-47d2-ace7-20b497384092. ``` Seems the it can be still disabled from, regardless ``` def _get_port_from_pool(self, pool_key, pod, subnets, security_groups): try: pool_ports = self._available_ports_pools[pool_key] except (KeyError, AttributeError): raise exceptions.ResourceNotReady(pod) ``` :-/ We either remove exception ResourceNotReady if we capture exceptions from neutron directly. The VIF handler would fail on those. I will write it all down and we can probably discuss.. How to verify: 1. Create a project and a deployment of 3 pods 2. Decrease the subnet range that corresponds to the namespace to 3 3. Scale the deployment to 10 4. Make sure the controller is not restarted and you see the error 'No more IP addresses available on network...' (In reply to Itzik Brown from comment #12) > How to verify: > 1. Create a project and a deployment of 3 pods > 2. Decrease the subnet range that corresponds to the namespace to 3 > 3. Scale the deployment to 10 > 4. Make sure the controller is not restarted and you see the error 'No more > IP addresses available on network...' I don't think you can change Neutron subnet CIDR. Just deploy 3.11 with default prefixlen for Kuryr subnetpool set to e.g. 28 [1] so you can only create 14 pods per namespace and try creating 15. Note that for that to be used you got to set openshift_kuryr_subnet_driver=namespace. [1] https://github.com/openshift/openshift-ansible/blob/release-3.11/roles/openshift_openstack/templates/heat_stack.yaml.j2#L294-L295 Verified with version: v3.11.550 $ openstack subnet pool list +--------------------------------------+---------------------------------------------------------+--------------+ | ID | Name | Prefixes | +--------------------------------------+---------------------------------------------------------+--------------+ | c3b3f82b-3561-4ba1-951e-1b10d4d3b146 | openshift-ansible-openshift.example.com-pod-subnet-pool | 10.11.0.0/16 | +--------------------------------------+---------------------------------------------------------+--------------+ $ openstack subnet pool set --default-prefix-length 28 c3b3f82b-3561-4ba1-951e-1b10d4d3b146 $ oc new-project demo $ cat deployment.yaml apiVersion: apps/v1 kind: Deployment metadata: name: demo labels: app: demo spec: replicas: 3 selector: matchLabels: app: demo template: metadata: labels: app: demo spec: containers: - name: demo image: kuryr/demo ports: - containerPort: 8080 $ oc create -f deployment.yaml $ oc scale --replicas=15 deployment/demo $ oc get deployment NAME DESIRED CURRENT UP-TO-DATE AVAILABLE AGE demo 15 15 15 10 6m $ oc get pods -n kuryr NAME READY STATUS RESTARTS AGE kuryr-cni-ds-2fl86 2/2 Running 0 14h kuryr-cni-ds-84ck7 2/2 Running 0 14h kuryr-cni-ds-ftf2v 2/2 Running 0 14h kuryr-cni-ds-g8qqv 2/2 Running 0 14h kuryr-cni-ds-gntts 2/2 Running 1 14h kuryr-cni-ds-kj9ld 2/2 Running 0 14h kuryr-cni-ds-qxfnp 2/2 Running 0 14h kuryr-cni-ds-zpg7v 2/2 Running 1 14h kuryr-controller-567485999f-lp4cm 1/1 Running 0 14h In Kuryr logs: IpAddressGenerationFailureClient: No more IP addresses available on network a267ad65-2f1d-4967-a08c-744dfe9fcb83. Neutron server returns request_ids: ['req-84c2863b-446f-4abc-9b70-754c3b490988'] Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: OpenShift Container Platform 3.11.569 security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:4827 |