Bug 2002909
| Summary: | [Kuryr][3.11] dont block kuryr if one subnet runs out of IPs | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Robin Cernin <rcernin> |
| Component: | Networking | Assignee: | Robin Cernin <rcernin> |
| Networking sub component: | kuryr | QA Contact: | Jon Uriarte <juriarte> |
| Status: | CLOSED ERRATA | Docs Contact: | |
| Severity: | high | ||
| Priority: | unspecified | CC: | itbrown, mdemaced, mdulko |
| Version: | 3.11.0 | Keywords: | Triaged |
| Target Milestone: | --- | ||
| Target Release: | 3.11.z | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2021-12-02 22:01:17 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Robin Cernin
2021-09-10 03:10:11 UTC
## Problem description
`eventlet.spawn` swallows the Exception from neutron and the `kuryr_kubernetes/handlers/retry.py` always fails with `ResourceNotReady` exception
~~~
try:
return self._get_port_from_pool(pool_key, pod, subnets,
tuple(sorted(security_groups)))
except exceptions.ResourceNotReady:
LOG.warning("Ports pool does not have available ports!")
eventlet.spawn(self._populate_pool, pool_key, pod, subnets,
tuple(sorted(security_groups)))
raise
~~~
This means that the following retry will always raise an exception `ResourceNotReady` regardless neutron
~~~
try:
self._handler(event)
break
except n_exc.OverQuotaClient:
with excutils.save_and_reraise_exception() as ex:
if self._sleep(deadline, attempt, ex.value):
ex.reraise = False
except self._exceptions:
with excutils.save_and_reraise_exception() as ex:
if self._sleep(deadline, attempt, ex.value):
ex.reraise = False
else:
LOG.debug('Report handler unhealthy %s', self._handler)
self._handler.set_liveness(alive=False)
except Exception:
LOG.exception('Report handler unhealthy %s', self._handler)
self._handler.set_liveness(alive=False)
raise
~~~
Thus ignoring `OverQuotaClient` or any other exception.
I run into this when I wanted to exclude new exception `IpAddressGenerationFailureClient` because it doesn't make
much sense to restart Kuryr controller on no more IP addresses available in the pool, since Kuryr doesn't allow
update of neutron subnet.
The sysadmins must be aware of the issue as their developers are unable to create any new pods in the namespace.
Well Kuryr controller still works as there is no problem with is as such. We delay some handlers being processed and more pressure on OpenStack.
The `/alive` will block `VIFHandler` and it may take time before `K8s` actually trigger the restart of the container.
This extends the time when the Kuryr controller is unusable and creates more headaches to system admins working with Kuryr.
## Reproducer
~~~
# Check original range
openstack subnet show ns/momo-subnet -f value -c allocation_pools
192.168.3.2-192.168.3.62
# Update the subnet allocation pool
neutron subnet-update --allocation-pool start=192.168.3.2,end=192.168.3.4 ns/momo-subnet
# Check the updated range
openstack subnet show ns/momo-subnet -f value -c allocation_pools
192.168.3.2-192.168.3.4
~~~
Scale the pods beyond the range
~~~
oc scale deployment echo --replicas=25
~~~
Check the pods
~~~
$ oc get pods
NAME READY STATUS RESTARTS AGE
echo-6b477b8fd7-26vjw 0/1 ContainerCreating 0 12m
echo-6b477b8fd7-49hnf 0/1 ContainerCreating 0 12m
echo-6b477b8fd7-5jbvk 1/1 Running 0 12m
echo-6b477b8fd7-6svm9 1/1 Running 0 12m
echo-6b477b8fd7-7wln8 1/1 Running 0 15d
echo-6b477b8fd7-9wvp2 1/1 Running 0 12m
echo-6b477b8fd7-bpd7f 1/1 Running 0 13m
echo-6b477b8fd7-cms48 0/1 ContainerCreating 0 12m
echo-6b477b8fd7-d64zm 1/1 Running 0 13m
echo-6b477b8fd7-dtfsq 1/1 Running 0 12m
echo-6b477b8fd7-f7zqg 1/1 Running 0 13m
echo-6b477b8fd7-g6tw4 0/1 ContainerCreating 0 12m
echo-6b477b8fd7-jcqcf 1/1 Running 0 12m
echo-6b477b8fd7-jpbbh 1/1 Running 0 12m
echo-6b477b8fd7-k2kf8 0/1 ContainerCreating 0 12m
echo-6b477b8fd7-l775r 0/1 ContainerCreating 0 12m
echo-6b477b8fd7-lbjrr 1/1 Running 0 12m
echo-6b477b8fd7-n8tq2 1/1 Running 0 12m
echo-6b477b8fd7-nlsbr 1/1 Running 0 12m
echo-6b477b8fd7-s96m2 1/1 Running 0 12m
echo-6b477b8fd7-vrchb 0/1 ContainerCreating 0 12m
echo-6b477b8fd7-wgjnj 0/1 ContainerCreating 0 12m
echo-6b477b8fd7-wsvz6 0/1 ContainerCreating 0 12m
echo-6b477b8fd7-xdph2 1/1 Running 0 12m
echo-6b477b8fd7-zxztf 0/1 ContainerCreating 0 12m
~~~
We can see that neutron fails to create more ports
~~~
2021-09-10 03:50:54.354 4487 ERROR kuryr_kubernetes.controller.drivers.nested_vlan_vif [-] Error creating bulk ports:
IpAddressGenerationFailureClient:
No more IP addresses available on network 5dd1dacd-e11e-47d2-ace7-20b497384092.
2021-09-10 03:49:29.352 Traceback (most recent call last):
2021-09-10 03:49:29.352 File "/home/cloud-user/stack/kuryr-kubernetes/kuryr_kubernetes/controller/drivers/nested_vlan_vif.py", line 84, in request_vifs
2021-09-10 03:49:29.352 ports = neutron.create_port(bulk_port_rq).get('ports')
2021-09-10 03:49:29.352 File "/home/cloud-user/stack/kuryr-kubernetes/.venv/lib/python2.7/site-packages/neutronclient/v2_0/client.py", line 794, in create_port
2021-09-10 03:49:29.352 return self.post(self.ports_path, body=body)
2021-09-10 03:49:29.352 File "/home/cloud-user/stack/kuryr-kubernetes/.venv/lib/python2.7/site-packages/neutronclient/v2_0/client.py", line 359, in post
2021-09-10 03:49:29.352 headers=headers, params=params)
2021-09-10 03:49:29.352 File "/home/cloud-user/stack/kuryr-kubernetes/.venv/lib/python2.7/site-packages/neutronclient/v2_0/client.py", line 294, in do_request
2021-09-10 03:49:29.352 self._handle_fault_response(status_code, replybody, resp)
2021-09-10 03:49:29.352 File "/home/cloud-user/stack/kuryr-kubernetes/.venv/lib/python2.7/site-packages/neutronclient/v2_0/client.py", line 269, in _handle_fault_response
2021-09-10 03:49:29.352 exception_handler_v20(status_code, error_body)
2021-09-10 03:49:29.352 File "/home/cloud-user/stack/kuryr-kubernetes/.venv/lib/python2.7/site-packages/neutronclient/v2_0/client.py", line 93, in exception_handler_v20
2021-09-10 03:49:29.352 request_ids=request_ids)
2021-09-10 03:49:29.352 IpAddressGenerationFailureClient: No more IP addresses available on network 5dd1dacd-e11e-47d2-ace7-20b497384092.
~~~
And we can see the VIF Handler fails eventually
~~~
2021-09-10 03:56:06.167 4487 DEBUG kuryr_kubernetes.handlers.retry [-] Handler VIFHandler failed (attempt 9; ResourceNotReady: Resource not ready: ...)
2021-09-10 03:56:06.168 4487 DEBUG kuryr_kubernetes.handlers.retry [-] Report handler unhealthy VIFHandler __call__ /home/cloud-user/stack/kuryr-kubernetes/kuryr_kubernetes/handlers/retry.py:89
~~~
And liveness is set to `False`
~~~
curl -v 127.0.0.1:8082/alive
* About to connect() to 127.0.0.1 port 8082 (#0)
* Trying 127.0.0.1...
* Connected to 127.0.0.1 (127.0.0.1) port 8082 (#0)
> GET /alive HTTP/1.1
> User-Agent: curl/7.29.0
> Host: 127.0.0.1:8082
> Accept: */*
>
* HTTP 1.0, assume close after body
< HTTP/1.0 500 INTERNAL SERVER ERROR
< Connection: close
< Content-Type: text/html; charset=utf-8
< Content-Length: 0
< Server: Werkzeug/0.14.1 Python/2.7.5
< Date: Fri, 10 Sep 2021 07:56:29 GMT
<
* Closing connection 0
~~~
## Pull Request
https://github.com/openshift/kuryr-kubernetes/pull/559
Without this patch: ``` 2021-09-10 03:56:06.167 4487 DEBUG kuryr_kubernetes.handlers.retry [-] Handler VIFHandler failed (attempt 9; ResourceNotReady: Resource not ready: ``` With this patch: ``` 2021-09-10 16:34:17.581 6985 DEBUG kuryr_kubernetes.handlers.retry [-] Handler VIFHandler failed (attempt 8; IpAddressGenerationFailureClient: No more IP addresses available on network 5dd1dacd-e11e-47d2-ace7-20b497384092. ``` Seems the it can be still disabled from, regardless
```
def _get_port_from_pool(self, pool_key, pod, subnets, security_groups):
try:
pool_ports = self._available_ports_pools[pool_key]
except (KeyError, AttributeError):
raise exceptions.ResourceNotReady(pod)
```
:-/
We either remove exception ResourceNotReady if we capture exceptions from neutron directly. The VIF handler would fail on those. I will write it all down and we can probably discuss.. How to verify: 1. Create a project and a deployment of 3 pods 2. Decrease the subnet range that corresponds to the namespace to 3 3. Scale the deployment to 10 4. Make sure the controller is not restarted and you see the error 'No more IP addresses available on network...' (In reply to Itzik Brown from comment #12) > How to verify: > 1. Create a project and a deployment of 3 pods > 2. Decrease the subnet range that corresponds to the namespace to 3 > 3. Scale the deployment to 10 > 4. Make sure the controller is not restarted and you see the error 'No more > IP addresses available on network...' I don't think you can change Neutron subnet CIDR. Just deploy 3.11 with default prefixlen for Kuryr subnetpool set to e.g. 28 [1] so you can only create 14 pods per namespace and try creating 15. Note that for that to be used you got to set openshift_kuryr_subnet_driver=namespace. [1] https://github.com/openshift/openshift-ansible/blob/release-3.11/roles/openshift_openstack/templates/heat_stack.yaml.j2#L294-L295 Verified with version: v3.11.550
$ openstack subnet pool list
+--------------------------------------+---------------------------------------------------------+--------------+
| ID | Name | Prefixes |
+--------------------------------------+---------------------------------------------------------+--------------+
| c3b3f82b-3561-4ba1-951e-1b10d4d3b146 | openshift-ansible-openshift.example.com-pod-subnet-pool | 10.11.0.0/16 |
+--------------------------------------+---------------------------------------------------------+--------------+
$ openstack subnet pool set --default-prefix-length 28 c3b3f82b-3561-4ba1-951e-1b10d4d3b146
$ oc new-project demo
$ cat deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: demo
labels:
app: demo
spec:
replicas: 3
selector:
matchLabels:
app: demo
template:
metadata:
labels:
app: demo
spec:
containers:
- name: demo
image: kuryr/demo
ports:
- containerPort: 8080
$ oc create -f deployment.yaml
$ oc scale --replicas=15 deployment/demo
$ oc get deployment
NAME DESIRED CURRENT UP-TO-DATE AVAILABLE AGE
demo 15 15 15 10 6m
$ oc get pods -n kuryr
NAME READY STATUS RESTARTS AGE
kuryr-cni-ds-2fl86 2/2 Running 0 14h
kuryr-cni-ds-84ck7 2/2 Running 0 14h
kuryr-cni-ds-ftf2v 2/2 Running 0 14h
kuryr-cni-ds-g8qqv 2/2 Running 0 14h
kuryr-cni-ds-gntts 2/2 Running 1 14h
kuryr-cni-ds-kj9ld 2/2 Running 0 14h
kuryr-cni-ds-qxfnp 2/2 Running 0 14h
kuryr-cni-ds-zpg7v 2/2 Running 1 14h
kuryr-controller-567485999f-lp4cm 1/1 Running 0 14h
In Kuryr logs:
IpAddressGenerationFailureClient: No more IP addresses available on network a267ad65-2f1d-4967-a08c-744dfe9fcb83.
Neutron server returns request_ids: ['req-84c2863b-446f-4abc-9b70-754c3b490988']
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: OpenShift Container Platform 3.11.569 security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:4827 |