Created attachment 1754413 [details] kuryr-controller logs Description of problem: Thanks to https://github.com/openshift/openshift-docs/pull/27508/files?short_path=767f21b#diff-767f21b9e191fd96fbaef306b9aedf9c70aa327890b0b5537b04ef50bf799824 it is possible to update the kuryr configuration on a running OCP cluster. When enablePortPoolsPrepopulation is set to true, kuryr-controller is restarted in loop until it got to crashloopback status. 2021-02-02 14:19:11.515 1 ERROR kuryr_kubernetes.handlers.retry [-] Report handler unhealthy KuryrNetworkPopulationHandler: openstack.exceptions.HttpException: No Subnet found for 01a8b91e-fb5e-4c97-9f86-708ffd30dd63: Client Error for url: https://10.46.22.204:13696/v2.0/subnets/01a8b91e-fb5e-4c97-9f86-708ffd30dd63, 2021-02-02 14:19:11.515 1 ERROR kuryr_kubernetes.handlers.retry Traceback (most recent call last): 2021-02-02 14:19:11.515 1 ERROR kuryr_kubernetes.handlers.retry File "/usr/lib/python3.6/site-packages/kuryr_kubernetes/handlers/retry.py", line 81, in __call__ 2021-02-02 14:19:11.515 1 ERROR kuryr_kubernetes.handlers.retry self._handler(event, *args, **kwargs) 2021-02-02 14:19:11.515 1 ERROR kuryr_kubernetes.handlers.retry File "/usr/lib/python3.6/site-packages/kuryr_kubernetes/handlers/k8s_base.py", line 89, in __call__ 2021-02-02 14:19:11.515 1 ERROR kuryr_kubernetes.handlers.retry self.on_added(obj) 2021-02-02 14:19:11.515 1 ERROR kuryr_kubernetes.handlers.retry File "/usr/lib/python3.6/site-packages/kuryr_kubernetes/controller/handlers/kuryrnetwork_population.py", line 56, in on_added 2021-02-02 14:19:11.515 1 ERROR kuryr_kubernetes.handlers.retry subnets = self._drv_subnets.get_namespace_subnet(namespace, subnet_id) 2021-02-02 14:19:11.515 1 ERROR kuryr_kubernetes.handlers.retry File "/usr/lib/python3.6/site-packages/kuryr_kubernetes/controller/drivers/namespace_subnet.py", line 53, in get_namespace_subnet 2021-02-02 14:19:11.515 1 ERROR kuryr_kubernetes.handlers.retry return {subnet_id: utils.get_subnet(subnet_id)} 2021-02-02 14:19:11.515 1 ERROR kuryr_kubernetes.handlers.retry File "/usr/lib/python3.6/site-packages/dogpile/cache/region.py", line 1270, in decorate 2021-02-02 14:19:11.515 1 ERROR kuryr_kubernetes.handlers.retry should_cache_fn) 2021-02-02 14:19:11.515 1 ERROR kuryr_kubernetes.handlers.retry File "/usr/lib/python3.6/site-packages/dogpile/cache/region.py", line 864, in get_or_create 2021-02-02 14:19:11.515 1 ERROR kuryr_kubernetes.handlers.retry async_creator) as value: 2021-02-02 14:19:11.515 1 ERROR kuryr_kubernetes.handlers.retry File "/usr/lib/python3.6/site-packages/dogpile/lock.py", line 186, in __enter__ 2021-02-02 14:19:11.515 1 ERROR kuryr_kubernetes.handlers.retry return self._enter() 2021-02-02 14:19:11.515 1 ERROR kuryr_kubernetes.handlers.retry File "/usr/lib/python3.6/site-packages/dogpile/lock.py", line 93, in _enter 2021-02-02 14:19:11.515 1 ERROR kuryr_kubernetes.handlers.retry generated = self._enter_create(value, createdtime) 2021-02-02 14:19:11.515 1 ERROR kuryr_kubernetes.handlers.retry File "/usr/lib/python3.6/site-packages/dogpile/lock.py", line 179, in _enter_create 2021-02-02 14:19:11.515 1 ERROR kuryr_kubernetes.handlers.retry return self.creator() 2021-02-02 14:19:11.515 1 ERROR kuryr_kubernetes.handlers.retry File "/usr/lib/python3.6/site-packages/dogpile/cache/region.py", line 831, in gen_value 2021-02-02 14:19:11.515 1 ERROR kuryr_kubernetes.handlers.retry created_value = creator() 2021-02-02 14:19:11.515 1 ERROR kuryr_kubernetes.handlers.retry File "/usr/lib/python3.6/site-packages/dogpile/cache/region.py", line 1266, in creator 2021-02-02 14:19:11.515 1 ERROR kuryr_kubernetes.handlers.retry return fn(*arg, **kw) 2021-02-02 14:19:11.515 1 ERROR kuryr_kubernetes.handlers.retry File "/usr/lib/python3.6/site-packages/kuryr_kubernetes/utils.py", line 213, in get_subnet 2021-02-02 14:19:11.515 1 ERROR kuryr_kubernetes.handlers.retry n_subnet = os_net.get_subnet(subnet_id) 2021-02-02 14:19:11.515 1 ERROR kuryr_kubernetes.handlers.retry File "/usr/lib/python3.6/site-packages/openstack/network/v2/_proxy.py", line 3538, in get_subnet 2021-02-02 14:19:11.515 1 ERROR kuryr_kubernetes.handlers.retry return self._get(_subnet.Subnet, subnet) 2021-02-02 14:19:11.515 1 ERROR kuryr_kubernetes.handlers.retry File "/usr/lib/python3.6/site-packages/openstack/proxy.py", line 46, in check 2021-02-02 14:19:11.515 1 ERROR kuryr_kubernetes.handlers.retry return method(self, expected, actual, *args, **kwargs) 2021-02-02 14:19:11.515 1 ERROR kuryr_kubernetes.handlers.retry File "/usr/lib/python3.6/site-packages/openstack/proxy.py", line 508, in _get 2021-02-02 14:19:11.515 1 ERROR kuryr_kubernetes.handlers.retry resource_type=resource_type.__name__, value=value)) 2021-02-02 14:19:11.515 1 ERROR kuryr_kubernetes.handlers.retry File "/usr/lib/python3.6/site-packages/openstack/resource.py", line 1416, in fetch 2021-02-02 14:19:11.515 1 ERROR kuryr_kubernetes.handlers.retry self._translate_response(response, **kwargs) 2021-02-02 14:19:11.515 1 ERROR kuryr_kubernetes.handlers.retry File "/usr/lib/python3.6/site-packages/openstack/resource.py", line 1113, in _translate_response 2021-02-02 14:19:11.515 1 ERROR kuryr_kubernetes.handlers.retry exceptions.raise_from_response(response, error_message=error_message) 2021-02-02 14:19:11.515 1 ERROR kuryr_kubernetes.handlers.retry File "/usr/lib/python3.6/site-packages/openstack/exceptions.py", line 235, in raise_from_response 2021-02-02 14:19:11.515 1 ERROR kuryr_kubernetes.handlers.retry http_status=http_status, request_id=request_id 2021-02-02 14:19:11.515 1 ERROR kuryr_kubernetes.handlers.retry openstack.exceptions.HttpException: No Subnet found for 01a8b91e-fb5e-4c97-9f86-708ffd30dd63: Client Error for url: https://10.46.22.204:13696/v2.0/subnets/01a8b91e-fb5e-4c97-9f86-708ffd30dd63, 408 Request Time-out: Your browser didn't send a complete r 2021-02-02 14:19:11.515 1 ERROR kuryr_kubernetes.handlers.retry ^[[00m Version-Release number of selected component (if applicable): OCP4.6 (4.6.0-0.nightly-2021-01-30-211400) over OSP13 (2021-01-20.1) with Amphora provider. How reproducible: Always. Steps to Reproduce: 1. Install OSP13 + OCP4.6 with amphora provider. 2. Run $ oc edit networks.operator.openshift.io cluster and define below kuryrConfig section: spec: clusterNetwork: - cidr: 10.128.0.0/14 hostPrefix: 23 defaultNetwork: kuryrConfig: enablePortPoolsPrepopulation: true poolBatchPorts: 3 poolMaxPorts: 7 poolMinPorts: 1 type: Kuryr 3. Wait until openshift-kuryr pods are restarted with new configuration and monitor kuryr-controller. Actual results: kuryr-controller goes to CrashLoopBackOff status. Expected results: kuryr-controller is stable. Additional info: kuryr-controller logs attached.
sos report: http://rhos-release.virt.bos.redhat.com/log/bz1924075
Created attachment 1754821 [details] error on kuryr_controller when sos_report was taken
Assigning it to me as I started taking a look on this one.
Failed on OCP4.8.0-0.nightly-2021-02-21-102854 over OSP13 (2021-01-20.1) with Amphora provider. Once the kuryrConfig is changed, the kuryr pods remain stable. However, while running tempest tests, kuryr-controller is restarted from time to time with below exception: 2021-02-22 14:40:45.740 1 ERROR kuryr_kubernetes.handlers.logging [-] Failed to handle event {'type': 'ADDED', 'object': {'apiVersion': 'openstack.org/v1', 'kind': 'KuryrNetwork', 'metadata': {'creationTimestamp': '2021-02-22T14:40:45Z', 'finalizers': ['kuryrnetwork.finalizers.kuryr.openstack.org'], 'generation': 1, 'managedFields': [{'apiVersion': 'openstack.org/v1', 'fieldsType': 'FieldsV1', 'fieldsV1': {'f:metadata': {'f:finalizers': {'.': {}, 'v:"kuryrnetwork.finalizers.kuryr.openstack.org"': {}}}, 'f:spec': {'.': {}, 'f:nsLabels': {}, 'f:nsName': {}, 'f:projectId': {}}}, 'manager': 'python-requests', 'operation': 'Update', 'time': '2021-02-22T14:40:45Z'}], 'name': 'kuryr-namespace-1148497519', 'namespace': 'kuryr-namespace-1148497519', 'resourceVersion': '491628', 'uid': 'd381b3e2-0382-43e3-93c3-f7b4a9f381dd'}, 'spec': {'nsLabels': {}, 'nsName': 'kuryr-namespace-1148497519', 'projectId': '15145598c81a4fe3a49d9b1676bb2b0f'}}}: KeyError: 'status' 2021-02-22 14:40:45.740 1 ERROR kuryr_kubernetes.handlers.logging Traceback (most recent call last): 2021-02-22 14:40:45.740 1 ERROR kuryr_kubernetes.handlers.logging File "/usr/lib/python3.6/site-packages/kuryr_kubernetes/handlers/logging.py", line 37, in __call__ 2021-02-22 14:40:45.740 1 ERROR kuryr_kubernetes.handlers.logging self._handler(event, *args, **kwargs) 2021-02-22 14:40:45.740 1 ERROR kuryr_kubernetes.handlers.logging File "/usr/lib/python3.6/site-packages/kuryr_kubernetes/handlers/retry.py", line 80, in __call__ 2021-02-22 14:40:45.740 1 ERROR kuryr_kubernetes.handlers.logging self._handler(event, *args, **kwargs) 2021-02-22 14:40:45.740 1 ERROR kuryr_kubernetes.handlers.logging File "/usr/lib/python3.6/site-packages/kuryr_kubernetes/handlers/k8s_base.py", line 89, in __call__ 2021-02-22 14:40:45.740 1 ERROR kuryr_kubernetes.handlers.logging self.on_added(obj) 2021-02-22 14:40:45.740 1 ERROR kuryr_kubernetes.handlers.logging File "/usr/lib/python3.6/site-packages/kuryr_kubernetes/controller/handlers/kuryrnetwork_population.py", line 45, in on_added 2021-02-22 14:40:45.740 1 ERROR kuryr_kubernetes.handlers.logging subnet_id = kuryrnet_crd['status'].get('subnetId') 2021-02-22 14:40:45.740 1 ERROR kuryr_kubernetes.handlers.logging KeyError: 'status' This is presumably leading to the observed failures on the kuryr-tempest tests. The config is: $ oc get networks.operator.openshift.io cluster -o yaml | yq .spec { "clusterNetwork": [ { "cidr": "10.128.0.0/14", "hostPrefix": 23 } ], "defaultNetwork": { "kuryrConfig": { "enablePortPoolsPrepopulation": true, "poolBatchPorts": 3, "poolMaxPorts": 7, "poolMinPorts": 1 }, "type": "Kuryr" }, "disableNetworkDiagnostics": false, "logLevel": "Normal", "managementState": "Managed", "observedConfig": null, "operatorLogLevel": "Normal", "serviceNetwork": [ "172.30.0.0/16" ], "unsupportedConfigOverrides": null } To reproduce it, as instance, run below test: $ python -m testtools.run kuryr_tempest_plugin.tests.scenario.test_network_policy.NetworkPolicyScenario.test_ipblock_network_policy_allow_except must_gather: http://file.rdu.redhat.com/rlobillo/must_gather_1924075.tgz sos_report: http://rhos-release.virt.bos.redhat.com/log/bz1924075_2
The issue last reported on comment #5 is unrelated to the issue first reported on this Bugzilla. The new issue is about a race condition of the status field not yet being present on the KuryrNetwork CRD (The crd spec and status are filled in 2 different operations)and the KuryrPrePopulation handler expecting that the CRD already contains it.
Verified on OCP4.8.0-0.nightly-2021-04-17-044339 over OSP16.1 (RHOS-16.1-RHEL-8-20210323.n.0) with OVN-Octavia. Configuration change is applied and kuryr-controller remains stable. kuryr-tempest tests passed except test_port_pool which is missing a change on the test.
Created attachment 1773297 [details] kuryr tempest results with the fix
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2438