Bug 1924075 - kuryr-controller restart when enablePortPoolsPrepopulation = true
Summary: kuryr-controller restart when enablePortPoolsPrepopulation = true
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.6.z
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: 4.8.0
Assignee: Maysa Macedo
QA Contact: rlobillo
URL:
Whiteboard:
Depends On:
Blocks: 1929066
TreeView+ depends on / blocked
 
Reported: 2021-02-02 14:36 UTC by rlobillo
Modified: 2021-07-27 22:39 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-07-27 22:37:56 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
kuryr-controller logs (77.50 KB, text/plain)
2021-02-02 14:36 UTC, rlobillo
no flags Details
error on kuryr_controller when sos_report was taken (73.82 KB, text/plain)
2021-02-03 17:29 UTC, rlobillo
no flags Details
kuryr tempest results with the fix (125.07 KB, application/xhtml+xml)
2021-04-19 14:30 UTC, rlobillo
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github openshift kuryr-kubernetes pull 452 0 None open Bug 1924075: Fix port pools 2021-02-15 11:34:27 UTC
Github openshift kuryr-kubernetes pull 470 0 None open Bug 1924075: Skip pool pre population if no Status is present on CRD 2021-03-03 23:50:53 UTC
Red Hat Product Errata RHSA-2021:2438 0 None None None 2021-07-27 22:39:28 UTC

Description rlobillo 2021-02-02 14:36:51 UTC
Created attachment 1754413 [details]
kuryr-controller logs

Description of problem:

Thanks to https://github.com/openshift/openshift-docs/pull/27508/files?short_path=767f21b#diff-767f21b9e191fd96fbaef306b9aedf9c70aa327890b0b5537b04ef50bf799824 it is possible to update the kuryr configuration on a running OCP cluster.

When enablePortPoolsPrepopulation is set to true, kuryr-controller is restarted in loop until it got to crashloopback status.

2021-02-02 14:19:11.515 1 ERROR kuryr_kubernetes.handlers.retry [-] Report handler unhealthy KuryrNetworkPopulationHandler: openstack.exceptions.HttpException: No Subnet found for 01a8b91e-fb5e-4c97-9f86-708ffd30dd63: Client Error for url: https://10.46.22.204:13696/v2.0/subnets/01a8b91e-fb5e-4c97-9f86-708ffd30dd63,
2021-02-02 14:19:11.515 1 ERROR kuryr_kubernetes.handlers.retry Traceback (most recent call last):
2021-02-02 14:19:11.515 1 ERROR kuryr_kubernetes.handlers.retry   File "/usr/lib/python3.6/site-packages/kuryr_kubernetes/handlers/retry.py", line 81, in __call__
2021-02-02 14:19:11.515 1 ERROR kuryr_kubernetes.handlers.retry     self._handler(event, *args, **kwargs)
2021-02-02 14:19:11.515 1 ERROR kuryr_kubernetes.handlers.retry   File "/usr/lib/python3.6/site-packages/kuryr_kubernetes/handlers/k8s_base.py", line 89, in __call__
2021-02-02 14:19:11.515 1 ERROR kuryr_kubernetes.handlers.retry     self.on_added(obj)
2021-02-02 14:19:11.515 1 ERROR kuryr_kubernetes.handlers.retry   File "/usr/lib/python3.6/site-packages/kuryr_kubernetes/controller/handlers/kuryrnetwork_population.py", line 56, in on_added
2021-02-02 14:19:11.515 1 ERROR kuryr_kubernetes.handlers.retry     subnets = self._drv_subnets.get_namespace_subnet(namespace, subnet_id)
2021-02-02 14:19:11.515 1 ERROR kuryr_kubernetes.handlers.retry   File "/usr/lib/python3.6/site-packages/kuryr_kubernetes/controller/drivers/namespace_subnet.py", line 53, in get_namespace_subnet
2021-02-02 14:19:11.515 1 ERROR kuryr_kubernetes.handlers.retry     return {subnet_id: utils.get_subnet(subnet_id)}
2021-02-02 14:19:11.515 1 ERROR kuryr_kubernetes.handlers.retry   File "/usr/lib/python3.6/site-packages/dogpile/cache/region.py", line 1270, in decorate
2021-02-02 14:19:11.515 1 ERROR kuryr_kubernetes.handlers.retry     should_cache_fn)
2021-02-02 14:19:11.515 1 ERROR kuryr_kubernetes.handlers.retry   File "/usr/lib/python3.6/site-packages/dogpile/cache/region.py", line 864, in get_or_create
2021-02-02 14:19:11.515 1 ERROR kuryr_kubernetes.handlers.retry     async_creator) as value:
2021-02-02 14:19:11.515 1 ERROR kuryr_kubernetes.handlers.retry   File "/usr/lib/python3.6/site-packages/dogpile/lock.py", line 186, in __enter__
2021-02-02 14:19:11.515 1 ERROR kuryr_kubernetes.handlers.retry     return self._enter()
2021-02-02 14:19:11.515 1 ERROR kuryr_kubernetes.handlers.retry   File "/usr/lib/python3.6/site-packages/dogpile/lock.py", line 93, in _enter
2021-02-02 14:19:11.515 1 ERROR kuryr_kubernetes.handlers.retry     generated = self._enter_create(value, createdtime)
2021-02-02 14:19:11.515 1 ERROR kuryr_kubernetes.handlers.retry   File "/usr/lib/python3.6/site-packages/dogpile/lock.py", line 179, in _enter_create
2021-02-02 14:19:11.515 1 ERROR kuryr_kubernetes.handlers.retry     return self.creator()
2021-02-02 14:19:11.515 1 ERROR kuryr_kubernetes.handlers.retry   File "/usr/lib/python3.6/site-packages/dogpile/cache/region.py", line 831, in gen_value
2021-02-02 14:19:11.515 1 ERROR kuryr_kubernetes.handlers.retry     created_value = creator()
2021-02-02 14:19:11.515 1 ERROR kuryr_kubernetes.handlers.retry   File "/usr/lib/python3.6/site-packages/dogpile/cache/region.py", line 1266, in creator
2021-02-02 14:19:11.515 1 ERROR kuryr_kubernetes.handlers.retry     return fn(*arg, **kw)
2021-02-02 14:19:11.515 1 ERROR kuryr_kubernetes.handlers.retry   File "/usr/lib/python3.6/site-packages/kuryr_kubernetes/utils.py", line 213, in get_subnet
2021-02-02 14:19:11.515 1 ERROR kuryr_kubernetes.handlers.retry     n_subnet = os_net.get_subnet(subnet_id)
2021-02-02 14:19:11.515 1 ERROR kuryr_kubernetes.handlers.retry   File "/usr/lib/python3.6/site-packages/openstack/network/v2/_proxy.py", line 3538, in get_subnet
2021-02-02 14:19:11.515 1 ERROR kuryr_kubernetes.handlers.retry     return self._get(_subnet.Subnet, subnet)
2021-02-02 14:19:11.515 1 ERROR kuryr_kubernetes.handlers.retry   File "/usr/lib/python3.6/site-packages/openstack/proxy.py", line 46, in check
2021-02-02 14:19:11.515 1 ERROR kuryr_kubernetes.handlers.retry     return method(self, expected, actual, *args, **kwargs)
2021-02-02 14:19:11.515 1 ERROR kuryr_kubernetes.handlers.retry   File "/usr/lib/python3.6/site-packages/openstack/proxy.py", line 508, in _get
2021-02-02 14:19:11.515 1 ERROR kuryr_kubernetes.handlers.retry     resource_type=resource_type.__name__, value=value))
2021-02-02 14:19:11.515 1 ERROR kuryr_kubernetes.handlers.retry   File "/usr/lib/python3.6/site-packages/openstack/resource.py", line 1416, in fetch
2021-02-02 14:19:11.515 1 ERROR kuryr_kubernetes.handlers.retry     self._translate_response(response, **kwargs)
2021-02-02 14:19:11.515 1 ERROR kuryr_kubernetes.handlers.retry   File "/usr/lib/python3.6/site-packages/openstack/resource.py", line 1113, in _translate_response
2021-02-02 14:19:11.515 1 ERROR kuryr_kubernetes.handlers.retry     exceptions.raise_from_response(response, error_message=error_message)
2021-02-02 14:19:11.515 1 ERROR kuryr_kubernetes.handlers.retry   File "/usr/lib/python3.6/site-packages/openstack/exceptions.py", line 235, in raise_from_response
2021-02-02 14:19:11.515 1 ERROR kuryr_kubernetes.handlers.retry     http_status=http_status, request_id=request_id
2021-02-02 14:19:11.515 1 ERROR kuryr_kubernetes.handlers.retry openstack.exceptions.HttpException: No Subnet found for 01a8b91e-fb5e-4c97-9f86-708ffd30dd63: Client Error for url: https://10.46.22.204:13696/v2.0/subnets/01a8b91e-fb5e-4c97-9f86-708ffd30dd63, 408 Request Time-out: Your browser didn't send a complete r
2021-02-02 14:19:11.515 1 ERROR kuryr_kubernetes.handlers.retry ^[[00m


Version-Release number of selected component (if applicable): OCP4.6 (4.6.0-0.nightly-2021-01-30-211400) over OSP13 (2021-01-20.1) with Amphora provider.


How reproducible: Always.

Steps to Reproduce:

1. Install OSP13 + OCP4.6 with amphora provider.
2. Run 

$ oc edit networks.operator.openshift.io cluster

and define below kuryrConfig section:

spec:
  clusterNetwork:
  - cidr: 10.128.0.0/14
    hostPrefix: 23
  defaultNetwork:
    kuryrConfig:
      enablePortPoolsPrepopulation: true
      poolBatchPorts: 3
      poolMaxPorts: 7
      poolMinPorts: 1
    type: Kuryr

3. Wait until openshift-kuryr pods are restarted with new configuration and monitor kuryr-controller.

Actual results: kuryr-controller goes to CrashLoopBackOff status.


Expected results: kuryr-controller is stable.


Additional info: kuryr-controller logs attached.

Comment 1 rlobillo 2021-02-03 17:11:57 UTC
sos report: http://rhos-release.virt.bos.redhat.com/log/bz1924075

Comment 2 rlobillo 2021-02-03 17:29:15 UTC
Created attachment 1754821 [details]
error on kuryr_controller when sos_report was taken

Comment 3 Maysa Macedo 2021-02-05 14:45:23 UTC
Assigning it to me as I started taking a look on this one.

Comment 5 rlobillo 2021-02-22 15:20:09 UTC
Failed on OCP4.8.0-0.nightly-2021-02-21-102854 over OSP13 (2021-01-20.1) with Amphora provider.

Once the kuryrConfig is changed, the kuryr pods remain stable. However, while running tempest tests, kuryr-controller is restarted from time to time with below exception:


2021-02-22 14:40:45.740 1 ERROR kuryr_kubernetes.handlers.logging [-] Failed to handle event {'type': 'ADDED', 'object': {'apiVersion': 'openstack.org/v1', 'kind': 'KuryrNetwork', 'metadata': {'creationTimestamp': '2021-02-22T14:40:45Z', 'finalizers': ['kuryrnetwork.finalizers.kuryr.openstack.org'], 'generation': 1, 'managedFields': [{'apiVersion': 'openstack.org/v1', 'fieldsType': 'FieldsV1', 'fieldsV1': {'f:metadata': {'f:finalizers': {'.': {}, 'v:"kuryrnetwork.finalizers.kuryr.openstack.org"': {}}}, 'f:spec': {'.': {}, 'f:nsLabels': {}, 'f:nsName': {}, 'f:projectId': {}}}, 'manager': 'python-requests', 'operation': 'Update', 'time': '2021-02-22T14:40:45Z'}], 'name': 'kuryr-namespace-1148497519', 'namespace': 'kuryr-namespace-1148497519', 'resourceVersion': '491628', 'uid': 'd381b3e2-0382-43e3-93c3-f7b4a9f381dd'}, 'spec': {'nsLabels': {}, 'nsName': 'kuryr-namespace-1148497519', 'projectId': '15145598c81a4fe3a49d9b1676bb2b0f'}}}: KeyError: 'status'
2021-02-22 14:40:45.740 1 ERROR kuryr_kubernetes.handlers.logging Traceback (most recent call last):
2021-02-22 14:40:45.740 1 ERROR kuryr_kubernetes.handlers.logging   File "/usr/lib/python3.6/site-packages/kuryr_kubernetes/handlers/logging.py", line 37, in __call__
2021-02-22 14:40:45.740 1 ERROR kuryr_kubernetes.handlers.logging     self._handler(event, *args, **kwargs)
2021-02-22 14:40:45.740 1 ERROR kuryr_kubernetes.handlers.logging   File "/usr/lib/python3.6/site-packages/kuryr_kubernetes/handlers/retry.py", line 80, in __call__
2021-02-22 14:40:45.740 1 ERROR kuryr_kubernetes.handlers.logging     self._handler(event, *args, **kwargs)
2021-02-22 14:40:45.740 1 ERROR kuryr_kubernetes.handlers.logging   File "/usr/lib/python3.6/site-packages/kuryr_kubernetes/handlers/k8s_base.py", line 89, in __call__
2021-02-22 14:40:45.740 1 ERROR kuryr_kubernetes.handlers.logging     self.on_added(obj)
2021-02-22 14:40:45.740 1 ERROR kuryr_kubernetes.handlers.logging   File "/usr/lib/python3.6/site-packages/kuryr_kubernetes/controller/handlers/kuryrnetwork_population.py", line 45, in on_added
2021-02-22 14:40:45.740 1 ERROR kuryr_kubernetes.handlers.logging     subnet_id = kuryrnet_crd['status'].get('subnetId')
2021-02-22 14:40:45.740 1 ERROR kuryr_kubernetes.handlers.logging KeyError: 'status'

This is presumably leading to the observed failures on the kuryr-tempest tests.

The config is:

$ oc get networks.operator.openshift.io cluster -o yaml | yq .spec
{
  "clusterNetwork": [
    {
      "cidr": "10.128.0.0/14",
      "hostPrefix": 23
    }
  ],
  "defaultNetwork": {
    "kuryrConfig": {
      "enablePortPoolsPrepopulation": true,
      "poolBatchPorts": 3,
      "poolMaxPorts": 7,
      "poolMinPorts": 1
    },
    "type": "Kuryr"
  },
  "disableNetworkDiagnostics": false,
  "logLevel": "Normal",
  "managementState": "Managed",
  "observedConfig": null,
  "operatorLogLevel": "Normal",
  "serviceNetwork": [
    "172.30.0.0/16"
  ],
  "unsupportedConfigOverrides": null
}

To reproduce it, as instance, run below test:

$ python -m testtools.run kuryr_tempest_plugin.tests.scenario.test_network_policy.NetworkPolicyScenario.test_ipblock_network_policy_allow_except

must_gather: http://file.rdu.redhat.com/rlobillo/must_gather_1924075.tgz
sos_report: http://rhos-release.virt.bos.redhat.com/log/bz1924075_2

Comment 6 Maysa Macedo 2021-02-22 15:40:51 UTC
The issue last reported on comment #5 is unrelated to the issue first reported on this Bugzilla.

The new issue is about a race condition of the status field not yet being present on the KuryrNetwork CRD (The crd spec and status are filled in 2 different operations)and the KuryrPrePopulation handler expecting that the CRD already contains it.

Comment 9 rlobillo 2021-04-19 14:28:49 UTC
Verified on OCP4.8.0-0.nightly-2021-04-17-044339 over OSP16.1 (RHOS-16.1-RHEL-8-20210323.n.0) with OVN-Octavia.

Configuration change is applied and kuryr-controller remains stable.

kuryr-tempest tests passed except test_port_pool which is missing a change on the test.

Comment 10 rlobillo 2021-04-19 14:30:10 UTC
Created attachment 1773297 [details]
kuryr tempest results with the fix

Comment 13 errata-xmlrpc 2021-07-27 22:37:56 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438


Note You need to log in before you can comment on or make changes to this bug.