1869294 – [kuryr] Network policy fails to get applied or removed when there's a pending load balancer being created

Bug 1869294 - [kuryr] Network policy fails to get applied or removed when there's a pending load balancer being created

Summary: [kuryr] Network policy fails to get applied or removed when there's a pending...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	4.6
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	4.6.0
Assignee:	Michał Dulko
QA Contact:	GenadiC
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-08-17 12:41 UTC by Michał Dulko
Modified:	2020-10-27 16:28 UTC (History)
CC List:	1 user (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-10-27 16:28:36 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift kuryr-kubernetes pull 322	0	None	closed	Bug 1869294: Ignore ResourceNotReady from update_lbaas_sg() when handling network policy	2020-09-07 16:05:19 UTC
Red Hat Product Errata	RHBA-2020:4196	0	None	None	None	2020-10-27 16:28:53 UTC

Description Michał Dulko 2020-08-17 12:41:13 UTC

Description of problem:
We're seeing that in the gate at the moment, the test_update_network_policy is failing very often with:

Traceback (most recent call last):
  File "/opt/stack/tempest/.tox/tempest/lib/python3.6/site-packages/kuryr_tempest_plugin/tests/scenario/base_network_policy.py", line 259, in test_update_network_policy
    self.assertIsNotNone(crd_pod_selector)
  File "/opt/stack/tempest/.tox/tempest/lib/python3.6/site-packages/testtools/testcase.py", line 439, in assertIsNotNone
    self.assertThat(observed, matcher, message)
  File "/opt/stack/tempest/.tox/tempest/lib/python3.6/site-packages/testtools/testcase.py", line 502, in assertThat
    raise mismatch_error
testtools.matchers._impl.MismatchError: None matches Is(None)

And the culprit is this circling around in the logs until it times out:

2020-08-04 20:33:36.495 1 DEBUG kuryr_kubernetes.controller.drivers.lbaasv2 [-] KuryrLoadBalancer for service default/kuryr-service-1926940643 not populated yet. update_lbaas_sg /usr/local/lib/python3.6/site-packages/kuryr_kubernetes/controller/drivers/lbaasv2.py:769[00m
2020-08-04 20:33:36.495 1 DEBUG kuryr_kubernetes.handlers.retry [-] Handler KuryrNetworkPolicyHandler failed (attempt 2; ResourceNotReady: Resource not ready: 'kuryr-service-1926940643') _sleep /usr/local/lib/python3.6/site-packages/kuryr_kubernetes/handlers/retry.py:101[00m

Point is - the handler is technically right - that LB is not yet provisioned, yet we wait to apply it the SG. This should not be necessary - when members will get added, they will have the correct SG rules applied.

Version-Release number of selected component (if applicable):


How reproducible:
Only on envs with very slow Octavia Amphora creations, probably never happens on proper OSP envs, we've only seen it on OpenStack gates.

Steps to Reproduce:
1. Create a service.
2. While loadbalancer for that service is still created try creating a NetworkPolicy.

Actual results:
KuryrNetworkPolicy CRD doesn't get status.podSelector populated until that LB is created (or never if LB creation took over several minutes).

Expected results:
KuryrNetworkPolicy CRD gets status.podSelector populated correctly even before the LB is up.

Additional info:

Comment 3 rlobillo 2020-09-09 10:11:11 UTC

Verified on OCP4.6.0-0.nightly-2020-09-07-224533 over OSP13 2020-09-03.2 with Amphora provider.

Loadbalancer and NP are created sequentially and kuryrloadbalancer resource is getting populated before the loadbalancer is on ACTIVE operating_status.NP is fully operational once the loadbalancer become ready.

1. Create environment:
$ oc new-project test
$ oc run --image kuryr/demo demo-allowed-caller 
$ oc run --image kuryr/demo demo-caller 
$ oc run --image kuryr/demo demo
$ 
$ cat np_resource.yaml 
kind: NetworkPolicy
apiVersion: networking.k8s.io/v1
metadata:
  name: np
spec:
  podSelector:
    matchLabels:
      run: demo
  ingress:
  - from:
    - podSelector:
        matchLabels:
          run: demo-allowed-caller

2. Create service and apply NP on it:
$ oc expose pod/demo --port 80 --target-port 8080 && sleep 1 && oc apply -f np_resource.yaml
service/demo exposed
networkpolicy.networking.k8s.io/np created

3. Loadbalancer pending to be created and kuryrloadbalancer shows correct status:

(shiftstack) [stack@undercloud-0 ~]$ . overcloudrc && openstack loadbalancer show test/demo
+---------------------+--------------------------------------+
| Field               | Value                                |
+---------------------+--------------------------------------+
| admin_state_up      | True                                 |
| created_at          | 2020-09-09T09:49:38                  |
| description         | openshiftClusterID=ostest-jhtjg      |
| flavor              |                                      |
| id                  | 4397f78d-6694-4aea-986c-5e8b4826ec32 |
| listeners           |                                      |
| name                | test/demo                            |
| operating_status    | OFFLINE                              |
| pools               |                                      |
| project_id          | abf184ea0ec84b70ab13de3bfd1ed0cc     |
| provider            | octavia                              |
| provisioning_status | PENDING_CREATE                       |
| updated_at          | None                                 |
| vip_address         | 172.30.97.133                        |
| vip_network_id      | 62d90e55-a172-4da0-8366-0348bbdf88e6 |
| vip_port_id         | 6e73ae1a-7da5-443e-ae59-276459f91c43 |
| vip_qos_policy_id   | None                                 |
| vip_subnet_id       | d164c66c-e705-4200-a5bb-6243d4bd5f9e |
+---------------------+--------------------------------------+
(overcloud) [stack@undercloud-0 ~]$ oc get knp -n test np -o json | jq ".status.podSelector"
{
  "matchLabels": {
    "run": "demo"
  }
}
(overcloud) [stack@undercloud-0 ~]$ . overcloudrc && openstack loadbalancer show test/demo
+---------------------+--------------------------------------+
| Field               | Value                                |
+---------------------+--------------------------------------+
| admin_state_up      | True                                 |
| created_at          | 2020-09-09T09:49:38                  |
| description         | openshiftClusterID=ostest-jhtjg      |
| flavor              |                                      |
| id                  | 4397f78d-6694-4aea-986c-5e8b4826ec32 |
| listeners           |                                      |
| name                | test/demo                            |
| operating_status    | OFFLINE                              |
| pools               |                                      |
| project_id          | abf184ea0ec84b70ab13de3bfd1ed0cc     |
| provider            | octavia                              |
| provisioning_status | PENDING_CREATE                       |
| updated_at          | None                                 |
| vip_address         | 172.30.97.133                        |
| vip_network_id      | 62d90e55-a172-4da0-8366-0348bbdf88e6 |
| vip_port_id         | 6e73ae1a-7da5-443e-ae59-276459f91c43 |
| vip_qos_policy_id   | None                                 |
| vip_subnet_id       | d164c66c-e705-4200-a5bb-6243d4bd5f9e |
+---------------------+--------------------------------------+
(overcloud) [stack@undercloud-0 ~]$ . overcloudrc && openstack loadbalancer show test/demo
+---------------------+--------------------------------------+
| Field               | Value                                |
+---------------------+--------------------------------------+
| admin_state_up      | True                                 |
| created_at          | 2020-09-09T09:49:38                  |
| description         | openshiftClusterID=ostest-jhtjg      |
| flavor              |                                      |
| id                  | 4397f78d-6694-4aea-986c-5e8b4826ec32 |
| listeners           | e30985ec-193a-4231-bbfc-97ff2c6fb4d1 |
| name                | test/demo                            |
| operating_status    | ONLINE                               |
| pools               | 5c4aefe2-3d1d-42ac-b159-c3c08de98956 |
| project_id          | abf184ea0ec84b70ab13de3bfd1ed0cc     |
| provider            | octavia                              |
| provisioning_status | ACTIVE                               |
| updated_at          | 2020-09-09T09:51:05                  |
| vip_address         | 172.30.97.133                        |
| vip_network_id      | 62d90e55-a172-4da0-8366-0348bbdf88e6 |
| vip_port_id         | 6e73ae1a-7da5-443e-ae59-276459f91c43 |
| vip_qos_policy_id   | None                                 |
| vip_subnet_id       | d164c66c-e705-4200-a5bb-6243d4bd5f9e |
+---------------------+--------------------------------------+

4. NP is correctly applied on the service:

(overcloud) [stack@undercloud-0 ~]$ oc get all
NAME                      READY   STATUS    RESTARTS   AGE
pod/demo                  1/1     Running   0          11m
pod/demo-allowed-caller   1/1     Running   0          3m41s
pod/demo-caller           1/1     Running   0          11m

NAME           TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)   AGE
service/demo   ClusterIP   172.30.97.133   <none>        80/TCP    2m38s
(overcloud) [stack@undercloud-0 ~]$ oc rsh pod/demo-allowed-caller curl 172.30.97.133
demo: HELLO! I AM ALIVE!!!
(overcloud) [stack@undercloud-0 ~]$ oc rsh pod/demo-caller curl 172.30.97.133
^Ccommand terminated with exit code 130

Comment 5 errata-xmlrpc 2020-10-27 16:28:36 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196

Note You need to log in before you can comment on or make changes to this bug.