Bug 1869294

Summary: [kuryr] Network policy fails to get applied or removed when there's a pending load balancer being created
Product: OpenShift Container Platform Reporter: Michał Dulko <mdulko>
Component: NetworkingAssignee: Michał Dulko <mdulko>
Networking sub component: kuryr QA Contact: GenadiC <gcheresh>
Status: CLOSED ERRATA Docs Contact:
Severity: medium    
Priority: medium CC: rlobillo
Version: 4.6   
Target Milestone: ---   
Target Release: 4.6.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-10-27 16:28:36 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Description Michał Dulko 2020-08-17 12:41:13 UTC
Description of problem:
We're seeing that in the gate at the moment, the test_update_network_policy is failing very often with:

Traceback (most recent call last):
  File "/opt/stack/tempest/.tox/tempest/lib/python3.6/site-packages/kuryr_tempest_plugin/tests/scenario/base_network_policy.py", line 259, in test_update_network_policy
    self.assertIsNotNone(crd_pod_selector)
  File "/opt/stack/tempest/.tox/tempest/lib/python3.6/site-packages/testtools/testcase.py", line 439, in assertIsNotNone
    self.assertThat(observed, matcher, message)
  File "/opt/stack/tempest/.tox/tempest/lib/python3.6/site-packages/testtools/testcase.py", line 502, in assertThat
    raise mismatch_error
testtools.matchers._impl.MismatchError: None matches Is(None)

And the culprit is this circling around in the logs until it times out:

2020-08-04 20:33:36.495 1 DEBUG kuryr_kubernetes.controller.drivers.lbaasv2 [-] KuryrLoadBalancer for service default/kuryr-service-1926940643 not populated yet. update_lbaas_sg /usr/local/lib/python3.6/site-packages/kuryr_kubernetes/controller/drivers/lbaasv2.py:769[00m
2020-08-04 20:33:36.495 1 DEBUG kuryr_kubernetes.handlers.retry [-] Handler KuryrNetworkPolicyHandler failed (attempt 2; ResourceNotReady: Resource not ready: 'kuryr-service-1926940643') _sleep /usr/local/lib/python3.6/site-packages/kuryr_kubernetes/handlers/retry.py:101[00m

Point is - the handler is technically right - that LB is not yet provisioned, yet we wait to apply it the SG. This should not be necessary - when members will get added, they will have the correct SG rules applied.

Version-Release number of selected component (if applicable):


How reproducible:
Only on envs with very slow Octavia Amphora creations, probably never happens on proper OSP envs, we've only seen it on OpenStack gates.

Steps to Reproduce:
1. Create a service.
2. While loadbalancer for that service is still created try creating a NetworkPolicy.

Actual results:
KuryrNetworkPolicy CRD doesn't get status.podSelector populated until that LB is created (or never if LB creation took over several minutes).

Expected results:
KuryrNetworkPolicy CRD gets status.podSelector populated correctly even before the LB is up.

Additional info:

Comment 3 rlobillo 2020-09-09 10:11:11 UTC
Verified on OCP4.6.0-0.nightly-2020-09-07-224533 over OSP13 2020-09-03.2 with Amphora provider.

Loadbalancer and NP are created sequentially and kuryrloadbalancer resource is getting populated before the loadbalancer is on ACTIVE operating_status.NP is fully operational once the loadbalancer become ready.

1. Create environment:
$ oc new-project test
$ oc run --image kuryr/demo demo-allowed-caller 
$ oc run --image kuryr/demo demo-caller 
$ oc run --image kuryr/demo demo
$ 
$ cat np_resource.yaml 
kind: NetworkPolicy
apiVersion: networking.k8s.io/v1
metadata:
  name: np
spec:
  podSelector:
    matchLabels:
      run: demo
  ingress:
  - from:
    - podSelector:
        matchLabels:
          run: demo-allowed-caller

2. Create service and apply NP on it:
$ oc expose pod/demo --port 80 --target-port 8080 && sleep 1 && oc apply -f np_resource.yaml
service/demo exposed
networkpolicy.networking.k8s.io/np created

3. Loadbalancer pending to be created and kuryrloadbalancer shows correct status:

(shiftstack) [stack@undercloud-0 ~]$ . overcloudrc && openstack loadbalancer show test/demo
+---------------------+--------------------------------------+
| Field               | Value                                |
+---------------------+--------------------------------------+
| admin_state_up      | True                                 |
| created_at          | 2020-09-09T09:49:38                  |
| description         | openshiftClusterID=ostest-jhtjg      |
| flavor              |                                      |
| id                  | 4397f78d-6694-4aea-986c-5e8b4826ec32 |
| listeners           |                                      |
| name                | test/demo                            |
| operating_status    | OFFLINE                              |
| pools               |                                      |
| project_id          | abf184ea0ec84b70ab13de3bfd1ed0cc     |
| provider            | octavia                              |
| provisioning_status | PENDING_CREATE                       |
| updated_at          | None                                 |
| vip_address         | 172.30.97.133                        |
| vip_network_id      | 62d90e55-a172-4da0-8366-0348bbdf88e6 |
| vip_port_id         | 6e73ae1a-7da5-443e-ae59-276459f91c43 |
| vip_qos_policy_id   | None                                 |
| vip_subnet_id       | d164c66c-e705-4200-a5bb-6243d4bd5f9e |
+---------------------+--------------------------------------+
(overcloud) [stack@undercloud-0 ~]$ oc get knp -n test np -o json | jq ".status.podSelector"
{
  "matchLabels": {
    "run": "demo"
  }
}
(overcloud) [stack@undercloud-0 ~]$ . overcloudrc && openstack loadbalancer show test/demo
+---------------------+--------------------------------------+
| Field               | Value                                |
+---------------------+--------------------------------------+
| admin_state_up      | True                                 |
| created_at          | 2020-09-09T09:49:38                  |
| description         | openshiftClusterID=ostest-jhtjg      |
| flavor              |                                      |
| id                  | 4397f78d-6694-4aea-986c-5e8b4826ec32 |
| listeners           |                                      |
| name                | test/demo                            |
| operating_status    | OFFLINE                              |
| pools               |                                      |
| project_id          | abf184ea0ec84b70ab13de3bfd1ed0cc     |
| provider            | octavia                              |
| provisioning_status | PENDING_CREATE                       |
| updated_at          | None                                 |
| vip_address         | 172.30.97.133                        |
| vip_network_id      | 62d90e55-a172-4da0-8366-0348bbdf88e6 |
| vip_port_id         | 6e73ae1a-7da5-443e-ae59-276459f91c43 |
| vip_qos_policy_id   | None                                 |
| vip_subnet_id       | d164c66c-e705-4200-a5bb-6243d4bd5f9e |
+---------------------+--------------------------------------+
(overcloud) [stack@undercloud-0 ~]$ . overcloudrc && openstack loadbalancer show test/demo
+---------------------+--------------------------------------+
| Field               | Value                                |
+---------------------+--------------------------------------+
| admin_state_up      | True                                 |
| created_at          | 2020-09-09T09:49:38                  |
| description         | openshiftClusterID=ostest-jhtjg      |
| flavor              |                                      |
| id                  | 4397f78d-6694-4aea-986c-5e8b4826ec32 |
| listeners           | e30985ec-193a-4231-bbfc-97ff2c6fb4d1 |
| name                | test/demo                            |
| operating_status    | ONLINE                               |
| pools               | 5c4aefe2-3d1d-42ac-b159-c3c08de98956 |
| project_id          | abf184ea0ec84b70ab13de3bfd1ed0cc     |
| provider            | octavia                              |
| provisioning_status | ACTIVE                               |
| updated_at          | 2020-09-09T09:51:05                  |
| vip_address         | 172.30.97.133                        |
| vip_network_id      | 62d90e55-a172-4da0-8366-0348bbdf88e6 |
| vip_port_id         | 6e73ae1a-7da5-443e-ae59-276459f91c43 |
| vip_qos_policy_id   | None                                 |
| vip_subnet_id       | d164c66c-e705-4200-a5bb-6243d4bd5f9e |
+---------------------+--------------------------------------+

4. NP is correctly applied on the service:

(overcloud) [stack@undercloud-0 ~]$ oc get all
NAME                      READY   STATUS    RESTARTS   AGE
pod/demo                  1/1     Running   0          11m
pod/demo-allowed-caller   1/1     Running   0          3m41s
pod/demo-caller           1/1     Running   0          11m

NAME           TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)   AGE
service/demo   ClusterIP   172.30.97.133   <none>        80/TCP    2m38s
(overcloud) [stack@undercloud-0 ~]$ oc rsh pod/demo-allowed-caller curl 172.30.97.133
demo: HELLO! I AM ALIVE!!!
(overcloud) [stack@undercloud-0 ~]$ oc rsh pod/demo-caller curl 172.30.97.133
^Ccommand terminated with exit code 130

Comment 5 errata-xmlrpc 2020-10-27 16:28:36 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196