Bug 1869294
| Summary: | [kuryr] Network policy fails to get applied or removed when there's a pending load balancer being created | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Michał Dulko <mdulko> |
| Component: | Networking | Assignee: | Michał Dulko <mdulko> |
| Networking sub component: | kuryr | QA Contact: | GenadiC <gcheresh> |
| Status: | CLOSED ERRATA | Docs Contact: | |
| Severity: | medium | ||
| Priority: | medium | CC: | rlobillo |
| Version: | 4.6 | ||
| Target Milestone: | --- | ||
| Target Release: | 4.6.0 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2020-10-27 16:28:36 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
Verified on OCP4.6.0-0.nightly-2020-09-07-224533 over OSP13 2020-09-03.2 with Amphora provider.
Loadbalancer and NP are created sequentially and kuryrloadbalancer resource is getting populated before the loadbalancer is on ACTIVE operating_status.NP is fully operational once the loadbalancer become ready.
1. Create environment:
$ oc new-project test
$ oc run --image kuryr/demo demo-allowed-caller
$ oc run --image kuryr/demo demo-caller
$ oc run --image kuryr/demo demo
$
$ cat np_resource.yaml
kind: NetworkPolicy
apiVersion: networking.k8s.io/v1
metadata:
name: np
spec:
podSelector:
matchLabels:
run: demo
ingress:
- from:
- podSelector:
matchLabels:
run: demo-allowed-caller
2. Create service and apply NP on it:
$ oc expose pod/demo --port 80 --target-port 8080 && sleep 1 && oc apply -f np_resource.yaml
service/demo exposed
networkpolicy.networking.k8s.io/np created
3. Loadbalancer pending to be created and kuryrloadbalancer shows correct status:
(shiftstack) [stack@undercloud-0 ~]$ . overcloudrc && openstack loadbalancer show test/demo
+---------------------+--------------------------------------+
| Field | Value |
+---------------------+--------------------------------------+
| admin_state_up | True |
| created_at | 2020-09-09T09:49:38 |
| description | openshiftClusterID=ostest-jhtjg |
| flavor | |
| id | 4397f78d-6694-4aea-986c-5e8b4826ec32 |
| listeners | |
| name | test/demo |
| operating_status | OFFLINE |
| pools | |
| project_id | abf184ea0ec84b70ab13de3bfd1ed0cc |
| provider | octavia |
| provisioning_status | PENDING_CREATE |
| updated_at | None |
| vip_address | 172.30.97.133 |
| vip_network_id | 62d90e55-a172-4da0-8366-0348bbdf88e6 |
| vip_port_id | 6e73ae1a-7da5-443e-ae59-276459f91c43 |
| vip_qos_policy_id | None |
| vip_subnet_id | d164c66c-e705-4200-a5bb-6243d4bd5f9e |
+---------------------+--------------------------------------+
(overcloud) [stack@undercloud-0 ~]$ oc get knp -n test np -o json | jq ".status.podSelector"
{
"matchLabels": {
"run": "demo"
}
}
(overcloud) [stack@undercloud-0 ~]$ . overcloudrc && openstack loadbalancer show test/demo
+---------------------+--------------------------------------+
| Field | Value |
+---------------------+--------------------------------------+
| admin_state_up | True |
| created_at | 2020-09-09T09:49:38 |
| description | openshiftClusterID=ostest-jhtjg |
| flavor | |
| id | 4397f78d-6694-4aea-986c-5e8b4826ec32 |
| listeners | |
| name | test/demo |
| operating_status | OFFLINE |
| pools | |
| project_id | abf184ea0ec84b70ab13de3bfd1ed0cc |
| provider | octavia |
| provisioning_status | PENDING_CREATE |
| updated_at | None |
| vip_address | 172.30.97.133 |
| vip_network_id | 62d90e55-a172-4da0-8366-0348bbdf88e6 |
| vip_port_id | 6e73ae1a-7da5-443e-ae59-276459f91c43 |
| vip_qos_policy_id | None |
| vip_subnet_id | d164c66c-e705-4200-a5bb-6243d4bd5f9e |
+---------------------+--------------------------------------+
(overcloud) [stack@undercloud-0 ~]$ . overcloudrc && openstack loadbalancer show test/demo
+---------------------+--------------------------------------+
| Field | Value |
+---------------------+--------------------------------------+
| admin_state_up | True |
| created_at | 2020-09-09T09:49:38 |
| description | openshiftClusterID=ostest-jhtjg |
| flavor | |
| id | 4397f78d-6694-4aea-986c-5e8b4826ec32 |
| listeners | e30985ec-193a-4231-bbfc-97ff2c6fb4d1 |
| name | test/demo |
| operating_status | ONLINE |
| pools | 5c4aefe2-3d1d-42ac-b159-c3c08de98956 |
| project_id | abf184ea0ec84b70ab13de3bfd1ed0cc |
| provider | octavia |
| provisioning_status | ACTIVE |
| updated_at | 2020-09-09T09:51:05 |
| vip_address | 172.30.97.133 |
| vip_network_id | 62d90e55-a172-4da0-8366-0348bbdf88e6 |
| vip_port_id | 6e73ae1a-7da5-443e-ae59-276459f91c43 |
| vip_qos_policy_id | None |
| vip_subnet_id | d164c66c-e705-4200-a5bb-6243d4bd5f9e |
+---------------------+--------------------------------------+
4. NP is correctly applied on the service:
(overcloud) [stack@undercloud-0 ~]$ oc get all
NAME READY STATUS RESTARTS AGE
pod/demo 1/1 Running 0 11m
pod/demo-allowed-caller 1/1 Running 0 3m41s
pod/demo-caller 1/1 Running 0 11m
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/demo ClusterIP 172.30.97.133 <none> 80/TCP 2m38s
(overcloud) [stack@undercloud-0 ~]$ oc rsh pod/demo-allowed-caller curl 172.30.97.133
demo: HELLO! I AM ALIVE!!!
(overcloud) [stack@undercloud-0 ~]$ oc rsh pod/demo-caller curl 172.30.97.133
^Ccommand terminated with exit code 130
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:4196 |
Description of problem: We're seeing that in the gate at the moment, the test_update_network_policy is failing very often with: Traceback (most recent call last): File "/opt/stack/tempest/.tox/tempest/lib/python3.6/site-packages/kuryr_tempest_plugin/tests/scenario/base_network_policy.py", line 259, in test_update_network_policy self.assertIsNotNone(crd_pod_selector) File "/opt/stack/tempest/.tox/tempest/lib/python3.6/site-packages/testtools/testcase.py", line 439, in assertIsNotNone self.assertThat(observed, matcher, message) File "/opt/stack/tempest/.tox/tempest/lib/python3.6/site-packages/testtools/testcase.py", line 502, in assertThat raise mismatch_error testtools.matchers._impl.MismatchError: None matches Is(None) And the culprit is this circling around in the logs until it times out: 2020-08-04 20:33:36.495 1 DEBUG kuryr_kubernetes.controller.drivers.lbaasv2 [-] KuryrLoadBalancer for service default/kuryr-service-1926940643 not populated yet. update_lbaas_sg /usr/local/lib/python3.6/site-packages/kuryr_kubernetes/controller/drivers/lbaasv2.py:769[00m 2020-08-04 20:33:36.495 1 DEBUG kuryr_kubernetes.handlers.retry [-] Handler KuryrNetworkPolicyHandler failed (attempt 2; ResourceNotReady: Resource not ready: 'kuryr-service-1926940643') _sleep /usr/local/lib/python3.6/site-packages/kuryr_kubernetes/handlers/retry.py:101[00m Point is - the handler is technically right - that LB is not yet provisioned, yet we wait to apply it the SG. This should not be necessary - when members will get added, they will have the correct SG rules applied. Version-Release number of selected component (if applicable): How reproducible: Only on envs with very slow Octavia Amphora creations, probably never happens on proper OSP envs, we've only seen it on OpenStack gates. Steps to Reproduce: 1. Create a service. 2. While loadbalancer for that service is still created try creating a NetworkPolicy. Actual results: KuryrNetworkPolicy CRD doesn't get status.podSelector populated until that LB is created (or never if LB creation took over several minutes). Expected results: KuryrNetworkPolicy CRD gets status.podSelector populated correctly even before the LB is up. Additional info: