2187985 – Creating members with an invalid subnet on Edge environment won't allow Octavia to delete them

This bug has been migrated to another issue tracking site. It has been closed here and may no longer be being monitored.

If you would like to get updates for this issue, or to participate in it, you may do so at Red Hat Issue Tracker .

Bug 2187985 - Creating members with an invalid subnet on Edge environment won't allow Octavia to delete them

Summary: Creating members with an invalid subnet on Edge environment won't allow Octav...

Keywords:
Status:	CLOSED MIGRATED
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	openstack-octavia
Sub Component:
Version:	17.1 (Wallaby)
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	zstream
Target Release:	17.1
Assignee:	Gregory Thiemonge
QA Contact:	Bruna Bonguardo
Docs Contact:	Greg Rakauskas
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2023-04-19 11:03 UTC by Omer Schwartz
Modified:	2024-12-06 15:44 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	Known Issue
Doc Text:	Adding a load balancer member whose subnet is not in the Load-balancing service (octavia) availability zone puts the load balancer in `ERROR`. The member cannot be removed because of the `ERROR` status, making the load balancer unusable. + Workaround: Delete the load balancer.
Clone Of:
Environment:
Last Closed:	2024-12-06 15:43:32 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	OSP-24366	0	None	None	None	2024-12-06 15:43:31 UTC
Red Hat Issue Tracker	OSP-33158	0	None	None	None	2024-12-06 15:44:22 UTC

Description Omer Schwartz 2023-04-19 11:03:35 UTC

Description of problem:

1. Creating member on Edge environment with an invalid subnet will (obviously) create an invalid member and won't allow Octavia to delete it afterwards.

2. It also makes the created member and any other member which is created afterwards to have "operating_status": "OFFLINE" & "provisioning_status": "ERROR", even if the other members were provided with the correct subnet.

Probably the HandleNetworkDeltas task fails, maybe Octavia tries to create a port as part of the task and it fails.

I got the following traceback:

2023-04-17 14:07:21.866 13 ERROR oslo_messaging.rpc.server Traceback (most recent call last):
2023-04-17 14:07:21.866 13 ERROR oslo_messaging.rpc.server   File "/usr/lib/python3.9/site-packages/oslo_messaging/rpc/server.py", line 165, in _process_incoming
2023-04-17 14:07:21.866 13 ERROR oslo_messaging.rpc.server     res = self.dispatcher.dispatch(message)
2023-04-17 14:07:21.866 13 ERROR oslo_messaging.rpc.server   File "/usr/lib/python3.9/site-packages/oslo_messaging/rpc/dispatcher.py", line 309, in dispatch
2023-04-17 14:07:21.866 13 ERROR oslo_messaging.rpc.server     return self._do_dispatch(endpoint, method, ctxt, args)
2023-04-17 14:07:21.866 13 ERROR oslo_messaging.rpc.server   File "/usr/lib/python3.9/site-packages/oslo_messaging/rpc/dispatcher.py", line 229, in _do_dispatch
2023-04-17 14:07:21.866 13 ERROR oslo_messaging.rpc.server     result = func(ctxt, **new_args)
2023-04-17 14:07:21.866 13 ERROR oslo_messaging.rpc.server   File "/usr/lib/python3.9/site-packages/octavia/controller/queue/v1/endpoints.py", line 127, in delete_member
2023-04-17 14:07:21.866 13 ERROR oslo_messaging.rpc.server     self.worker.delete_member(member_id)
2023-04-17 14:07:21.866 13 ERROR oslo_messaging.rpc.server   File "/usr/lib/python3.9/site-packages/octavia/controller/worker/v1/controller_worker.py", line 508, in delete_member
2023-04-17 14:07:21.866 13 ERROR oslo_messaging.rpc.server     delete_member_tf.run()
2023-04-17 14:07:21.866 13 ERROR oslo_messaging.rpc.server   File "/usr/lib/python3.9/site-packages/taskflow/engines/action_engine/engine.py", line 247, in run
2023-04-17 14:07:21.866 13 ERROR oslo_messaging.rpc.server     for _state in self.run_iter(timeout=timeout):
2023-04-17 14:07:21.866 13 ERROR oslo_messaging.rpc.server   File "/usr/lib/python3.9/site-packages/taskflow/engines/action_engine/engine.py", line 340, in run_iter
2023-04-17 14:07:21.866 13 ERROR oslo_messaging.rpc.server     failure.Failure.reraise_if_any(er_failures)
2023-04-17 14:07:21.866 13 ERROR oslo_messaging.rpc.server   File "/usr/lib/python3.9/site-packages/taskflow/types/failure.py", line 339, in reraise_if_any
2023-04-17 14:07:21.866 13 ERROR oslo_messaging.rpc.server     failures[0].reraise()
2023-04-17 14:07:21.866 13 ERROR oslo_messaging.rpc.server   File "/usr/lib/python3.9/site-packages/taskflow/types/failure.py", line 346, in reraise
2023-04-17 14:07:21.866 13 ERROR oslo_messaging.rpc.server     six.reraise(*self._exc_info)
2023-04-17 14:07:21.866 13 ERROR oslo_messaging.rpc.server   File "/usr/lib/python3.9/site-packages/six.py", line 709, in reraise
2023-04-17 14:07:21.866 13 ERROR oslo_messaging.rpc.server     raise value
2023-04-17 14:07:21.866 13 ERROR oslo_messaging.rpc.server   File "/usr/lib/python3.9/site-packages/taskflow/engines/action_engine/executor.py", line 53, in _execute_task
2023-04-17 14:07:21.866 13 ERROR oslo_messaging.rpc.server     result = task.execute(**arguments)
2023-04-17 14:07:21.866 13 ERROR oslo_messaging.rpc.server   File "/usr/lib/python3.9/site-packages/octavia/controller/worker/v1/tasks/network_tasks.py", line 420, in execute
2023-04-17 14:07:21.866 13 ERROR oslo_messaging.rpc.server     ret = handle_delta.execute(amphorae[amp_id], delta)
2023-04-17 14:07:21.866 13 ERROR oslo_messaging.rpc.server   File "/usr/lib/python3.9/site-packages/octavia/controller/worker/v1/tasks/network_tasks.py", line 337, in execute
2023-04-17 14:07:21.866 13 ERROR oslo_messaging.rpc.server     port = self.network_driver.plug_fixed_ip(
2023-04-17 14:07:21.866 13 ERROR oslo_messaging.rpc.server   File "/usr/lib/python3.9/site-packages/octavia/network/drivers/neutron/base.py", line 294, in plug_fixed_ip
2023-04-17 14:07:21.866 13 ERROR oslo_messaging.rpc.server     raise base.NetworkException(str(e))
2023-04-17 14:07:21.866 13 ERROR oslo_messaging.rpc.server octavia.network.base.NetworkException: Invalid input for operation: Failed to create port on network d7f7de6c-0e84-49e2-9042-697fa85d2532, because fixed_ips included invalid subnet 086e650b-0c78-43db-811c-5dfcd64423b6.
2023-04-17 14:07:21.866 13 ERROR oslo_messaging.rpc.server Neutron server returns request_ids: ['req-26d5bb1c-be0f-41fb-83be-4fa6e6394b60']


I will add the exact steps below.

Version-Release number of selected component (if applicable):
RHOS-17.1-RHEL-9-20230404.n.1

How reproducible:
100%

Steps to Reproduce:
1. Deploy an Edge environment with both nova & neutron availability zones
2. Deploy the Octavia service
3. Create a loadbalancer with either 1/2 members, the first one with an invalid subnet

Actual results:
- None of the members are deletable, although deleting the LB with --cascade does delete them successfully
- Members which were provided with the correct subnet, sharing the same provisioning_status and operating_status as the first invalid member.

Expected results:
- Both members should be deletable
- Members which were provided with the correct subnet are supposed to be deployed successfully

Additional info: the commands I ran

I used this d/s job to deploy the environment
https://rhos-ci-staging-jenkins.lab.eng.tlv2.redhat.com/view/QE/view/OSP17.1/job/DFG-edge-deployment-17.1-rhel-virthost-ipv4-3cont-2comp-2leafs-x-2comp-tls_everywhere-routed_provider_nets-ovn-naz/

To deploy the Octavia service I created a file (octavia-dcn-parameters.yaml) with the following content:

   octavia_controller_availability_zone: az-central
   octavia_availability_zones:
     az-central: # no cidr needed, it uses the already existing subnet
     az-dcn1:
       lb_mgmt_subnet_cidr: 172.47.0.0/16
     az-dcn2:
       lb_mgmt_subnet_cidr: 172.48.0.0/16
   octavia_backbone_tenant_cidr: 172.49.0.0/16

And ran the following playbook:
ansible-playbook -i overcloud-deploy/central/config-download/central/tripleo-ansible-inventory.yaml /usr/share/ansible/tripleo-playbooks/octavia-dcn-deployment.yaml -e @octavia-dcn-parameters.yaml -e stack=central -v


# I created a security group with the following rules:
openstack security group rule create --protocol tcp --dst-port 22 sg1
openstack security group rule create --protocol tcp --dst-port 80 sg1
openstack security group rule create --protocol tcp --dst-port 8080 sg1
openstack security group rule create --protocol tcp --dst-port 443 sg1
openstack security group rule create --protocol icmp sg1

# I created 2 nova vms on the public network, each one on a different AZ with the default security group and I added the new security group that I created to the servers
openstack server create --wait --flavor m1.tiny --image cirros-0.5.2-x86_64 --network public --availability-zone az-dcn1 vm1
openstack server add security group vm1 sg1
openstack server create --wait --flavor m1.tiny --image cirros-0.5.2-x86_64 --network public --availability-zone az-dcn2 vm2
openstack server add security group vm2 sg1

# I created 3 availability zone profiles & availability zones
openstack loadbalancer availabilityzoneprofile create --provider amphora --name azp-dcn1 --availability-zone-data '{"compute_zone": "az-dcn1", "management_network": "<mgmt-network-id-of-lb-mgmt-az-dcn1-net>"}'
openstack loadbalancer availabilityzoneprofile create --provider amphora --name azp-dcn2 --availability-zone-data '{"compute_zone": "az-dcn2", "management_network": "<mgmt-network-id-of-lb-mgmt-az-dcn2-net>"}'
openstack loadbalancer availabilityzoneprofile create --provider amphora --name azp-central --availability-zone-data '{"compute_zone": "az-central", "management_network": "<mgmt-network-id-of-lb-mgmt-net>"}'

openstack loadbalancer availabilityzone create --availabilityzoneprofile azp-dcn1 --name az-dcn1
openstack loadbalancer availabilityzone create --availabilityzoneprofile azp-dcn2 --name az-dcn2
openstack loadbalancer availabilityzone create --availabilityzoneprofile azp-central --name az-central

# I created a loadbalancer using segment1 subnet - which is the subnet that was created on the central az and is part of the public network I used for the nova servers
# And HTTP listener & pool
openstack loadbalancer create --name lb1 --vip-subnet-id segment1 --availability-zone az-central --wait
openstack loadbalancer listener create --wait --protocol-port 80 --protocol HTTP --name listener1 lb1
openstack loadbalancer pool create --name pool1 --lb-algorithm ROUND_ROBIN --listener listener1 --protocol HTTP

# I deployed the members using segment2, which is the subnet that is deployed on az-dcn2 - not the same az that the LB using
openstack loadbalancer member create --name member1 --address 10.101.20.212 --protocol-port 8080 --subnet-id segment2 pool1
openstack loadbalancer member create --name member2 --address 10.101.30.229 --protocol-port 8080 --subnet-id segment2 pool1

# I created an HTTP healthmonitor
openstack loadbalancer healthmonitor create --delay 10 --max-retries 4 --timeout 5 --type HTTP --name http_hm1 pool1

# My playbook also ran the Octavia testing server on the vm servers but I don't think that it matters on this case

Comment 4 Gregory Thiemonge 2023-04-25 11:39:59 UTC

One important thing to add:

If the controller fails to add a member, the revert function of MemberToErrorOnRevertTask is executed:
https://opendev.org/openstack/octavia/src/branch/stable/wallaby/octavia/controller/worker/v1/tasks/lifecycle_tasks.py#L140-L145

it sets the member to ERROR and also sets the listener/pool/lb to ACTIVE, it means that the user can still add another member.

In the octavia-worker, the handling of the member's subnets works with deltas (what is needed vs what is plugged), it may mean that when adding a 2nd member, the controller will also apply some changes for the first one if it is needed (maybe because it failed during the 1st call).

So maybe we should consider that:

- either there's something wrong and adding new members will not fix the situation, we may set the LB to ERROR to avoid other issues
- or we should check the provisioning_status of the members in the worker when plugging/unplugging subnet, and skip members that are in ERROR

Comment 7 Lukas Svaty 2023-06-16 08:13:29 UTC

Bulk moving target milestone to GA after the release of Beta on 14th June '23.

Note You need to log in before you can comment on or make changes to this bug.