2103781 – [OSP16.1] 2 ports down (no traffic) / can't migrate / can't create new instances

Bug 2103781 - [OSP16.1] 2 ports down (no traffic) / can't migrate / can't create new instances

Summary: [OSP16.1] 2 ports down (no traffic) / can't migrate / can't create new instances

Keywords:
Status:	CLOSED DUPLICATE of bug 1955031
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	openstack-neutron
Sub Component:
Version:	16.1 (Train)
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	urgent
Target Milestone:	---
Target Release:	---
Assignee:	Slawek Kaplonski
QA Contact:	Eran Kuris
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2022-07-04 20:35 UTC by ggrimaux
Modified:	2022-07-11 09:45 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2022-07-11 09:45:25 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	OSP-16233	0	None	None	None	2022-07-04 20:43:58 UTC

Description ggrimaux 2022-07-04 20:35:16 UTC

Description of problem:
Setup: OCP on OSP

History:
2 weeks ago client live-migrated instances out of a compute node that has a bad memory stick that need to be replaced.
Some of his instances failed during live migration. The instance would come back alive but the instance would show as running on the source compute nodes (openstack server show).
Today we were cleaning duplicated ports in their database:
~~~
#from their db dump
 
INSERT INTO `ml2_port_bindings` VALUES
[..]
('09d0d2b2-04ba-4a01-ac2a-e0990d3d125e','compute1','ovs','normal','{}','{\"port_filter\": true}','INACTIVE'),
('09d0d2b2-04ba-4a01-ac2a-e0990d3d125e','compute4','ovs','normal','','{\"port_filter\": true}','ACTIVE')
 
('10084c02-ce48-42de-ae90-8dc04c0ef1b2','compute1','unbound','normal','{\"migrating_to\": \"compute1\"}','','INACTIVE'),
('10084c02-ce48-42de-ae90-8dc04c0ef1b2','compute4','ovs','normal','','{\"port_filter\": true}','ACTIVE')
...

/*The list is long*/
~~~

And we were following this procedure from this BZ as it was suggested to us [1]:
curl -X DELETE -H "X-Auth-Token: $TOKEN" ${NEUTRON_API_URL}/v2.0/ports/f6d54c4c-8bde-4550-8a3c-83aed3f97c73/bindings/compute-0.redhat.local

[1]
https://bugzilla.redhat.com/show_bug.cgi?id=2097160#c19

After 5-6 ports deleted, client monitoring showed that two OCP Workers were showing down. 

Instance we are working on: 48b534fd-0a20-4bdc-882e-d0a33c0750db
Port of this instance: 09d0d2b2-04ba-4a01-ac2a-e0990d3d125e

Indeed from the openstack port show output the port is showing "DOWN":

We tcpdumped from all interfaces and we would see the packets arriving but not on the TAP interface.

At that point I wrongfully believed that the Mac_Binding in sb database was wrong because of TAP interfaces having mac-addresses starting with FE where everywhere else its starting with FA but I was told by neutron engineering that this normal and expected:
https://bugzilla.redhat.com/show_bug.cgi?id=2103688#c1

So because of that I did 'ovn-sbctl destroy mac_binding _uuid' on the compute node.

I learned today its a possible behavior when live-migrating OCP workers because it creates known issues.
So we tried to do cold migration but those aren't working as well.

The following is from a cold migration we tried on the second problematic instance (second OCP worker):
| events        | [{'event': 'compute_resize_instance', 'start_time': '2022-07-04T17:29:11.000000', 'finish_time': '2022-07-04T17:29:14.000000', 'result': 'Error', 'traceback': '  File "/usr/lib/python3.6/site-packages/nova/compute/utils.py", line 1372, in decorated_function\n    return function(self, context, *args, **kwargs)\n  File "/usr/lib/python3.6/site-packages/nova/compute/manager.py", line 219, in decorated_function\n    kwargs[\'instance\'], e, sys.exc_info())\n  File "/usr/lib/python3.6/site-packages/oslo_utils/excutils.py", line 220, in __exit__\n    self.force_reraise()\n  File "/usr/lib/python3.6/site-packages/oslo_utils/excutils.py", line 196, in force_reraise\n    six.reraise(self.type_, self.value, self.tb)\n  File "/usr/lib/python3.6/site-packages/six.py", line 675, in reraise\n    raise value\n  File "/usr/lib/python3.6/site-packages/nova/compute/manager.py", line 207, in decorated_function\n    return function(self, context, *args, **kwargs)\n  File "/usr/lib/python3.6/site-packages/nova/compute/manager.py", line 4886, in resize_instance\n    self._revert_allocation(context, instance, migration)\n  File "/usr/lib/python3.6/site-packages/oslo_utils/excutils.py", line 220, in __exit__\n    self.force_reraise()\n  File "/usr/lib/python3.6/site-packages/oslo_utils/excutils.py", line 196, in force_reraise\n    six.reraise(self.type_, self.value, self.tb)\n  File "/usr/lib/python3.6/site-packages/six.py", line 675, in reraise\n    raise value\n  File "/usr/lib/python3.6/site-packages/nova/compute/manager.py", line 4883, in resize_instance\n    instance_type, clean_shutdown, request_spec)\n  File "/usr/lib/python3.6/site-packages/nova/compute/manager.py", line 4922, in _resize_instance\n    timeout, retry_interval)\n  File "/usr/lib/python3.6/site-packages/nova/virt/libvirt/driver.py", line 10053, in migrate_disk_and_power_off\n    shared_storage)\n  File "/usr/lib/python3.6/site-packages/oslo_utils/excutils.py", line 220, in __exit__\n    self.force_reraise()\n  File "/usr/lib/python3.6/site-packages/oslo_utils/excutils.py", line 196, in force_reraise\n    six.reraise(self.type_, self.value, self.tb)\n  File "/usr/lib/python3.6/site-packages/six.py", line 675, in reraise\n    raise value\n  File "/usr/lib/python3.6/site-packages/nova/virt/libvirt/driver.py", line 10007, in migrate_disk_and_power_off\n    os.rename(inst_base, inst_base_resize)\n', 'host': 'compute4', 'hostId': '3f902eada5c3e888f627d70f900b007a814dc7f8eae15fc46277a2ad'}, {'event': 'compute_prep_resize', 'start_time': '2022-07-04T17:29:10.000000', 'finish_time': '2022-07-04T17:29:11.000000', 'result': 'Success', 'traceback': None, 'host': 'compute1', 'hostId': '8f6c1bfef47c44c4fb581e682453c45a803cd296ecd0e2bba239d844'}, {'event': 'cold_migrate', 'start_time': '2022-07-04T17:29:07.000000', 'finish_time': '2022-07-04T17:29:10.000000', 'result': 'Success', 'traceback': None, 'host': 'controller3', 'hostId': 'ad9f04a6f829df18243a00495deb4cb935d488c331eb9cff57f10b29'}, {'event': 'conductor_migrate_server', 'start_time': '2022-07-04T17:29:07.000000', 'finish_time': '2022-07-04T17:29:10.000000', 'result': 'Success', 'traceback': None, 'host': 'controller3', 'hostId': 'ad9f04a6f829df18243a00495deb4cb935d488c331eb9cff57f10b29'}] |



After talking to shift people they told me to increase the count of workers and then manually delete the broken ones but even new workers can't be spawned right now.
Here's the error we saw in the logs:
nova/nova-compute.log:10603:2022-07-04 18:48:15.466 7 WARNING nova.virt.libvirt.driver [req-84d58adf-2968-4871-a71d-5e4c11e57d42 05db5113a5784028872fa16f75fbd37d 18efe6874f004d6e91c722942ad11592 - default default] [instance: d3522518-443a-4a54-a9cc-15029a883e1b] Timeout waiting for [('network-vif-plugged', '429fc50e-fe1a-44da-a8f8-c980fbd3865d')] for instance with vm_state building and task_state spawning.: eventlet.timeout.Timeout: 300 seconds



So the current situation is right now:
#1 we have 2 instances that ports are listed as DOWN and we can't bring them up 
#2 we can't migrate them off to new compute nodes
#3 we can't create new instances on this cloud


Now I feel like all the problems are related in some way but maybe not.
If you need us to split the issues in separate BZ please let us know and we will do this for you.

We have the following from the customer:
sosreport #from the compute node where the instances live
sosreport #from the 3 controller nodes
mysqldump #from overcloud node
ovn databases (nb and sb)
output.log #containing openstack server show/port show/serve event list and show for the two instances having issues.

If you need anything else please let us know!

Please reach out to us on irc ggrimaux/alisci for anything.

Thank you!


Version-Release number of selected component (if applicable):
OSP 16.1 z6

How reproducible:
100%
Can't migrate
Can't create instances

Steps to Reproduce:
1. Try to migrate
2. Try to create new instance
3.

Actual results:
Traffic is down for two instances.
Can't migrate
Can't create new instances

Expected results:
Fix existing ports
Be able to migrate instances
Be able to create new instances

Additional info:
sosreport #from the compute node where the instances live
sosreport #from the 3 controller nodes
mysqldump #from overcloud node
ovn databases (nb and sb)
output.log #containing openstack server show/port show/serve event list and show for the two instances having issues.

Note You need to log in before you can comment on or make changes to this bug.