Description of problem: Setup: OCP on OSP History: 2 weeks ago client live-migrated instances out of a compute node that has a bad memory stick that need to be replaced. Some of his instances failed during live migration. The instance would come back alive but the instance would show as running on the source compute nodes (openstack server show). Today we were cleaning duplicated ports in their database: ~~~ #from their db dump INSERT INTO `ml2_port_bindings` VALUES [..] ('09d0d2b2-04ba-4a01-ac2a-e0990d3d125e','compute1','ovs','normal','{}','{\"port_filter\": true}','INACTIVE'), ('09d0d2b2-04ba-4a01-ac2a-e0990d3d125e','compute4','ovs','normal','','{\"port_filter\": true}','ACTIVE') ('10084c02-ce48-42de-ae90-8dc04c0ef1b2','compute1','unbound','normal','{\"migrating_to\": \"compute1\"}','','INACTIVE'), ('10084c02-ce48-42de-ae90-8dc04c0ef1b2','compute4','ovs','normal','','{\"port_filter\": true}','ACTIVE') ... /*The list is long*/ ~~~ And we were following this procedure from this BZ as it was suggested to us [1]: curl -X DELETE -H "X-Auth-Token: $TOKEN" ${NEUTRON_API_URL}/v2.0/ports/f6d54c4c-8bde-4550-8a3c-83aed3f97c73/bindings/compute-0.redhat.local [1] https://bugzilla.redhat.com/show_bug.cgi?id=2097160#c19 After 5-6 ports deleted, client monitoring showed that two OCP Workers were showing down. Instance we are working on: 48b534fd-0a20-4bdc-882e-d0a33c0750db Port of this instance: 09d0d2b2-04ba-4a01-ac2a-e0990d3d125e Indeed from the openstack port show output the port is showing "DOWN": We tcpdumped from all interfaces and we would see the packets arriving but not on the TAP interface. At that point I wrongfully believed that the Mac_Binding in sb database was wrong because of TAP interfaces having mac-addresses starting with FE where everywhere else its starting with FA but I was told by neutron engineering that this normal and expected: https://bugzilla.redhat.com/show_bug.cgi?id=2103688#c1 So because of that I did 'ovn-sbctl destroy mac_binding _uuid' on the compute node. I learned today its a possible behavior when live-migrating OCP workers because it creates known issues. So we tried to do cold migration but those aren't working as well. The following is from a cold migration we tried on the second problematic instance (second OCP worker): | events | [{'event': 'compute_resize_instance', 'start_time': '2022-07-04T17:29:11.000000', 'finish_time': '2022-07-04T17:29:14.000000', 'result': 'Error', 'traceback': ' File "/usr/lib/python3.6/site-packages/nova/compute/utils.py", line 1372, in decorated_function\n return function(self, context, *args, **kwargs)\n File "/usr/lib/python3.6/site-packages/nova/compute/manager.py", line 219, in decorated_function\n kwargs[\'instance\'], e, sys.exc_info())\n File "/usr/lib/python3.6/site-packages/oslo_utils/excutils.py", line 220, in __exit__\n self.force_reraise()\n File "/usr/lib/python3.6/site-packages/oslo_utils/excutils.py", line 196, in force_reraise\n six.reraise(self.type_, self.value, self.tb)\n File "/usr/lib/python3.6/site-packages/six.py", line 675, in reraise\n raise value\n File "/usr/lib/python3.6/site-packages/nova/compute/manager.py", line 207, in decorated_function\n return function(self, context, *args, **kwargs)\n File "/usr/lib/python3.6/site-packages/nova/compute/manager.py", line 4886, in resize_instance\n self._revert_allocation(context, instance, migration)\n File "/usr/lib/python3.6/site-packages/oslo_utils/excutils.py", line 220, in __exit__\n self.force_reraise()\n File "/usr/lib/python3.6/site-packages/oslo_utils/excutils.py", line 196, in force_reraise\n six.reraise(self.type_, self.value, self.tb)\n File "/usr/lib/python3.6/site-packages/six.py", line 675, in reraise\n raise value\n File "/usr/lib/python3.6/site-packages/nova/compute/manager.py", line 4883, in resize_instance\n instance_type, clean_shutdown, request_spec)\n File "/usr/lib/python3.6/site-packages/nova/compute/manager.py", line 4922, in _resize_instance\n timeout, retry_interval)\n File "/usr/lib/python3.6/site-packages/nova/virt/libvirt/driver.py", line 10053, in migrate_disk_and_power_off\n shared_storage)\n File "/usr/lib/python3.6/site-packages/oslo_utils/excutils.py", line 220, in __exit__\n self.force_reraise()\n File "/usr/lib/python3.6/site-packages/oslo_utils/excutils.py", line 196, in force_reraise\n six.reraise(self.type_, self.value, self.tb)\n File "/usr/lib/python3.6/site-packages/six.py", line 675, in reraise\n raise value\n File "/usr/lib/python3.6/site-packages/nova/virt/libvirt/driver.py", line 10007, in migrate_disk_and_power_off\n os.rename(inst_base, inst_base_resize)\n', 'host': 'compute4', 'hostId': '3f902eada5c3e888f627d70f900b007a814dc7f8eae15fc46277a2ad'}, {'event': 'compute_prep_resize', 'start_time': '2022-07-04T17:29:10.000000', 'finish_time': '2022-07-04T17:29:11.000000', 'result': 'Success', 'traceback': None, 'host': 'compute1', 'hostId': '8f6c1bfef47c44c4fb581e682453c45a803cd296ecd0e2bba239d844'}, {'event': 'cold_migrate', 'start_time': '2022-07-04T17:29:07.000000', 'finish_time': '2022-07-04T17:29:10.000000', 'result': 'Success', 'traceback': None, 'host': 'controller3', 'hostId': 'ad9f04a6f829df18243a00495deb4cb935d488c331eb9cff57f10b29'}, {'event': 'conductor_migrate_server', 'start_time': '2022-07-04T17:29:07.000000', 'finish_time': '2022-07-04T17:29:10.000000', 'result': 'Success', 'traceback': None, 'host': 'controller3', 'hostId': 'ad9f04a6f829df18243a00495deb4cb935d488c331eb9cff57f10b29'}] | After talking to shift people they told me to increase the count of workers and then manually delete the broken ones but even new workers can't be spawned right now. Here's the error we saw in the logs: nova/nova-compute.log:10603:2022-07-04 18:48:15.466 7 WARNING nova.virt.libvirt.driver [req-84d58adf-2968-4871-a71d-5e4c11e57d42 05db5113a5784028872fa16f75fbd37d 18efe6874f004d6e91c722942ad11592 - default default] [instance: d3522518-443a-4a54-a9cc-15029a883e1b] Timeout waiting for [('network-vif-plugged', '429fc50e-fe1a-44da-a8f8-c980fbd3865d')] for instance with vm_state building and task_state spawning.: eventlet.timeout.Timeout: 300 seconds So the current situation is right now: #1 we have 2 instances that ports are listed as DOWN and we can't bring them up #2 we can't migrate them off to new compute nodes #3 we can't create new instances on this cloud Now I feel like all the problems are related in some way but maybe not. If you need us to split the issues in separate BZ please let us know and we will do this for you. We have the following from the customer: sosreport #from the compute node where the instances live sosreport #from the 3 controller nodes mysqldump #from overcloud node ovn databases (nb and sb) output.log #containing openstack server show/port show/serve event list and show for the two instances having issues. If you need anything else please let us know! Please reach out to us on irc ggrimaux/alisci for anything. Thank you! Version-Release number of selected component (if applicable): OSP 16.1 z6 How reproducible: 100% Can't migrate Can't create instances Steps to Reproduce: 1. Try to migrate 2. Try to create new instance 3. Actual results: Traffic is down for two instances. Can't migrate Can't create new instances Expected results: Fix existing ports Be able to migrate instances Be able to create new instances Additional info: sosreport #from the compute node where the instances live sosreport #from the 3 controller nodes mysqldump #from overcloud node ovn databases (nb and sb) output.log #containing openstack server show/port show/serve event list and show for the two instances having issues.