This bug has been migrated to another issue tracking site. It has been closed here and may no longer be being monitored.

If you would like to get updates for this issue, or to participate in it, you may do so at Red Hat Issue Tracker .
Bug 2312196 - Error connecting to RabbitMQ while live migrating VMs
Summary: Error connecting to RabbitMQ while live migrating VMs
Keywords:
Status: CLOSED MIGRATED
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-nova
Version: 16.2 (Train)
Hardware: Unspecified
OS: Unspecified
medium
high
Target Milestone: z5
: 17.1
Assignee: melanie witt
QA Contact: OSP DFG:Compute
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2024-09-13 14:01 UTC by Eric Nothen
Modified: 2025-01-14 20:35 UTC (History)
10 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2025-01-14 20:35:50 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker   OSP-32824 0 None None None 2025-01-14 20:35:50 UTC
Red Hat Issue Tracker OSP-33456 0 None None None 2025-01-14 20:35:08 UTC
Red Hat Issue Tracker OSPRH-12217 0 None None None 2024-12-06 03:08:25 UTC
Red Hat Issue Tracker OSPRH-12264 0 None None None 2024-12-06 03:08:25 UTC

Description Eric Nothen 2024-09-13 14:01:29 UTC
Description of problem:
While live migrating a VM that is under heavy memory pressure, connection to RabbitMQ is dropped and migration is aborted.

Version-Release number of selected component (if applicable):
16.2

How reproducible:
Reported by customer while live migrating VMs (OCP workers). Reproduced in lab environment with the same symptoms and consequences.

Steps to Reproduce:
1. Start a VM and run stress-ng on it to fill and continuously update memory (or deploy OCP worker if preferred)
2. Start live-migration and monitor progress running these commands in loop:
~~~
date
openstack server show $server_id -c progress -f yaml
nova server-migration-show $server_id $migration_id | grep memory_ | sed -e 's/| //g;s/|$//g'
~~~


Actual results:

3. Initially, migration progresses normally:
~~~
Thu Sep 12 16:11:55 CEST 2024
progress: 0
memory_processed_bytes -                                    
memory_remaining_bytes -                                    
memory_total_bytes     -                                    

Thu Sep 12 16:12:21 CEST 2024
progress: 4
memory_processed_bytes 2211554969                           
memory_remaining_bytes 32076713984                          
memory_total_bytes     34364792832            
~~~

4. Eventually, the rate at which memory is updated is higher than the speed of copy, and progress falls back. At this point, the migration has already copied over 12 times the memory size of the VM, but this is known and (I guess) normal:
~~~
Thu Sep 12 17:22:38 CEST 2024
progress: 94
memory_processed_bytes 428854981180                         
memory_remaining_bytes 834887680                            
memory_total_bytes     34364792832                          

Thu Sep 12 17:23:07 CEST 2024
progress: 31
memory_processed_bytes 431996388459                         
memory_remaining_bytes 21890752512                          
memory_total_bytes     34364792832                          

Thu Sep 12 17:23:33 CEST 2024
progress: 41
memory_processed_bytes 434533618219                         
memory_remaining_bytes 19336859648                          
memory_total_bytes     34364792832                          

Thu Sep 12 17:23:58 CEST 2024
progress: 49
memory_processed_bytes 437340261899                         
memory_remaining_bytes 16534937600                          
memory_total_bytes     34364792832                          

5. Afterwards, progress and memory metrics are no longer updated, which suggest lack of updates on the database for ~22 minutes (17:25 to 17:47):
~~~
Thu Sep 12 17:24:25 CEST 2024
progress: 63
memory_processed_bytes 441992684877                         
memory_remaining_bytes 11881574400                          
memory_total_bytes     34364792832                          

Thu Sep 12 17:24:52 CEST 2024
progress: 74
memory_processed_bytes 445978914886                         
memory_remaining_bytes 7899877376                           
memory_total_bytes     34364792832                          

Thu Sep 12 17:25:17 CEST 2024
progress: 81
memory_processed_bytes 448567785763                         
memory_remaining_bytes 5316001792                           
memory_total_bytes     34364792832                          

Thu Sep 12 17:25:41 CEST 2024
progress: 81
memory_processed_bytes 448567785763                         
memory_remaining_bytes 5316001792                           
memory_total_bytes     34364792832                          

--- snip ---

Thu Sep 12 17:47:38 CEST 2024
progress: 81
memory_processed_bytes 448567785763                         
memory_remaining_bytes 5316001792                           
memory_total_bytes     34364792832                          

Thu Sep 12 17:47:59 CEST 2024
progress: 81
memory_processed_bytes 448567785763                         
memory_remaining_bytes 5316001792                           
memory_total_bytes     34364792832                          
~~~

6. During this window, the healthcheck on nova_compute fails, suggesting already a problem connecting to Rabbit:
~~~
$ sudo grep tripleo_nova_compute_healthcheck.service /var/log/messages
...
Sep 12 17:25:23 overcloud-novacompute-1 systemd[1]: tripleo_nova_compute_healthcheck.service: Succeeded.
Sep 12 17:27:25 overcloud-novacompute-1 systemd[1]: tripleo_nova_compute_healthcheck.service: Succeeded.
Sep 12 17:29:28 overcloud-novacompute-1 systemd[1]: tripleo_nova_compute_healthcheck.service: Main process exited, code=exited, status=1/FAILURE
Sep 12 17:29:28 overcloud-novacompute-1 systemd[1]: tripleo_nova_compute_healthcheck.service: Failed with result 'exit-code'.
Sep 12 17:31:17 overcloud-novacompute-1 systemd[1]: tripleo_nova_compute_healthcheck.service: Main process exited, code=exited, status=1/FAILURE
Sep 12 17:31:17 overcloud-novacompute-1 systemd[1]: tripleo_nova_compute_healthcheck.service: Failed with result 'exit-code'.
Sep 12 17:32:26 overcloud-novacompute-1 systemd[1]: tripleo_nova_compute_healthcheck.service: Main process exited, code=exited, status=1/FAILURE
Sep 12 17:32:26 overcloud-novacompute-1 systemd[1]: tripleo_nova_compute_healthcheck.service: Failed with result 'exit-code'.
Sep 12 17:34:15 overcloud-novacompute-1 systemd[1]: tripleo_nova_compute_healthcheck.service: Main process exited, code=exited, status=1/FAILURE
Sep 12 17:34:15 overcloud-novacompute-1 systemd[1]: tripleo_nova_compute_healthcheck.service: Failed with result 'exit-code'.
Sep 12 17:35:25 overcloud-novacompute-1 systemd[1]: tripleo_nova_compute_healthcheck.service: Main process exited, code=exited, status=1/FAILURE
Sep 12 17:35:25 overcloud-novacompute-1 systemd[1]: tripleo_nova_compute_healthcheck.service: Failed with result 'exit-code'.
Sep 12 17:37:29 overcloud-novacompute-1 systemd[1]: tripleo_nova_compute_healthcheck.service: Main process exited, code=exited, status=1/FAILURE
Sep 12 17:37:29 overcloud-novacompute-1 systemd[1]: tripleo_nova_compute_healthcheck.service: Failed with result 'exit-code'.
Sep 12 17:39:25 overcloud-novacompute-1 systemd[1]: tripleo_nova_compute_healthcheck.service: Main process exited, code=exited, status=1/FAILURE
Sep 12 17:39:25 overcloud-novacompute-1 systemd[1]: tripleo_nova_compute_healthcheck.service: Failed with result 'exit-code'.
Sep 12 17:41:26 overcloud-novacompute-1 systemd[1]: tripleo_nova_compute_healthcheck.service: Main process exited, code=exited, status=1/FAILURE
Sep 12 17:41:26 overcloud-novacompute-1 systemd[1]: tripleo_nova_compute_healthcheck.service: Failed with result 'exit-code'.
Sep 12 17:43:19 overcloud-novacompute-1 systemd[1]: tripleo_nova_compute_healthcheck.service: Main process exited, code=exited, status=1/FAILURE
Sep 12 17:43:19 overcloud-novacompute-1 systemd[1]: tripleo_nova_compute_healthcheck.service: Failed with result 'exit-code'.
Sep 12 17:44:26 overcloud-novacompute-1 systemd[1]: tripleo_nova_compute_healthcheck.service: Main process exited, code=exited, status=1/FAILURE
Sep 12 17:44:26 overcloud-novacompute-1 systemd[1]: tripleo_nova_compute_healthcheck.service: Failed with result 'exit-code'.
Sep 12 17:46:25 overcloud-novacompute-1 systemd[1]: tripleo_nova_compute_healthcheck.service: Main process exited, code=exited, status=1/FAILURE
Sep 12 17:46:25 overcloud-novacompute-1 systemd[1]: tripleo_nova_compute_healthcheck.service: Failed with result 'exit-code'.
Sep 12 17:48:24 overcloud-novacompute-1 systemd[1]: tripleo_nova_compute_healthcheck.service: Succeeded.
Sep 12 17:50:23 overcloud-novacompute-1 systemd[1]: tripleo_nova_compute_healthcheck.service: Succeeded.
~~~

7. For whatever reason, nova-compute.log is not updated anymore during this window, even when debug is enabled (see jump from 17:25 to 17:48) in nova-compute.log:
~~~
2024-09-12 17:25:10.779 8 DEBUG nova.virt.libvirt.migration [-] [instance: 2ce3ef37-1e9a-4638-8b10-fa95cdd4ed25] Downtime does not need to change update_downtime /usr/lib/python3.6/site-packages/nova/virt/libvirt/migration.py:521
2024-09-12 17:25:10.880 8 DEBUG nova.virt.libvirt.driver [req-6a398fec-e21e-469c-a4b2-086cdbe04303 - - - - -] skipping disk for instance-000000e6 as it does not have a path _get_instance_disk_info_from_config /usr/lib/python3.6/site-packages/nova/virt/libvirt/driver.py:9918
2024-09-12 17:25:10.886 8 DEBUG nova.virt.libvirt.driver [req-6a398fec-e21e-469c-a4b2-086cdbe04303 - - - - -] skipping disk for instance-000000e9 as it does not have a path _get_instance_disk_info_from_config /usr/lib/python3.6/site-packages/nova/virt/libvirt/driver.py:9918
2024-09-12 17:25:10.891 8 DEBUG nova.virt.libvirt.driver [req-6a398fec-e21e-469c-a4b2-086cdbe04303 - - - - -] skipping disk for instance-000000c7 as it does not have a path _get_instance_disk_info_from_config /usr/lib/python3.6/site-packages/nova/virt/libvirt/driver.py:9918
2024-09-12 17:48:05.970 8 ERROR oslo.messaging._drivers.impl_rabbit [-] [66ce8f67-540f-41a8-8cc6-ccb361f8b79e] AMQP server on overcloud-controller-2.ctlplane.keller.lab:5672 is unreachable: Server unexpectedly closed connection. Trying again in 1 seconds.: OSError: Server unexpectedly closed connection
~~~

8. Immediately after the error connecting to Rabbit is reported, the traceback comes and the migration is aborted:
~~~
2024-09-12 17:48:05.973 8 ERROR oslo.messaging._drivers.impl_rabbit [-] Failed to process message ... skipping it.: ConnectionResetError: [Errno 104] Connection reset by peer
2024-09-12 17:48:05.973 8 ERROR oslo.messaging._drivers.impl_rabbit Traceback (most recent call last):
2024-09-12 17:48:05.973 8 ERROR oslo.messaging._drivers.impl_rabbit   File "/usr/lib/python3.6/site-packages/oslo_messaging/_drivers/impl_rabbit.py", line 382, in _callback
2024-09-12 17:48:05.973 8 ERROR oslo.messaging._drivers.impl_rabbit     self.callback(RabbitMessage(message))
2024-09-12 17:48:05.973 8 ERROR oslo.messaging._drivers.impl_rabbit   File "/usr/lib/python3.6/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 497, in __call__
2024-09-12 17:48:05.973 8 ERROR oslo.messaging._drivers.impl_rabbit     message.acknowledge()
2024-09-12 17:48:05.973 8 ERROR oslo.messaging._drivers.impl_rabbit   File "/usr/lib/python3.6/site-packages/oslo_messaging/_drivers/impl_rabbit.py", line 237, in acknowledge
2024-09-12 17:48:05.973 8 ERROR oslo.messaging._drivers.impl_rabbit     self._raw_message.ack()
2024-09-12 17:48:05.973 8 ERROR oslo.messaging._drivers.impl_rabbit   File "/usr/lib/python3.6/site-packages/kombu/message.py", line 126, in ack
2024-09-12 17:48:05.973 8 ERROR oslo.messaging._drivers.impl_rabbit     self.channel.basic_ack(self.delivery_tag, multiple=multiple)
2024-09-12 17:48:05.973 8 ERROR oslo.messaging._drivers.impl_rabbit   File "/usr/lib/python3.6/site-packages/amqp/channel.py", line 1394, in basic_ack
2024-09-12 17:48:05.973 8 ERROR oslo.messaging._drivers.impl_rabbit     spec.Basic.Ack, argsig, (delivery_tag, multiple),
2024-09-12 17:48:05.973 8 ERROR oslo.messaging._drivers.impl_rabbit   File "/usr/lib/python3.6/site-packages/amqp/abstract_channel.py", line 59, in send_method
2024-09-12 17:48:05.973 8 ERROR oslo.messaging._drivers.impl_rabbit     conn.frame_writer(1, self.channel_id, sig, args, content)
2024-09-12 17:48:05.973 8 ERROR oslo.messaging._drivers.impl_rabbit   File "/usr/lib/python3.6/site-packages/amqp/method_framing.py", line 172, in write_frame
2024-09-12 17:48:05.973 8 ERROR oslo.messaging._drivers.impl_rabbit     write(view[:offset])
2024-09-12 17:48:05.973 8 ERROR oslo.messaging._drivers.impl_rabbit   File "/usr/lib/python3.6/site-packages/amqp/transport.py", line 252, in write
2024-09-12 17:48:05.973 8 ERROR oslo.messaging._drivers.impl_rabbit     self._write(s)
2024-09-12 17:48:05.973 8 ERROR oslo.messaging._drivers.impl_rabbit   File "/usr/lib/python3.6/site-packages/amqp/transport.py", line 377, in _write
2024-09-12 17:48:05.973 8 ERROR oslo.messaging._drivers.impl_rabbit     n = write(s)
2024-09-12 17:48:05.973 8 ERROR oslo.messaging._drivers.impl_rabbit   File "/usr/lib/python3.6/site-packages/eventlet/green/ssl.py", line 166, in write
2024-09-12 17:48:05.973 8 ERROR oslo.messaging._drivers.impl_rabbit     super(GreenSSLSocket, self).write, data)
2024-09-12 17:48:05.973 8 ERROR oslo.messaging._drivers.impl_rabbit   File "/usr/lib/python3.6/site-packages/eventlet/green/ssl.py", line 147, in _call_trampolining
2024-09-12 17:48:05.973 8 ERROR oslo.messaging._drivers.impl_rabbit     return func(*a, **kw)
2024-09-12 17:48:05.973 8 ERROR oslo.messaging._drivers.impl_rabbit   File "/usr/lib64/python3.6/ssl.py", line 884, in write
2024-09-12 17:48:05.973 8 ERROR oslo.messaging._drivers.impl_rabbit     return self._sslobj.write(data)
2024-09-12 17:48:05.973 8 ERROR oslo.messaging._drivers.impl_rabbit   File "/usr/lib64/python3.6/ssl.py", line 601, in write
2024-09-12 17:48:05.973 8 ERROR oslo.messaging._drivers.impl_rabbit     return self._sslobj.write(data)
2024-09-12 17:48:05.973 8 ERROR oslo.messaging._drivers.impl_rabbit ConnectionResetError: [Errno 104] Connection reset by peer
2024-09-12 17:48:05.973 8 ERROR oslo.messaging._drivers.impl_rabbit
2024-09-12 17:48:06.056 8 ERROR oslo.messaging._drivers.impl_rabbit [-] [50b4b3a6-d64c-438c-8970-78599a048ddd] AMQP server on overcloud-controller-0.ctlplane.keller.lab:5672 is unreachable: <RecoverableConnectionError: unknown error>. Trying again in 1 seconds.: amqp.exceptions.RecoverableConnectionError: <RecoverableConnectionError: unknown error>
2024-09-12 17:48:06.063 8 ERROR oslo.messaging._drivers.impl_rabbit [-] [0c470004-08a3-4caa-ab53-f1d4a396c013] AMQP server on overcloud-controller-0.ctlplane.keller.lab:5672 is unreachable: Server unexpectedly closed connection. Trying again in 1 seconds.: OSError: Server unexpectedly closed connection
2024-09-12 17:48:06.064 8 INFO oslo.messaging._drivers.impl_rabbit [-] A recoverable connection/channel error occurred, trying to reconnect: [Errno 104] Connection reset by peer
2024-09-12 17:48:06.086 8 INFO oslo.messaging._drivers.impl_rabbit [-] A recoverable connection/channel error occurred, trying to reconnect: [Errno 104] Connection reset by peer
2024-09-12 17:48:06.096 8 INFO oslo.messaging._drivers.impl_rabbit [-] A recoverable connection/channel error occurred, trying to reconnect: Server unexpectedly closed connection
2024-09-12 17:48:06.107 8 WARNING nova.virt.libvirt.driver [-] [instance: 2ce3ef37-1e9a-4638-8b10-fa95cdd4ed25] Error monitoring migration: Timed out waiting for a reply to message ID 389a868aba184d1a9956337bca78a569: oslo_messaging.exceptions.MessagingTimeout: Timed out waiting for a reply to message ID 389a868aba184d1a9956337bca78a569
2024-09-12 17:48:06.107 8 ERROR nova.virt.libvirt.driver [instance: 2ce3ef37-1e9a-4638-8b10-fa95cdd4ed25] Traceback (most recent call last):
2024-09-12 17:48:06.107 8 ERROR nova.virt.libvirt.driver [instance: 2ce3ef37-1e9a-4638-8b10-fa95cdd4ed25]   File "/usr/lib/python3.6/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 425, in get
2024-09-12 17:48:06.107 8 ERROR nova.virt.libvirt.driver [instance: 2ce3ef37-1e9a-4638-8b10-fa95cdd4ed25]     return self._queues[msg_id].get(block=True, timeout=timeout)
2024-09-12 17:48:06.107 8 ERROR nova.virt.libvirt.driver [instance: 2ce3ef37-1e9a-4638-8b10-fa95cdd4ed25]   File "/usr/lib/python3.6/site-packages/eventlet/queue.py", line 322, in get
2024-09-12 17:48:06.107 8 ERROR nova.virt.libvirt.driver [instance: 2ce3ef37-1e9a-4638-8b10-fa95cdd4ed25]     return waiter.wait()
2024-09-12 17:48:06.107 8 ERROR nova.virt.libvirt.driver [instance: 2ce3ef37-1e9a-4638-8b10-fa95cdd4ed25]   File "/usr/lib/python3.6/site-packages/eventlet/queue.py", line 141, in wait
2024-09-12 17:48:06.107 8 ERROR nova.virt.libvirt.driver [instance: 2ce3ef37-1e9a-4638-8b10-fa95cdd4ed25]     return get_hub().switch()
2024-09-12 17:48:06.107 8 ERROR nova.virt.libvirt.driver [instance: 2ce3ef37-1e9a-4638-8b10-fa95cdd4ed25]   File "/usr/lib/python3.6/site-packages/eventlet/hubs/hub.py", line 298, in switch
2024-09-12 17:48:06.107 8 ERROR nova.virt.libvirt.driver [instance: 2ce3ef37-1e9a-4638-8b10-fa95cdd4ed25]     return self.greenlet.switch()
2024-09-12 17:48:06.107 8 ERROR nova.virt.libvirt.driver [instance: 2ce3ef37-1e9a-4638-8b10-fa95cdd4ed25] queue.Empty
2024-09-12 17:48:06.107 8 ERROR nova.virt.libvirt.driver [instance: 2ce3ef37-1e9a-4638-8b10-fa95cdd4ed25]
2024-09-12 17:48:06.107 8 ERROR nova.virt.libvirt.driver [instance: 2ce3ef37-1e9a-4638-8b10-fa95cdd4ed25] During handling of the above exception, another exception occurred:
2024-09-12 17:48:06.107 8 ERROR nova.virt.libvirt.driver [instance: 2ce3ef37-1e9a-4638-8b10-fa95cdd4ed25]
2024-09-12 17:48:06.107 8 ERROR nova.virt.libvirt.driver [instance: 2ce3ef37-1e9a-4638-8b10-fa95cdd4ed25] Traceback (most recent call last):
2024-09-12 17:48:06.107 8 ERROR nova.virt.libvirt.driver [instance: 2ce3ef37-1e9a-4638-8b10-fa95cdd4ed25]   File "/usr/lib/python3.6/site-packages/nova/virt/libvirt/driver.py", line 9382, in _live_migration
2024-09-12 17:48:06.107 8 ERROR nova.virt.libvirt.driver [instance: 2ce3ef37-1e9a-4638-8b10-fa95cdd4ed25]     finish_event, disk_paths)
2024-09-12 17:48:06.107 8 ERROR nova.virt.libvirt.driver [instance: 2ce3ef37-1e9a-4638-8b10-fa95cdd4ed25]   File "/usr/lib/python3.6/site-packages/nova/virt/libvirt/driver.py", line 9256, in _live_migration_monitor
2024-09-12 17:48:06.107 8 ERROR nova.virt.libvirt.driver [instance: 2ce3ef37-1e9a-4638-8b10-fa95cdd4ed25]     info, remaining)
2024-09-12 17:48:06.107 8 ERROR nova.virt.libvirt.driver [instance: 2ce3ef37-1e9a-4638-8b10-fa95cdd4ed25]   File "/usr/lib/python3.6/site-packages/nova/virt/libvirt/migration.py", line 557, in save_stats
2024-09-12 17:48:06.107 8 ERROR nova.virt.libvirt.driver [instance: 2ce3ef37-1e9a-4638-8b10-fa95cdd4ed25]     migration.save()
2024-09-12 17:48:06.107 8 ERROR nova.virt.libvirt.driver [instance: 2ce3ef37-1e9a-4638-8b10-fa95cdd4ed25]   File "/usr/lib/python3.6/site-packages/oslo_versionedobjects/base.py", line 210, in wrapper
2024-09-12 17:48:06.107 8 ERROR nova.virt.libvirt.driver [instance: 2ce3ef37-1e9a-4638-8b10-fa95cdd4ed25]     ctxt, self, fn.__name__, args, kwargs)
2024-09-12 17:48:06.107 8 ERROR nova.virt.libvirt.driver [instance: 2ce3ef37-1e9a-4638-8b10-fa95cdd4ed25]   File "/usr/lib/python3.6/site-packages/nova/conductor/rpcapi.py", line 246, in object_action
2024-09-12 17:48:06.107 8 ERROR nova.virt.libvirt.driver [instance: 2ce3ef37-1e9a-4638-8b10-fa95cdd4ed25]     objmethod=objmethod, args=args, kwargs=kwargs)
2024-09-12 17:48:06.107 8 ERROR nova.virt.libvirt.driver [instance: 2ce3ef37-1e9a-4638-8b10-fa95cdd4ed25]   File "/usr/lib/python3.6/site-packages/oslo_messaging/rpc/client.py", line 181, in call
2024-09-12 17:48:06.107 8 ERROR nova.virt.libvirt.driver [instance: 2ce3ef37-1e9a-4638-8b10-fa95cdd4ed25]     transport_options=self.transport_options)
2024-09-12 17:48:06.107 8 ERROR nova.virt.libvirt.driver [instance: 2ce3ef37-1e9a-4638-8b10-fa95cdd4ed25]   File "/usr/lib/python3.6/site-packages/oslo_messaging/transport.py", line 129, in _send
2024-09-12 17:48:06.107 8 ERROR nova.virt.libvirt.driver [instance: 2ce3ef37-1e9a-4638-8b10-fa95cdd4ed25]     transport_options=transport_options)
2024-09-12 17:48:06.107 8 ERROR nova.virt.libvirt.driver [instance: 2ce3ef37-1e9a-4638-8b10-fa95cdd4ed25]   File "/usr/lib/python3.6/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 674, in send
2024-09-12 17:48:06.107 8 ERROR nova.virt.libvirt.driver [instance: 2ce3ef37-1e9a-4638-8b10-fa95cdd4ed25]     transport_options=transport_options)
2024-09-12 17:48:06.107 8 ERROR nova.virt.libvirt.driver [instance: 2ce3ef37-1e9a-4638-8b10-fa95cdd4ed25]   File "/usr/lib/python3.6/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 662, in _send
2024-09-12 17:48:06.107 8 ERROR nova.virt.libvirt.driver [instance: 2ce3ef37-1e9a-4638-8b10-fa95cdd4ed25]     call_monitor_timeout)
2024-09-12 17:48:06.107 8 ERROR nova.virt.libvirt.driver [instance: 2ce3ef37-1e9a-4638-8b10-fa95cdd4ed25]   File "/usr/lib/python3.6/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 551, in wait
2024-09-12 17:48:06.107 8 ERROR nova.virt.libvirt.driver [instance: 2ce3ef37-1e9a-4638-8b10-fa95cdd4ed25]     message = self.waiters.get(msg_id, timeout=timeout)
2024-09-12 17:48:06.107 8 ERROR nova.virt.libvirt.driver [instance: 2ce3ef37-1e9a-4638-8b10-fa95cdd4ed25]   File "/usr/lib/python3.6/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 429, in get
2024-09-12 17:48:06.107 8 ERROR nova.virt.libvirt.driver [instance: 2ce3ef37-1e9a-4638-8b10-fa95cdd4ed25]     'to message ID %s' % msg_id)
2024-09-12 17:48:06.107 8 ERROR nova.virt.libvirt.driver [instance: 2ce3ef37-1e9a-4638-8b10-fa95cdd4ed25] oslo_messaging.exceptions.MessagingTimeout: Timed out waiting for a reply to message ID 389a868aba184d1a9956337bca78a569
2024-09-12 17:48:06.107 8 ERROR nova.virt.libvirt.driver [instance: 2ce3ef37-1e9a-4638-8b10-fa95cdd4ed25]
2024-09-12 17:48:06.153 8 ERROR nova.virt.libvirt.driver [-] [instance: 2ce3ef37-1e9a-4638-8b10-fa95cdd4ed25] Live Migration failure: operation failed: migration out job: unexpectedly failed: libvirt.libvirtError: operation failed: migration out job: unexpectedly failed
2024-09-12 17:48:06.154 8 DEBUG nova.virt.libvirt.driver [-] [instance: 2ce3ef37-1e9a-4638-8b10-fa95cdd4ed25] Migration operation thread notification thread_finished /usr/lib/python3.6/site-packages/nova/virt/libvirt/driver.py:9369
2024-09-12 17:48:06.194 8 DEBUG nova.compute.resource_tracker [req-6a398fec-e21e-469c-a4b2-086cdbe04303 - - - - -] Hypervisor/Node resource view: name=overcloud-novacompute-1.keller.lab free_ram=29768MB free_disk=1535.8201903514564GB free_vcpus=3 pci_devices=[{"dev_id": "pci_0000_06_00_0", "address": "0000:06:00.0", "product_id": "1045", "vendor_id": "1af4", "numa_node": null, "label": "label_1af4_1045", "dev_type": "type-PCI"}, {"dev_id": "pci_0000_05_00_0", "address": "0000:05:00.0", "product_id": "1042", "vendor_id": "1af4", "numa_node": null, "label": "label_1af4_1042", "dev_type": "type-PCI"}, {"dev_id": "pci_0000_00_02_5", "address": "0000:00:02.5", "product_id": "000c", "vendor_id": "1b36", "numa_node": null, "label": "label_1b36_000c", "dev_type": "type-PCI"}, {"dev_id": "pci_0000_00_02_2", "address": "0000:00:02.2", "product_id": "000c", "vendor_id": "1b36", "numa_node": null, "label": "label_1b36_000c", "dev_type": "type-PCI"}, {"dev_id": "pci_0000_00_02_6", "address": "0000:00:02.6", "product_id": "000c", "vendor_id": "1b36", "numa_node": null, "label": "label_1b36_000c", "dev_type": "type-PCI"}, {"dev_id": "pci_0000_00_01_0", "address": "0000:00:01.0", "product_id": "1111", "vendor_id": "1234", "numa_node": null, "label": "label_1234_1111", "dev_type": "type-PCI"}, {"dev_id": "pci_0000_00_02_7", "address": "0000:00:02.7", "product_id": "000c", "vendor_id": "1b36", "numa_node": null, "label": "label_1b36_000c", "dev_type": "type-PCI"}, {"dev_id": "pci_0000_00_02_4", "address": "0000:00:02.4", "product_id": "000c", "vendor_id": "1b36", "numa_node": null, "label": "label_1b36_000c", "dev_type": "type-PCI"}, {"dev_id": "pci_0000_00_02_1", "address": "0000:00:02.1", "product_id": "000c", "vendor_id": "1b36", "numa_node": null, "label": "label_1b36_000c", "dev_type": "type-PCI"}, {"dev_id": "pci_0000_00_02_0", "address": "0000:00:02.0", "product_id": "000c", "vendor_id": "1b36", "numa_node": null, "label": "label_1b36_000c", "dev_type": "type-PCI"}, {"dev_id": "pci_0000_00_02_3", "address": "0000:00:02.3", "product_id": "000c", "vendor_id": "1b36", "numa_node": null, "label": "label_1b36_000c", "dev_type": "type-PCI"}, {"dev_id": "pci_0000_00_1f_0", "address": "0000:00:1f.0", "product_id": "2918", "vendor_id": "8086", "numa_node": null, "label": "label_8086_2918", "dev_type": "type-PCI"}, {"dev_id": "pci_0000_00_1f_3", "address": "0000:00:1f.3", "product_id": "2930", "vendor_id": "8086", "numa_node": null, "label": "label_8086_2930", "dev_type": "type-PCI"}, {"dev_id": "pci_0000_07_00_0", "address": "0000:07:00.0", "product_id": "1044", "vendor_id": "1af4", "numa_node": null, "label": "label_1af4_1044", "dev_type": "type-PCI"}, {"dev_id": "pci_0000_01_00_0", "address": "0000:01:00.0", "product_id": "1041", "vendor_id": "1af4", "numa_node": null, "label": "label_1af4_1041", "dev_type": "type-PCI"}, {"dev_id": "pci_0000_00_00_0", "address": "0000:00:00.0", "product_id": "29c0", "vendor_id": "8086", "numa_node": null, "label": "label_8086_29c0", "dev_type": "type-PCI"}, {"dev_id": "pci_0000_03_00_0", "address": "0000:03:00.0", "product_id": "000d", "vendor_id": "1b36", "numa_node": null, "label": "label_1b36_000d", "dev_type": "type-PCI"}, {"dev_id": "pci_0000_00_1f_2", "address": "0000:00:1f.2", "product_id": "2922", "vendor_id": "8086", "numa_node": null, "label": "label_8086_2922", "dev_type": "type-PCI"}, {"dev_id": "pci_0000_04_00_0", "address": "0000:04:00.0", "product_id": "1043", "vendor_id": "1af4", "numa_node": null, "label": "label_1af4_1043", "dev_type": "type-PCI"}, {"dev_id": "pci_0000_02_00_0", "address": "0000:02:00.0", "product_id": "1041", "vendor_id": "1af4", "numa_node": null, "label": "label_1af4_1041", "dev_type": "type-PCI"}] _report_hypervisor_resource_view /usr/lib/python3.6/site-packages/nova/compute/resource_tracker.py:1063
2024-09-12 17:48:06.195 8 DEBUG oslo_concurrency.lockutils [req-6a398fec-e21e-469c-a4b2-086cdbe04303 - - - - -] Lock "compute_resources" acquired by "nova.compute.resource_tracker.ResourceTracker._update_available_resource" :: waited 0.000s inner /usr/lib/python3.6/site-packages/oslo_concurrency/lockutils.py:327
2024-09-12 17:48:06.223 8 ERROR nova.virt.libvirt.driver [-] [instance: 2ce3ef37-1e9a-4638-8b10-fa95cdd4ed25] Failed to cancel migration Requested operation is not valid: no job is active on the domain: libvirt.libvirtError: Requested operation is not valid: no job is active on the domain
2024-09-12 17:48:06.240 8 DEBUG nova.virt.libvirt.driver [-] [instance: 2ce3ef37-1e9a-4638-8b10-fa95cdd4ed25] Live migration monitoring is all done _live_migration /usr/lib/python3.6/site-packages/nova/virt/libvirt/driver.py:9401
2024-09-12 17:48:06.241 8 ERROR nova.compute.manager [-] [instance: 2ce3ef37-1e9a-4638-8b10-fa95cdd4ed25] Live migration failed.: oslo_messaging.exceptions.MessagingTimeout: Timed out waiting for a reply to message ID 389a868aba184d1a9956337bca78a569
2024-09-12 17:48:06.241 8 ERROR nova.compute.manager [instance: 2ce3ef37-1e9a-4638-8b10-fa95cdd4ed25] Traceback (most recent call last):
2024-09-12 17:48:06.241 8 ERROR nova.compute.manager [instance: 2ce3ef37-1e9a-4638-8b10-fa95cdd4ed25]   File "/usr/lib/python3.6/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 425, in get
2024-09-12 17:48:06.241 8 ERROR nova.compute.manager [instance: 2ce3ef37-1e9a-4638-8b10-fa95cdd4ed25]     return self._queues[msg_id].get(block=True, timeout=timeout)
2024-09-12 17:48:06.241 8 ERROR nova.compute.manager [instance: 2ce3ef37-1e9a-4638-8b10-fa95cdd4ed25]   File "/usr/lib/python3.6/site-packages/eventlet/queue.py", line 322, in get
2024-09-12 17:48:06.241 8 ERROR nova.compute.manager [instance: 2ce3ef37-1e9a-4638-8b10-fa95cdd4ed25]     return waiter.wait()
2024-09-12 17:48:06.241 8 ERROR nova.compute.manager [instance: 2ce3ef37-1e9a-4638-8b10-fa95cdd4ed25]   File "/usr/lib/python3.6/site-packages/eventlet/queue.py", line 141, in wait
2024-09-12 17:48:06.241 8 ERROR nova.compute.manager [instance: 2ce3ef37-1e9a-4638-8b10-fa95cdd4ed25]     return get_hub().switch()
2024-09-12 17:48:06.241 8 ERROR nova.compute.manager [instance: 2ce3ef37-1e9a-4638-8b10-fa95cdd4ed25]   File "/usr/lib/python3.6/site-packages/eventlet/hubs/hub.py", line 298, in switch
2024-09-12 17:48:06.241 8 ERROR nova.compute.manager [instance: 2ce3ef37-1e9a-4638-8b10-fa95cdd4ed25]     return self.greenlet.switch()
2024-09-12 17:48:06.241 8 ERROR nova.compute.manager [instance: 2ce3ef37-1e9a-4638-8b10-fa95cdd4ed25] queue.Empty
2024-09-12 17:48:06.241 8 ERROR nova.compute.manager [instance: 2ce3ef37-1e9a-4638-8b10-fa95cdd4ed25]
2024-09-12 17:48:06.241 8 ERROR nova.compute.manager [instance: 2ce3ef37-1e9a-4638-8b10-fa95cdd4ed25] During handling of the above exception, another exception occurred:
2024-09-12 17:48:06.241 8 ERROR nova.compute.manager [instance: 2ce3ef37-1e9a-4638-8b10-fa95cdd4ed25]
2024-09-12 17:48:06.241 8 ERROR nova.compute.manager [instance: 2ce3ef37-1e9a-4638-8b10-fa95cdd4ed25] Traceback (most recent call last):
2024-09-12 17:48:06.241 8 ERROR nova.compute.manager [instance: 2ce3ef37-1e9a-4638-8b10-fa95cdd4ed25]   File "/usr/lib/python3.6/site-packages/nova/compute/manager.py", line 7466, in _do_live_migration
2024-09-12 17:48:06.241 8 ERROR nova.compute.manager [instance: 2ce3ef37-1e9a-4638-8b10-fa95cdd4ed25]     block_migration, migrate_data)
2024-09-12 17:48:06.241 8 ERROR nova.compute.manager [instance: 2ce3ef37-1e9a-4638-8b10-fa95cdd4ed25]   File "/usr/lib/python3.6/site-packages/nova/virt/libvirt/driver.py", line 8858, in live_migration
2024-09-12 17:48:06.241 8 ERROR nova.compute.manager [instance: 2ce3ef37-1e9a-4638-8b10-fa95cdd4ed25]     migrate_data)
2024-09-12 17:48:06.241 8 ERROR nova.compute.manager [instance: 2ce3ef37-1e9a-4638-8b10-fa95cdd4ed25]   File "/usr/lib/python3.6/site-packages/nova/virt/libvirt/driver.py", line 9382, in _live_migration
2024-09-12 17:48:06.241 8 ERROR nova.compute.manager [instance: 2ce3ef37-1e9a-4638-8b10-fa95cdd4ed25]     finish_event, disk_paths)
2024-09-12 17:48:06.241 8 ERROR nova.compute.manager [instance: 2ce3ef37-1e9a-4638-8b10-fa95cdd4ed25]   File "/usr/lib/python3.6/site-packages/nova/virt/libvirt/driver.py", line 9256, in _live_migration_monitor
2024-09-12 17:48:06.241 8 ERROR nova.compute.manager [instance: 2ce3ef37-1e9a-4638-8b10-fa95cdd4ed25]     info, remaining)
2024-09-12 17:48:06.241 8 ERROR nova.compute.manager [instance: 2ce3ef37-1e9a-4638-8b10-fa95cdd4ed25]   File "/usr/lib/python3.6/site-packages/nova/virt/libvirt/migration.py", line 557, in save_stats
2024-09-12 17:48:06.241 8 ERROR nova.compute.manager [instance: 2ce3ef37-1e9a-4638-8b10-fa95cdd4ed25]     migration.save()
2024-09-12 17:48:06.241 8 ERROR nova.compute.manager [instance: 2ce3ef37-1e9a-4638-8b10-fa95cdd4ed25]   File "/usr/lib/python3.6/site-packages/oslo_versionedobjects/base.py", line 210, in wrapper
2024-09-12 17:48:06.241 8 ERROR nova.compute.manager [instance: 2ce3ef37-1e9a-4638-8b10-fa95cdd4ed25]     ctxt, self, fn.__name__, args, kwargs)
2024-09-12 17:48:06.241 8 ERROR nova.compute.manager [instance: 2ce3ef37-1e9a-4638-8b10-fa95cdd4ed25]   File "/usr/lib/python3.6/site-packages/nova/conductor/rpcapi.py", line 246, in object_action
2024-09-12 17:48:06.241 8 ERROR nova.compute.manager [instance: 2ce3ef37-1e9a-4638-8b10-fa95cdd4ed25]     objmethod=objmethod, args=args, kwargs=kwargs)
2024-09-12 17:48:06.241 8 ERROR nova.compute.manager [instance: 2ce3ef37-1e9a-4638-8b10-fa95cdd4ed25]   File "/usr/lib/python3.6/site-packages/oslo_messaging/rpc/client.py", line 181, in call
2024-09-12 17:48:06.241 8 ERROR nova.compute.manager [instance: 2ce3ef37-1e9a-4638-8b10-fa95cdd4ed25]     transport_options=self.transport_options)
2024-09-12 17:48:06.241 8 ERROR nova.compute.manager [instance: 2ce3ef37-1e9a-4638-8b10-fa95cdd4ed25]   File "/usr/lib/python3.6/site-packages/oslo_messaging/transport.py", line 129, in _send
2024-09-12 17:48:06.241 8 ERROR nova.compute.manager [instance: 2ce3ef37-1e9a-4638-8b10-fa95cdd4ed25]     transport_options=transport_options)
2024-09-12 17:48:06.241 8 ERROR nova.compute.manager [instance: 2ce3ef37-1e9a-4638-8b10-fa95cdd4ed25]   File "/usr/lib/python3.6/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 674, in send
2024-09-12 17:48:06.241 8 ERROR nova.compute.manager [instance: 2ce3ef37-1e9a-4638-8b10-fa95cdd4ed25]     transport_options=transport_options)
2024-09-12 17:48:06.241 8 ERROR nova.compute.manager [instance: 2ce3ef37-1e9a-4638-8b10-fa95cdd4ed25]   File "/usr/lib/python3.6/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 662, in _send
2024-09-12 17:48:06.241 8 ERROR nova.compute.manager [instance: 2ce3ef37-1e9a-4638-8b10-fa95cdd4ed25]     call_monitor_timeout)
2024-09-12 17:48:06.241 8 ERROR nova.compute.manager [instance: 2ce3ef37-1e9a-4638-8b10-fa95cdd4ed25]   File "/usr/lib/python3.6/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 551, in wait
2024-09-12 17:48:06.241 8 ERROR nova.compute.manager [instance: 2ce3ef37-1e9a-4638-8b10-fa95cdd4ed25]     message = self.waiters.get(msg_id, timeout=timeout)
2024-09-12 17:48:06.241 8 ERROR nova.compute.manager [instance: 2ce3ef37-1e9a-4638-8b10-fa95cdd4ed25]   File "/usr/lib/python3.6/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 429, in get
2024-09-12 17:48:06.241 8 ERROR nova.compute.manager [instance: 2ce3ef37-1e9a-4638-8b10-fa95cdd4ed25]     'to message ID %s' % msg_id)
2024-09-12 17:48:06.241 8 ERROR nova.compute.manager [instance: 2ce3ef37-1e9a-4638-8b10-fa95cdd4ed25] oslo_messaging.exceptions.MessagingTimeout: Timed out waiting for a reply to message ID 389a868aba184d1a9956337bca78a569
2024-09-12 17:48:06.241 8 ERROR nova.compute.manager [instance: 2ce3ef37-1e9a-4638-8b10-fa95cdd4ed25]
~~~

9. Immediately after the traceback, the service is able to reconnect to Rabbit:
~~~
2024-09-12 17:48:07.072 8 INFO oslo.messaging._drivers.impl_rabbit [-] [66ce8f67-540f-41a8-8cc6-ccb361f8b79e] Reconnected to AMQP server on overcloud-controller-2.ctlplane.keller.lab:5672 via [amqp] client with port 44858.
2024-09-12 17:48:07.101 8 INFO oslo.messaging._drivers.impl_rabbit [-] [0c470004-08a3-4caa-ab53-f1d4a396c013] Reconnected to AMQP server on overcloud-controller-0.ctlplane.keller.lab:5672 via [amqp] client with port 36704.
2024-09-12 17:48:07.116 8 INFO oslo.messaging._drivers.impl_rabbit [-] [50b4b3a6-d64c-438c-8970-78599a048ddd] Reconnected to AMQP server on overcloud-controller-0.ctlplane.keller.lab:5672 via [amqp] client with port 36696.
2024-09-12 17:48:07.119 8 INFO oslo_messaging._drivers.amqpdriver [-] No calling threads waiting for msg_id : 389a868aba184d1a9956337bca78a569
~~~

10. Additionally, Nova only reports Rabbit as unreachable exactly during this live migration window, not before or after:
~~~
root@overcloud-novacompute-1 ~]# egrep "AMQP server on overcloud-controller-..ctlplane.keller.lab:5672 is unreachable" /var/log/containers/nova/nova-compute.log
2024-09-12 17:48:05.970 8 ERROR oslo.messaging._drivers.impl_rabbit [-] [66ce8f67-540f-41a8-8cc6-ccb361f8b79e] AMQP server on overcloud-controller-2.ctlplane.keller.lab:5672 is unreachable: Server unexpectedly closed connection. Trying again in 1 seconds.: OSError: Server unexpectedly closed connection
2024-09-12 17:48:06.056 8 ERROR oslo.messaging._drivers.impl_rabbit [-] [50b4b3a6-d64c-438c-8970-78599a048ddd] AMQP server on overcloud-controller-0.ctlplane.keller.lab:5672 is unreachable: <RecoverableConnectionError: unknown error>. Trying again in 1 seconds.: amqp.exceptions.RecoverableConnectionError: <RecoverableConnectionError: unknown error>
2024-09-12 17:48:06.063 8 ERROR oslo.messaging._drivers.impl_rabbit [-] [0c470004-08a3-4caa-ab53-f1d4a396c013] AMQP server on overcloud-controller-0.ctlplane.keller.lab:5672 is unreachable: Server unexpectedly closed connection. Trying again in 1 seconds.: OSError: Server unexpectedly closed connection
~~~

10. Checking for possible errors of Rabbit, in the controllers, I can see that the containers were last restarted days before the migration:
~~~
$ ansible -i inventory.yaml -m shell -a 'podman inspect $(podman ps -q -f name=rabbit) | jq .[].State.StartedAt' -b Controller
overcloud-controller-1 | CHANGED | rc=0 >>
"2024-09-09T14:48:31.592566484+02:00"
overcloud-controller-2 | CHANGED | rc=0 >>
"2024-09-09T14:50:18.414106543+02:00"
overcloud-controller-0 | CHANGED | rc=0 >>
"2024-09-09T14:45:50.453585594+02:00"
~~~

11. Rabbit reports some errors, but not during the live migration window:
~~~
$ ansible -i inventory.yaml -m shell -a 'zgrep error /var/log/containers/stdouts/rabbitmq-bundle.log* |grep "2024-09-12" | cut -d: -f-3 | sort -u' -b Controller 
overcloud-controller-2 | CHANGED | rc=0 >>

overcloud-controller-1 | CHANGED | rc=0 >>
/var/log/containers/stdouts/rabbitmq-bundle.log.2.gz:2024-09-12T16:01
/var/log/containers/stdouts/rabbitmq-bundle.log.2.gz:2024-09-12T16:03
overcloud-controller-0 | CHANGED | rc=0 >>
/var/log/containers/stdouts/rabbitmq-bundle.log.2.gz:2024-09-12T16:01
~~~

12. It seems the migration never reached post_copy, it was aborted before that. Because post_copy is enabled, auto_converge is never triggered either.


Expected results:

1. post_copy should be triggered if enabled
2. Even when a single migration is stuck, nova on the source compute node should not stop logging to nova-compute.log
3. The connection to Rabbit should not break during live migration, and the healthcheck of nova compute should not fail as a result


Additional info:

This issue is not reproduced when disabling post copy. While the live migration above failed after running for over an hour and a half, and copying an amount of ram well over 13 times the ram size of the vm (memory processing metric stops being updated, but the process kept copying all the time), migrating another vm with the same flavor and workload can successfully complete live migration in a fraction of that time, and copying less than 2x the size of the ram:

~~~
(overcloud) [stack.lab ~]$ ansible -i inventory.yaml -m shell -a 'crudini --set /var/lib/config-data/puppet-generated/nova_libvirt/etc/nova/nova.conf libvirt live_migration_permit_post_copy False ; systemctl restart tripleo_nova_compute.service' -b Compute
overcloud-novacompute-0 | CHANGED | rc=0 >>

overcloud-novacompute-1 | CHANGED | rc=0 >>

(overcloud) [stack.lab ~]$ 
(overcloud) [stack.lab ~]$ date ; time openstack server migrate --live-migration e81bfb27-89e9-4218-95f2-9fa9104f87f5 --wait ; date
Fri Sep 13 15:43:38 CEST 2024
Progress: 100Complete

real	8m56.001s
user	0m2.376s
sys	0m0.604s
Fri Sep 13 15:52:34 CEST 2024
(overcloud) [stack.lab ~]$ 
~~~

~~~
[tripleo-admin@overcloud-controller-0 ~]$ sudo podman exec $(sudo podman ps -q -f name=galera) mysql nova -e 'select created_at,updated_at,status,memory_total,memory_processed,memory_remaining from migrations where id=3047\G'
*************************** 1. row ***************************
      created_at: 2024-09-13 13:43:56
      updated_at: 2024-09-13 13:52:30
          status: completed
    memory_total: 34364792832
memory_processed: 60154154600
memory_remaining: 141168640
[tripleo-admin@overcloud-controller-0 ~]$ 
~~~

The customer reports the same type of improvements when disabling post copy in the case attached.

Finally, while a failed live migration is an operation that can be recovered, what's more serious is that we are seeing other operations partially completed while the migration is stuck like shown above. For example, adding or detaching volumes from the vm might complete in cinder but not in nova, or the other way around. This causes incidents that have to be fixed with the cinder cli in the best case, but frequently using sql updates.

Comment 1 Eric Nothen 2024-09-13 14:03:21 UTC
Opened BZ#2312197 for the (related) issue of having both post copy and auto converge enabled by default.

Comment 2 melanie witt 2024-09-18 23:25:24 UTC
Thanks for the detailed writeup. There are a few things to mention based on what was captured, some clear and some not so clear.

The fact that post copy was not triggered might be expected behavior depending on how the Nova deployment is configured.

"Permit" is the key term in the live_migration_permit_post_copy [1] and live_migration_permit_auto_converge [2] configuration setting names. Each setting means whether Nova is "permitted" to invoke post copy or auto converge, respectively. It is valid for both to be permitted. Nova will activate either post copy or auto converge based on the following conditions:

From live_migration_permit_post_copy [1]:

"When permitted, post-copy mode will be automatically activated if we reach the timeout defined by live_migration_completion_timeout [3] and live_migration_timeout_action [4] is set to ‘force_complete’. Note if you change to no timeout or choose to use ‘abort’, i.e. live_migration_completion_timeout = 0, then there will be no automatic switch to post-copy."

From live_migration_permit_auto_converge [2]:

"Auto converge throttles down CPU if a progress of on-going live migration is slow. Auto converge will only be used if this flag is set to True and post copy is not permitted or post copy is unavailable due to the version of libvirt and QEMU in use."

So, post copy will only be triggered if live_migration_timeout_action [4] is set to "force_complete" and it appears there is no TripleO configuration parameter for live_migration_timeout_action (defaults to "abort") or for live_migration_completion_timeout (defaults to 800) seconds. So those must be specified as a TripleO extra config if desired, something like this taken from an example in a KCS [5]:

nova::migration::libvirt::live_migration_completion_timeout: 10
nova::migration::libvirt::live_migration_timeout_action: 'force_complete'

The above means that if a live migration has not completed within 10 seconds, post copy will be triggered to complete the migration.

If [libvirt]live_migration_timeout_action in nova.conf is not set to 'force_complete', post copy will not be triggered.


As for the RabbitMQ issues, it is unclear how or why they are related to the migration attempts given that live migration activities don't go over the RabbitMQ message queue. There are a couple of live migration performance related configuration options that you could consider trying to help figure out how the migration could be affecting the message queue.

There is a live_migration_bandwidth [6] setting which can be used to limit the maximum bandwidth in MiB/s to be used during live migration. Probably unlikely to be helpful given that you're dealing with migrations that are already not transferring memory fast enough to complete.

Finally, there is a live_migration_inbound_addr [7] setting which can be used to designate a separate network for live migration traffic:

"This option is useful in environments where the live-migration traffic can impact the network plane significantly. A separate network for live-migration traffic can then use this config option and avoids the impact on the management network."

That could be helpful if the network traffic from the migrations is somehow impeding communications over the message queue in your environment.


[1] https://docs.openstack.org/nova/latest/configuration/config.html#libvirt.live_migration_permit_post_copy
[2] https://docs.openstack.org/nova/latest/configuration/config.html#libvirt.live_migration_permit_auto_converge
[3] https://docs.openstack.org/nova/latest/configuration/config.html#libvirt.live_migration_completion_timeout
[4] https://docs.openstack.org/nova/latest/configuration/config.html#libvirt.live_migration_timeout_action
[5] https://access.redhat.com/solutions/7053151
[6] https://docs.openstack.org/nova/latest/configuration/config.html#libvirt.live_migration_bandwidth
[7] https://docs.openstack.org/nova/latest/configuration/config.html#libvirt.live_migration_inbound_addr

Comment 3 Eric Nothen 2024-09-19 11:03:16 UTC
(In reply to melanie witt from comment #2)
> Thanks for the detailed writeup. There are a few things to mention based on
> what was captured, some clear and some not so clear.
> 
> The fact that post copy was not triggered might be expected behavior
> depending on how the Nova deployment is configured.
> 
> "Permit" is the key term in the live_migration_permit_post_copy [1] and
> live_migration_permit_auto_converge [2] configuration setting names. Each
> setting means whether Nova is "permitted" to invoke post copy or auto
> converge, respectively. It is valid for both to be permitted. Nova will
> activate either post copy or auto converge based on the following conditions:
> 

Thank you for the clarification. I understand now that "permitted" doesn't mean it's actually going to be used. 

> From live_migration_permit_post_copy [1]:
> 
> "When permitted, post-copy mode will be automatically activated if we reach
> the timeout defined by live_migration_completion_timeout [3] and
> live_migration_timeout_action [4] is set to ‘force_complete’. Note if you
> change to no timeout or choose to use ‘abort’, i.e.
> live_migration_completion_timeout = 0, then there will be no automatic
> switch to post-copy."
> 
> From live_migration_permit_auto_converge [2]:
> 
> "Auto converge throttles down CPU if a progress of on-going live migration
> is slow. Auto converge will only be used if this flag is set to True and
> post copy is not permitted or post copy is unavailable due to the version of
> libvirt and QEMU in use."
> 
> So, post copy will only be triggered if live_migration_timeout_action [4] is
> set to "force_complete" and it appears there is no TripleO configuration
> parameter for live_migration_timeout_action (defaults to "abort") or for
> live_migration_completion_timeout (defaults to 800) seconds. So those must
> be specified as a TripleO extra config if desired, something like this taken
> from an example in a KCS [5]:
> 
> nova::migration::libvirt::live_migration_completion_timeout: 10
> nova::migration::libvirt::live_migration_timeout_action: 'force_complete'
> 
> The above means that if a live migration has not completed within 10
> seconds, post copy will be triggered to complete the migration.
> 

Maybe I'm not understanding completely the intended default behavior, but what I read on [3] (live_migration_completion_timeout) is:

   Value is per GiB of guest RAM + disk to be transferred, with lower bound of a minimum of 2 GiB

We have a VM with 64GB of ram and disk on shared storage (no disk copy), so I assume it will copy at least 64GB plus whatever amount of memory that changes during the migration. In the best case (no memory updates at all), that is potentially a total wait of 64*800/3600 > 14 hours before forcing or aborting the migration. If I changed the default timeout to 10 seconds, Nova would still wait potentially at least 10 minutes before calling the timeout action.  Am I getting this right?


> If [libvirt]live_migration_timeout_action in nova.conf is not set to 
> 'force_complete', post copy will not be triggered.

Ok, so our default configuration permits post copy (because NovaLiveMigrationPermitPostCopy defaults to true) and permits auto converge (because NovaLiveMigrationPermitAutoConverge defaults to true), but won't actually trigger post copy (because timeout action is not set) and won't trigger auto converge either (because post copy is permitted/available). Is this correct?


> As for the RabbitMQ issues, it is unclear how or why they are related to the
> migration attempts given that live migration activities don't go over the
> RabbitMQ message queue. There are a couple of live migration performance
> related configuration options that you could consider trying to help figure
> out how the migration could be affecting the message queue.
> 
> There is a live_migration_bandwidth [6] setting which can be used to limit
> the maximum bandwidth in MiB/s to be used during live migration. Probably
> unlikely to be helpful given that you're dealing with migrations that are
> already not transferring memory fast enough to complete.

> Finally, there is a live_migration_inbound_addr [7] setting which can be
> used to designate a separate network for live migration traffic:
> 
> "This option is useful in environments where the live-migration traffic can
> impact the network plane significantly. A separate network for
> live-migration traffic can then use this config option and avoids the impact
> on the management network."
> 
> That could be helpful if the network traffic from the migrations is somehow
> impeding communications over the message queue in your environment.

This, I also don't think would help. I do have a very simple setup in my reproducer so this parameters might help there, but originally, the customer ran into this issue in a production environment with a dedicated storage vlan that has a 20gbs link. What they see in Grafana is that even when the memory_processed, memory_remaining and progress fields are not updated any more on the database, the actual network traffic remains stable until the migration completes hours later.

It looks as if libvirt keeps copying memory from source to target for hours. Because the VM at destination is never activated, the memory at source keeps being updated and therefore still needs to be copied. Given that somehow the connection to Rabbit is temporarily broken (as reported by the healthcheck), there are some side effects noticed:

1. The database is not reflecting the actual amount of memory that is being copied. It just stops being updated after a while as I showed in the summary.
2. If they attempt to `nova live-migration-force-complete ...` the migration, it makes no difference. The VM does not transition state, and the copy remains visible in Grafana.
3. Unrelated operations on the same VM do not complete successfully. For example, operations adding or removing volumes from the vm, while it's being migrated, leave a mismatching state on the database, such as cinder.volume_attachment updated but nova.block_device_mapping not.


> 
> [1]
> https://docs.openstack.org/nova/latest/configuration/config.html#libvirt.
> live_migration_permit_post_copy
> [2]
> https://docs.openstack.org/nova/latest/configuration/config.html#libvirt.
> live_migration_permit_auto_converge
> [3]
> https://docs.openstack.org/nova/latest/configuration/config.html#libvirt.
> live_migration_completion_timeout
> [4]
> https://docs.openstack.org/nova/latest/configuration/config.html#libvirt.
> live_migration_timeout_action
> [5] https://access.redhat.com/solutions/7053151
> [6]
> https://docs.openstack.org/nova/latest/configuration/config.html#libvirt.
> live_migration_bandwidth
> [7]
> https://docs.openstack.org/nova/latest/configuration/config.html#libvirt.
> live_migration_inbound_addr

Comment 4 melanie witt 2024-09-19 20:15:01 UTC
(In reply to Eric Nothen from comment #3)
> Maybe I'm not understanding completely the intended default behavior, but
> what I read on [3] (live_migration_completion_timeout) is:
> 
>    Value is per GiB of guest RAM + disk to be transferred, with lower bound
> of a minimum of 2 GiB
> 
> We have a VM with 64GB of ram and disk on shared storage (no disk copy), so
> I assume it will copy at least 64GB plus whatever amount of memory that
> changes during the migration. In the best case (no memory updates at all),
> that is potentially a total wait of 64*800/3600 > 14 hours before forcing or
> aborting the migration. If I changed the default timeout to 10 seconds, Nova
> would still wait potentially at least 10 minutes before calling the timeout
> action.  Am I getting this right?

Yes that is correct. I asked a colleague about these settings and they said at the time, a 1G link was considered the norm and that 10G was especially fast. That explains why these are all "per 1GB of data", I think.

So you'll be adjusting for the speed of your link to pick wait times that make sense for the size of your data (64GB). The clearest example I found with regard to how the settings work is actually in the code [1]:

    '''Calculate downtime value steps and time between increases.

    :param data_gb: total GB of RAM and disk to transfer

    This looks at the total downtime steps and upper bound
    downtime value and uses a linear function.

    For example, with 10 steps, 30 second step delay, 3 GB
    of RAM and 400ms target maximum downtime, the downtime will
    be increased every 90 seconds in the following progression:

    -   0 seconds -> set downtime to  40ms
    -  90 seconds -> set downtime to  76ms
    - 180 seconds -> set downtime to 112ms
    - 270 seconds -> set downtime to 148ms
    - 360 seconds -> set downtime to 184ms
    - 450 seconds -> set downtime to 220ms
    - 540 seconds -> set downtime to 256ms
    - 630 seconds -> set downtime to 292ms
    - 720 seconds -> set downtime to 328ms
    - 810 seconds -> set downtime to 364ms
    - 900 seconds -> set downtime to 400ms

    This allows the guest a good chance to complete migration
    with a small downtime value.
    '''
    downtime = CONF.libvirt.live_migration_downtime
    steps = CONF.libvirt.live_migration_downtime_steps
    delay = CONF.libvirt.live_migration_downtime_delay

    delay = int(delay * data_gb)

    base = downtime / steps
    offset = (downtime - base) / steps

    for i in range(steps + 1):
        yield (int(delay * i), int(base + offset * i))


So you have a few things to potentially set depending on how you want the downtime to be adjusted during the live migration: live_migration_downtime (defaults to 500ms) [2], live_migration_downtime_steps (defaults to 10) [3], and live_migration_downtime_delay (defaults to 75s) [4]. So with the current defaults and 64GB it would look like:

10 steps, 75 second step delay, 64GB RAM, no disk, and 500ms target maximum downtime

-     0 seconds               -> set downtime to  50ms
-  4800 seconds  (1.33 hours) -> set downtime to  95ms
-  9600 seconds  (2.66 hours) -> set downtime to 140ms
- 14400 seconds  (4.00 hours) -> set downtime to 185ms
- 19200 seconds  (5.33 hours) -> set downtime to 230ms
- 24000 seconds  (6.66 hours) -> set downtime to 275ms
- 28800 seconds  (8.00 hours) -> set downtime to 320ms
- 33600 seconds  (9.33 hours) -> set downtime to 365ms
- 38400 seconds (10.66 hours) -> set downtime to 410ms
- 43200 seconds (12.00 hours) -> set downtime to 455ms
- 48000 seconds (13.33 hours) -> set downtime to 500ms

And the live_migration_completion_timeout setting is indeed multiplied by the size of the data to get the actual timeout [5]:

                completion_timeout = int(
                    CONF.libvirt.live_migration_completion_timeout * data_gb)

So 10 seconds would turn into 64GB * 10 seconds = 640 seconds.

> > If [libvirt]live_migration_timeout_action in nova.conf is not set to 
> > 'force_complete', post copy will not be triggered.
> 
> Ok, so our default configuration permits post copy (because
> NovaLiveMigrationPermitPostCopy defaults to true) and permits auto converge
> (because NovaLiveMigrationPermitAutoConverge defaults to true), but won't
> actually trigger post copy (because timeout action is not set) and won't
> trigger auto converge either (because post copy is permitted/available). Is
> this correct?

Yes, that is correct.

> This, I also don't think would help. I do have a very simple setup in my
> reproducer so this parameters might help there, but originally, the customer
> ran into this issue in a production environment with a dedicated storage
> vlan that has a 20gbs link. What they see in Grafana is that even when the
> memory_processed, memory_remaining and progress fields are not updated any
> more on the database, the actual network traffic remains stable until the
> migration completes hours later.

A note regarding the dedicated storage VLAN with the 20G link, that link will not be used by default (the storage backend is not used by default for live migration). If you want to use the 20G link, you will need to set [libvirt]live_migration_inbound_addr [6] to use it.

> It looks as if libvirt keeps copying memory from source to target for hours.
> Because the VM at destination is never activated, the memory at source keeps
> being updated and therefore still needs to be copied. Given that somehow the
> connection to Rabbit is temporarily broken (as reported by the healthcheck),
> there are some side effects noticed:
> 
> 1. The database is not reflecting the actual amount of memory that is being
> copied. It just stops being updated after a while as I showed in the summary.
> 2. If they attempt to `nova live-migration-force-complete ...` the
> migration, it makes no difference. The VM does not transition state, and the
> copy remains visible in Grafana.
> 3. Unrelated operations on the same VM do not complete successfully. For
> example, operations adding or removing volumes from the vm, while it's being
> migrated, leave a mismatching state on the database, such as
> cinder.volume_attachment updated but nova.block_device_mapping not.

The side effects you mention make sense if the message queue connection is broken. The database updates from nova-compute would get routed to nova-conductor over the message queue before they are written to the database ... so if the message queue is not reachable, updates wouldn't make it to nova-conductor (and so, won't make it to the database). Requests from the REST API would behave similarly because operations like 2 and 3 would have to go over the message queue to reach nova-compute. Prior to that nova-api had made some updates but nova-compute didn't/couldn't, so the mismatch happens.

To be clear, you saw the same message queue connection drop in your simplified reproducer? So far, I can't think why the post-copy-not-triggered would affect the message queue.


[1] https://github.com/openstack/nova/blob/9135a04591b36c22540030d96df31c754eb4433c/nova/virt/libvirt/migration.py#L736
[2] https://docs.openstack.org/nova/latest/configuration/config.html#libvirt.live_migration_downtime
[3] https://docs.openstack.org/nova/latest/configuration/config.html#libvirt.live_migration_downtime_steps
[4] https://docs.openstack.org/nova/latest/configuration/config.html#libvirt.live_migration_downtime_delay
[5] https://github.com/openstack/nova/blob/9135a04591b36c22540030d96df31c754eb4433c/nova/virt/libvirt/driver.py#L11043
[6] https://docs.openstack.org/nova/latest/configuration/config.html#libvirt.live_migration_inbound_addr

Comment 5 Eric Nothen 2024-09-20 16:22:48 UTC
(In reply to melanie witt from comment #4)
> 
> So you have a few things to potentially set depending on how you want the
> downtime to be adjusted during the live migration: live_migration_downtime
> (defaults to 500ms) [2], live_migration_downtime_steps (defaults to 10) [3],
> and live_migration_downtime_delay (defaults to 75s) [4]. So with the current
> defaults and 64GB it would look like:
> 

Thank you for the extended explanation, but I will put this on hold for now. My customer tested just disabling post copy, and this alone has been a huge improvement (4 hours to 6 minutes to migrate). I would have a very hard time explaining them that they should enable post copy again and tune some other parameters and potentially get a similar improvement.

Instead, I would like to focus in two things:

1. Why nova breaks (it does stop logging and connecting to Rabbit so something must be going on)
2. The behavior with the default configuration.

More  on these below.


> > Ok, so our default configuration permits post copy (because
> > NovaLiveMigrationPermitPostCopy defaults to true) and permits auto converge
> > (because NovaLiveMigrationPermitAutoConverge defaults to true), but won't
> > actually trigger post copy (because timeout action is not set) and won't
> > trigger auto converge either (because post copy is permitted/available). Is
> > this correct?
> 
> Yes, that is correct.
> 

Doesn't sound like our default configuration is optimal. Shouldn't we change something?

> The side effects you mention make sense if the message queue connection is
> broken. The database updates from nova-compute would get routed to
> nova-conductor over the message queue before they are written to the
> database ... so if the message queue is not reachable, updates wouldn't make
> it to nova-conductor (and so, won't make it to the database). Requests from
> the REST API would behave similarly because operations like 2 and 3 would
> have to go over the message queue to reach nova-compute. Prior to that
> nova-api had made some updates but nova-compute didn't/couldn't, so the
> mismatch happens.
> 
> To be clear, you saw the same message queue connection drop in your
> simplified reproducer? So far, I can't think why the post-copy-not-triggered
> would affect the message queue.
> 

I saw the connection closed on Rabbit. Today I enabled post copy again and tried another migration, with the same symptoms. Here's how it goes:

1. migration of vm5 starts at 15:32:03 and makes progress:
~~~
(overcloud) [stack.lab ~]$ scripts/migrate-vm.sh
Fri Sep 20 15:32:03 CEST 2024: Starting migration from overcloud-novacompute-0.keller.lab
Fri Sep 20 15:32:53 CEST 2024: migration_id=3064
Fri Sep 20 15:32:53 CEST 2024
progress: 1
memory_processed_bytes 1227182145                           
memory_remaining_bytes 33129431040                          
memory_total_bytes     34364792832                          

Fri Sep 20 15:33:42 CEST 2024
progress: 23
memory_processed_bytes 8642674068                           
memory_remaining_bytes 25403781120                          
memory_total_bytes     34364792832                          
...
~~~

2. By 16:06 the progress and memory_* fields are stuck, so I know it's broken:
~~~
Fri Sep 20 16:05:44 CEST 2024
progress: 67
memory_processed_bytes 206128218285                         
memory_remaining_bytes 9054208000                           
memory_total_bytes     34364792832                          

Fri Sep 20 16:06:37 CEST 2024
progress: 91
memory_processed_bytes 212163573750                         
memory_remaining_bytes 3030519808                           
memory_total_bytes     34364792832                          

Fri Sep 20 16:07:29 CEST 2024
progress: 91
memory_processed_bytes 212163573750                         
memory_remaining_bytes 3030519808                           
memory_total_bytes     34364792832                          

Fri Sep 20 16:08:30 CEST 2024
progress: 91
memory_processed_bytes 212163573750                         
memory_remaining_bytes 3030519808                           
memory_total_bytes     34364792832                          
~~~

3. Checking on the source compute node I see that nova compute stopped logging at 16:06:37
~~~
[root@overcloud-novacompute-0 ~]# tail -f /var/log/containers/nova/nova-compute.log
2024-09-20 16:06:33.347 8 DEBUG oslo_concurrency.processutils [req-970bc184-5d0d-4f30-ae5a-b3f1c99808f4 - - - - -] Running cmd (subprocess): ceph df --format=json --id openstack --conf /etc/ceph/ceph.conf execute /usr/lib/python3.6/site-packages/oslo_concurrency/processutils.py:379
2024-09-20 16:06:33.816 8 DEBUG nova.virt.libvirt.migration [-] [instance: e81bfb27-89e9-4218-95f2-9fa9104f87f5] Current 50 elapsed 2029 steps [(0, 50), (2400, 95), (4800, 140), (7200, 185), (9600, 230), (12000, 275), (14400, 320), (16800, 365), (19200, 410), (21600, 455), (24000, 500)] update_downtime /usr/lib/python3.6/site-packages/nova/virt/libvirt/migration.py:509
2024-09-20 16:06:33.817 8 DEBUG nova.virt.libvirt.migration [-] [instance: e81bfb27-89e9-4218-95f2-9fa9104f87f5] Downtime does not need to change update_downtime /usr/lib/python3.6/site-packages/nova/virt/libvirt/migration.py:521
2024-09-20 16:06:36.110 8 DEBUG oslo_concurrency.processutils [req-970bc184-5d0d-4f30-ae5a-b3f1c99808f4 - - - - -] CMD "ceph df --format=json --id openstack --conf /etc/ceph/ceph.conf" returned: 0 in 2.764s execute /usr/lib/python3.6/site-packages/oslo_concurrency/processutils.py:416
2024-09-20 16:06:36.868 8 DEBUG nova.virt.libvirt.driver [-] [instance: e81bfb27-89e9-4218-95f2-9fa9104f87f5] Migration running for 725 secs, memory 9% remaining (bytes processed=212163573750, remaining=3030519808, total=34364792832); disk 100% remaining (bytes processed=0, remaining=0, total=0). _live_migration_monitor /usr/lib/python3.6/site-packages/nova/virt/libvirt/driver.py:9285
2024-09-20 16:06:37.180 8 DEBUG nova.virt.libvirt.driver [req-970bc184-5d0d-4f30-ae5a-b3f1c99808f4 - - - - -] skipping disk for instance-000000f8 as it does not have a path _get_instance_disk_info_from_config /usr/lib/python3.6/site-packages/nova/virt/libvirt/driver.py:9918
2024-09-20 16:06:37.374 8 DEBUG nova.virt.libvirt.driver [req-970bc184-5d0d-4f30-ae5a-b3f1c99808f4 - - - - -] skipping disk for instance-00000101 as it does not have a path _get_instance_disk_info_from_config /usr/lib/python3.6/site-packages/nova/virt/libvirt/driver.py:9918
2024-09-20 16:06:37.375 8 DEBUG nova.virt.libvirt.driver [req-970bc184-5d0d-4f30-ae5a-b3f1c99808f4 - - - - -] skipping disk for instance-00000101 as it does not have a path _get_instance_disk_info_from_config /usr/lib/python3.6/site-packages/nova/virt/libvirt/driver.py:9918
2024-09-20 16:06:37.487 8 DEBUG nova.virt.libvirt.migration [-] [instance: e81bfb27-89e9-4218-95f2-9fa9104f87f5] Current 50 elapsed 2033 steps [(0, 50), (2400, 95), (4800, 140), (7200, 185), (9600, 230), (12000, 275), (14400, 320), (16800, 365), (19200, 410), (21600, 455), (24000, 500)] update_downtime /usr/lib/python3.6/site-packages/nova/virt/libvirt/migration.py:509
2024-09-20 16:06:37.487 8 DEBUG nova.virt.libvirt.migration [-] [instance: e81bfb27-89e9-4218-95f2-9fa9104f87f5] Downtime does not need to change update_downtime /usr/lib/python3.6/site-packages/nova/virt/libvirt/migration.py:521
~~~

4. A quick check on Rabbit and it looks like it's alright
~~~
(overcloud) [stack.lab ~]$ date ; ansible -i inventory.yaml -m shell -a 'podman exec $(podman ps -q -f name=rabbit) rabbitmq-diagnostics -q ping' -b Controller
Fri Sep 20 16:28:26 CEST 2024
overcloud-controller-2 | CHANGED | rc=0 >>
Ping succeeded
overcloud-controller-0 | CHANGED | rc=0 >>
Ping succeeded
overcloud-controller-1 | CHANGED | rc=0 >>
Ping succeeded
(overcloud) [stack.lab ~]$
(overcloud) [stack.lab ~]$ date ; ansible -i inventory.yaml -m shell -a 'podman exec $(podman ps -q -f name=rabbit) rabbitmq-diagnostics -q status' -b Controller
Fri Sep 20 16:29:25 CEST 2024
overcloud-controller-2 | CHANGED | rc=0 >>
Runtime

OS PID: 434
OS: Linux
Uptime (seconds): 5351
Is under maintenance?: false
RabbitMQ version: 3.8.16
Node name: rabbit@overcloud-controller-2
Erlang configuration: Erlang/OTP 23 [erts-11.2.2.17] [source] [64-bit] [smp:8:8] [ds:8:8:10] [async-threads:1] [hipe]
Erlang processes: 941 used, 1048576 limit
Scheduler run queue: 1
Cluster heartbeat timeout (net_ticktime): 15

Plugins

Enabled plugin file: /etc/rabbitmq/enabled_plugins
Enabled plugins:

 * rabbitmq_management
 * amqp_client
 * rabbitmq_web_dispatch
 * cowboy
 * cowlib
 * rabbitmq_management_agent

Data directory

Node data directory: /var/lib/rabbitmq/mnesia/rabbit@overcloud-controller-2
Raft data directory: /var/lib/rabbitmq/mnesia/rabbit@overcloud-controller-2/quorum/rabbit@overcloud-controller-2

Config files

 * /etc/rabbitmq/rabbitmq.config

Log file(s)

 * /var/log/rabbitmq/rabbit
 * /var/log/rabbitmq/rabbit

Alarms

(none)

Memory

Total memory used: 0.1579 gb
Calculation strategy: rss
Memory high watermark setting: 0.4 of available memory, computed to: 6.6402 gb

other_proc: 0.0485 gb (30.69 %)
allocated_unused: 0.0388 gb (24.55 %)
code: 0.0328 gb (20.79 %)
other_system: 0.017 gb (10.79 %)
plugins: 0.007 gb (4.42 %)
other_ets: 0.0036 gb (2.29 %)
reserved_unallocated: 0.0026 gb (1.66 %)
binary: 0.0021 gb (1.35 %)
atom: 0.0015 gb (0.93 %)
queue_procs: 0.0013 gb (0.85 %)
mgmt_db: 0.001 gb (0.63 %)
mnesia: 0.0008 gb (0.5 %)
metrics: 0.0005 gb (0.3 %)
queue_slave_procs: 0.0003 gb (0.18 %)
quorum_ets: 0.0 gb (0.03 %)
msg_index: 0.0 gb (0.02 %)
connection_other: 0.0 gb (0.0 %)
connection_channels: 0.0 gb (0.0 %)
connection_readers: 0.0 gb (0.0 %)
connection_writers: 0.0 gb (0.0 %)
quorum_queue_procs: 0.0 gb (0.0 %)

File Descriptors

Total: 29, limit: 65439
Sockets: 27, limit: 58893

Free Disk Space

Low free disk space watermark: 0.05 gb
Free disk space: 82.773 gb

Totals

Connection count: 27
Queue count: 205
Virtual host count: 1

Listeners

Interface: [::], port: 15671, protocol: https, purpose: HTTP API over TLS (HTTPS)
Interface: [::], port: 25672, protocol: clustering, purpose: inter-node and CLI tool communication
Interface: 192.168.24.15, port: 5672, protocol: amqp/ssl, purpose: AMQP 0-9-1 and AMQP 1.0 over TLS
overcloud-controller-0 | CHANGED | rc=0 >>
Runtime

OS PID: 420
OS: Linux
Uptime (seconds): 5405
Is under maintenance?: false
RabbitMQ version: 3.8.16
Node name: rabbit@overcloud-controller-0
Erlang configuration: Erlang/OTP 23 [erts-11.2.2.17] [source] [64-bit] [smp:8:8] [ds:8:8:10] [async-threads:1] [hipe]
Erlang processes: 1408 used, 1048576 limit
Scheduler run queue: 1
Cluster heartbeat timeout (net_ticktime): 15

Plugins

Enabled plugin file: /etc/rabbitmq/enabled_plugins
Enabled plugins:

 * rabbitmq_management
 * amqp_client
 * rabbitmq_web_dispatch
 * cowboy
 * cowlib
 * rabbitmq_management_agent

Data directory

Node data directory: /var/lib/rabbitmq/mnesia/rabbit@overcloud-controller-0
Raft data directory: /var/lib/rabbitmq/mnesia/rabbit@overcloud-controller-0/quorum/rabbit@overcloud-controller-0

Config files

 * /etc/rabbitmq/rabbitmq.config

Log file(s)

 * /var/log/rabbitmq/rabbit
 * /var/log/rabbitmq/rabbit

Alarms

(none)

Memory

Total memory used: 0.1574 gb
Calculation strategy: rss
Memory high watermark setting: 0.4 of available memory, computed to: 6.6402 gb

other_proc: 0.0473 gb (29.59 %)
allocated_unused: 0.0389 gb (24.28 %)
code: 0.0328 gb (20.51 %)
other_system: 0.0173 gb (10.8 %)
plugins: 0.0077 gb (4.78 %)
other_ets: 0.0037 gb (2.3 %)
queue_procs: 0.0036 gb (2.22 %)
binary: 0.0031 gb (1.94 %)
queue_slave_procs: 0.0018 gb (1.09 %)
atom: 0.0015 gb (0.92 %)
mgmt_db: 0.0011 gb (0.7 %)
mnesia: 0.0008 gb (0.48 %)
metrics: 0.0005 gb (0.34 %)
quorum_ets: 0.0 gb (0.03 %)
msg_index: 0.0 gb (0.02 %)
connection_other: 0.0 gb (0.0 %)
connection_channels: 0.0 gb (0.0 %)
connection_readers: 0.0 gb (0.0 %)
connection_writers: 0.0 gb (0.0 %)
quorum_queue_procs: 0.0 gb (0.0 %)
reserved_unallocated: 0.0 gb (0.0 %)

File Descriptors

Total: 33, limit: 65439
Sockets: 31, limit: 58893

Free Disk Space

Low free disk space watermark: 0.05 gb
Free disk space: 82.6869 gb

Totals

Connection count: 31
Queue count: 205
Virtual host count: 1

Listeners

Interface: [::], port: 15671, protocol: https, purpose: HTTP API over TLS (HTTPS)
Interface: [::], port: 25672, protocol: clustering, purpose: inter-node and CLI tool communication
Interface: 192.168.24.14, port: 5672, protocol: amqp/ssl, purpose: AMQP 0-9-1 and AMQP 1.0 over TLS
overcloud-controller-1 | CHANGED | rc=0 >>
Runtime

OS PID: 280
OS: Linux
Uptime (seconds): 5434
Is under maintenance?: false
RabbitMQ version: 3.8.16
Node name: rabbit@overcloud-controller-1
Erlang configuration: Erlang/OTP 23 [erts-11.2.2.17] [source] [64-bit] [smp:8:8] [ds:8:8:10] [async-threads:1] [hipe]
Erlang processes: 2769 used, 1048576 limit
Scheduler run queue: 1
Cluster heartbeat timeout (net_ticktime): 15

Plugins

Enabled plugin file: /etc/rabbitmq/enabled_plugins
Enabled plugins:

 * rabbitmq_management
 * amqp_client
 * rabbitmq_web_dispatch
 * cowboy
 * cowlib
 * rabbitmq_management_agent

Data directory

Node data directory: /var/lib/rabbitmq/mnesia/rabbit@overcloud-controller-1
Raft data directory: /var/lib/rabbitmq/mnesia/rabbit@overcloud-controller-1/quorum/rabbit@overcloud-controller-1

Config files

 * /etc/rabbitmq/rabbitmq.config

Log file(s)

 * /var/log/rabbitmq/rabbit
 * /var/log/rabbitmq/rabbit

Alarms

(none)

Memory

Total memory used: 0.2178 gb
Calculation strategy: rss
Memory high watermark setting: 0.4 of available memory, computed to: 6.6402 gb

other_proc: 0.077 gb (34.98 %)
allocated_unused: 0.0615 gb (27.94 %)
code: 0.0328 gb (14.92 %)
other_system: 0.0177 gb (8.04 %)
plugins: 0.011 gb (4.98 %)
queue_procs: 0.0047 gb (2.14 %)
mgmt_db: 0.0044 gb (2.02 %)
other_ets: 0.0038 gb (1.75 %)
binary: 0.003 gb (1.36 %)
atom: 0.0015 gb (0.67 %)
metrics: 0.0013 gb (0.59 %)
mnesia: 0.0008 gb (0.38 %)
queue_slave_procs: 0.0005 gb (0.21 %)
quorum_ets: 0.0 gb (0.02 %)
msg_index: 0.0 gb (0.01 %)
connection_other: 0.0 gb (0.0 %)
connection_channels: 0.0 gb (0.0 %)
connection_readers: 0.0 gb (0.0 %)
connection_writers: 0.0 gb (0.0 %)
quorum_queue_procs: 0.0 gb (0.0 %)
reserved_unallocated: 0.0 gb (0.0 %)

File Descriptors

Total: 134, limit: 65439
Sockets: 132, limit: 58893

Free Disk Space

Low free disk space watermark: 0.05 gb
Free disk space: 82.8217 gb

Totals

Connection count: 132
Queue count: 205
Virtual host count: 1

Listeners

Interface: [::], port: 15671, protocol: https, purpose: HTTP API over TLS (HTTPS)
Interface: [::], port: 25672, protocol: clustering, purpose: inter-node and CLI tool communication
Interface: 192.168.24.23, port: 5672, protocol: amqp/ssl, purpose: AMQP 0-9-1 and AMQP 1.0 over TLS
~~~

5. Creating a VM on the source compute fails, which confirms a RabbitMQ connection issue on that compute:
~~~
(overcloud) [stack.lab ~]$ date ; openstack server create --image cirros-0.5.2 --flavor 1c1g --network testnet vm5 --host overcloud-novacompute-0.keller.lab --wait
Fri Sep 20 16:40:05 CEST 2024
Error creating server: vm5
Error creating server
(overcloud) [stack.lab ~]$
(overcloud) [stack.lab ~]$ openstack server event list vm5
+------------------------------------------+--------------------------------------+--------+----------------------------+
| Request ID                               | Server ID                            | Action | Start Time                 |
+------------------------------------------+--------------------------------------+--------+----------------------------+
| req-96eaa786-0a25-40f3-9b99-dd5f3aa23f2c | f6f6e4bb-9edf-4d1c-a82f-eec508ec2d6a | create | 2024-09-20T14:40:12.000000 |
+------------------------------------------+--------------------------------------+--------+----------------------------+
(overcloud) [stack.lab ~]$ 
(overcloud) [stack.lab ~]$ openstack server event show vm5 req-96eaa786-0a25-40f3-9b99-dd5f3aa23f2c -c events -f value
[{'event': 'conductor_schedule_and_build_instances', 'start_time': '2024-09-20T14:40:43.000000', 'finish_time': '2024-09-20T14:40:45.000000', 'result': 'Error', 'traceback': '  File "/usr/lib/python3.6/site-packages/nova/conductor/manager.py", line 1379, in schedule_and_build_instances\n    instance_uuids, return_alternates=True)\n  File "/usr/lib/python3.6/site-packages/nova/conductor/manager.py", line 839, in _schedule_instances\n    return_alternates=return_alternates)\n  File "/usr/lib/python3.6/site-packages/nova/scheduler/client/query.py", line 42, in select_destinations\n    instance_uuids, return_objects, return_alternates)\n  File "/usr/lib/python3.6/site-packages/nova/scheduler/rpcapi.py", line 160, in select_destinations\n    return cctxt.call(ctxt, \'select_destinations\', **msg_args)\n  File "/usr/lib/python3.6/site-packages/oslo_messaging/rpc/client.py", line 181, in call\n    transport_options=self.transport_options)\n  File "/usr/lib/python3.6/site-packages/oslo_messaging/transport.py", line 129, in _send\n    transport_options=transport_options)\n  File "/usr/lib/python3.6/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 674, in send\n    transport_options=transport_options)\n  File "/usr/lib/python3.6/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 664, in _send\n    raise result\n', 'host': 'overcloud-controller-0.keller.lab', 'hostId': 'c7d7a18634c4f86c107f8840e764e4bcc525dd16115359beddd0fca3'}]
(overcloud) [stack.lab ~]$ 
~~~

6. Creating some vms in the target compute does succeed:
~~~
(overcloud) [stack.lab ~]$ openstack server create --image cirros-0.5.2 --flavor 1c1g --network testnet vm6 --host overcloud-novacompute-1.keller.lab --wait
...
(overcloud) [stack.lab ~]$ openstack server create --image cirros-0.5.2 --flavor 1c1g --network testnet vm7 --host overcloud-novacompute-1.keller.lab --wait
...
(overcloud) [stack.lab ~]$ openstack server create --image cirros-0.5.2 --flavor 1c1g --network testnet vm8 --host overcloud-novacompute-1.keller.lab --wait
...
(overcloud) [stack.lab ~]$ openstack server list --long --fit
+--------------------+------+--------+------------+-------------+--------------------+--------------+--------------------+--------+-------------------+----------------------+------------+
| ID                 | Name | Status | Task State | Power State | Networks           | Image Name   | Image ID           | Flavor | Availability Zone | Host                 | Properties |
+--------------------+------+--------+------------+-------------+--------------------+--------------+--------------------+--------+-------------------+----------------------+------------+
| 1ce708e0-55d6-48b4 | vm8  | ACTIVE | None       | Running     | testnet=10.0.0.170 | cirros-0.5.2 | 17eb3a79-4e09-443b | 1c1g   | nova              | overcloud-novacomput |            |
| -908a-f71990fdf85f |      |        |            |             |                    |              | -b9c7-0164249fc7d2 |        |                   | e-1.keller.lab       |            |
| f40fb5df-ce5c-40e3 | vm7  | ACTIVE | None       | Running     | testnet=10.0.0.219 | cirros-0.5.2 | 17eb3a79-4e09-443b | 1c1g   | nova              | overcloud-novacomput |            |
| -8442-12f5fbe5ad4e |      |        |            |             |                    |              | -b9c7-0164249fc7d2 |        |                   | e-1.keller.lab       |            |
| 17e11d78-ee73-4a88 | vm6  | ACTIVE | None       | Running     | testnet=10.0.0.88  | cirros-0.5.2 | 17eb3a79-4e09-443b | 1c1g   | nova              | overcloud-novacomput |            |
| -8251-7ceecb91a823 |      |        |            |             |                    |              | -b9c7-0164249fc7d2 |        |                   | e-1.keller.lab       |            |
| f6f6e4bb-9edf-4d1c | vm5  | ERROR  | None       | NOSTATE     |                    | cirros-0.5.2 | 17eb3a79-4e09-443b | 1c1g   |                   | None                 |            |
| -a82f-eec508ec2d6a |      |        |            |             |                    |              | -b9c7-0164249fc7d2 |        |                   |                      |            |
+--------------------+------+--------+------------+-------------+--------------------+--------------+--------------------+--------+-------------------+----------------------+------------+
(overcloud) [stack.lab ~]$ 
~~~

7. On the compute, I check if the nova_compute container is running and responsive, and there's no problem with that:
~~~
[root@overcloud-novacompute-0 ~]# podman exec nova_compute date
Fri Sep 20 16:52:35 CEST 2024
[root@overcloud-novacompute-0 ~]# podman exec nova_compute tail -1 /var/log/nova/nova-compute.log
2024-09-20 16:06:37.487 8 DEBUG nova.virt.libvirt.migration [-] [instance: e81bfb27-89e9-4218-95f2-9fa9104f87f5] Downtime does not need to change update_downtime /usr/lib/python3.6/site-packages/nova/virt/libvirt/migration.py:521
[root@overcloud-novacompute-0 ~]#
~~~

8. The RabbitMQ log in all controllers shows that the connection from the source compute was dropped about 2 minutes after nova compute stopped logging, and more than an hour later no connection was restarted from that compute:
~~~
(overcloud) [stack.lab ~]$ date ; ansible -i inventory.yaml -m shell -a 'grep 192.168.24.8 /var/log/containers/rabbitmq/rabbit@overcloud-controller-*.log' -b Controller
Fri Sep 20 17:18:06 CEST 2024
overcloud-controller-2 | CHANGED | rc=0 >>
/var/log/containers/rabbitmq/rabbit:2024-09-20 15:19:46.094 [info] <0.5282.0> accepting AMQP connection <0.5282.0> (192.168.24.8:59680 -> 192.168.24.15:5672)
/var/log/containers/rabbitmq/rabbit:2024-09-20 15:19:46.096 [info] <0.5282.0> Connection <0.5282.0> (192.168.24.8:59680 -> 192.168.24.15:5672) has a client-provided name: nova-compute:8:47bef70d-4a9c-4ebc-928c-3e62a939a052
/var/log/containers/rabbitmq/rabbit:2024-09-20 15:19:46.097 [info] <0.5282.0> connection <0.5282.0> (192.168.24.8:59680 -> 192.168.24.15:5672 - nova-compute:8:47bef70d-4a9c-4ebc-928c-3e62a939a052): user 'guest' authenticated and granted access to vhost '/'
/var/log/containers/rabbitmq/rabbit:2024-09-20 15:44:19.795 [info] <0.9261.0> accepting AMQP connection <0.9261.0> (192.168.24.8:35996 -> 192.168.24.15:5672)
/var/log/containers/rabbitmq/rabbit:2024-09-20 15:44:19.817 [info] <0.9261.0> Connection <0.9261.0> (192.168.24.8:35996 -> 192.168.24.15:5672) has a client-provided name: nova-compute:8:97807bca-83ea-4dec-b7a0-d8aaa842f85b
/var/log/containers/rabbitmq/rabbit:2024-09-20 15:44:19.858 [info] <0.9261.0> connection <0.9261.0> (192.168.24.8:35996 -> 192.168.24.15:5672 - nova-compute:8:97807bca-83ea-4dec-b7a0-d8aaa842f85b): user 'guest' authenticated and granted access to vhost '/'
/var/log/containers/rabbitmq/rabbit:2024-09-20 16:08:46.310 [error] <0.5282.0> closing AMQP connection <0.5282.0> (192.168.24.8:59680 -> 192.168.24.15:5672 - nova-compute:8:47bef70d-4a9c-4ebc-928c-3e62a939a052):
/var/log/containers/rabbitmq/rabbit:2024-09-20 16:08:46.366 [info] <0.12600.0> Closing all channels from connection '192.168.24.8:59680 -> 192.168.24.15:5672' because it has been closed
/var/log/containers/rabbitmq/rabbit:2024-09-20 16:09:19.993 [error] <0.9261.0> closing AMQP connection <0.9261.0> (192.168.24.8:35996 -> 192.168.24.15:5672 - nova-compute:8:97807bca-83ea-4dec-b7a0-d8aaa842f85b):
/var/log/containers/rabbitmq/rabbit:2024-09-20 16:09:20.047 [info] <0.12711.0> Closing all channels from connection '192.168.24.8:35996 -> 192.168.24.15:5672' because it has been closed
overcloud-controller-1 | CHANGED | rc=0 >>
/var/log/containers/rabbitmq/rabbit:2024-09-20 15:19:41.852 [warning] <0.3635.0> closing AMQP connection <0.3635.0> (192.168.24.8:44480 -> 192.168.24.23:5672 - nova-compute:9:2fd2712a-3385-414d-a33f-33e648c085f2, vhost: '/', user: 'guest'):
/var/log/containers/rabbitmq/rabbit:2024-09-20 15:19:41.853 [warning] <0.3640.0> closing AMQP connection <0.3640.0> (192.168.24.8:44488 -> 192.168.24.23:5672 - nova-compute:9:d15a2557-c441-4bb5-a676-7ff2e832b96a, vhost: '/', user: 'guest'):
/var/log/containers/rabbitmq/rabbit:2024-09-20 15:19:41.853 [warning] <0.3625.0> closing AMQP connection <0.3625.0> (192.168.24.8:44454 -> 192.168.24.23:5672 - nova-compute:9:9c37e179-9cec-4805-9a2d-22d576e51be1, vhost: '/', user: 'guest'):
/var/log/containers/rabbitmq/rabbit:2024-09-20 15:19:41.853 [warning] <0.3630.0> closing AMQP connection <0.3630.0> (192.168.24.8:44470 -> 192.168.24.23:5672 - nova-compute:9:6ad19652-17a5-419e-a91d-54ff2749881e, vhost: '/', user: 'guest'):
/var/log/containers/rabbitmq/rabbit:2024-09-20 15:19:41.858 [warning] <0.2559.0> closing AMQP connection <0.2559.0> (192.168.24.8:50632 -> 192.168.24.23:5672 - nova-compute:9:12864273-c014-46a7-a1fb-1c01f2b56f97, vhost: '/', user: 'guest'):
/var/log/containers/rabbitmq/rabbit:2024-09-20 15:19:41.860 [warning] <0.3107.0> closing AMQP connection <0.3107.0> (192.168.24.8:44444 -> 192.168.24.23:5672 - nova-compute:9:c843c2d0-dd06-46b3-9aca-b7ea239e6513, vhost: '/', user: 'guest'):
/var/log/containers/rabbitmq/rabbit:2024-09-20 15:19:41.864 [info] <0.9814.0> Closing all channels from connection '192.168.24.8:44488 -> 192.168.24.23:5672' because it has been closed
/var/log/containers/rabbitmq/rabbit:2024-09-20 15:19:41.864 [info] <0.9811.0> Closing all channels from connection '192.168.24.8:44480 -> 192.168.24.23:5672' because it has been closed
/var/log/containers/rabbitmq/rabbit:2024-09-20 15:19:41.869 [info] <0.9831.0> Closing all channels from connection '192.168.24.8:50632 -> 192.168.24.23:5672' because it has been closed
/var/log/containers/rabbitmq/rabbit:2024-09-20 15:19:41.869 [info] <0.9819.0> Closing all channels from connection '192.168.24.8:44454 -> 192.168.24.23:5672' because it has been closed
/var/log/containers/rabbitmq/rabbit:2024-09-20 15:19:41.869 [info] <0.9824.0> Closing all channels from connection '192.168.24.8:44470 -> 192.168.24.23:5672' because it has been closed
/var/log/containers/rabbitmq/rabbit:2024-09-20 15:19:41.869 [info] <0.9837.0> Closing all channels from connection '192.168.24.8:44444 -> 192.168.24.23:5672' because it has been closed
/var/log/containers/rabbitmq/rabbit:2024-09-20 15:19:50.685 [info] <0.9865.0> accepting AMQP connection <0.9865.0> (192.168.24.8:49368 -> 192.168.24.23:5672)
/var/log/containers/rabbitmq/rabbit:2024-09-20 15:19:50.687 [info] <0.9865.0> Connection <0.9865.0> (192.168.24.8:49368 -> 192.168.24.23:5672) has a client-provided name: nova-compute:8:f137cc11-dc61-4f21-a33f-6430a2eaf284
/var/log/containers/rabbitmq/rabbit:2024-09-20 15:19:50.690 [info] <0.9865.0> connection <0.9865.0> (192.168.24.8:49368 -> 192.168.24.23:5672 - nova-compute:8:f137cc11-dc61-4f21-a33f-6430a2eaf284): user 'guest' authenticated and granted access to vhost '/'
/var/log/containers/rabbitmq/rabbit:2024-09-20 15:20:14.097 [info] <0.10166.0> accepting AMQP connection <0.10166.0> (192.168.24.8:60322 -> 192.168.24.23:5672)
/var/log/containers/rabbitmq/rabbit:2024-09-20 15:20:14.100 [info] <0.10166.0> Connection <0.10166.0> (192.168.24.8:60322 -> 192.168.24.23:5672) has a client-provided name: nova-compute:8:7f4fa811-893f-4f62-bc15-b25e4ae3be3c
/var/log/containers/rabbitmq/rabbit:2024-09-20 15:20:14.102 [info] <0.10166.0> connection <0.10166.0> (192.168.24.8:60322 -> 192.168.24.23:5672 - nova-compute:8:7f4fa811-893f-4f62-bc15-b25e4ae3be3c): user 'guest' authenticated and granted access to vhost '/'
/var/log/containers/rabbitmq/rabbit:2024-09-20 16:08:14.361 [error] <0.10166.0> closing AMQP connection <0.10166.0> (192.168.24.8:60322 -> 192.168.24.23:5672 - nova-compute:8:7f4fa811-893f-4f62-bc15-b25e4ae3be3c):
/var/log/containers/rabbitmq/rabbit:2024-09-20 16:08:14.399 [info] <0.20785.0> Closing all channels from connection '192.168.24.8:60322 -> 192.168.24.23:5672' because it has been closed
/var/log/containers/rabbitmq/rabbit:2024-09-20 16:08:50.877 [error] <0.9865.0> closing AMQP connection <0.9865.0> (192.168.24.8:49368 -> 192.168.24.23:5672 - nova-compute:8:f137cc11-dc61-4f21-a33f-6430a2eaf284):
/var/log/containers/rabbitmq/rabbit:2024-09-20 16:08:50.911 [info] <0.20877.0> Closing all channels from connection '192.168.24.8:49368 -> 192.168.24.23:5672' because it has been closed
overcloud-controller-0 | CHANGED | rc=0 >>
/var/log/containers/rabbitmq/rabbit:2024-09-20 15:19:46.052 [info] <0.6624.0> accepting AMQP connection <0.6624.0> (192.168.24.8:35560 -> 192.168.24.14:5672)
/var/log/containers/rabbitmq/rabbit:2024-09-20 15:19:46.053 [info] <0.6624.0> Connection <0.6624.0> (192.168.24.8:35560 -> 192.168.24.14:5672) has a client-provided name: nova-compute:8:8ba62138-4b9e-45ea-8a75-e898991388d5
/var/log/containers/rabbitmq/rabbit:2024-09-20 15:19:46.055 [info] <0.6624.0> connection <0.6624.0> (192.168.24.8:35560 -> 192.168.24.14:5672 - nova-compute:8:8ba62138-4b9e-45ea-8a75-e898991388d5): user 'guest' authenticated and granted access to vhost '/'
/var/log/containers/rabbitmq/rabbit:2024-09-20 15:22:18.467 [info] <0.7175.0> accepting AMQP connection <0.7175.0> (192.168.24.8:49900 -> 192.168.24.14:5672)
/var/log/containers/rabbitmq/rabbit:2024-09-20 15:22:18.484 [info] <0.7175.0> Connection <0.7175.0> (192.168.24.8:49900 -> 192.168.24.14:5672) has a client-provided name: nova-compute:8:cd2b4aca-9dbe-4c44-ae7d-1851e7c0f12d
/var/log/containers/rabbitmq/rabbit:2024-09-20 15:22:18.489 [info] <0.7175.0> connection <0.7175.0> (192.168.24.8:49900 -> 192.168.24.14:5672 - nova-compute:8:cd2b4aca-9dbe-4c44-ae7d-1851e7c0f12d): user 'guest' authenticated and granted access to vhost '/'
/var/log/containers/rabbitmq/rabbit:2024-09-20 16:08:18.701 [error] <0.7175.0> closing AMQP connection <0.7175.0> (192.168.24.8:49900 -> 192.168.24.14:5672 - nova-compute:8:cd2b4aca-9dbe-4c44-ae7d-1851e7c0f12d):
/var/log/containers/rabbitmq/rabbit:2024-09-20 16:08:18.723 [info] <0.13317.0> Closing all channels from connection '192.168.24.8:49900 -> 192.168.24.14:5672' because it has been closed
/var/log/containers/rabbitmq/rabbit:2024-09-20 16:08:46.310 [error] <0.6624.0> closing AMQP connection <0.6624.0> (192.168.24.8:35560 -> 192.168.24.14:5672 - nova-compute:8:8ba62138-4b9e-45ea-8a75-e898991388d5):
/var/log/containers/rabbitmq/rabbit:2024-09-20 16:08:46.409 [info] <0.13373.0> Closing all channels from connection '192.168.24.8:35560 -> 192.168.24.14:5672' because it has been closed
(overcloud) [stack.lab ~]$ 
~~~

9. About 17:27 (11:27 inside the VM... sorry, forgot to change the timezone there), with neither the migration making progress nor nova compute reconnecting, I kill the stress-ng jobs inside the VM:
~~~
[root@vm2 ~]# date ; pkill stress-ng-mem-load
Fri Sep 20 11:27:47 EDT 2024
[root@vm2 ~]# date ; pkill stress-ng
Fri Sep 20 11:27:49 EDT 2024
[root@vm2 ~]# 
~~~

10. Within seconds, nova compute starts logging again on the source compute:
~~~
2024-09-20 16:06:37.487 8 DEBUG nova.virt.libvirt.migration [-] [instance: e81bfb27-89e9-4218-95f2-9fa9104f87f5] Downtime does not need to change update_downtime /usr/lib/python3.6/site-packages/nova/virt/libvirt/migration.py:521

2024-09-20 17:28:24.441 8 ERROR oslo.messaging._drivers.impl_rabbit [-] [f137cc11-dc61-4f21-a33f-6430a2eaf284] AMQP server on overcloud-controller-1.ctlplane.keller.lab:5672 is unreachable: <RecoverableConnectionError: unknown error>. Trying again in 1 seconds.: amqp.exceptions.RecoverableConnectionError: <RecoverableConnectionError: unknown error>
~~~

11. And rabbitmq shows the re-connection:
~~~
[root@overcloud-controller-0 ~]# tail -f /var/log/containers/rabbitmq/rabbit\@overcloud-controller-0.log | grep 192.168.24.8
2024-09-20 17:28:24.527 [info] <0.23467.0> accepting AMQP connection <0.23467.0> (192.168.24.8:47346 -> 192.168.24.14:5672)
2024-09-20 17:28:24.550 [info] <0.23467.0> Connection <0.23467.0> (192.168.24.8:47346 -> 192.168.24.14:5672) has a client-provided name: nova-compute:8:cd2b4aca-9dbe-4c44-ae7d-1851e7c0f12d
2024-09-20 17:28:24.555 [info] <0.23467.0> connection <0.23467.0> (192.168.24.8:47346 -> 192.168.24.14:5672 - nova-compute:8:cd2b4aca-9dbe-4c44-ae7d-1851e7c0f12d): user 'guest' authenticated and granted access to vhost '/'
2024-09-20 17:28:25.513 [info] <0.23482.0> accepting AMQP connection <0.23482.0> (192.168.24.8:47362 -> 192.168.24.14:5672)
2024-09-20 17:28:25.520 [info] <0.23482.0> Connection <0.23482.0> (192.168.24.8:47362 -> 192.168.24.14:5672) has a client-provided name: nova-compute:8:8ba62138-4b9e-45ea-8a75-e898991388d5
2024-09-20 17:28:25.530 [info] <0.23482.0> connection <0.23482.0> (192.168.24.8:47362 -> 192.168.24.14:5672 - nova-compute:8:8ba62138-4b9e-45ea-8a75-e898991388d5): user 'guest' authenticated and granted access to vhost '/'
~~~

12. At this point, one could think that maybe the compute is too small and the VM is too big (wouldn't anyway match the customer case, because their computes are huge), but cpu utilization like it's at the edge:
~~~
[root@overcloud-novacompute-0 ~]# sar -s 15:00 -e 18:00
Linux 4.18.0-305.131.1.el8_4.x86_64 (overcloud-novacompute-0.keller.lab) 	09/20/2024 	_x86_64_	(8 CPU)

03:07:39 PM     CPU     %user     %nice   %system   %iowait    %steal     %idle
03:10:03 PM     all      0.97      0.00      0.76      0.30      3.60     94.36
03:20:03 PM     all      3.34      0.00      2.63      0.19      5.72     88.11 
03:30:03 PM     all      2.84      0.00      2.39      0.16      4.06     90.55 v-- Migration starts
03:40:03 PM     all     19.36      0.00      5.03      0.44     22.33     52.84
03:50:00 PM     all     24.01      0.10      6.33      0.60     27.86     41.11
04:00:02 PM     all     28.74      0.00      4.50      0.57     28.36     37.83
04:10:03 PM     all     25.99      0.00      4.17      0.60     28.15     41.10
04:20:03 PM     all     25.18      0.00      4.34      0.51     26.34     43.62 v-- Migration gets stuck
04:30:03 PM     all     28.42      0.00      4.89      0.54     27.34     38.81
04:40:03 PM     all     22.91      0.00      4.20      0.53     25.70     46.65
04:50:03 PM     all     25.92      0.00      4.46      0.63     26.15     42.83
05:00:02 PM     all     17.27      0.00      4.65      0.81     28.19     49.08
05:10:03 PM     all     28.39      0.00      4.77      0.62     27.97     38.25
05:20:00 PM     all     25.30      0.00      3.88      0.69     27.34     42.79
05:30:03 PM     all     20.03      0.00      3.62      0.38     21.69     54.27 ^-- stress-ng jobs killed
05:40:03 PM     all     15.62      0.00      1.35      0.32     15.05     67.66
05:50:03 PM     all     14.25      0.02      1.50      0.44     15.79     67.99
Average:        all     20.42      0.01      3.89      0.50     22.28     52.90
[root@overcloud-novacompute-0 ~]# 
~~~

So, I know for sure RabbitMQ is not down on any controller at any point. On the source compute, the nova_compute container is also not down, unresponsive or lagging, but it's neither sending anything to the log file nor attempting to re-connect at all.

Comment 6 melanie witt 2024-09-21 00:45:26 UTC
(In reply to Eric Nothen from comment #5)
> Thank you for the extended explanation, but I will put this on hold for now.
> My customer tested just disabling post copy, and this alone has been a huge
> improvement (4 hours to 6 minutes to migrate). I would have a very hard time
> explaining them that they should enable post copy again and tune some other
> parameters and potentially get a similar improvement.

That's fair. So in their case they are likely getting auto-convergence triggered since that's permitted by default. Generally speaking, auto-convergence is not desirable vs post-copy because auto-convergence throttles down the instance (and thus the workload) to facilitate the memory copy completion. That is why it's the fallback if post-copy is not available. If the customer's workload can tolerate the throttling down and they find it acceptable, it is valid for them to choose that option over post-copy.

> Instead, I would like to focus in two things:
> 
> 1. Why nova breaks (it does stop logging and connecting to Rabbit so
> something must be going on)
> 2. The behavior with the default configuration.
> 
> More  on these below.
> 
> 
> > > Ok, so our default configuration permits post copy (because
> > > NovaLiveMigrationPermitPostCopy defaults to true) and permits auto converge
> > > (because NovaLiveMigrationPermitAutoConverge defaults to true), but won't
> > > actually trigger post copy (because timeout action is not set) and won't
> > > trigger auto converge either (because post copy is permitted/available). Is
> > > this correct?
> > 
> > Yes, that is correct.
> > 
> 
> Doesn't sound like our default configuration is optimal. Shouldn't we change
> something?

From what I understand, post-copy and/or auto-converge are not appropriate for all environments and are workload-specific. So it is offered as something to opt into -- i.e. if you know post-copy and/or auto-converge are appropriate for your workloads, you can choose to use them.

In the case of this customer, they would opt in by setting live_migration_timeout_action=force_complete. The caveat there though is that the live_migration_completion_timeout _must_ be set much much lower than the default for a fast network link, because the timeout gets multiplied by the size of the data being transferred.

I looked around a bit for other info and found a couple of old spec proposals upstream [1][2]. The gist of these was about providing a way to enable more fine-grained control over post-copy and auto-convergence to be per instance rather than per compute host. I think they help illustrate how specific post-copy and auto-convergence can be and why they are not configured to trigger by default.

That said, I'll put this on the agenda for our team triage call next week to discuss the defaults and whether there are reasonable changes we could make to improve the experience.
 
> I saw the connection closed on Rabbit. Today I enabled post copy again and
> tried another migration, with the same symptoms. Here's how it goes:
> 
> [...]
>
> So, I know for sure RabbitMQ is not down on any controller at any point. On
> the source compute, the nova_compute container is also not down,
> unresponsive or lagging, but it's neither sending anything to the log file
> nor attempting to re-connect at all.

Thanks for providing all this data. It's as though the live migration is somehow blocking all other activity of nova-compute ... That reminds me of a fix that I don't think made it back into the Train release, which fixed service-wide pauses in nova-compute [3]. However if I understand correctly, this patch wouldn't help in this case because the bug had to do with libvirt calls that return lists and the live migration calls don't involve lists returned from libvirt.

I'll share your findings with the team at our triage call and get their input.


[1] https://specs.openstack.org/openstack/nova-specs/specs/train/approved/expose-auto-converge-post-copy.html
[2] https://review.opendev.org/c/openstack/nova-specs/+/693655/3/specs/ussuri/approved/live-migration-without-performance-impact.rst
[3] https://review.opendev.org/c/openstack/nova/+/677736

Comment 7 melanie witt 2024-09-30 22:47:57 UTC
We discussed the post-copy and auto-converge config options and the team confirmed that because these options are so workload specific, they cannot be configured to trigger by default. It really depends on the admin/operator of the deployment to decide whether the workloads can tolerate the downsides of post-copy or auto-convergence. We are taking an action item to add to the documentation with detailed explanations of how each strategy works and what the tradeoffs are, to help admins/operators decide whether they want to trigger either of post-copy or auto-converge.

Regarding the nova-compute blocking behavior, it is not a scenario that anyone on the team had experience with before and there are no obvious potential causes. A colleague suggested collecting a GMR [1] given that you are able to reliably reproduce the behavior. The GMR will show what code the threads are currently running when the process is signaled and maybe that will provide some clues as to why this is happening.

[1] https://docs.openstack.org/nova/latest/reference/gmr.html

Comment 8 Eric Nothen 2024-10-02 10:41:58 UTC
(In reply to melanie witt from comment #7)
> We discussed the post-copy and auto-converge config options and the team
> confirmed that because these options are so workload specific, they cannot
> be configured to trigger by default.

I would challenge why we would want to enable by default two mechanisms that are so workload specific. More over, why do we want to enable auto converge by default if, according to what we discussed here, it's never going to be triggered anyway (unless the customer explicitly disables the other mechanism that we are enabling by default).

Anyhow, whether the configuration is right or wrong we better discuss on BZ#2312197, which I opened precisely for that.


> It really depends on the admin/operator
> of the deployment to decide whether the workloads can tolerate the downsides
> of post-copy or auto-convergence. We are taking an action item to add to the
> documentation with detailed explanations of how each strategy works and what
> the tradeoffs are, to help admins/operators decide whether they want to
> trigger either of post-copy or auto-converge.

This will be good, thanks.

> Regarding the nova-compute blocking behavior, it is not a scenario that
> anyone on the team had experience with before and there are no obvious
> potential causes. A colleague suggested collecting a GMR [1] given that you
> are able to reliably reproduce the behavior. The GMR will show what code the
> threads are currently running when the process is signaled and maybe that
> will provide some clues as to why this is happening.
> 
> [1] https://docs.openstack.org/nova/latest/reference/gmr.html

I will be uploading this information soon, but be aware that it doesn't happen as you would expect. The report seems to be "queued" when I trigger it and only comes up on the log after the migration fails, not immediately after signaling the process. In contrast, when I trigger the report during the time that no migration is running, or even when a migration is running but not stuck, ot comes on the log right away.

Comment 9 Eric Nothen 2024-10-02 12:32:29 UTC
Repeating the process described above with the objective of capturing a GMR.


1. Migration starts ~10:16:
~~~
(overcloud) [stack.lab ~]$ time scripts/migrate-vm.sh 
Wed Oct  2 10:16:19 CEST 2024: Starting migration from overcloud-novacompute-1.keller.lab
Wed Oct  2 10:16:29 CEST 2024: Getting migration id...
Wed Oct  2 10:16:32 CEST 2024: Migration id is 98
Wed Oct  2 10:16:32 CEST 2024
progress: 0
memory_processed_bytes -                                    
memory_remaining_bytes -                                    
memory_total_bytes     -                                    

Wed Oct  2 10:17:20 CEST 2024
progress: 12
memory_processed_bytes 4948618924                           
memory_remaining_bytes 28995780608                          
memory_total_bytes     34364792832                          
~~~

2. About an hour and 40 minutes later, seeing the migration is copying and not stuck, I trigger a GMR and immediately get the output on the log file:
~~~
[root@overcloud-novacompute-1 ~]# podman exec -ti nova_compute bash
bash-4.4$ pgrep nova-compute
9
bash-4.4$ 
bash-4.4$ 
bash-4.4$ date ; kill -USR2 9
Wed Oct  2 11:57:07 CEST 2024
bash-4.4$ 
bash-4.4$ 
~~~

/var/log/containers/stdouts/nova_compute.log
~~~
2024-10-02T11:57:08.011858995+02:00 stderr F ========================================================================
2024-10-02T11:57:08.011858995+02:00 stderr F ====                        Guru Meditation                         ====
2024-10-02T11:57:08.011858995+02:00 stderr F ========================================================================
2024-10-02T11:57:08.011858995+02:00 stderr F ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
2024-10-02T11:57:08.011858995+02:00 stderr F 
2024-10-02T11:57:08.011858995+02:00 stderr F 
...
~~~


3. Some ~20 minutes later (two hours after it started), migration has copied ~803GB of ram (25 times the size of the ram of the vm being migrated), and is still copying:
~~~
Wed Oct  2 12:16:21 CEST 2024
progress: 90
memory_processed_bytes 863009604188                         
memory_remaining_bytes 2423255040                           
memory_total_bytes     34364792832                          
~~~


4. By 12:18, after more than 2 hours copying ram, it gets stuck (memory_processed_bytes doesn't change for over 90 seconds or so):
~~~
Wed Oct  2 12:17:08 CEST 2024
progress: 89
memory_processed_bytes 871438397704                         
memory_remaining_bytes 686157824                            
memory_total_bytes     34364792832                          

Wed Oct  2 12:18:03 CEST 2024
progress: 99
memory_processed_bytes 878011974918                         
memory_remaining_bytes 4668698624                           
memory_total_bytes     34364792832                          

Wed Oct  2 12:18:55 CEST 2024
progress: 86
memory_processed_bytes 878011974918                         
memory_remaining_bytes 4668698624                           
memory_total_bytes     34364792832                          

Wed Oct  2 12:19:43 CEST 2024
progress: 86
memory_processed_bytes 878011974918                         
memory_remaining_bytes 4668698624                           
memory_total_bytes     34364792832                          
~~~


5. On the source compute node, I trigger another GMR but this time nothing comes on the log, so after checking different log files, I trigger a second one and again nothing comes out. It seems as if the container is always responsive, but the running nova process is not.
~~~
bash-4.4$ date ; kill -USR2 9
Wed Oct  2 12:20:21 CEST 2024
bash-4.4$ 
bash-4.4$ 
bash-4.4$ date ; kill -USR2 9
Wed Oct  2 12:25:22 CEST 2024
bash-4.4$ 
~~~

4. While still checking the log file, migration fails and the nova_compute container "unlocks itself" and spits out one GMR, followed by a traceback (file to be attached as it's two big to put inside the comment)
~~~
[root@overcloud-novacompute-1 ~]# cat /var/log/containers/stdouts/nova_compute.log
...
2024-10-02T12:26:02.342034984+02:00 stderr F 
2024-10-02T12:26:02.342034984+02:00 stderr F ========================================================================
2024-10-02T12:26:02.342034984+02:00 stderr F ====                        Guru Meditation                         ====
2024-10-02T12:26:02.342034984+02:00 stderr F ========================================================================
2024-10-02T12:26:02.342034984+02:00 stderr F ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
2024-10-02T12:26:02.342034984+02:00 stderr F 
2024-10-02T12:26:02.342034984+02:00 stderr F 
...
~~~

5. Same as reported before, nova_compute.log stopped logging when the migration got stuck (almost 8 minutes):
~~~
2024-10-02 12:18:16.054 9 DEBUG nova.virt.libvirt.driver [req-3c2e3909-f70e-4195-b254-3309e1eb330b - - - - -] skipping disk for instance-00000020 as it does not have a path _get_instance_disk_info_from_config /usr/lib/python3.6/site-packages/nova/virt/libvirt/driver.py:9923
2024-10-02 12:26:02.362 9 ERROR oslo.messaging._drivers.impl_rabbit [-] Failed to process message ... skipping it.: ConnectionResetError: [Errno 104] Connection reset by peer
~~~

6. The VM is still running on source, but with status error:
~~~
(overcloud) [stack.lab ~]$ openstack server show vm2 --fit -c OS-EXT-SRV-ATTR:host -c OS-EXT-STS:power_state -cOS-EXT-STS:vm_state -c status -c progress
+------------------------+------------------------------------+
| Field                  | Value                              |
+------------------------+------------------------------------+
| OS-EXT-SRV-ATTR:host   | overcloud-novacompute-1.keller.lab |
| OS-EXT-STS:power_state | Running                            |
| OS-EXT-STS:vm_state    | error                              |
| status                 | ERROR                              |
+------------------------+------------------------------------+
(overcloud) [stack.lab ~]$ 
~~~

7. Record with the status of the failed migration:
~~~
[root@overcloud-controller-0 ~]# podman exec $(podman ps -q -f name=galera) mysql nova -e "select * from migrations where id=98\G"
*************************** 1. row ***************************
          created_at: 2024-10-02 08:16:23
          updated_at: 2024-10-02 10:26:02
          deleted_at: NULL
                  id: 98
      source_compute: overcloud-novacompute-1.keller.lab
        dest_compute: overcloud-novacompute-0.keller.lab
           dest_host: NULL
              status: error
       instance_uuid: 9420b17f-56fa-43cf-a86a-7adc376b2d33
old_instance_type_id: 17
new_instance_type_id: 17
         source_node: overcloud-novacompute-1.keller.lab
           dest_node: overcloud-novacompute-0.keller.lab
             deleted: 0
      migration_type: live-migration
              hidden: 0
        memory_total: 34364792832
    memory_processed: 878011974918
    memory_remaining: 4668698624
          disk_total: 0
      disk_processed: 0
      disk_remaining: 0
                uuid: 09a6830b-e784-469e-9520-fd43183a5f37
     cross_cell_move: 0
             user_id: dcd17f4df82e4222868a84617345fd7a
          project_id: 3224cb31026b4b7e8caf537904b03a53
[root@overcloud-controller-0 ~]# 

~~~


I don't want to add to much entropy on this bugzilla and would like to focus on the fact that Nova gets stuck/halted/unresponsive/suspended/whatever, only to unlock itself minutes later, but do note that there are side effects after these failed migrations. Two that were reported by the customer, and that I see in my own test are:

1. If the vm is booting from volume, deleting the vm does not release the volume. The entry on cinder.volume_attachment remains, but tied to an instance that doesn't exist anymore:

2. When a migration fails, there are placement allocations for it both on the source and the target, with the one on the target being an orphan. This is an issue if scheduling more builds or migrations, as the scheduler ignores the host because it doesn't have enough resources, when in reality it does.

I can open separate BZs for that, unless these are known issues.

Let me know if any other log file is needed.

Comment 11 Eric Nothen 2024-10-03 09:59:08 UTC
I'm adding another set of logs from a failed migration last night. Same result, but a couple of differences:

- This migration goes in the opposite direction (compute-0 to compute-1), just to  show that this issue is not localized or limited to migrating vms from one specific compute
- This migration (id 122) got stuck for a really long time. So long, that I went to sleep and checked it back the day after. Here are the details:
~~~
[root@overcloud-controller-0 ~]# podman exec $(podman ps -q -f name=galera) mysql nova -e "select * from migrations where id=122\G"
*************************** 1. row ***************************
          created_at: 2024-10-02 17:02:41
          updated_at: 2024-10-03 04:36:24
          deleted_at: NULL
                  id: 122
      source_compute: overcloud-novacompute-0.keller.lab
        dest_compute: overcloud-novacompute-1.keller.lab
           dest_host: NULL
              status: error
       instance_uuid: 50dcc07a-6659-4473-8548-56274e5bdb2f
old_instance_type_id: 17
new_instance_type_id: 17
         source_node: overcloud-novacompute-0.keller.lab
           dest_node: overcloud-novacompute-1.keller.lab
             deleted: 0
      migration_type: live-migration
              hidden: 0
        memory_total: 34364792832
    memory_processed: 750136367542
    memory_remaining: 7575212032
          disk_total: 0
      disk_processed: 0
      disk_remaining: 0
                uuid: 3350e74a-508d-470a-a62d-cd777822f342
     cross_cell_move: 0
             user_id: dcd17f4df82e4222868a84617345fd7a
          project_id: 3224cb31026b4b7e8caf537904b03a53
[root@overcloud-controller-0 ~]# 
~~~

Timestamps on the DB are UTC (so local server time -2). Therefore, pre-migration task started on the source compute at 19:02
~~~
[root@overcloud-novacompute-0 ~]# zgrep -m 1 req-836beecc-d339-4055-92e1-d1b39886e776 /var/log/containers/nova/nova-compute.log.2.gz 
2024-10-02 19:02:48.593 9 DEBUG nova.virt.libvirt.driver [req-836beecc-d339-4055-92e1-d1b39886e776 dcd17f4df82e4222868a84617345fd7a 3224cb31026b4b7e8caf537904b03a53 - default default] [instance: 50dcc07a-6659-4473-8548-56274e5bdb2f] Check if temp file /var/lib/nova/instances/tmpdjs6pb8j exists to indicate shared storage is being used for migration. Exists? False _check_shared_storage_test_file /usr/lib/python3.6/site-packages/nova/virt/libvirt/driver.py:8792
~~~

And it's copying and logging progress until 20:42 (so it ran for 1 hour and 40 minutes already):
~~~
[root@overcloud-novacompute-0 ~]# tail -3 /var/log/containers/nova/nova-compute.log.1
2024-10-02 20:42:51.626 9 DEBUG nova.virt.libvirt.migration [-] [instance: 50dcc07a-6659-4473-8548-56274e5bdb2f] Downtime does not need to change update_downtime /usr/lib/python3.6/site-packages/nova/virt/libvirt/migration.py:521
2024-10-02 20:42:51.912 9 DEBUG oslo_concurrency.processutils [req-354d3708-e9ea-4b1f-8be5-68c1063aa83b - - - - -] CMD "ceph df --format=json --id openstack --conf /etc/ceph/ceph.conf" returned: 0 in 1.560s execute /usr/lib/python3.6/site-packages/oslo_concurrency/processutils.py:416
2024-10-02 20:42:52.691 9 DEBUG nova.virt.libvirt.driver [req-354d3708-e9ea-4b1f-8be5-68c1063aa83b - - - - -] skipping disk for instance-0000002c as it does not have a path _get_instance_disk_info_from_config /usr/lib/python3.6/site-packages/nova/virt/libvirt/driver.py:9923
~~~

30 minutes later, I trigger a GMR, but again nothing comes out on the log:
~~~
[root@overcloud-novacompute-0 ~]# podman exec -ti nova_compute bash
bash-4.4$ pgrep nova
9
bash-4.4$ 
bash-4.4$ date ; kill -USR2 9
Wed Oct  2 21:12:12 CEST 2024
bash-4.4$ 
~~~

The following morning, check the loga again, and I see that the GMR and traceback came at 6:36, so over 6 hours after I triggered it:
~~~
# grep "Guru Meditation" /var/log/containers/stdouts/nova_compute.log*
/var/log/containers/stdouts/nova_compute.log.1:2024-10-03T06:36:24.505756074+02:00 stderr F ====                        Guru Meditation                         ====
[root@overcloud-novacompute-0 ~]#
[root@overcloud-novacompute-0 ~]# head -1 /var/log/containers/nova/nova-compute.log
2024-10-03 06:36:24.558 9 ERROR oslo.messaging._drivers.impl_rabbit [-] Failed to process message ... skipping it.: ConnectionResetError: [Errno 104] Connection reset by peer
[root@overcloud-novacompute-0 ~]# 
~~~

In total, nova was stuck not logging anything from 20:42 to 06:36 the day after, almost 10 hours.

Comment 13 melanie witt 2024-10-09 02:04:17 UTC
Thanks for all the information. I haven't been able to spend much time looking at this but something in the GMR made me wonder if maybe this getting stuck could indeed be related to the tpool.Proxy issue from the past that I mentioned in an earlier comment:

2024-10-02T12:26:02.342175042+02:00 stderr F /usr/lib/python3.6/site-packages/oslo_service/periodic_task.py:222 in run_periodic_tasks
2024-10-02T12:26:02.342175042+02:00 stderr F     `task(self, context)`
2024-10-02T12:26:02.342175042+02:00 stderr F 
2024-10-02T12:26:02.342175042+02:00 stderr F /usr/lib/python3.6/site-packages/nova/compute/manager.py:9142 in update_available_resource
2024-10-02T12:26:02.342175042+02:00 stderr F     `startup=startup)`
2024-10-02T12:26:02.342175042+02:00 stderr F 
2024-10-02T12:26:02.342175042+02:00 stderr F /usr/lib/python3.6/site-packages/nova/compute/manager.py:9056 in _update_available_resource_for_node
2024-10-02T12:26:02.342175042+02:00 stderr F     `startup=startup)`
2024-10-02T12:26:02.342175042+02:00 stderr F 
2024-10-02T12:26:02.342175042+02:00 stderr F /usr/lib/python3.6/site-packages/nova/compute/resource_tracker.py:911 in update_available_resource
2024-10-02T12:26:02.342175042+02:00 stderr F     `resources = self.driver.get_available_resource(nodename)`
2024-10-02T12:26:02.342175042+02:00 stderr F 
2024-10-02T12:26:02.342175042+02:00 stderr F /usr/lib/python3.6/site-packages/nova/virt/libvirt/driver.py:8369 in get_available_resource
2024-10-02T12:26:02.342175042+02:00 stderr F     `data['pci_passthrough_devices'] = self._get_pci_passthrough_devices()`
2024-10-02T12:26:02.342175042+02:00 stderr F 
2024-10-02T12:26:02.342175042+02:00 stderr F /usr/lib/python3.6/site-packages/nova/virt/libvirt/driver.py:7080 in _get_pci_passthrough_devices
2024-10-02T12:26:02.342175042+02:00 stderr F     `in devices.items() if "pci" in dev.listCaps()]`
2024-10-02T12:26:02.342175042+02:00 stderr F 
2024-10-02T12:26:02.342175042+02:00 stderr F /usr/lib/python3.6/site-packages/nova/virt/libvirt/driver.py:7080 in <listcomp>
2024-10-02T12:26:02.342175042+02:00 stderr F     `in devices.items() if "pci" in dev.listCaps()]`
2024-10-02T12:26:02.342175042+02:00 stderr F 
2024-10-02T12:26:02.342175042+02:00 stderr F /usr/lib64/python3.6/site-packages/libvirt.py:6313 in listCaps
2024-10-02T12:26:02.342175042+02:00 stderr F     `ret = libvirtmod.virNodeDeviceListCaps(self._o)`

This routine is using libvirt calls that return lists, which was the characteristic of the issue. This periodic task is not part of the live migration obviously but maybe it could be affecting nova-compute?

In the meantime if you are able to, I think it's worth trying to apply this patch and see if there is different behavior. I don't mean to spend a lot of time, just enough to see if nova-compute still gets stuck with the patch applied.

https://review.opendev.org/c/openstack/nova/+/677736 

If it doesn't change anything, I'll continue and enlist some more help investigating this because nothing else seems to be giving any clues.

Comment 14 Eric Nothen 2024-10-16 11:08:41 UTC
(In reply to melanie witt from comment #13)

> In the meantime if you are able to, I think it's worth trying to apply this
> patch and see if there is different behavior. I don't mean to spend a lot of
> time, just enough to see if nova-compute still gets stuck with the patch
> applied.
> 
> https://review.opendev.org/c/openstack/nova/+/677736 
> 
> If it doesn't change anything, I'll continue and enlist some more help
> investigating this because nothing else seems to be giving any clues.

Sorry for the delay, just came back from pto yesterday.

I gave a try to patching host.py but it doesn't come clean:
~~~
[tripleo-admin@overcloud-novacompute-0 ~]$ patch host.py 36ee9c1.diff
patching file host.py
Reversed (or previously applied) patch detected!  Assume -R? [n] y
Hunk #2 FAILED at 110.
Hunk #3 succeeded at 268 (offset 17 lines).
Hunk #4 succeeded at 282 (offset 17 lines).
Hunk #5 succeeded at 654 (offset 20 lines).
1 out of 5 hunks FAILED -- saving rejects to file host.py.rej
[tripleo-admin@overcloud-novacompute-0 ~]$ 
~~~

It would be easier if you can provide an rpm with the changes, so that I use it with the usual image patching procedure [0].

[0] https://docs.redhat.com/en/documentation/red_hat_openstack_platform/16.2/html-single/advanced_overcloud_customization/index#proc_updating-existing-packages-on-container-images_containerized-services

Comment 16 melanie witt 2024-10-18 07:16:17 UTC
Thanks for trying it. I tried it myself earlier today and found we do in fact have that patch already in 16.x and somehow I managed to not find it before. Sorry about that.

While looking at the code some more I found a couple of additional places where lists of libvirt objects were not proxied. I have proposed it upstream [1], not sure if it will help but it's the only thing I find so far that could possibly be related.

[1] https://review.opendev.org/c/openstack/nova/+/932669

Comment 18 Eric Nothen 2024-10-21 17:34:25 UTC
TL;DR: Still fails. Not sure if related or not, but it failed much faster than before.

I've patched the nova-compute image with the patches provided:
~~~
[root@overcloud-novacompute-0 ~]# podman ps |grep nova_compute
a1b3167f7676  director.ctlplane.keller.lab:8787/keller-prod-osp_16_2_cv-rhosp16_containers-nova-compute:16.2-bz2312196  kolla_start           8 hours ago  Up 5 hours ago          nova_compute
[root@overcloud-novacompute-0 ~]# 
[root@overcloud-novacompute-0 ~]# podman exec -ti nova_compute rpm -qa --last | head
openstack-nova-migration-20.6.2-2.20230411134851.el8osttrunk.noarch Mon 21 Oct 2024 08:56:22 AM CEST
openstack-nova-compute-20.6.2-2.20230411134851.el8osttrunk.noarch Mon 21 Oct 2024 08:56:22 AM CEST
python3-nova-20.6.2-2.20230411134851.el8osttrunk.noarch Mon 21 Oct 2024 08:56:21 AM CEST
openstack-nova-common-20.6.2-2.20230411134851.el8osttrunk.noarch Mon 21 Oct 2024 08:56:21 AM CEST
xfsprogs-5.0.0-8.el8.x86_64                   Mon 26 Aug 2024 08:39:24 PM CEST
nvme-cli-1.12-4.el8_4.x86_64                  Mon 26 Aug 2024 08:39:24 PM CEST
ndctl-71.1-2.el8.x86_64                       Mon 26 Aug 2024 08:39:24 PM CEST
targetcli-2.1.53-2.el8.noarch                 Mon 26 Aug 2024 08:39:23 PM CEST
openssh-server-8.0p1-7.el8_4.x86_64           Mon 26 Aug 2024 08:39:23 PM CEST
ceph-common-14.2.22-128.el8cp.x86_64          Mon 26 Aug 2024 08:39:23 PM CEST
[root@overcloud-novacompute-0 ~]# 
~~~

The vm has id 7bb3e9e1-c698-431e-b68f-b0ea9435e980, the first migration starts 15:27:50:
~~~
Mon Oct 21 15:27:50 CEST 2024: Migration id is 185
Mon Oct 21 15:27:50 CEST 2024
progress: 0
memory_processed_bytes -                                    
memory_remaining_bytes -                                    
memory_total_bytes     -                                    

Mon Oct 21 15:28:49 CEST 2024
progress: 10
memory_processed_bytes 1428470784                           
memory_remaining_bytes 29864910848                          
memory_total_bytes     34364792832         
~~~

Progress goes back and forth for almost 2 hours until it fails about 17:26:
~~~
Mon Oct 21 17:22:40 CEST 2024
progress: 87
memory_processed_bytes 598769502203                         
memory_remaining_bytes 848388096                            
memory_total_bytes     34364792832                          

Mon Oct 21 17:23:29 CEST 2024
progress: 76
memory_processed_bytes 603166568721                         
memory_remaining_bytes 5937819648                           
memory_total_bytes     34364792832                          

Mon Oct 21 17:24:29 CEST 2024
progress: 90
memory_processed_bytes 608323424408                         
memory_remaining_bytes 790798336                            
memory_total_bytes     34364792832                          

Mon Oct 21 17:25:34 CEST 2024
progress: 90
memory_processed_bytes 608323424408                         
memory_remaining_bytes 790798336                            
memory_total_bytes     34364792832                          

Mon Oct 21 17:26:27 CEST 2024
progress: 90
memory_processed_bytes 608323424408                         
memory_remaining_bytes 790798336                            
memory_total_bytes     34364792832               
~~~

At this point, the vm is left in error state:
~~~
(overcloud) [stack.lab ~]$ openstack server show vm2 --fit
+-------------------------------------+----------------------------------------------------------------------------------------+
| Field                               | Value                                                                                  |
+-------------------------------------+----------------------------------------------------------------------------------------+
| OS-DCF:diskConfig                   | MANUAL                                                                                 |
| OS-EXT-AZ:availability_zone         | nova                                                                                   |
| OS-EXT-SRV-ATTR:host                | overcloud-novacompute-0.keller.lab                                                     |
| OS-EXT-SRV-ATTR:hostname            | vm2                                                                                    |
| OS-EXT-SRV-ATTR:hypervisor_hostname | overcloud-novacompute-0.keller.lab                                                     |
| OS-EXT-SRV-ATTR:instance_name       | instance-00000035                                                                      |
| OS-EXT-SRV-ATTR:kernel_id           |                                                                                        |
| OS-EXT-SRV-ATTR:launch_index        | 0                                                                                      |
| OS-EXT-SRV-ATTR:ramdisk_id          |                                                                                        |
| OS-EXT-SRV-ATTR:reservation_id      | r-6j0kbmf5                                                                             |
| OS-EXT-SRV-ATTR:root_device_name    | /dev/vda                                                                               |
| OS-EXT-SRV-ATTR:user_data           | None                                                                                   |
| OS-EXT-STS:power_state              | Running                                                                                |
| OS-EXT-STS:task_state               | None                                                                                   |
| OS-EXT-STS:vm_state                 | error                                                                                  |
| OS-SRV-USG:launched_at              | 2024-10-21T13:21:11.000000                                                             |
| OS-SRV-USG:terminated_at            | None                                                                                   |
| accessIPv4                          |                                                                                        |
| accessIPv6                          |                                                                                        |
| addresses                           | vmmigrate-net=10.11.1.250, 192.168.122.168                                             |
| config_drive                        |                                                                                        |
| created                             | 2024-10-21T13:21:00Z                                                                   |
| description                         | None                                                                                   |
| flavor                              | disk='11', ephemeral='0', , original_name='4c32g11d', ram='32768', swap='0', vcpus='4' |
| hostId                              | 1f4ef244ca3fe3dc82f22b5881b944b36ee46a44b8543363bc43503f                               |
| host_status                         | UP                                                                                     |
| id                                  | 7bb3e9e1-c698-431e-b68f-b0ea9435e980                                                   |
| image                               |                                                                                        |
| key_name                            | stack                                                                                  |
| locked                              | False                                                                                  |
| locked_reason                       | None                                                                                   |
| name                                | vm2                                                                                    |
| project_id                          | 3224cb31026b4b7e8caf537904b03a53                                                       |
| properties                          |                                                                                        |
| security_groups                     | name='all-open'                                                                        |
| server_groups                       | []                                                                                     |
| status                              | ERROR                                                                                  |
| tags                                | []                                                                                     |
| trusted_image_certificates          | None                                                                                   |
| updated                             | 2024-10-21T15:27:25Z                                                                   |
| user_id                             | dcd17f4df82e4222868a84617345fd7a                                                       |
| volumes_attached                    | delete_on_termination='False', id='f25af806-1cd0-4fd8-9c02-acf148224730'               |
+-------------------------------------+----------------------------------------------------------------------------------------+
(overcloud) [stack.lab ~]$ 
~~~

The migration event and the event output:
~~~
(overcloud) [stack.lab ~]$ openstack server event list vm2 --fit
+------------------------------------------+--------------------------------------+----------------+----------------------------+
| Request ID                               | Server ID                            | Action         | Start Time                 |
+------------------------------------------+--------------------------------------+----------------+----------------------------+
| req-08c59f04-3af4-4479-a952-0a7badd4105d | 7bb3e9e1-c698-431e-b68f-b0ea9435e980 | live-migration | 2024-10-21T13:26:56.000000 |
| req-121c4f5d-c3a5-4f4c-a224-cc4d60d8d86c | 7bb3e9e1-c698-431e-b68f-b0ea9435e980 | create         | 2024-10-21T13:20:56.000000 |
+------------------------------------------+--------------------------------------+----------------+----------------------------+
(overcloud) [stack.lab ~]$ 
(overcloud) [stack.lab ~]$ openstack server event show vm2 req-08c59f04-3af4-4479-a952-0a7badd4105d --fit
+---------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Field         | Value                                                                                                                                                                                             |
+---------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| action        | live-migration                                                                                                                                                                                    |
| events        | [{'event': 'compute_pre_live_migration', 'start_time': '2024-10-21T13:27:40.000000', 'finish_time': '2024-10-21T13:28:33.000000', 'result': 'Success', 'traceback': None, 'host': 'overcloud-     |
|               | novacompute-1.keller.lab', 'hostId': 'b9a380ea15a66b5905d56c39e84af06ab10c6a88b37dbf53d0ec7b7f'}, {'event': 'compute_live_migration', 'start_time': '2024-10-21T13:27:37.000000', 'finish_time':  |
|               | '2024-10-21T13:27:39.000000', 'result': 'Success', 'traceback': None, 'host': 'overcloud-novacompute-0.keller.lab', 'hostId': '1f4ef244ca3fe3dc82f22b5881b944b36ee46a44b8543363bc43503f'},        |
|               | {'event': 'compute_check_can_live_migrate_source', 'start_time': '2024-10-21T13:27:28.000000', 'finish_time': '2024-10-21T13:27:29.000000', 'result': 'Success', 'traceback': None, 'host':       |
|               | 'overcloud-novacompute-0.keller.lab', 'hostId': '1f4ef244ca3fe3dc82f22b5881b944b36ee46a44b8543363bc43503f'}, {'event': 'compute_check_can_live_migrate_destination', 'start_time':                |
|               | '2024-10-21T13:27:27.000000', 'finish_time': '2024-10-21T13:27:32.000000', 'result': 'Success', 'traceback': None, 'host': 'overcloud-novacompute-1.keller.lab', 'hostId':                        |
|               | 'b9a380ea15a66b5905d56c39e84af06ab10c6a88b37dbf53d0ec7b7f'}, {'event': 'conductor_live_migrate_instance', 'start_time': '2024-10-21T13:27:05.000000', 'finish_time':                              |
|               | '2024-10-21T13:27:37.000000', 'result': 'Success', 'traceback': None, 'host': 'overcloud-controller-1.keller.lab', 'hostId': '9af1740356f183f2490a3d736164cecbd422f08377f2d740807baa1c'}]         |
| instance_uuid | 7bb3e9e1-c698-431e-b68f-b0ea9435e980                                                                                                                                                              |
| message       | None                                                                                                                                                                                              |
| project_id    | 3224cb31026b4b7e8caf537904b03a53                                                                                                                                                                  |
| request_id    | req-08c59f04-3af4-4479-a952-0a7badd4105d                                                                                                                                                          |
| start_time    | 2024-10-21T13:26:56.000000                                                                                                                                                                        |
| updated_at    | 2024-10-21T13:28:33.000000                                                                                                                                                                        |
| user_id       | dcd17f4df82e4222868a84617345fd7a                                                                                                                                                                  |
+---------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
(overcloud) [stack.lab ~]$ 
~~~


In the case of this specific migration, the connection error started at 17:24:31:
~~~
2024-10-21 17:24:31.103 7 ERROR oslo.messaging._drivers.impl_rabbit [-] [01307675-8e8c-4584-a5e8-3844aabbe24c] AMQP server on overcloud-controller-2.ctlplane.keller.lab:5672 is unreachable: [Errno 111] ECONNREFUSED. Trying again in 1 seconds.: ConnectionRefusedError: [Errno 111] ECONNREFUSED
~~~

Live migration failed 17:25:26:
~~~
2024-10-21 17:25:26.732 7 ERROR nova.compute.manager [-] [instance: 7bb3e9e1-c698-431e-b68f-b0ea9435e980] Live migration failed.: oslo_messaging.exceptions.MessagingTimeout: Timed out waiting for a reply to message ID 69d73c25cf05406fb5e1baf42e2ae7c1
2024-10-21 17:25:26.732 7 ERROR nova.compute.manager [instance: 7bb3e9e1-c698-431e-b68f-b0ea9435e980] Traceback (most recent call last):
2024-10-21 17:25:26.732 7 ERROR nova.compute.manager [instance: 7bb3e9e1-c698-431e-b68f-b0ea9435e980]   File "/usr/lib/python3.6/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 425, in get
2024-10-21 17:25:26.732 7 ERROR nova.compute.manager [instance: 7bb3e9e1-c698-431e-b68f-b0ea9435e980]     return self._queues[msg_id].get(block=True, timeout=timeout)
2024-10-21 17:25:26.732 7 ERROR nova.compute.manager [instance: 7bb3e9e1-c698-431e-b68f-b0ea9435e980]   File "/usr/lib/python3.6/site-packages/eventlet/queue.py", line 322, in get
2024-10-21 17:25:26.732 7 ERROR nova.compute.manager [instance: 7bb3e9e1-c698-431e-b68f-b0ea9435e980]     return waiter.wait()
2024-10-21 17:25:26.732 7 ERROR nova.compute.manager [instance: 7bb3e9e1-c698-431e-b68f-b0ea9435e980]   File "/usr/lib/python3.6/site-packages/eventlet/queue.py", line 141, in wait
2024-10-21 17:25:26.732 7 ERROR nova.compute.manager [instance: 7bb3e9e1-c698-431e-b68f-b0ea9435e980]     return get_hub().switch()
2024-10-21 17:25:26.732 7 ERROR nova.compute.manager [instance: 7bb3e9e1-c698-431e-b68f-b0ea9435e980]   File "/usr/lib/python3.6/site-packages/eventlet/hubs/hub.py", line 298, in switch
2024-10-21 17:25:26.732 7 ERROR nova.compute.manager [instance: 7bb3e9e1-c698-431e-b68f-b0ea9435e980]     return self.greenlet.switch()
2024-10-21 17:25:26.732 7 ERROR nova.compute.manager [instance: 7bb3e9e1-c698-431e-b68f-b0ea9435e980] queue.Empty
2024-10-21 17:25:26.732 7 ERROR nova.compute.manager [instance: 7bb3e9e1-c698-431e-b68f-b0ea9435e980]
2024-10-21 17:25:26.732 7 ERROR nova.compute.manager [instance: 7bb3e9e1-c698-431e-b68f-b0ea9435e980] During handling of the above exception, another exception occurred:
2024-10-21 17:25:26.732 7 ERROR nova.compute.manager [instance: 7bb3e9e1-c698-431e-b68f-b0ea9435e980]
2024-10-21 17:25:26.732 7 ERROR nova.compute.manager [instance: 7bb3e9e1-c698-431e-b68f-b0ea9435e980] Traceback (most recent call last):
2024-10-21 17:25:26.732 7 ERROR nova.compute.manager [instance: 7bb3e9e1-c698-431e-b68f-b0ea9435e980]   File "/usr/lib/python3.6/site-packages/nova/compute/manager.py", line 7466, in _do_live_migration
2024-10-21 17:25:26.732 7 ERROR nova.compute.manager [instance: 7bb3e9e1-c698-431e-b68f-b0ea9435e980]     block_migration, migrate_data)
2024-10-21 17:25:26.732 7 ERROR nova.compute.manager [instance: 7bb3e9e1-c698-431e-b68f-b0ea9435e980]   File "/usr/lib/python3.6/site-packages/nova/virt/libvirt/driver.py", line 8858, in live_migration
2024-10-21 17:25:26.732 7 ERROR nova.compute.manager [instance: 7bb3e9e1-c698-431e-b68f-b0ea9435e980]     migrate_data)
2024-10-21 17:25:26.732 7 ERROR nova.compute.manager [instance: 7bb3e9e1-c698-431e-b68f-b0ea9435e980]   File "/usr/lib/python3.6/site-packages/nova/virt/libvirt/driver.py", line 9382, in _live_migration
2024-10-21 17:25:26.732 7 ERROR nova.compute.manager [instance: 7bb3e9e1-c698-431e-b68f-b0ea9435e980]     finish_event, disk_paths)
2024-10-21 17:25:26.732 7 ERROR nova.compute.manager [instance: 7bb3e9e1-c698-431e-b68f-b0ea9435e980]   File "/usr/lib/python3.6/site-packages/nova/virt/libvirt/driver.py", line 9256, in _live_migration_monitor
2024-10-21 17:25:26.732 7 ERROR nova.compute.manager [instance: 7bb3e9e1-c698-431e-b68f-b0ea9435e980]     info, remaining)
2024-10-21 17:25:26.732 7 ERROR nova.compute.manager [instance: 7bb3e9e1-c698-431e-b68f-b0ea9435e980]   File "/usr/lib/python3.6/site-packages/nova/virt/libvirt/migration.py", line 557, in save_stats
2024-10-21 17:25:26.732 7 ERROR nova.compute.manager [instance: 7bb3e9e1-c698-431e-b68f-b0ea9435e980]     migration.save()
2024-10-21 17:25:26.732 7 ERROR nova.compute.manager [instance: 7bb3e9e1-c698-431e-b68f-b0ea9435e980]   File "/usr/lib/python3.6/site-packages/oslo_versionedobjects/base.py", line 210, in wrapper
2024-10-21 17:25:26.732 7 ERROR nova.compute.manager [instance: 7bb3e9e1-c698-431e-b68f-b0ea9435e980]     ctxt, self, fn.__name__, args, kwargs)
2024-10-21 17:25:26.732 7 ERROR nova.compute.manager [instance: 7bb3e9e1-c698-431e-b68f-b0ea9435e980]   File "/usr/lib/python3.6/site-packages/nova/conductor/rpcapi.py", line 246, in object_action
2024-10-21 17:25:26.732 7 ERROR nova.compute.manager [instance: 7bb3e9e1-c698-431e-b68f-b0ea9435e980]     objmethod=objmethod, args=args, kwargs=kwargs)
2024-10-21 17:25:26.732 7 ERROR nova.compute.manager [instance: 7bb3e9e1-c698-431e-b68f-b0ea9435e980]   File "/usr/lib/python3.6/site-packages/oslo_messaging/rpc/client.py", line 181, in call
2024-10-21 17:25:26.732 7 ERROR nova.compute.manager [instance: 7bb3e9e1-c698-431e-b68f-b0ea9435e980]     transport_options=self.transport_options)
2024-10-21 17:25:26.732 7 ERROR nova.compute.manager [instance: 7bb3e9e1-c698-431e-b68f-b0ea9435e980]   File "/usr/lib/python3.6/site-packages/oslo_messaging/transport.py", line 129, in _send
2024-10-21 17:25:26.732 7 ERROR nova.compute.manager [instance: 7bb3e9e1-c698-431e-b68f-b0ea9435e980]     transport_options=transport_options)
2024-10-21 17:25:26.732 7 ERROR nova.compute.manager [instance: 7bb3e9e1-c698-431e-b68f-b0ea9435e980]   File "/usr/lib/python3.6/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 674, in send
2024-10-21 17:25:26.732 7 ERROR nova.compute.manager [instance: 7bb3e9e1-c698-431e-b68f-b0ea9435e980]     transport_options=transport_options)
2024-10-21 17:25:26.732 7 ERROR nova.compute.manager [instance: 7bb3e9e1-c698-431e-b68f-b0ea9435e980]   File "/usr/lib/python3.6/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 662, in _send
2024-10-21 17:25:26.732 7 ERROR nova.compute.manager [instance: 7bb3e9e1-c698-431e-b68f-b0ea9435e980]     call_monitor_timeout)
2024-10-21 17:25:26.732 7 ERROR nova.compute.manager [instance: 7bb3e9e1-c698-431e-b68f-b0ea9435e980]   File "/usr/lib/python3.6/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 551, in wait
2024-10-21 17:25:26.732 7 ERROR nova.compute.manager [instance: 7bb3e9e1-c698-431e-b68f-b0ea9435e980]     message = self.waiters.get(msg_id, timeout=timeout)
2024-10-21 17:25:26.732 7 ERROR nova.compute.manager [instance: 7bb3e9e1-c698-431e-b68f-b0ea9435e980]   File "/usr/lib/python3.6/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 429, in get
2024-10-21 17:25:26.732 7 ERROR nova.compute.manager [instance: 7bb3e9e1-c698-431e-b68f-b0ea9435e980]     'to message ID %s' % msg_id)
2024-10-21 17:25:26.732 7 ERROR nova.compute.manager [instance: 7bb3e9e1-c698-431e-b68f-b0ea9435e980] oslo_messaging.exceptions.MessagingTimeout: Timed out waiting for a reply to message ID 69d73c25cf05406fb5e1baf42e2ae7c1
~~~

And the node eventually reconnected 17:26:59:
~~~
2024-10-21 17:26:59.277 7 ERROR oslo.messaging._drivers.impl_rabbit [-] [7f4808ff-5c85-4b49-a4e9-6f01bc0c7bcd] AMQP server on overcloud-controller-0.ctlplane.keller.lab:5672 is unreachable: [Errno 111] ECONNREFUSED. Trying again in 1 seconds.: ConnectionRefusedError: [Errno 111] ECONNREFUSED
2024-10-21 17:27:00.392 7 INFO oslo.messaging._drivers.impl_rabbit [-] [7f4808ff-5c85-4b49-a4e9-6f01bc0c7bcd] Reconnected to AMQP server on overcloud-controller-2.ctlplane.keller.lab:5672 via [amqp] client with port 41074.
2024-10-21 17:27:11.958 7 INFO oslo.messaging._drivers.impl_rabbit [req-5dd64e8d-b14c-4ca0-96d9-4172e1a01246 - - - - -] [8a202219-6f32-4d81-a00c-220d67358576] Reconnected to AMQP server on overcloud-controller-2.ctlplane.keller.lab:5672 via [amqp] client with port 36582.
2024-10-21 17:27:12.146 7 INFO oslo.messaging._drivers.impl_rabbit [-] [01307675-8e8c-4584-a5e8-3844aabbe24c] Reconnected to AMQP server on overcloud-controller-2.ctlplane.keller.lab:5672 via [amqp] client with port 36580.
~~~

I don't see any new logging added on the patch, so I don't know if there's any specific on the log that I should be looking for, when compared to the previous migrations using the GA packages.

I'm attaching the nova* logs from the compute.

Comment 20 melanie witt 2024-10-21 21:14:30 UTC
Thank you for testing that.

Sorry if it wasn't clear, but the goal of that patch was/is to address the "nova-compute freezes and becomes unresponsive" part of the problem -- there is no new logging in it. AFAICT in the logs you attached, nova-compute didn't hang and stop logging or doing other activities during the migration. What we are looking for with the patch is whether it solves nova-compute getting stuck.

Regarding the rabbitmq part, I see this interesting error that I didn't see before (and may have missed):

>2024-10-21 15:33:43.483 7 ERROR oslo.messaging._drivers.impl_rabbit [-] [7f4808ff-5c85-4b49-a4e9-6f01bc0c7bcd] AMQP server on overcloud-controller-2.ctlplane.keller.lab:5672 is unreachable: (0, 0): (320) CONNECTION_FORCED - broker forced connection closure with reason 'shutdown'. Trying again in 1 seconds.: amqp.exceptions.ConnectionForced: (0, 0): (320) CONNECTION_FORCED - broker forced connection closure with reason 'shutdown'
>2024-10-21 15:33:43.488 7 ERROR oslo.messaging._drivers.impl_rabbit [-] [01307675-8e8c-4584-a5e8-3844aabbe24c] AMQP server on overcloud-controller-2.ctlplane.keller.lab:5672 is unreachable: (0, 0): (320) CONNECTION_FORCED - broker forced connection closure with reason 'shutdown'. Trying again in 1 seconds.: amqp.exceptions.ConnectionForced: (0, 0): (320) CONNECTION_FORCED - broker forced connection closure with reason 'shutdown'
>2024-10-21 15:33:44.567 7 ERROR oslo.messaging._drivers.impl_rabbit [-] [7f4808ff-5c85-4b49-a4e9-6f01bc0c7bcd] AMQP server on overcloud-controller-2.ctlplane.keller.lab:5672 is unreachable: <RecoverableConnectionError: unknown error>. Trying again in 1 seconds.: amqp.exceptions.RecoverableConnectionError: <RecoverableConnectionError: unknown error>
>2024-10-21 15:33:44.606 7 ERROR oslo.messaging._drivers.impl_rabbit [-] [01307675-8e8c-4584-a5e8-3844aabbe24c] AMQP server on overcloud-controller-2.ctlplane.keller.lab:5672 is unreachable: [Errno 111]
>2024-10-21 15:33:45.600 7 ERROR oslo.messaging._drivers.impl_rabbit [-] [7f4808ff-5c85-4b49-a4e9-6f01bc0c7bcd] AMQP server on overcloud-controller-2.ctlplane.keller.lab:5672 is unreachable: [Errno 111]
>2024-10-21 15:33:45.625 7 ERROR oslo.messaging._drivers.impl_rabbit [-] [01307675-8e8c-4584-a5e8-3844aabbe24c] AMQP server on overcloud-controller-0.ctlplane.keller.lab:5672 is unreachable: [Errno 111] ECONNREFUSED. Trying again in 1 seconds.: ConnectionRefusedError: [Errno 111] ECONNREFUSED

This says "CONNECTION_FORCED - broker forced connection closure with reason 'shutdown'. Trying again in 1 seconds.: amqp.exceptions.ConnectionForced: (0, 0): (320) CONNECTION_FORCED - broker forced connection closure with reason 'shutdown'" which sounds like it's saying that the message broker was shut down. Do you see any interesting logs for rabbitmq around this timestamp? If so, could you attach them?

IIUC you aren't using any container manager in your repro environment? Just wondering what things could possibly be shutting down rabbitmq. I hope there is helpful info in the logs.

Comment 21 Eric Nothen 2024-10-22 07:51:50 UTC
(In reply to melanie witt from comment #20)
> Thank you for testing that.
> 
> Sorry if it wasn't clear, but the goal of that patch was/is to address the
> "nova-compute freezes and becomes unresponsive" part of the problem -- there
> is no new logging in it. AFAICT in the logs you attached, nova-compute
> didn't hang and stop logging or doing other activities during the migration.
> What we are looking for with the patch is whether it solves nova-compute
> getting stuck.
> 

Yes, that's is a good point. The fact that it didn't freeze actually explains why it failed much quicker than with the previous try (GA image), which was frozen for 10 hours.


> Regarding the rabbitmq part, I see this interesting error that I didn't see
> before (and may have missed):
> 
> >2024-10-21 15:33:43.483 7 ERROR oslo.messaging._drivers.impl_rabbit [-] [7f4808ff-5c85-4b49-a4e9-6f01bc0c7bcd] AMQP server on overcloud-controller-2.ctlplane.keller.lab:5672 is unreachable: (0, 0): (320) CONNECTION_FORCED - broker forced connection closure with reason 'shutdown'. Trying again in 1 seconds.: amqp.exceptions.ConnectionForced: (0, 0): (320) CONNECTION_FORCED - broker forced connection closure with reason 'shutdown'
> >2024-10-21 15:33:43.488 7 ERROR oslo.messaging._drivers.impl_rabbit [-] [01307675-8e8c-4584-a5e8-3844aabbe24c] AMQP server on overcloud-controller-2.ctlplane.keller.lab:5672 is unreachable: (0, 0): (320) CONNECTION_FORCED - broker forced connection closure with reason 'shutdown'. Trying again in 1 seconds.: amqp.exceptions.ConnectionForced: (0, 0): (320) CONNECTION_FORCED - broker forced connection closure with reason 'shutdown'
> >2024-10-21 15:33:44.567 7 ERROR oslo.messaging._drivers.impl_rabbit [-] [7f4808ff-5c85-4b49-a4e9-6f01bc0c7bcd] AMQP server on overcloud-controller-2.ctlplane.keller.lab:5672 is unreachable: <RecoverableConnectionError: unknown error>. Trying again in 1 seconds.: amqp.exceptions.RecoverableConnectionError: <RecoverableConnectionError: unknown error>
> >2024-10-21 15:33:44.606 7 ERROR oslo.messaging._drivers.impl_rabbit [-] [01307675-8e8c-4584-a5e8-3844aabbe24c] AMQP server on overcloud-controller-2.ctlplane.keller.lab:5672 is unreachable: [Errno 111]
> >2024-10-21 15:33:45.600 7 ERROR oslo.messaging._drivers.impl_rabbit [-] [7f4808ff-5c85-4b49-a4e9-6f01bc0c7bcd] AMQP server on overcloud-controller-2.ctlplane.keller.lab:5672 is unreachable: [Errno 111]
> >2024-10-21 15:33:45.625 7 ERROR oslo.messaging._drivers.impl_rabbit [-] [01307675-8e8c-4584-a5e8-3844aabbe24c] AMQP server on overcloud-controller-0.ctlplane.keller.lab:5672 is unreachable: [Errno 111] ECONNREFUSED. Trying again in 1 seconds.: ConnectionRefusedError: [Errno 111] ECONNREFUSED
> 
> This says "CONNECTION_FORCED - broker forced connection closure with reason
> 'shutdown'. Trying again in 1 seconds.: amqp.exceptions.ConnectionForced:
> (0, 0): (320) CONNECTION_FORCED - broker forced connection closure with
> reason 'shutdown'" which sounds like it's saying that the message broker was
> shut down. Do you see any interesting logs for rabbitmq around this
> timestamp? If so, could you attach them?
> 

Yes, turns out there are errors in rabbit just before that:
~~~
Oct 21 15:33:10 overcloud-controller-2 system.pacemaker Oct 21 15:33:10 overcloud-controller-2.keller.lab pacemaker-controld  [2751] (process_lrm_event) #011error: Result of monitor operation for rabbitmq on rabbitmq-bundle-0: Timed Out | call=429564 key=rabbitmq_monitor_10000 timeout=40000ms
...
Oct 21 15:33:11 overcloud-controller-1 system.pacemaker Oct 21 15:33:11 overcloud-controller-1.keller.lab pacemaker-schedulerd[2810] (log_list_item) #011info: Container bundle set: rabbitmq-bundle [cluster.common.tag/keller-prod-osp_16_2_cv-rhosp16_containers-rabbitmq:pcmklatest]: rabbitmq-bundle-0#011(ocf::heartbeat:rabbitmq-cluster):#011 FAILED overcloud-controller-2
Oct 21 15:33:11 overcloud-controller-1 system.pacemaker Oct 21 15:33:11 overcloud-controller-1.keller.lab pacemaker-schedulerd[2810] (log_list_item) #011info: Container bundle set: rabbitmq-bundle [cluster.common.tag/keller-prod-osp_16_2_cv-rhosp16_containers-rabbitmq:pcmklatest]: rabbitmq-bundle-1#011(ocf::heartbeat:rabbitmq-cluster):#011 Stopped overcloud-controller-0
Oct 21 15:33:11 overcloud-controller-1 system.pacemaker Oct 21 15:33:11 overcloud-controller-1.keller.lab pacemaker-schedulerd[2810] (log_list_item) #011info: Container bundle set: rabbitmq-bundle [cluster.common.tag/keller-prod-osp_16_2_cv-rhosp16_containers-rabbitmq:pcmklatest]: rabbitmq-bundle-2#011(ocf::heartbeat:rabbitmq-cluster):#011 Started overcloud-controller-1
~~~

I'm not sure now if rabbit issues caused the migration to fail, or if the ongoing migration caused rabbit to go down, so I need to start from scratch and check if rabbit would have errors when I do a bunch of other things aside from migrations.


> IIUC you aren't using any container manager in your repro environment? Just
> wondering what things could possibly be shutting down rabbitmq. I hope there
> is helpful info in the logs.

I am not. This is a standard tripleo deployment of RHOSP.

Comment 22 melanie witt 2024-10-23 02:58:51 UTC
(In reply to Eric Nothen from comment #21)
> Yes, turns out there are errors in rabbit just before that:
> ~~~
> Oct 21 15:33:10 overcloud-controller-2 system.pacemaker Oct 21 15:33:10
> overcloud-controller-2.keller.lab pacemaker-controld  [2751]
> (process_lrm_event) #011error: Result of monitor operation for rabbitmq on
> rabbitmq-bundle-0: Timed Out | call=429564 key=rabbitmq_monitor_10000
> timeout=40000ms
> ...
> Oct 21 15:33:11 overcloud-controller-1 system.pacemaker Oct 21 15:33:11
> overcloud-controller-1.keller.lab pacemaker-schedulerd[2810] (log_list_item)
> #011info: Container bundle set: rabbitmq-bundle
> [cluster.common.tag/keller-prod-osp_16_2_cv-rhosp16_containers-rabbitmq:
> pcmklatest]: rabbitmq-bundle-0#011(ocf::heartbeat:rabbitmq-cluster):#011
> FAILED overcloud-controller-2
> Oct 21 15:33:11 overcloud-controller-1 system.pacemaker Oct 21 15:33:11
> overcloud-controller-1.keller.lab pacemaker-schedulerd[2810] (log_list_item)
> #011info: Container bundle set: rabbitmq-bundle
> [cluster.common.tag/keller-prod-osp_16_2_cv-rhosp16_containers-rabbitmq:
> pcmklatest]: rabbitmq-bundle-1#011(ocf::heartbeat:rabbitmq-cluster):#011
> Stopped overcloud-controller-0
> Oct 21 15:33:11 overcloud-controller-1 system.pacemaker Oct 21 15:33:11
> overcloud-controller-1.keller.lab pacemaker-schedulerd[2810] (log_list_item)
> #011info: Container bundle set: rabbitmq-bundle
> [cluster.common.tag/keller-prod-osp_16_2_cv-rhosp16_containers-rabbitmq:
> pcmklatest]: rabbitmq-bundle-2#011(ocf::heartbeat:rabbitmq-cluster):#011
> Started overcloud-controller-1
> ~~~
> 
> I'm not sure now if rabbit issues caused the migration to fail, or if the
> ongoing migration caused rabbit to go down, so I need to start from scratch
> and check if rabbit would have errors when I do a bunch of other things
> aside from migrations.

Yeah, I'm thinking about the same thing. There is a chance the migration "monitoring" greenthread is disrupting rabbit, namely this in an excerpt from the traceback:

2024-10-21 17:25:26.732 7 ERROR nova.compute.manager [instance: 7bb3e9e1-c698-431e-b68f-b0ea9435e980] Traceback (most recent call last):
2024-10-21 17:25:26.732 7 ERROR nova.compute.manager [instance: 7bb3e9e1-c698-431e-b68f-b0ea9435e980]   File "/usr/lib/python3.6/site-packages/nova/compute/manager.py", line 7466, in _do_live_migration
2024-10-21 17:25:26.732 7 ERROR nova.compute.manager [instance: 7bb3e9e1-c698-431e-b68f-b0ea9435e980]     block_migration, migrate_data)
2024-10-21 17:25:26.732 7 ERROR nova.compute.manager [instance: 7bb3e9e1-c698-431e-b68f-b0ea9435e980]   File "/usr/lib/python3.6/site-packages/nova/virt/libvirt/driver.py", line 8858, in live_migration
2024-10-21 17:25:26.732 7 ERROR nova.compute.manager [instance: 7bb3e9e1-c698-431e-b68f-b0ea9435e980]     migrate_data)
2024-10-21 17:25:26.732 7 ERROR nova.compute.manager [instance: 7bb3e9e1-c698-431e-b68f-b0ea9435e980]   File "/usr/lib/python3.6/site-packages/nova/virt/libvirt/driver.py", line 9382, in _live_migration
2024-10-21 17:25:26.732 7 ERROR nova.compute.manager [instance: 7bb3e9e1-c698-431e-b68f-b0ea9435e980]     finish_event, disk_paths)
2024-10-21 17:25:26.732 7 ERROR nova.compute.manager [instance: 7bb3e9e1-c698-431e-b68f-b0ea9435e980]   File "/usr/lib/python3.6/site-packages/nova/virt/libvirt/driver.py", line 9256, in _live_migration_monitor
2024-10-21 17:25:26.732 7 ERROR nova.compute.manager [instance: 7bb3e9e1-c698-431e-b68f-b0ea9435e980]     info, remaining)
2024-10-21 17:25:26.732 7 ERROR nova.compute.manager [instance: 7bb3e9e1-c698-431e-b68f-b0ea9435e980]   File "/usr/lib/python3.6/site-packages/nova/virt/libvirt/migration.py", line 557, in save_stats
2024-10-21 17:25:26.732 7 ERROR nova.compute.manager [instance: 7bb3e9e1-c698-431e-b68f-b0ea9435e980]     migration.save()

This is doing a 'migration.save()' which will make an RPC call to nova-conductor and nova-conductor will connect to the database to write the update. This save_stats is being called every 5 seconds ... so I wonder if that's enough bombarding rabbit to cause failures after a while. There is unfortunately not a config option to control the interval of that save_stats, so if we want to try saving less frequently to see if it changes anything, we have to patch the code for now.

To be clear, the eventual failure of the migration if [libvirt]live_migration_timeout_action != force_complete and post copy is permitted is expected behavior (as was covered in earlier comments in this BZ) and we don't expect it to complete if the guest is dirtying memory fast enough.

Regarding the configuration defaults we ship, a colleague did note in our last team discussion about this BZ that in OSP 18.0, [libvirt]live_migration_timeout_action is defaulted to force_complete. At some point there was a decision to turn on post copy by default, so with 18.0 your test scenario should not timeout or fail the migration as post copy would be triggered.

Comment 23 Eric Nothen 2024-10-23 14:58:51 UTC
For lack of a better test case, I have ran no less than 10 of the following loops that, to some extent, require multiple OSP endpoints to interact together:

~~~
openstack server create --flavor 1c256m --image cirros-0.5.2 --key-name stack --network testnet -c OS-EXT-STS:task_state --max 20 testbulk
while [ $(openstack server list | grep -c BUILD) -gt 0 ] ;do sleep 10 ;done ; openstack server list --name testbulk --fit
for x in $(seq 1 20) ;do fip=$(openstack floating ip create external -c floating_ip_address -f value); openstack server add floating ip testbulk-$x $fip; done
openstack server list | awk '/testbulk/ {print $2}' | xargs openstack server delete
openstack floating ip list -c ID -c Port | awk '/None/ {print $2}' | xargs openstack floating ip delete
~~~

During this loop, 20 VMs and FIPs are created, assigned to each other, then deleted. This completes fine without RabbitMQ errors, which I take it as a sign of stability.

Then I do the following: On comppute-0, I change the code so that it would update progress on the database every 30 seconds instead of every 5 (and restart the container), compute-1 I leave as it is:
~~~
[stack.lab ~]$ ansible -i inventory.yaml -m shell -a 'sudo podman exec nova_compute grep -n -m 1 -B1 "if (n % ." /usr/lib/python3.6/site-packages/nova/virt/libvirt/driver.py' Compute
overcloud-novacompute-0 | CHANGED | rc=0 >>
9241-                # Testing with updates every 30 seconds
9242:                if (n % 60) == 0:
overcloud-novacompute-1 | CHANGED | rc=0 >>
9239-                # when debug logs are off.
9240:                if (n % 10) == 0:
[stack.lab ~]$ 
~~~

So: migrations from compute-0 to compute-1 would be updating progress every 30 seconds. Migrations from compute-1 to compute-0 every 5 seconds.

The vm is currently running on compute-0 and so the first migration test goes from 0 to 1. Filtered long from nova_compute agrees:
~~~
Oct 23 16:18:18 overcloud-novacompute-0 openstack.nova.compute 2024-10-23 16:18:18.050 7 INFO nova.virt.libvirt.driver [-] [instance: 8c847a89-14b5-4aeb-a165-c38b7c23d63d] Migration running for 0 secs, memory 100% remaining (bytes processed=0, remaining=0, total=0); disk 100% remaining (bytes processed=0, remaining=0, total=0).
Oct 23 16:18:52 overcloud-novacompute-0 openstack.nova.compute 2024-10-23 16:18:51.697 7 INFO nova.virt.libvirt.driver [-] [instance: 8c847a89-14b5-4aeb-a165-c38b7c23d63d] Migration running for 30 secs, memory 79% remaining (bytes processed=4139905114, remaining=27130343424, total=34364792832); disk 100% remaining (bytes processed=0, remaining=0, total=0).
Oct 23 16:19:25 overcloud-novacompute-0 openstack.nova.compute 2024-10-23 16:19:25.261 7 INFO nova.virt.libvirt.driver [-] [instance: 8c847a89-14b5-4aeb-a165-c38b7c23d63d] Migration running for 60 secs, memory 68% remaining (bytes processed=8064491563, remaining=23205691392, total=34364792832); disk 100% remaining (bytes processed=0, remaining=0, total=0).
Oct 23 16:20:03 overcloud-novacompute-0 openstack.nova.compute 2024-10-23 16:20:02.827 7 INFO nova.virt.libvirt.driver [-] [instance: 8c847a89-14b5-4aeb-a165-c38b7c23d63d] Migration running for 90 secs, memory 54% remaining (bytes processed=12609758216, remaining=18669236224, total=34364792832); disk 100% remaining (bytes processed=0, remaining=0, total=0).
Oct 23 16:20:35 overcloud-novacompute-0 openstack.nova.compute 2024-10-23 16:20:34.919 7 INFO nova.virt.libvirt.driver [-] [instance: 8c847a89-14b5-4aeb-a165-c38b7c23d63d] Migration running for 120 secs, memory 31% remaining (bytes processed=17165948309, remaining=10493169664, total=34364792832); disk 100% remaining (bytes processed=0, remaining=0, total=0).
Oct 23 16:21:09 overcloud-novacompute-0 openstack.nova.compute 2024-10-23 16:21:08.801 7 INFO nova.virt.libvirt.driver [-] [instance: 8c847a89-14b5-4aeb-a165-c38b7c23d63d] Migration running for 150 secs, memory 17% remaining (bytes processed=21804297899, remaining=5863710720, total=34364792832); disk 100% remaining (bytes processed=0, remaining=0, total=0).
Oct 23 16:21:45 overcloud-novacompute-0 openstack.nova.compute 2024-10-23 16:21:45.371 7 INFO nova.virt.libvirt.driver [-] [instance: 8c847a89-14b5-4aeb-a165-c38b7c23d63d] Migration running for 180 secs, memory 4% remaining (bytes processed=26137543899, remaining=1538850816, total=34364792832); disk 100% remaining (bytes processed=0, remaining=0, total=0).
Oct 23 16:22:18 overcloud-novacompute-0 openstack.nova.compute 2024-10-23 16:22:17.873 7 INFO nova.virt.libvirt.driver [-] [instance: 8c847a89-14b5-4aeb-a165-c38b7c23d63d] Migration running for 210 secs, memory 14% remaining (bytes processed=32150579044, remaining=4734943232, total=34364792832); disk 100% remaining (bytes processed=0, remaining=0, total=0).
Oct 23 16:22:50 overcloud-novacompute-0 openstack.nova.compute 2024-10-23 16:22:49.847 7 INFO nova.virt.libvirt.driver [-] [instance: 8c847a89-14b5-4aeb-a165-c38b7c23d63d] Migration running for 240 secs, memory 15% remaining (bytes processed=37518479372, remaining=5052497920, total=34364792832); disk 100% remaining (bytes processed=0, remaining=0, total=0).
Oct 23 16:23:23 overcloud-novacompute-0 openstack.nova.compute 2024-10-23 16:23:23.032 7 INFO nova.virt.libvirt.driver [-] [instance: 8c847a89-14b5-4aeb-a165-c38b7c23d63d] Migration running for 270 secs, memory 7% remaining (bytes processed=43495314234, remaining=2398822400, total=34364792832); disk 100% remaining (bytes processed=0, remaining=0, total=0).
Oct 23 16:23:55 overcloud-novacompute-0 openstack.nova.compute 2024-10-23 16:23:55.539 7 INFO nova.virt.libvirt.driver [-] [instance: 8c847a89-14b5-4aeb-a165-c38b7c23d63d] Migration operation has completed
Oct 23 16:23:55 overcloud-novacompute-0 openstack.nova.compute 2024-10-23 16:23:55.564 7 DEBUG nova.virt.libvirt.driver [-] [instance: 8c847a89-14b5-4aeb-a165-c38b7c23d63d] Migration operation thread has finished _live_migration_operation /usr/lib/python3.6/site-packages/nova/virt/libvirt/driver.py:9069
Oct 23 16:23:55 overcloud-novacompute-0 openstack.nova.compute 2024-10-23 16:23:55.565 7 DEBUG nova.virt.libvirt.driver [-] [instance: 8c847a89-14b5-4aeb-a165-c38b7c23d63d] Migration operation thread notification thread_finished /usr/lib/python3.6/site-packages/nova/virt/libvirt/driver.py:9369
~~~

Migration completes successfully. Then I migrate it back right away, no changes on the actual workload. Since it's coming from compute-1, it sends progress every 5 seconds:
~~~
Oct 23 16:28:34 overcloud-novacompute-1 openstack.nova.compute 2024-10-23 16:28:33.708 7 INFO nova.virt.libvirt.driver [-] [instance: 8c847a89-14b5-4aeb-a165-c38b7c23d63d] Migration running for 0 secs, memory 100% remaining (bytes processed=0, remaining=0, total=0); disk 100% remaining (bytes processed=0, remaining=0, total=0).
Oct 23 16:28:41 overcloud-novacompute-1 openstack.nova.compute 2024-10-23 16:28:41.003 7 DEBUG nova.virt.libvirt.driver [-] [instance: 8c847a89-14b5-4aeb-a165-c38b7c23d63d] Migration running for 5 secs, memory 90% remaining (bytes processed=377038303, remaining=30911877120, total=34364792832); disk 100% remaining (bytes processed=0, remaining=0, total=0). _live_migration_monitor /usr/lib/python3.6/site-packages/nova/virt/libvirt/driver.py:9285
Oct 23 16:28:50 overcloud-novacompute-1 openstack.nova.compute 2024-10-23 16:28:49.986 7 DEBUG nova.virt.libvirt.driver [-] [instance: 8c847a89-14b5-4aeb-a165-c38b7c23d63d] Migration running for 10 secs, memory 87% remaining (bytes processed=1214147669, remaining=30050676736, total=34364792832); disk 100% remaining (bytes processed=0, remaining=0, total=0). _live_migration_monitor /usr/lib/python3.6/site-packages/nova/virt/libvirt/driver.py:9285
~~~

15 minutes later, the migration job (same vm, same workload, just different direction) keeps running:
~~~
Oct 23 16:53:38 overcloud-novacompute-1 openstack.nova.compute 2024-10-23 16:53:37.702 7 DEBUG nova.virt.libvirt.driver [-] [instance: 8c847a89-14b5-4aeb-a165-c38b7c23d63d] Migration running for 890 secs, memory 20% remaining (bytes processed=177259242795, remaining=6954078208, total=34364792832); disk 100% remaining (bytes processed=0, remaining=0, total=0). _live_migration_monitor /usr/lib/python3.6/site-packages/nova/virt/libvirt/driver.py:9285
Oct 23 16:53:45 overcloud-novacompute-1 openstack.nova.compute 2024-10-23 16:53:44.510 7 DEBUG nova.virt.libvirt.driver [-] [instance: 8c847a89-14b5-4aeb-a165-c38b7c23d63d] Migration running for 895 secs, memory 17% remaining (bytes processed=178286347924, remaining=5928939520, total=34364792832); disk 100% remaining (bytes processed=0, remaining=0, total=0). _live_migration_monitor /usr/lib/python3.6/site-packages/nova/virt/libvirt/driver.py:9285
Oct 23 16:53:52 overcloud-novacompute-1 openstack.nova.compute 2024-10-23 16:53:51.511 7 INFO nova.virt.libvirt.driver [-] [instance: 8c847a89-14b5-4aeb-a165-c38b7c23d63d] Migration running for 900 secs, memory 15% remaining (bytes processed=179090433317, remaining=5126352896, total=34364792832); disk 100% remaining (bytes processed=0, remaining=0, total=0).
~~~

I'll report later how/when this job finishes.

Comment 24 Eric Nothen 2024-10-24 05:37:59 UTC
The migration that started 16:28:34 was finally aborted more than 7 hours later:

~~~
Oct 23 16:28:34 overcloud-novacompute-1 openstack.nova.compute 2024-10-23 16:28:33.708 7 INFO nova.virt.libvirt.driver [-] [instance: 8c847a89-14b5-4aeb-a165-c38b7c23d63d] Migration running for 0 secs, memory 100% remaining (bytes processed=0, remaining=0, total=0); disk 100% remaining (bytes processed=0, remaining=0, total=0).
...
Oct 23 23:35:13 overcloud-novacompute-1 openstack.nova.compute 2024-10-23 23:35:12.329 7 DEBUG nova.virt.libvirt.migration [-] [instance: 8c847a89-14b5-4aeb-a165-c38b7c23d63d] Downtime does not need to change update_downtime /usr/lib/python3.6/site-packages/nova/virt/libvirt/migration.py:521
Oct 23 23:35:13 overcloud-novacompute-1 openstack.nova.compute 2024-10-23 23:35:13.712 7 WARNING nova.virt.libvirt.migration [-] [instance: 8c847a89-14b5-4aeb-a165-c38b7c23d63d] Live migration not completed after 25600 sec
Oct 23 23:35:13 overcloud-novacompute-1 openstack.nova.compute 2024-10-23 23:35:13.737 7 ERROR nova.virt.libvirt.driver [-] [instance: 8c847a89-14b5-4aeb-a165-c38b7c23d63d] Live Migration failure: operation aborted: migration out job: canceled by client: libvirt.libvirtError: operation aborted: migration out job: canceled by client
Oct 23 23:35:13 overcloud-novacompute-1 openstack.nova.compute 2024-10-23 23:35:13.738 7 DEBUG nova.virt.libvirt.driver [-] [instance: 8c847a89-14b5-4aeb-a165-c38b7c23d63d] Migration operation thread notification thread_finished /usr/lib/python3.6/site-packages/nova/virt/libvirt/driver.py:9369
Oct 23 23:35:13 overcloud-novacompute-1 openstack.nova.compute 2024-10-23 23:35:13.739 7 DEBUG nova.virt.libvirt.migration [-] [instance: 8c847a89-14b5-4aeb-a165-c38b7c23d63d] Current 500 elapsed 25601 steps [(0, 50), (2400, 95), (4800, 140), (7200, 185), (9600, 230), (12000, 275), (14400, 320), (16800, 365), (19200, 410), (21600, 455), (24000, 500)] update_downtime /usr/lib/python3.6/site-packages/nova/virt/libvirt/migration.py:509
Oct 23 23:35:14 overcloud-novacompute-1 openstack.nova.compute 2024-10-23 23:35:13.739 7 DEBUG nova.virt.libvirt.migration [-] [instance: 8c847a89-14b5-4aeb-a165-c38b7c23d63d] Downtime does not need to change update_downtime /usr/lib/python3.6/site-packages/nova/virt/libvirt/migration.py:521
Oct 23 23:35:14 overcloud-novacompute-1 openstack.nova.compute 2024-10-23 23:35:14.254 7 DEBUG nova.virt.libvirt.migration [-] [instance: 8c847a89-14b5-4aeb-a165-c38b7c23d63d] VM running on src, migration failed _log /usr/lib/python3.6/site-packages/nova/virt/libvirt/migration.py:427
Oct 23 23:35:14 overcloud-novacompute-1 openstack.nova.compute 2024-10-23 23:35:14.255 7 DEBUG nova.virt.libvirt.driver [-] [instance: 8c847a89-14b5-4aeb-a165-c38b7c23d63d] Fixed incorrect job type to be 4 _live_migration_monitor /usr/lib/python3.6/site-packages/nova/virt/libvirt/driver.py:9183
Oct 23 23:35:14 overcloud-novacompute-1 openstack.nova.compute 2024-10-23 23:35:14.255 7 ERROR nova.virt.libvirt.driver [-] [instance: 8c847a89-14b5-4aeb-a165-c38b7c23d63d] Migration operation has aborted
~~~

According to the last progress update, it had copied a over 84 times the size of the ram:
~~~
Wed Oct 23 23:34:56 CEST 2024
progress: 80
memory_processed_bytes 2896956480518                        
memory_remaining_bytes 6540087296                           
memory_total_bytes     34364792832                  
~~~

I don't see a gap in the logs, nor a traceback, so the patches you submitted are indeed taking care of the freeze. This time, the migration timed out after 25600 seconds, which I take it is 800*32 (live_migration_completion_timeout*<size of ram>), and therefore expected.

Comment 25 melanie witt 2024-10-24 18:23:32 UTC
This is a very useful testing result. I agree the timeout aligns with expectation if the RAM size is 32GB.

For the freeze patch, I will open a bug upstream and link it to the patch: https://review.opendev.org/c/openstack/nova/+/932669 and proceed with reviews. It will be backportable.

I think this also seems to confirm that the frequency of the save_stats (every 5 seconds) is indeed interfering with the migration in the case of rapid dirtying of memory, with quite severe results.

I was thinking maybe the best action to take for that would be to add a new [libvirt] config option for the interval and maybe default to the 30 seconds you tested or maybe 20 seconds -- ideally as low as we can set it while still enabling the stress-ng test case to complete. And if an operator wants or needs to change the interval in config, they can.


Note You need to log in before you can comment on or make changes to this bug.