+++ This bug was initially created as a clone of Bug #1167905 +++
Description of problem:
We are able to do a Live Migration of an instance from compute node 1 to compute node 2 without dropping an active SSH session to the instance.
However, if we then Live Migrate the same instance back to compute node 1, the SSH session hangs.
Version-Release number of selected component (if applicable):
openstack-nova-compute-2014.1.2-1.el7ost.noarch
How reproducible:
GSS has not reproduced the issue, but Lee Yarwood was working on it.
The customer can reproduce the issue reliably.
Steps to Reproduce:
1. Setup an OpenStack environment capable of Live Migration.
2. Launch an instance, and establish an SSH connection.
3. Live Migrate the instance. (The SSH connection will not drop)
4. Live Migrate the instance back to the compute node it came from.
Actual results:
SSH connection hangs, eventually leading to "broken pipe".
Expected results:
SSH connection should not hang or drop when moving back to the original compute node.
Additional info:
We narrowed the issue to precisely the case where we are moving back to the same instance. The customer environment has 3 compute nodes, and we could start on any of the three, and move the instance to each of the other two without losing SSH. It is only when moving back to the original node that the issue occurs.
--- Additional comment from Kashyap Chamarthy on 2014-12-02 15:07:19 EST ---
I just extracted the tar file, looking at the the Neutron configs, the
setup seems to involve GRE tunneling with Neutron Open vSwitch plugin
and RabbitMQ. The environment involves 3 Nova Compute nodes and a
Controller node.
Also, some more questions to help diagnose the issue:
- How was live migration performed? If via Nova CLI,
precise invocation would be useful to know.
- What kind of live migration? The environment involves shared
storage, correct?
- What is the exact libvirt error?
- Just to ensure I understand the issue correctly from your original
description: Live migration fails in this case only when you try to
migrate the Nova instance back to the Compute node where the
instance was *originally* launched, right?
If possible to perform the live-migration test again, some log files
that can be useful:
- Sanitized Nova debug logs (preferably contextual on the source and
destination hosts): compute.log, conductor.log, scheduler.log
- /var/log/libvirt/qemu/instance.log -- For both compute nodes involved.
This will give us the QEMU CLI that's being used by libvirt
If the test can be repeated, it'll be useful to enable debug logging
for libvirt on the Compute nodes involved. As libvirt is the
component that's doing the heavy lifting under the hood. These
logs gives us detailed interactions between libvirt and QEMU on
source and destination Compute nodes. To enable:
1. In /etc/libvirt/libvirtd.conf, have these two config attributes:
. . .
log_level = 1
log_outputs="1:file:/var/tmp/libvirtd.log"
. . .
NOTE: Do *not* forget to turn off this debugging control once the
test is finished -- it continues to log heavily and might fill up
the disk space, otherwise.
2. Restart libvirtd:
$ systemctl restart libvirtd
3. Repeat your test.
4. Capture the libvirt logs and attach them as plain text to the
bug.
The customer case that lead to the original bug has closed after we suggested they set the AliveInterval to a non-zero value. We'll re-open this bug if the customer re-opens the support ticket.