Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1167905

Summary: Repeated Live Migration can cause instance sessions to hang.
Product: Red Hat OpenStack Reporter: John Trowbridge <jtrowbri>
Component: openstack-novaAssignee: Nikola Dipanov <ndipanov>
Status: CLOSED INSUFFICIENT_DATA QA Contact: nlevinki <nlevinki>
Severity: high Docs Contact:
Priority: high    
Version: 5.0 (RHEL 7)CC: benglish, berrange, dasmith, dgilbert, dmaley, eglynn, jtrowbri, kchamart, laine, ndipanov, pbrady, sbauza, sclewis, sferdjao, sgordon, vromanso, wdaniel, yeylon
Target Milestone: ---Keywords: ZStream
Target Release: 5.0 (RHEL 7)   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 1179934 (view as bug list) Environment:
Last Closed: 2015-04-28 14:07:02 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1179934    
Bug Blocks:    

Description John Trowbridge 2014-11-25 15:26:52 UTC
Description of problem:

We are able to do a Live Migration of an instance from compute node 1 to compute node 2 without dropping an active SSH session to the instance.

However, if we then Live Migrate the same instance back to compute node 1, the SSH session hangs.

Version-Release number of selected component (if applicable):

openstack-nova-compute-2014.1.2-1.el7ost.noarch


How reproducible:
GSS has not reproduced the issue, but Lee Yarwood was working on it.

The customer can reproduce the issue reliably.

Steps to Reproduce:
1. Setup an OpenStack environment capable of Live Migration.
2. Launch an instance, and establish an SSH connection.
3. Live Migrate the instance. (The SSH connection will not drop)
4. Live Migrate the instance back to the compute node it came from.

Actual results:

SSH connection hangs, eventually leading to "broken pipe".

Expected results:

SSH connection should not hang or drop when moving back to the original compute node.

Additional info:

We narrowed the issue to precisely the case where we are moving back to the same instance. The customer environment has 3 compute nodes, and we could start on any of the three, and move the instance to each of the other two without losing SSH. It is only when moving back to the original node that the issue occurs.

Comment 3 Kashyap Chamarthy 2014-12-02 20:07:19 UTC
I just extracted the tar file, looking at the the Neutron configs, the
setup seems to involve GRE tunneling with Neutron Open vSwitch plugin
and RabbitMQ. The environment involves 3 Nova Compute nodes and a
Controller node.

Also, some more questions to help diagnose the issue:

  - How was live migration performed? If via Nova CLI, 
    precise invocation would be useful to know. 

  - What kind of live migration? The environment involves shared
    storage, correct?

  - What is the exact libvirt error?

  - Just to ensure I understand the issue correctly from your original
    description: Live migration fails in this case only when you try to
    migrate the Nova instance back to the Compute node where the
    instance was *originally* launched, right?


If possible to perform the live-migration test again, some log files
that can be useful:

  - Sanitized Nova debug logs (preferably contextual on the source and
    destination hosts): compute.log, conductor.log, scheduler.log
 
  - /var/log/libvirt/qemu/instance.log -- For both compute nodes involved. 
    This will give us the QEMU CLI that's being used by libvirt

    If the test can be repeated, it'll be useful to enable debug logging
    for libvirt on the Compute nodes involved. As libvirt is the
    component that's doing the heavy lifting under the hood. These
    logs gives us detailed interactions between libvirt and QEMU on
    source and destination Compute nodes. To enable:

     1. In /etc/libvirt/libvirtd.conf, have these two config attributes:

        . . .
        log_level = 1
        log_outputs="1:file:/var/tmp/libvirtd.log"
        . . .

       NOTE: Do *not* forget to turn off this debugging control once the
       test is finished -- it continues to log heavily and might fill up
       the disk space, otherwise.
  
     2. Restart libvirtd:

        $ systemctl restart libvirtd

     3. Repeat your test.

     4. Capture the libvirt logs and attach them as plain text to the
        bug.

Comment 6 Jeff Dexter 2015-01-06 13:42:41 UTC
Created attachment 976809 [details]
compute node 46a, original node

Comment 7 Jeff Dexter 2015-01-06 13:47:19 UTC
Created attachment 976810 [details]
node 47a 2nd node in chain

Comment 9 Nikola Dipanov 2015-01-09 16:12:38 UTC
So I have several more questions since I was not able to fully reproduce the issue in my test environment.

1) Where did you run the SSH client? On one of the nodes involved in the deployment or on a laptop/workstation
2) What woudl be really useful is to provide me with the output of 'neutron net-list' and 'neutron net-show' for all the networks. and the networks of the instance.
3) Finally - as a workaround, would it be possible to try tweaking 
ServerAliveInterval (or ClientAliveInterval in the instance), and ServerAliveCountMax and in the instance ClientAliveCountMax, and see if that helps.

Comment 10 Nikola Dipanov 2015-01-09 16:24:18 UTC
So regarding 2) in comment #9 - please see the following link for a handy way to provide all the interesting info on this

https://kashyapc.fedorapeople.org/virt/openstack/debugging-neutron.txt

Comment 11 Dr. David Alan Gilbert 2015-01-14 11:01:52 UTC
If this is just ssh connections that are in flight dropping, I wonder if this could be a case of:

https://bugzilla.redhat.com/show_bug.cgi?id=1081461

that is partially to do with the order of which network interfaces are taken 'up', so the interactions are a bit weird.

Comment 12 Kashyap Chamarthy 2015-01-14 11:58:39 UTC
(In reply to Dr. David Alan Gilbert from comment #11)
> If this is just ssh connections that are in flight dropping, I wonder if
> this could be a case of:
> 
> https://bugzilla.redhat.com/show_bug.cgi?id=1081461
> 
> that is partially to do with the order of which network interfaces are taken
> 'up', so the interactions are a bit weird.

Not sure if people the above bug will take note of the "Doc Text" field written by Laine Stump, posting it here for convenience:

  "In previous versions of libvirt, when migrating a guest that used
  macvtap ("direct") network interfaces, network connections to the 
  guest would often be reset as soon as the migration started,
  potentially leaving the guest unreachable until migration was
  finished. This was due to the destination host beginning to send out
  packets with the guest's MAC address while the guest was still running
  on the source host.  libvirt now assures that the destination host
  keeps the guest's macvtap devices inactive until the guest has been
  stopped on the source host, thus eliminating any interruptions in
  guest connectivity"

Comment 13 Laine Stump 2015-01-14 19:23:39 UTC
To further aid in deciding if Bug 1081461 is the source of the problem here, note that:

1) if the guest is not using macvtap then it most definitely isn't the source of the problem (for other types of interfaces, only the guest itself will send packets with the MAC address used by the guest)

2) in the case described in that bug, libvirt did not log any errors (and none would be expected). So any errors in the log that seem to be related to the problem (the errors in Comment 5 appear to be unrelated to me) would be a vote against Bug1081461 being the cause.

Comment 18 Dave Maley 2015-04-28 14:07:02 UTC
We did not hear back from the customer after suggesting they set the AliveInterval value to non-zero and the support case has since closed.  We'll re-open this bug if the customer re-opens the support ticket.