Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1179934

Summary: Repeated Live Migration can cause instance sessions to hang.
Product: Red Hat OpenStack Reporter: Lon Hohberger <lhh>
Component: openstack-novaAssignee: Nikola Dipanov <ndipanov>
Status: CLOSED INSUFFICIENT_DATA QA Contact: nlevinki <nlevinki>
Severity: high Docs Contact:
Priority: high    
Version: 5.0 (RHEL 7)CC: benglish, berrange, dasmith, dmaley, eglynn, jtrowbri, kchamart, ndipanov, nlevinki, pbrady, sbauza, sferdjao, sgordon, vromanso, yeylon
Target Milestone: z4Keywords: ZStream
Target Release: 6.0 (Juno)   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: 1167905 Environment:
Last Closed: 2015-04-28 14:08:55 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1167905, 1168676    

Description Lon Hohberger 2015-01-07 21:03:44 UTC
+++ This bug was initially created as a clone of Bug #1167905 +++

Description of problem:

We are able to do a Live Migration of an instance from compute node 1 to compute node 2 without dropping an active SSH session to the instance.

However, if we then Live Migrate the same instance back to compute node 1, the SSH session hangs.

Version-Release number of selected component (if applicable):

openstack-nova-compute-2014.1.2-1.el7ost.noarch


How reproducible:
GSS has not reproduced the issue, but Lee Yarwood was working on it.

The customer can reproduce the issue reliably.

Steps to Reproduce:
1. Setup an OpenStack environment capable of Live Migration.
2. Launch an instance, and establish an SSH connection.
3. Live Migrate the instance. (The SSH connection will not drop)
4. Live Migrate the instance back to the compute node it came from.

Actual results:

SSH connection hangs, eventually leading to "broken pipe".

Expected results:

SSH connection should not hang or drop when moving back to the original compute node.

Additional info:

We narrowed the issue to precisely the case where we are moving back to the same instance. The customer environment has 3 compute nodes, and we could start on any of the three, and move the instance to each of the other two without losing SSH. It is only when moving back to the original node that the issue occurs.

--- Additional comment from Kashyap Chamarthy on 2014-12-02 15:07:19 EST ---

I just extracted the tar file, looking at the the Neutron configs, the
setup seems to involve GRE tunneling with Neutron Open vSwitch plugin
and RabbitMQ. The environment involves 3 Nova Compute nodes and a
Controller node.

Also, some more questions to help diagnose the issue:

  - How was live migration performed? If via Nova CLI, 
    precise invocation would be useful to know. 

  - What kind of live migration? The environment involves shared
    storage, correct?

  - What is the exact libvirt error?

  - Just to ensure I understand the issue correctly from your original
    description: Live migration fails in this case only when you try to
    migrate the Nova instance back to the Compute node where the
    instance was *originally* launched, right?


If possible to perform the live-migration test again, some log files
that can be useful:

  - Sanitized Nova debug logs (preferably contextual on the source and
    destination hosts): compute.log, conductor.log, scheduler.log
 
  - /var/log/libvirt/qemu/instance.log -- For both compute nodes involved. 
    This will give us the QEMU CLI that's being used by libvirt

    If the test can be repeated, it'll be useful to enable debug logging
    for libvirt on the Compute nodes involved. As libvirt is the
    component that's doing the heavy lifting under the hood. These
    logs gives us detailed interactions between libvirt and QEMU on
    source and destination Compute nodes. To enable:

     1. In /etc/libvirt/libvirtd.conf, have these two config attributes:

        . . .
        log_level = 1
        log_outputs="1:file:/var/tmp/libvirtd.log"
        . . .

       NOTE: Do *not* forget to turn off this debugging control once the
       test is finished -- it continues to log heavily and might fill up
       the disk space, otherwise.
  
     2. Restart libvirtd:

        $ systemctl restart libvirtd

     3. Repeat your test.

     4. Capture the libvirt logs and attach them as plain text to the
        bug.

Comment 2 Nikola Dipanov 2015-01-09 15:02:39 UTC
I will need to ask for more clarification on the original bug. Will link the relevant comments back here.

Comment 5 Dave Maley 2015-04-28 14:08:55 UTC
The customer case that lead to the original bug has closed after we suggested they set the AliveInterval to a non-zero value. We'll re-open this bug if the customer re-opens the support ticket.